CS771A Assignment 2The Boys Rajarshi Dutta 200762 Udvas Basak 201056 Shivam Pandey 200938 1 Solution 1 1.1 TheoryFigure 1: Illustration of a working ID3 algorithm We have used the Iterative Dichotomizer decision tree algorithm which is considered one of the most robustly used algorithms for supervised machine learning classification tasks. The algorithm works by recursively partitioning the training data based on their attributes until the decision tree produces the purest splits (referred to as the leaves). ID3 uses a top-down greedy approach to build a decision tree. The tree is basically constructed from the top and the greedy approach means that at each iteration the best feature is selected to split the node. 1.2 Our approach The decision tree constructed to solve the given Wordle-Solvr problem consists of the following components: 1.2.1 process node • For the process node function, every non-root node present in the ID3 tree constructed is considered and passed through the get entropy function which calculates the index of the vocabulary that minimizes entropy the most or maximizes the Information Gain. This index can be used to access the most appropriate query that can be used at that step. Preprint. Under review. • All words from list all words list are extracted and reveal function helps to uncover the mask between the query and the required word which is stored in split dict as the key. The value of the split dict contains an array of indices that returns the given mask when queried. get entropy • The get entropy function iterates over all of the words present in that node (possible candidate queries) and passes the array to the function calc entropy which returns the entropy of the current node. • Another dictionary is maintained which given the mask, corresponds to a number of queries(which when queried with the word provides that particular mask). These values are used to calculate the sum of weighted entropy after splitting the node. • The difference of the weighted sum of entropies after splitting the node with the initialentropy calculated from the parent node gives us the information gain used to produce the best split out of all words. (1) where, V = possible values in Attribute A, S = set of examples X, Sv = subset where XA = V • Information gain indirectly describes the mutual information between the attribute and theclass of labels S. calc entropy • This function simply calculates the Shannon Entropy based on the formula where p represents the probability of each class label which can be produced by the corresponding split based on their attribute. E(p) = −p · log2(p) (2) 2 Solution 2 The entire algorithm has been implemented in the submit.py file. 2
Write your name and section number at the top right of this page: Class Section Number Bret 8:50 001 Sahifa 1:20 003 Miranda 9:55 004 Sahifa 3:30 005 Sahifa 8:50 006 Cameron 1:20 007 As you complete the exam, write your initials at the top right of each other page.If you need more room, there is a blank page at the end of the exam, or we can give you some scratch paper. Some multiple choice questions are “Select ONE” while others are “Select ALL that apply”. Pay attention to the question type and only mark one option if it says “Select ONE”. Fill in the circles completely. 1. A dataframe lions has 32 rows, each representing a unique lion. It has two numeric columns: age , each lion’s age in decimal years, and proportion.black , the percentage of that lion’s nose which is black. Finally, it has a logical column adult , which is TRUE if age is greater than 3 and FALSE otherwise. Consider the following code which attempts to make a scatter plot of age on the X axis vs. proportion.black on the y axis.Which of the following statements are true about errors in the above code? Select ALL that apply. fill = adult must be changed to color = adult to color the points differently. alpha = 0.5 must be moved outside of aes() , like size = 2 currently is. size = 2 must be moved within aes() , like alpha = 0.5 is. fill = adult must be moved to the aes() in the ggplot() call, not geom point() . 2. Using the same dataframe as Question 1, consider the following code which creates a new dataframe, lions summary .Which of the following statements are true about lions summary ? Select ALL that apply. lions summary has 2 rows. lions summary has 2 columns. (False: adult, averageAge, and averagePropBlack.) lions summary has a column called ‘adult‘. If there were any NA values in the age column of lions , all the values of averageAge in lions summary will be NA (Sneakily false: If there are only NA values in one category of adult, the other can still function!) 3. A random viewer’s ideal movie length, in minutes, is approximately X ∼ N(120,15). Actual movie lengths are given by Y ∼ N(143,19). (a) Which R code below calculates the probability that a random movie is shorter than the 40th percentile for ideal movie length? Select ONE. qnorm(0.4, 120, 15) %>% pnorm(143, 19) qnorm(0.4, 143, 19) %>% pnorm(120, 15) pnorm(0.4, 120, 15) %>% qnorm(143, 19) pnorm(0.4, 143, 19) %>% qnorm(120, 15) (b) What movie length y corresponds to a z-score of 1? Select ONE. 19143 124162 (143 + 19 = 162) 4. Your friend claims that a random variable, X, follows the following probability distribution:Which of the following are true about your friend’s claim? Select ONE. This distribution has negative values of x, therefore X is not a random variable. This distribution has negative values of P(X = x), therefore X is not a random variable. The area under this distribution is not 1, therefore X is not a random variable. X is a valid random variable. 5. Consider ordering a ”footlong” (12 inch) sub from Subway. Consider trying to estimate µ, the average length of the sub you receive when you order a 12 inch sub. You do this by ordering 40 footlong subs and measuring their lengths in inches. (What a great problem this is.) You observe that the average length of your 40 subs is 11.5 inches, and the standard deviation of their lengths is 0.4 inches. Consider testing H0 : µ = 12 vs. Hα : µ < 12. Write a numeric expression (only containing numbers and artihmetic symbols like + and ×) for the test statistic of this test. x¯ − µnull √ sx/ n Answer:6. The fictitious distribution of the number of children in a household (X) is given below. x 0 1 2 3 P(X = x) ? 0.4 0.25 0.15 (a) Which of the following statements about X is true? Select ONE. X is a discrete RV. X is a binomial RV. X is a continuous RV. X is a normal RV. (b) What value of P(X = 0) makes this a valid probability distribution? Select ONE. 0.1 0.15 0.2 0.25 (c) What is the median number of children per household? Select ONE. 0 1 2 Not enough infomration to determine 7. Data is collected on the duration each winter that a lake’s surface is frozen over a period of 103 consecutive years. The correlation coefficient between the variables is r = −0.4. The year variable has a mean ¯x = 1950 and a standard deviation sx = 30. The freeze duration variable has ¯y = 90 and a standard deviation sy = 17. Consider fitting a simple linear regression model to this data. Write a numerical expression (using only numbers and arithmetic symbols like + or ×) for the predicted freeze duration in the year 1890. No need to simplify. Answer: (1890 is two x standard deviations BELOW the mean of x, so our predicted value will be r * two y standard deviations below the mean of y.)* 90 + (−0.4) × (−2) × 17 *You could also take the long way:* and βˆ0 = 90 − βˆ1 ∗ 1950, so ˆy1 = βˆ0 + βˆ1 ∗ 1890 8. We continue to consider the lake freezing data from Problem 7. A 95% confidence interval for the freeze duration of the true regression line at x = 1890 has the form: yˆ1 ± c × SEC A 95% prediction interval for the duration that the lake was frozen in the single year 1890 has the form yˆ1 ± c × SEP where c is a critical value from some distribution and SEC, SEP are some positive values. Reminder: There are 103 years in the dataset. Which R code calculates the numeric critical value c? Select ONE. qnorm(0.95) qnorm(0.975) qt(0.95, 101) qt(0.975, 101) qt(0.95, 102) qt(0.975, 102) 9. Using the expressions from the previous problem, circle the true relationship between SEC and SEP and briefly explain why. SEC < SEP SEC = SEP SEC > SEP Answer: A confidence interval for the height of the regression line is narrower than a prediction interval for at the same x point; but they have the same center and critical value. Therefore, it must be that SEC < SEP. 10. Consider trying to estimate the proportion of UW-Madison undergraduate students who will graduate at the end of this semester, p. You ask 50 random students, and 7 of them will graduate at the end of the semester. Consider using this information to calculate a confidence interval for p. Which of the following statements are true? Select ALL that apply. If we were to calculate an Agresti-Coull confidence interval for p, its center would be (7 + 1)/(50 + 2). (False: It would be (7 + 2)/(50 + 4).) If we were to calculate a Wald confidence interval p, its center would be 7/50. The upper bound of the Agresti-Coull confidence interval would be greater than the upper bound of the Wald confidence interval. (it’s shifted towards 0.5) As we increase the confidence level of our interval towards 100%, our interval would widen and approach [0, 1]. 11. We are interested in comparing the proportions of individuals with bachelor degrees among of adult men aged 25 and older in the states of Minnesota and Wisconsin. Let pM and pW represent these two population proportions. In random samples of nM = 300 Minnesota men and nW = 400 Wisconsin men aged 25 and older, the numbers of individuals with bachelor degrees are xM = 90 and xW = 100. Without simplification, write a numerical expression using provided data for the standard error SEci used in a 95% confidence interval for pM − pW of the form (point estimate) ± zcrit × SEci when using the Agresti-Coull method. Answer: SEci 12. In the setting of the problem above, write a numerical expression using provided data for the standard error SEht of the test statistic Z: pˆM − pˆW Z = ∼ N(0,1) SEht Answer: SEht 13. Suppose that the test statistic from the previous problem has the value z = 1.47 and that the p-value from a two-sided test is 0.14. Which of the following statements are true? Select ALL that apply. The test is statistically significant at an α = 0.05 level. A 95% confidence interval for pM − pW will contain the value 0. ( insignificant p-value) 14. You ask a group of 30 people from Country A to each privately give you a random number between 1 and 1000. You then ask a group of 40 people from Country B to each privately give you a random number between 1 and 1000. Let µA be the true average value that all members of country A would give upon this request, and similarly for µB. You are interested in the quantity µA − µB. Which of the following statements are true about how to approach this problem through statistical inference? Select ONE. Two-sample means inference is impossible to conduct because the sample sizes are different. Two-sample means inference is impossible to conduct because the random variables are technically discrete. Two-sample inference is more appropriate than paired-sample inference in this case. Paired-sample inference is more appropriate than two-sample inference in this case. 15. Assume the correct value of the test statistic from problem 14 is -50. Which R code correctly calculates the p-value for this test? Select ONE. pt(-50, df = W) 2*pt(-50, df = W) 2*pt(abs(-50), df = W) 1 – pt(-50, df = W)
Get your feet wet in GHIDRA or maybe your eyes :’(Instructions:Compile the hello world C code we saw in class into an executable Code:GCC Command:Load the executable into GHIDRA and locate the main function that you wrote. You may be surprised to see how much additional code is added by the compiler!Comment the same instructions that you commented for Extra Credit #1Note: You can write and compile the code above in AWS workspace. After you are done with commenting, you can submit your ASCII Listing file directly from the AWS workspace (Highly recommended). Doing this will help you avoid transferring the file from the AWS workspace to your personal computer.Exporting Comments from GHIDRA :-5 points for not following these steps properly.Default export settings in GHIDRA will truncate comments!Click on File and then click on Export program as shown in figure below.This will open a new pop-up in that choose ASCII format as shown in figure below.Click on options to increase the end of line width (max comment width).Transferring Files:To transfer files from your personal device to the AWS workspace:Create a zip folder of all the files that you would like to transfer to the AWS workspace.Every GT student has Box and OneDrive accounts given free by the institution. Login to either of those two and upload the desired files.Now go back to the AWS workspace and login to either of those two services where you uploaded you zip folder. Download folder to the AWS workspace and use the appropriate 7z command to unzip your folder.Grade: 100 pointsThis assignment can be done individually or in a team of 2. Please join a group in Gradescope if you are collaborating.Do not create or join a group in Canvas. Canvas groups are different from Gradescope groups.New to Gradescope? This link provides instructions for how to create groups in Gradescope: https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-membersZoom can also provide the ability to collaborate and video conference with your teammate.Submission:Upload your commented ASCII Listing file exported from GHIDRA (called submission.txt) to the assignment in Gradescope.Note: Gradescope will only check the formatting of your submission. Gradescope will not automatically check for correctness and provide a grade. Extra Credit #1 – Get your assembly skillz up to speed!Instructions:Compile the hello world C code we saw in class into assembly code Code:GCC Command:gcc -W -Wall -Wextra -Wpedantic -fno-asynchronous-unwind-tables -O0 -S -masm=intel hello.c -o hello.sFor each line with an assembly instruction, add a comment explaining what that instruction is doingBe smart about it! No “moves 2 into eax” Instead say: “the number of args must be 2”Grade:5 extra credit points (fills in lost points on real labs)Teams:This assignment can be done individually or in a team of 2. Please join a group in Gradescope if you are collaborating.Do not create or join a group in Canvas. Canvas groups are different from Gradescope groups.New to Gradescope? This link provides instructions for how to create groups in Gradescope: https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-membersZoom can also provide the ability to collaborate and video conference with your teammate.Submission:Upload your commented assembly code file (called hello.s) to the assignment in Gradescope.Note: Gradescope will only check the formatting of your submission. Gradescope will not automatically check for correctness and provide a grade.
This assignment has in total 55 base points and 10 extra points, and the cap is 55. Bonus questions are indicated using the ⋆ mark. Please specify the following information before submission: • Your Name: Yufeng Xu • Your NetID: yx3038 Problem 1: Infinite knapsack [10⋆ +15 pts] (a)⋆ Prove that in any optimal solution (n1,…,nm), the total number of copies of the items I2,…,Im included in the knapsack is at most wmax − 1, i.e., (b) Based on the conclusion of (a), design an algorithm Knapsack(m,W,(w1,v1),…,(wm,vm)) for infinite knapsack with running time ). For convenience, your algorithm only need to return the maximum value achieved. Describe the basic idea and give the pseudocode. Briefly justify the correctness and analyze the time complexity. (Hint: According to (a), we know that, in an optimal solution, the total weight of the items ), and hence the item I1 occupies “most” space of the knapsack. Try to combine this observation with the O(mW)-time DP algorithm.) Solution. Please write down your solution to Problem 1 here. (a) We prove the problem by contradiction. Suppose there exists an optimal solution where . We denote these items as where n ≥ wmax, and for each , ∃i = 2,…,m such that wj′ = wi. There can be multiple wj′ s that are equal to the same wi. Now, we want to show that ∃1 ≤ l ≤ r ≤ n such that is a multiple of w1. Consider the sum of the first k wj′ s, denoted as . Assume Sk ≡ Rk(mod w1), where Rk = 0,1,…,w1−1. (i) if ∃k such that Rk = 0, then take is a multiple of w1. (ii) if ∄k such that Rk = 0, Rk can only take w1 − 1 possible values. Because n ≥ wmax ≥ w1 ≥ w1 − 1, by the Pigeon-hole Principle, ∃k1,k2 such that Rk1 = Rk2, hence take w1). Next, we can take 1 ≤ l ≤ r ≤ n such that . Therefore, we can replace with t w1’s, and the final value is going to be larger, which is contradictory to our assumption that the original solution is optimal. Therefore, , i.e., (b) We know from (a) that 1, hence . Consider the naive infinite knapsack algorithm. Let DP(i,w) represent the maximul value by choosing from items {i,i + 1,…,m} with total weight w. Because we can either take ni the i-th item or take none of them, DP(i,w)=max{DP(i + 1,w), vi+DP(i,w − wi)}. Because i = 1,2,…,m, w = 0,1,…,W, hence this algorithm is of time complexity O(nW). Now, with the conclusion from (a), we can limit the search space of w for i = 2,…,m to . Then, the global maximum is equal to maxw{DP(2,w . For items 2,…,m, given any total weight limit w, with the infinte knapsack problem, we can obtain the optimal solution that maximizes the total value of the items chosen. For the whole problem, we learned from (a) that the total weight of items 2,…,m ≤ wmax2. Therefore, we can limit the total for 2,…,m to wmax2 and simply take as many item 1 as possible for the weight limit we haven’t used. By using this method, we can obtain the correct optimal solution. Because the weight limit for item 2,…,m shrinks from , and it takes O(1) time to add item 1 to the partial solution, the time complexity of the revised algorithm is Below is the pseudocode: def KNAPSACK(m, W, I ): # I is a l i s t of (w, v ) ’ s I = sorted(I , key=lambda x : x . v / x .w) w max = max([ item .w for item in I ]) DP = [[0 for w in range(w max ∗∗ 2)] for idx in range(m + 1)] # m+1 is set to handle boundary cases for idx in range(m − 1 , 0 , −1): for w in range(w max ∗∗ 2 + 1): if w >= I [ idx ] .w: DP[ idx ] [w] = max( I [ idx ] . v + DP[ idx ] [w − I [ idx ] .w] , DP[ idx + 1][w]) else : DP[ idx ] [w] = DP[ idx + 1][w] ans = max([DP[ 1 ] [w] + floor ((W − w) / I [ 0 ] .w) ∗ I [ 0 ] . v ]) return ans Problem 2: Counting shortest paths [15 pts] Given an undirected graph G = (V,E) and a vertex s ∈ V , we want to know for every vertex t ∈ V , how many (distinct) shortest paths from s to t there are. This can be viewed as a generalization of the unique shortest-path problem. Design an algorithm Count(G,s) to solve this problem. The algorithm should return a list L that contains a pair (v,δv) for each vertex v of G, where δv is the number of shortest paths from s to v. Describe the basic idea and give the pseudocode. Briefly justify the correctness and analyze the time complexity. Your algorithm should run in O(n + m) time, where n = |V | and m = |E|. A sample example. Suppose the input graph is G = (V,E) where V = {s,a,b,c,d,e} and E = {(a,b),(b,c),(c,d),(a,d),(s,a),(c,e)}; see the figure below. Then the algorithm Count(G,s) should return L = [(s,1),(a,1),(b,1),(c,2),(d,1),(e,2)]. (Hint: Exploit the BFS tree and extend the idea in the unique shortest-path problem.)Solution. Please write down your solution to Problem 2 here. We solve the problem by BFS. Consider a BFS tree of the graph, we view s as the root, and perform a level-order traversal. For each node, we count the number of it on that level as the number of shortest paths. For the neighbors of that node, if the number of shortest paths has not been counted for a neighbor, we append it to the queue that maintains the BFS tree, otherwise we just ignore it, because it must have been studied in a higher level, and the node’s appearance in the current level does not corresponds to a shortest path. The pseudocode is as follows: class Node: def i n i t ( s e l f ): s e l f . neighbors = [ ] def add neighbor ( self , other ): s e l f . neighbors . append( other ) other . neighbors . append( s e l f ) def COUNT(G, s ): (V, E) = G # V: l i s t of Node ; E: l i s t of tuples of 2 Nodes for edge in E: edge [ 0 ] . add neighbor ( edge [1]) cnt = {} queue = [ s ] while queue != [ ] : L = len(queue) print(L) for u in queue [ : L ] : cnt [u] = cnt . get (u, 0) + 1 print(u. neighbors ) for v in u. neighbors : if v not in cnt . keys (): queue . append(v) queue = queue [L : ] L = [( item [0] , item [1]) for item in cnt . items ()] return L This algorithm is correct because all the nodes that have the same shortest-path-length are on the same level of the BFS tree, so by counting the times a node appear in the shallowest level where it appears, we can count the number of shortest paths to the node. Because only the shortest path will be considered, each edge will be visited once and only once. The sum of the number of times each node is visited is O(m), therefore, this algorithm is O(m) (and consequently O(m + n)). Problem 3: Reaching the one [15 pts] Let n ≥ 3 be a given positive integer and we are going to play a game as follows. During the game, we have a number k which is initially equal to 2. In each round of the game, we are allowed to change the current number k to a new number in one of the following ways: • Type-A: Change k to (k + 1) mod n. • Type-B: Change k to a divisor of k that is greater than 1. • Type-C: Change k to k2 mod n. Note that the above three rules guarantee that our number k is always in the range {0,1,…,n−1} during the entire game. The game terminates when k becomes 1. Our goal is to finish the game in fewest rounds. For example, suppose n = 91 and we can play the game as follows. • Round 1: 2 =⇒ 3 using Type-A change. • Round 2: 3 =⇒ 9 using Type-C change. • Round 3: 9 =⇒ 81 using Type-C change. • Round 4: 81 =⇒ 27 using Type-B change. • Round 5: 27 =⇒ 1 using Type-C change. As you can verify, this is an optimal solution. Design an algorithm Game(n) which returns an optimal solution to the game as a list. So Game(91) should return [2,3,9,81,27,1] (or another solution with 5 rounds). Describe the basic idea and give the pseudocode. Briefly justify the correctness and analyze the time complexity. Your algorithm should run in O(nlogn) time. (Hint: Formulate the problem as a graph problem with n vertices and O(nlogn) edges.) Solution. Please write down your solution to Problem 3 here. We construct a graph based on the problem setting: all the possible remainders of n: 0,1,…,n−1 are constructed as n vertices. For Type-A operation, we connect vertex k to vertex k+1 for k = 0,1,…,n−2, and connect n−1 to 0. For Type-B operation, we connect vertex k to all its divisors larger than 1. For Type-C operation, we connect vertex k to k2(mod n) for k = 2,…,n − 1. Next, we consider the total number of directed edges. For Type-A, the number of directed edges is n; for Type-B, the total number of edges is ; for Type-C, the total number of edges is n − 2. Therefore, the total number of directed edges 1) . By integral test, +1, where ). Therefore, )). Also, n is O(nlog(n)), hence the number of directed edges is O(nlog(n)). Next, we perform a BFS search on this graph G. We use a queue to maintain the BFS procedure. We first append the starting vertex 2 to the queue. Every time, for the first vertex in the queue, if it’s 1, we quit the BFS; otherwise we append all of its unvisited neighbors to the end of the queue, and mark the current vertex the previous vertex of all its neighbors (if a neighbor does not have a recorded previous vertex). This algorithm can always find the shortest path between 2 and 1, because the nearest vertex will always come to the head of the queue. So when 1 is visited, the path is guaranteed to be shortest. The time complexity of BFS is O(nlog(n)) because the total number of edges in the graph is O(nlog(n)), and each edge is visited at most once. The time needed to construct the graph is also O(nlog(n)) because each edge takes O(1) time to construct. Therefore, the overall complexity of the algorithm is O(nlog(n)). The pseudocode is given as follows: class Vertex : def i n i t ( self , val ): s e l f . val = val s e l f . neighbors = [ ] def add neighbor ( self , other ): s e l f . neighbors . append( other ) def GAME(n ): V = [ Vertex ( i ) for i in range(n )] # Type A edge for i in range(n − 1): V[ i ] . add neighbor (V[ i + 1]) # Type B edge for i in range(2 , n ): for j in range(2 ∗ i , n, i ): V[ j ] . add neighbor (V[ i ]) # Type C edge for i in range(2 , n ): if i ∗∗ 2 % n != i : V[ i ] . add neighbor (V[ i ∗∗ 2 % n ]) # BFS visited = [0 for i in range(n )] prev = [−1 for i in range(n )] queue = [2] while queue != [ ] : head = queue [0] queue = queue [ 1 : ] visited [ head ] = 1 if head == 1: break for n in V[ head ] . neighbors : if not visited [n. val ] : queue . append(n. val ) if prev [n. val ] == −1: prev [n. val ] = head sol = [1] while sol [−1] != −1: sol . append( prev [ sol [ −1]]) return sol [ −2:: −1] Problem 4: Finding a strongly connected component [10 pts] Given a directed graph G = (V,E) and a vertex s ∈ V , we want to compute the strongly connected component of G containing s. Design an algorithm SCC(G,s) which returns the set of vertices of G that lie in the same strongly connected component as s. Describe the basic idea and give the pseudocode. Briefly justify the correctness and analyze the time complexity. Your algorithm should run in O(n + m) time, where n = |V | and m = |E|. (Hint: By definition, a vertex v is in the same strongly connected component as s iff v is reachable from s and s is reachable from v. We already know how to find the vertices reachable from s using DFS. Can you use a similar idea to compute the vertices from which s is reachable?) Solution. Please write down your solution to Problem 4 here. For each node u, we store all the nodes v that (u,v) ∈ E as its out-neighbors and all the nodes w that (w,u) ∈ E as its in-neighbors. Then we perform DFS based on out-neighbors to obtain all the nodes that s can reach, and DFS based on in-neighbors to obtain all the nodes that can reach s. Lastly, we take the intersection of the two results, so that we can obtain the set of points in the strongly connected components containing s. This algorithm is correct, as we know DFS can reach a point t iff it’s reachable from s. If we reverse the directions of all edges, and perform DFS again, a point t is reached iff ∃ a path consisting of reversed from s to t, i.e., ∃ a edge from t to s, s is reachable from t. By taking the union of the results obtained by the 2 DFS’s, a node t is kept iff it is reachable from s and it can reach s, i.e., it is strongly connected to s. The time complexity to process in-neighbors and out-neighbors is O(m). The time complexity of DFS is O(n + m). Therefore, the time complexity of the whole algorithm is O(n + m). The pseudocode is as follows: class Node: def i n i t ( self , val ): s e l f . val = val s e l f . neighbors = [ [ ] , [ ] ] # 0 for in , 1 for out def add neighbor ( self , other ): s e l f . neighbors [ 1 ] . append( other ) other . neighbors [ 0 ] . append( s e l f ) # add a edge from s e l f to other def SCC(G, s ): (V, E) = G for (u, v) in E: u. add neighbor (v) res = [ set () , set ()] # res [0] for a l l the nodes can be reached from s ; # res [1] for a l l the nodes that can reach s def DFS(u, direction = 0): # direction = 0 for in−neighbors , 1 for out−neighbors res [ direction ] . add(u) for v in u. neighbors [ direction ] : if v not in res [ direction ] : DFS(v , direction = direction ) DFS(s , 0) DFS(s , 1) ans = res [ 0 ] . intersection ( res [1])ans . remove( s ) return ans
CSC718 Operating Sys & Parallel Programming Homework 4 MPI, OpenMP, and GPU Programming Assignments Total: 100 points The homework assignment will be graded based on the following criteria: • Accuracy: 1) the solution meets specific requirements in the problem description; 2) the solution produces correct results; 2) the procedures adopted in the solution are technically sound. • Efficiency: efficiency will be one of the criteria when grading programing assignment. The solution should produce the desired results efficiently. • Effort/neatness: the solution includes excellent effort, and all relate work is shown neatly and organized well. Homework assignment feedback will be available through the DropBox folder on D2L. For all the programming assignments, you can choose any operating systems you want. I will usually provide C/C++ samples for the programming assignments. If you prefer to use other languages, e.g., Java, they are accepted too. A README.txt is required to submit any programming assignments. In the README.txt, you need to provide the following information: 1) How to compile your program? 2) How to run your program? 3) What is the average running time of your program? 4) What are the expected output results when I run your program? Zip all you source code, project files, supporting files, and README.txt and submit the all-in-one zip file together to the D2L Dropbox. If you have any questions about the homework, please let me know. (10 points) What are the strengths and limitations of GPU programing? (15 points) Among the four parallel frameworks, multithreaded programming, MPI programming, openMP programming, and GPU programming (e.g., CUDA), we discussed in the class, what will be the strategies you are going to use in general when selecting a parallel framework for your applications? (10 points) Benchmarking of a sequential program reveals that 95 percent of the execution time is spent inside functions that are amendable to parallelization. What is the maximum speed up we could expect from executing a parallel version of this program on 10 processors using Amdahl’s Law? The “acceptable” identifier must meet the following constraints:CSC718 Operating Sys & Parallel Programming • The identifier is a six-digit combination of the numerals 0-9 1. (15 points) Write a sequential program to count the different identifiers given these constraints. 2. (20 points) Convert the sequential program to an MPI parallel program. 3. (20 points) (choose only one: A or B, not both) A. Convert the sequential program to an openMP parallel program. B. Convert the sequential program to a GPU program using a parallel framework at your choice (e.g., CUDA, OpenCL, etc.) 4. (10 points) Benchmark the three programs based on your choice in 3). Problem 1 process/ 1 thread 2 processes / 2 threads 3 processes / 3 threads 4 processes / or 4 threads Program1.c (sequential) n/a n/a n/a Program2.c (MPI) Program3.c (openMP or CUDA)
CS771A Assignment 3The Boys Rajarshi Dutta Udvas Basak Shivam Pandey 200762 201056 200938 1 Question 1 The first part of the solution essentially uses a linear model named Elastic Regression, which takes into account both the effects of Ridge Regression Penalty – L2 loss and Lasso Regression Penalty – L1 loss. The objective/loss function of the model is represented below: C(w) = argmin(1) w where, • w′ = Weights multiplied with each of the features present in the dataset taken into account for the construction of the regression model. • α = constant that multiplies both the lasso and ridge penalties. • λ1 = Regularization parameter for the Ridge regression penalty. • λ2 = Regularization parameter for the Lasso regression penalty. During the regularization procedure, the λ1 parameter leads to the formation of a sparse model, whereas the quadratic section of the penalty makes the part of λ1 portion more stable to the path of regularization, decreases the number of variables to be selected thereby promoting the grouping effect. The grouping effect enables the variables to be easily identified by correlation leading to the enhancement of the sampling procedure. This method first finds the Ridge Regression coefficients and then conducts the second step by using a Lasso sort of shrinkage of the coefficients. Inorder to determine the best performing linear model, Randomized Search Cross Validation technique has been applied for efficient hyperparameter selection out of the following sets of hyperparameters of the model. Here grid is a dictionary that contains the different hyperparameters of the Elastic Net Regression model as keys and its values. grid = dict() grid[’alpha’] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 0.2, 1.0] grid[’l1_ratio’] = np.arange(0,1, 0.01) grid[’selection’] = [’cyclic’, ’random’] grid[’max_iter’] = [2, 10, 100, 1000] grid[’tol’] = [1e-4, 1e-3, 1e-5, 1e-2, 1e-1, 0.0] The Randomized Search Cross Validation returns the best estimator with the following hyperparameters ElasticNet(alpha=0.1, l1_ratio=0.62, selection=’random’, tol=0.001, warm_start=True) Preprint. Under review. The corresponding MAE score on the training data are 5.6248 (for the O3 levels) and 6.5136 (for the NO2 levels) respectively. 2 Question 2 A suitable non-linear model which can effectively fit on the given dataset is K Nearest Neighbour algorithm. The KNN can be both used as a classification and regression model, which essentially classifies unknown data points based on the distance with respect to k nearest data points. The model essentially works on two different kinds of distance parameters, namely Euclidean distance, Minkowski distance and the Manhattan distance. One strength of the KNN regression algorithm that we would like to draw attention to at this point is its ability to work well with non-linear relationships. In the case of regression tasks, the algorithm chooses the k nearest points based on the given distance parameter. It calculates the new point to be approximately equal to the average of these nearest points. Also here Distance-weighted k-nearest-neighbor rule should be such where that it varies with the distance between the sample and the considered neighbor in such a manner that the value decreases with increasing sample-to-neighbor distance. (2) Here, the optimal value of K has been calculated by plotting the Train MAE loss and Test MAE loss for the case of both O3 and NO2 sensor values. The MAE loss variation for different k values has been shown below.Figure 1: MAE loss variation for K values We find that the KNN Regressor provides the minimum Mean absolute error for a k value of 5. model = KNeighborsRegressor(n_neighbors=5, P = 1, algorithm=”auto” , n_jobs = -1) Some other models tested for the given regression problem includes Xgboost, Multi-layer Perceptron, Random Forest Regressor models. • XGBoost Regressor : XGBoost uses a combination of two techniques, gradient boosting and tree boosting, to improve the accuracy of the model. Gradient boosting involves minimizing the loss function mean absolute error by iteratively adding weak learners (e.g., decision trees) to the model. Tree boosting, on the other hand, involves optimizing the split points of the decision tree by minimizing the loss function. • Random Forest Regressor : The algorithm works by creating a large number of decision trees, each of which is trained on a subset of the data and a subset of the features. The 2 final prediction is then made by averaging the predictions of all the trees in the forest. Random forest regression also includes several regularization techniques to prevent overfitting and improve the generalization of the model. These techniques include maximum depth, minimum samples per leaf, and maximum features. • Arttifical Neural Network : The given neural network consists of 64 followed by 32 and 8 layers. The optimizer used is Adam, and the kernel initialisation is He uniform and the loss function used is Mean absolute Error loss. The Perceptron is backpropagated for around 100 epochs on the training data with a batch size of 64. Non-linear Models O3 MAE loss NO2 MAE loss K Neighbours Regressor 3.3284 2.3228 XGboost Regressor 5.1433 4.7230 Random Forest Regressor 6.0784 5.3145 MultiLayer Perceptron 4.8807 4.9178 Table 1: Stats of different non-linear models 3 Question 3 The code is available in the submit.py file. Code snippet is mentioned below: import numpy as np import pickle as pkl # Define your prediction method here # df is a dataframe containing timestamps, weather data and potentials def my_predict( df ): X = np.array(df.iloc[:, 1:]) # Load your model file model = pkl.load(open(“knn_model.pkl”, “rb”)) test_preds = np.transpose(model.predict(X)) # Make two sets of predictions, one for O3 and pred_o3, pred_no2 = test_preds[0], test_preds[1] # Return both sets of predictions return ( pred_o3, pred_no2 ) another for NO2 3
CS771A Assignment 2The Boys Rajarshi Dutta 200762 Udvas Basak 201056 Shivam Pandey 200938 1 Solution 1 1.1 TheoryFigure 1: Illustration of a working ID3 algorithm We have used the Iterative Dichotomizer decision tree algorithm which is considered one of the most robustly used algorithms for supervised machine learning classification tasks. The algorithm works by recursively partitioning the training data based on their attributes until the decision tree produces the purest splits (referred to as the leaves). ID3 uses a top-down greedy approach to build a decision tree. The tree is basically constructed from the top and the greedy approach means that at each iteration the best feature is selected to split the node. 1.2 Our approach The decision tree constructed to solve the given Wordle-Solvr problem consists of the following components: 1.2.1 process node • For the process node function, every non-root node present in the ID3 tree constructed is considered and passed through the get entropy function which calculates the index of the vocabulary that minimizes entropy the most or maximizes the Information Gain. This index can be used to access the most appropriate query that can be used at that step. Preprint. Under review. • All words from list all words list are extracted and reveal function helps to uncover the mask between the query and the required word which is stored in split dict as the key. The value of the split dict contains an array of indices that returns the given mask when queried. get entropy • The get entropy function iterates over all of the words present in that node (possible candidate queries) and passes the array to the function calc entropy which returns the entropy of the current node. • Another dictionary is maintained which given the mask, corresponds to a number of queries(which when queried with the word provides that particular mask). These values are used to calculate the sum of weighted entropy after splitting the node. • The difference of the weighted sum of entropies after splitting the node with the initialentropy calculated from the parent node gives us the information gain used to produce the best split out of all words. (1) where, V = possible values in Attribute A, S = set of examples X, Sv = subset where XA = V • Information gain indirectly describes the mutual information between the attribute and theclass of labels S. calc entropy • This function simply calculates the Shannon Entropy based on the formula where p represents the probability of each class label which can be produced by the corresponding split based on their attribute. E(p) = −p · log2(p) (2) 2 Solution 2 The entire algorithm has been implemented in the submit.py file. 2
This assignment focuses on problems related to Lecture 9 and Lectures 18 through 22. • Lecture 18: Algorithms for RA operations • Lecture 19: Query processing and query plans • Lecture 20: Object-relational database programming • Lecture 20: Key-value stores. NoSQL in MapReduce style • Lecture 21: Key-value stores; NoSQL in Spark style • Lecture 22: Graph databases Other lectures that a relevant for this assignment are Lectures 8, 13, and 14: • Lecture 8: Translating Pure SQL queries into RA expressions • Lecture 9: Query optimization • Lecture 13: Object-Relational databases and queries • Lecture 14: Nested Relational, Semi-structured Databases, Document Databases This assignment has problems that are required to be solved. Others, identified as such, are practice problems that you should attempt since the serve as preparation for the final exam. Turn in a single assignment7.sql file that contains the PostgreSQL code of the solutions for the problem that require such code. (Do not include solutions for the practice problems.) Also turn in a assignment7.txt file that contains all the output associated with the problems in this assignment. For all the other problems, submit a single assignment7.pdf file with your solutions. 1 1 Analysis of Queries Using Query Plans Consider Lecture 19 on Query Processing: Query Planning in (Object) Relational Systems. Consider the analysis, using query plans, for the SOME quantifier. 1. Assume the relation schemas P(x), Q(x), R(x, y) and S(x, z). Consider the NOT ALL generalized query {(p.x, q.x)|P(p) ∧ Q(p) ∧ R(p.x) 6⊃ S(p.x)} where R(p.x) = {r.y | R(r) ∧ r.x = p.x} S(q.x) = {s.z | S(s) ∧ s.x = q.x} Consider Lecture 19 on Query Processing: Query Planning in (Object) Relational Systems and in particular the analysis, using query plans, for the SOME generalized quantifier. Now to the problem. In analogy with the analysis for the SOME generalized quantifier, do an analysis for the NOT ALL generalized quantifier. 2 Experiments to Test the Effectiveness of Query Optimization In the following problems, you will conduct experiments in PostgreSQL to gain insight into whether or not query optimization can be effective. In other words, can it be determined experimentally if optimizing an SQL or an RA expression improves the time (and space) complexity of query evaluation? Additionally, can it be determined if the PostgreSQL query optimizer attains the same (i.e., better or worse) optimization as optimization by hand. Recall that in SQL you can specify each RA expression as an RA SQL query. This implies that each of the optimization rules for RA can be applied directly to queries formulated in RA SQL. In the following problems you will need to generate artificial data of increasing size and measure the time of evaluating non-optimized and optimized queries. The size of this data can be in the ten or hundreds of thousands of tuples. This is necessary because on very small data is it is not possible to gain sufficient insights into the quality (or lack of quality) of optimization. You can use the data generation functions that were developed in Assignment 6. Additionally, you are advised to examine the query plans generated by PostgreSQL. For the problems in this assignments, we will use three relations:1 P(a int) R(a int, b int) S(b int) 1A typical case could be where P is Person, R is Knows, and S is the set of persons with the Databases skill. Another case could where P is the set of persons who work for Amazon, R is personSkill and S is the set of skills of persons who live in Bloomington. Etc. 2 To generate P or S, you should use the function SetOfIntegers which generate a set of up to n randomly selected integers in the range [l, u]: create or replace function SetOfIntegers(n int, l int, u int) returns table (x int) as $$ select floor(random() * (u-l+1) + l)::int as x from generate_series(1,n) group by (x) order by 1; $$ language sql; To generate R, you should use the function BinaryRelationOverIntegers which generates up to n randomly selected pairs with first components in the range [l1, u1] and second components in the range [l2, u2]: create or replace function BinaryRelationOverIntegers(n int, l_1 int, u_1 int, l_2 int, u_2 int) returns table (x int, y int) as $$ select floor(random() * (u_1-l_1+1) + l_1)::int as x, floor(random() * (u_2-l_2+1) + l_2)::int as y from generate_series(1,n) group by (x,y) order by 1,2; $$ language sql; Example 1 Consider the query Q1 select distinct r1.a from R r1, R r2 where r1.b = r2.a; This query can be translated and optimized to the query Q2 select distinct r1.a from R r1 natural join (select distinct r2.a as b from R r2) r2; Image that you have generated a relation R. Then when you execute explain analyze select distinct r1.a from R r1, R r2 where r1.b = r2.a; the system will return its query plan as well as the execution time to evaluate Q1 measured in ms. And, when you execute explain analyze select distinct r1.a from R r1 natural join (select distinct r2.a as b from R r2) r2; the system will return its query plan as well as the execution time to evaluate Q2 measured in ms. This permits us to compare the non-optimized query Q1 3 with the optimized query Q2 for various differently-sized relations R. Here are some of these comparisons for various differently-sized random relations R. In this table, R was generated with lower and upper bounds l1 = l2 = 1000 and u1 = u2 = 1000. 2 R Q1 (in ms) Q2 (in ms) 104 27.03 7.80 105 3176.53 58.36 106 69251.58 400.54 Notice the significant difference between the execution times of the non-optimized query Q1 and the optimized query Q2. So clearly, optimization works on query Q1. Incidentally, below are the query plans for Q1 and Q2. Examining these query plans should reveal why Q1 runs much slower than Q2. (Why?) QUERY PLAN for Q1 ———————————— HashAggregate Group Key: r1.a -> Hash Join Hash Cond: (r1.b = r2.a) -> Seq Scan on r r1 -> Hash -> Seq Scan on r r2 QUERY PLAN for query Q2 —————————————— HashAggregate Group Key: r1.a -> Hash Join Hash Cond: (r1.b = r2.a) -> Seq Scan on r r1 -> Hash -> HashAggregate Group Key: r2.a -> Seq Scan on r r2 2All the experiments where done on a MacMini. 4 We now turn to the problems for this section. 2. Consider query Q3 select distinct p.a from P p, R r1, R r2, R r3, S s where p.a = r1.a and r1.b = r2.a and r2.b = r3.a and r.b = S.b; Intuitively, if we view R as a graph, and P and S as node types (properties), then Q3 determines each P-node in the graph from which there emanates a path of length 3 that ends at a S-node.3 I.e., a P-node n0 is in the answer if there exists sequence of nodes (n0, n1, n2, n3) such that (n0, n1), (n1, n2), and (n2, n3) are edges in R and n3 is a S-node. (a) Translate and optimize this query and call it Q4. Then write Q4 as an RA SQL query just as was done for query Q2 in Example 1. (b) Compare queries Q3 and Q4 in a similar way as we did for Q1 and Q2 in Example 1. You should experiment with different sizes for R. Incidentally, these relations do not need to use the same parameters as those shown in the above table for Q1 and Q2 in Example 1. (c) What conclusions do you draw from the results of these experiments regarding the effectiveness of query optimization in PostgreSQL and/or by hand? 3Such a query is typical in Graph Databases. 5 3. Consider the Pure SQL Q5 which is an formulation of a variation of the not subset (not only) set semijoin query {p.a | P(p) ∧ R(p.a) 6⊆ S} where R(p.a) = {r.b | R(r) ∧ r.a = p.a}. select p.a from P p where exists (select 1 from R r where r.a = p.a and not exists (select 1 from S s where r.b = s.b)); (a) Translate and optimize this query and call it Q6. Then write Q6 as an RA SQL query just as was done for Q2 in Example 1. (b) An alternative way to write a query equivalent with Q5 is as the object-relational query with nestedR as (select P.a, array_agg(R.b) as bs from P natural join R group by (P.a)), Ss as (select array(select b from S) as bs) select a from nestedR where not (bs ’b’ as b, p.value->’c’ as c from encodingofR p; a | b | c —–+—+— 1 | 2 | 3 4 | 5 | 6 1 | 2 | 4 An important aspect of this encoding strategy is that it is possible to put multiple relations, possible with different schemas and arities, into the same key-value store. Besides R, let us also consider a binary relation S(a,d). create table S (a int, d int); insert into S values (1,2), (5,6), (2,1), (2,3); table S; 8PostgreSQL support both json and jsonb objects. For this assignment, you should use the jsonb object type since it comes with more functionality and offers more efficient computation. 9Notice that this strategy works in general for any relation, independent of the number of attributes of the relation. 16 a | d —–+ 1 | 2 5 | 6 2 | 1 2 | 3 (4 rows) We can now encode both R and S into a single key-value store encodingofRandS as follows: create table encodingofRandS(key text, value jsonb); insert into encodingofRandS select ’R’ as key, jsonb_build_object(’a’, r.a, ’b’, r.b, ’c’, r.c) as value from R r union select ’S’ as key, jsonb_build_object(’a’, s.a, ’d’, s.d) as value from S s order by 1, 2; table encodingofRandS; key | value —–+————————– R | {“a”: 1, “b”: 2, “c”: 3} R | {“a”: 1, “b”: 2, “c”: 4} R | {“a”: 4, “b”: 5, “c”: 6} S | {“a”: 1, “d”: 2} S | {“a”: 2, “d”: 1} S | {“a”: 2, “d”: 3} S | {“a”: 5, “d”: 6} (7 rows) Furthermore, we can decode this key-value store using 2 object-relational SQL queries and recover R and S. select p.value->’a’ as a, p.value->’b’ as b, p.value->’c’ as c from encodingofRandS p where p.key = ’R’; a | b | c —–+—+— 1 | 2 | 3 4 | 5 | 6 1 | 2 | 4 (3 rows) select p.value->’a’ as a, p.value->’d’ as d from encodingofRandS p where p.key = ’S’; a | d —–+ 1 | 2 5 | 6 2 | 1 2 | 3 (4 rows) 17 Example 2 Consider the following problem. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well as a 3-phases simulation that implements the set intersection of two unary relations R(a) and S(a), i.e., the relation R ∩ S. You can assume that the domain of the attribute ‘a’ is integer. — EncodingOfRandS; drop table R; drop table S; create table R(a int); insert into R values (1),(2),(3),(4); create table S(a int); insert into S values (2),(4),(5); drop table EncodingOfRandS; create table EncodingOfRandS(key text, value jsonb); insert into EncodingOfRandS select ’R’ as key, jsonb_build_object(’a’, r.a) as value from R r union select ’S’ as key, jsonb_build_object(’a’, s.a) as value from S s order by 1; table EncodingOfRandS; key | value —–+———- R | {“a”: 1} R | {“a”: 4} R | {“a”: 2} R | {“a”: 3} S | {“a”: 4} S | {“a”: 5} S | {“a”: 2} (7 rows) — mapper function CREATE OR REPLACE FUNCTION mapper(key text, value jsonb) RETURNS TABLE(key jsonb, value text) AS $$ SELECT value, key; $$ LANGUAGE SQL; — reducer function CREATE OR REPLACE FUNCTION reducer(key jsonb, valuesArray text[]) RETURNS TABLE(key text, value jsonb) AS $$ SELECT ’R intersect S’::text, key WHERE ARRAY[’R’,’S’] ’a’ as a FROM Reduce_Phase p order by 1; a — 2 4 (2 rows) We now turn to the problems for this section. 13. Practice problem–not graded. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well as a 3-phases simulation that implements the symmetric difference of two unary relations R(a) and S(a), i.e., the relation (R−S)∪(S−R). You can assume that the domain of the attribute ‘a’ is integer. 14. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well as a 3-phases simulation that implements the semijoin of two relations R(A,B) and S(A,B,C), i.e., the relation R n S. You can assume that the domains of A, B, and C are integer. Use the encoding and decoding methods described above. 15. Practice problem–not graded. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well as a 3-phases simulation that implements the natural join R ./ S of two relations R(A, B) and S(B,C). You can assume that the domains of A, B, and C are integer. Use the encoding and decoding methods described above. 16. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well as a 3-phases simulation that implements the SQL query SELECT r.A, array_agg(r.B), sum(r.B) FROM R r GROUP BY (r.A) HAVING COUNT(r.B) < 3; Here R is a relation with schema (A, B). You can assume that the domains of A and B are integers. Use the encoding and decoding methods described above. 19 We now turn to some problems that relate to query processing in Spark. Note that in Spark it is possible to operate on multiple key-value stores. 17. Let R(K, V ) and S(K, W) be two binary key-value pair relations. You can assume that the domains of K, V , and W are integers. Consider the cogroup transformation R.cogroup(S) introduced in the lecture on Spark. (a) Define a PostgreSQL view coGroup that computes a complex-object relation that represent the co-group transformation R.cogroup(S). Show that this view works. (b) Write a PostgreSQL query that use this coGroup view to compute the semi join R n S, in other words compute the relation R ./ πK(S). (c) Write a PostgreSQL query that uses this coGroup view to implement the SQL query SELECT distinct r.K as rK, s.K as sK FROM R r, S s WHERE NOT ARRAY(SELECT r1.V FROM R r1 WHERE r1.K = r.K) && ARRAY(SELECT s1.W FROM S s1 WHERE s1.K = s.K); 18. Practice problem–not graded. Let A(x) and B(x) be the schemas to represent two set of integers A and B. Consider the cogroup transformation introduced in the lecture on Spark. Using an approach analogous to the one in Problem 17 solve the following problems:10 (a) Write a PostgreSQL query that uses the cogroup transformation to compute A ∩ B. (b) Write a PostgreSQL query that uses the cogroup operator to compute the symmetric difference of A and B, i.e., the expression (A − B) ∪ (B − A). 10An important aspect of this problem is to represent A and B as a key-value stores. 20 5 Graph query languages Each of the following problems is a practice problem. 19. Practice Problem–not graded. Consider the database schema Person, Company, companyLocation, Knows, jobSkill, and personSkill. (a) Specify an Entity-Relationship Diagram that models this database schema. (b) Specify the node and relationship types of a Property Graph for this database schema. In addition, specify the properties, if any, associated with each such type. 20. Practice Problem–not graded. Using the Property Graph model in Problem 19b, formulate the following queries in the Cypher query language: (a) Find the types of the relationships associated with Person nodes. (b) Find each person (node) whose name is ‘John’ and has a salary that is at least 50000. (c) Find each jobSkill (node) that is the job skill of a person who knows a person who works for ’Amazon’ and who has a salary that is at least 50000. (d) Find each person (node) who knows directly or indirectly (i.e., recursively) another person who works for Amazon. (e) Find for each company node, that node along with the number of persons who work for that company and who have both the Databases and Networks job skills.
For this assignment, you will need the material covered in the lectures • Lecture 15: External Merge Sorting • Lecture 16: Indexing • Lecture 17: B+ trees and Hashing In addition, you should read the following sections in Chapter 14 and 15 in the textbook Database Systems The Complete Book by Garcia-Molina, Ullman, and Widom: • Section 14.1: Index-structure Basics • Section 14.2: B-Trees • Section 14.3: Hash Tables • Section 14.7: Bitmap Indexes • Section 15.8.1: Multipass Sort-Based Algorithms In the file performingExperiments.sql supplied with this assignment, we have include several PostgreSQL functions that should be useful for running experiments. Of course, you may wish to write your own functions and/or adopt the functions in this .sql to suite the required experiments for the various problems in this assignment. The problems that need to be included in the assignment6.sql are marked with a blue bullet •. The problems that need to be included in the assignment6.pdf are marked with a red bullet •. (You should aim to use Latex to construct your .pdf file.) In addition, submit a file assignment6Code.sql that contains all the sql code that you developed for this assignment. Practice problems should not be submitted.1 Data generation PostgreSQL functions and clauses In this assignment there will be a need to do simulations with various-size relations. Many of these relations will have randomized data. PostgreSQL provides useful functions and clauses that make such relations: random() returns a random real number between 0 and 1 using the uniform distribution floor(random() ∗ (u − l + 1) + l) :: int returns a random integer in the range [l, u] using the uniform distribution generate series(m, n) generates the set of integers in the range [m, n] generate series(m, n, k) generates the set of integers in the range [m, n]) separated by step-size k order by random() randomly orders the tuples that are the result of a query limit(n) returns the first n tuples from the result of a query offset(n) returns all but the first n tuples from the result of a query limit(m) offset(n) returns the first m tuples from all but the first n tuples from the result of a query row number() is a window function that assigns a sequential integer to each row in a result set vacuum is a garbage collection function to clean and reclaim secondary memory space For more detail, consult the manual pages https://www.postgresql.org/docs/13/functions-math.html https://www.postgresql.org/docs/current/functions-srf.html https://www.postgresql.org/docs/current/queries-limit.html https://www.postgresql.org/docs/8.4/functions-window.html https://www.postgresql.org/docs/9.5/routine-vacuuming.html In the file performingExperiments.sql supplied with this assignment, we have include several PostgreSQL functions that should be useful for running experiments. Of course, you may wish to write your own functions and/or adopt the functions in this .sql to suite the required experiments for the various problems in this assignment. Generating sets To generate a set, i.e., a unary relation, of n randomly selected integers in the range [l, u], you can use the following function:1 create or replace function SetOfIntegers(n int, l int, u int) returns table (x int) as $$ select floor(random() * (u-l+1) + l)::int from generate_series(1,n); $$ language sql; Example 1 To generate a unary relation with 3 randomly selected integers in the range 5 to 10, do the following: select x from SetofIntegers(3,5,10); Of course, running this query multiple times, result in different sets. 1Typically the function SetOfIntegers returns a bag (multiset) but this is fine for this assignment. In case we want a set, we can always eliminate duplicates. 2 Generating binary relations The idea behind generating a set can be generalized to that for the generation of a binary relation.2 To generate a binary relation of n randomly selected pairs of integers (x, y) such x ∈ [l1, u1] and y ∈ [l2, u2], you can use the following function: create or replace function BinaryRelationOverIntegers(n int, l_1 int, u_1 int, l_2 int, u_2 int) returns table (x int, y int) as $$ select floor(random() * (u_1-l_1+1) + l_1)::int as x, floor(random() * (u_2-l_2+1) + l_2)::int as y from generate_series(1,n); $$ language sql; Example 2 To generate a binary relation with 20 randomly selected pairs with first components in the range [3, 8] and second components in the range [2, 11], do the following: select x, y from BinaryRelationOverIntegers(20,3,8,2,11); Generating functions A relation generated by BinaryRelationOverIntegers is in general not a function since it is possible that the relation has pairs (x, y1) and (x, y2) with y1 6= y2. To create a (partial) function f : [l1, u1] → [l2, u2] of n randomly selected function pairs, we can use the following function: create or replace function FunctionOverIntegers(n int, l_1 int, u_1 int, l_2 int, u_2 int) returns table (x int, y int) as $$ select x, floor(random() * (u_2-l_2+1) + l_2)::int as y from generate_series(l_1,u_1) x order by random() limit(n); $$ language sql; Example 3 To generate a partial function [1, 20] → [3, 8] of 15 randomly selection function pairs, do the following:3 select x, y from FunctionOverIntegers(15,1,20,3,8); 2Clearly, all of this can be generalized to higher-arity relations. 3When using this function, it is customary to use n such that n ∈ [0, u1 − l1 + 1]. 3 Generating relations with categorical (non-numeric) data Thus far, the sets, binary relations, and functions have all just involved integer ranges. It is possible to include ranges that have different types including categorical data such as text strings. The technique to accomplish this is to first associate with a categorical range an integer range that associate with each element in the categorical range a unique value of the integer range. The next example illustrates this. Example 4 Consider the jobSkill relation and assume that it contents is skill AI Databases Networks OperatingSystems Programming Suppose that we want to generate a personSkill(pid, skill) relation. Let us assume that the pid’s are integers in the range [1, m]. There are 5 skills in the jobSkill and it is therefore natural to associate with each skill a separate integer (index value) in the range [1, 5]. This can be done with a query involving the row number() window function: select skill, row number() over (order by skill) as index from Skill; The result is the following relation: skill index AI 1 Databases 2 Networks 3 OperatingSystems 4 Programming 5 Using this technique, we can write a PostgreSQL function that generates a personSkill relation with n randomly selected (pid, skill) tuples, with pid’s in the range [l, u]: create or replace function GeneratepersonSkillRelation(n int, l int, u int) returns table (pid int, skill text) as $$ with skillNumber(skill, index) as (select skill, row_number() over (order by skill) from Skill), pS as (select x, y from BinaryRelationOverIntegers(n,l,u,1, (select count(1) from Skill)::int)) select x as pid, skill from pS join skillNumber on y = index group by (x, skill) order by 1,2; $$ language sql; 4 In this function, the skillNumber view associates with each job skill an integer index in the range [1, |Skill|]. The pS view is a randomly generated binary relation with n tuples, with pid’s in the range [l, u], and skill numbers in the range [1, |Skill|]. The join operation associates the numeric range with the Skill range. The ‘group by (x, skill) order by 1,2’ clause eliminates duplicate tuples and orders the result. The query select * from GeneratepersonSkillRelation(10,1,15); may make the personSkill relation: pid skill 1 AI 2 Programming 3 Databases 4 Databases 6 Networks 6 OperatingSystems 6 Programming 9 Databases 14 Databases 14 Networks Problems We now turn to the problems in this section. 1. • Given a discrete probability mass function P, as specified in a relation P(outcome: int, probability: float), over a range of possible outcomes [u2, l2], design a PostgreSQL function RelationOverProbabilityFunction(n, l1, u1, l2, u2) that generates a relation of up to n pairs (x, y) such that • x is uniformly selected in the range [l1, u1]; and • y is selected in accordance with the probability mass function P in the range [l2, u2]. An example of a possible P as stored in relation P is as follows:4 P outcome probability 1 0.25 2 0.10 3 0.40 4 0.10 5 0.15 4Notice that the sum of the probabilities in the probability column in P sum to 1. 5 Note that when P is the uniform probability mass function, then RelationOverProbabilityFunction and BinaryRelationOverIntegers are the same binary relation producing functions. Hint: For insight into this problem, consult the method of Inverse Transform Sampling for discrete probability mass functions. Test your function for the cases listed in the Assignment-Script-2021-Fall-assignment6.sql file. 2. Practice Problem-not graded. Use the technique in Problem 1 and the method for generating categorical data discussed above to write a PostgreSQL function that generates a personSkill relation, given a probability mass function over the Skill relation. Your function should work for any Skill relation and any probability distribution defined over it. Test your function for the cases listed in the Assignment-Script-2021-Fall-assignment6.sql file. 6 2 Sorting We have learned about external sorting. The problems in this section are designed to look into this sorting method as it implemented in PostgreSQL. 3. • Create successively larger sets of n randomly selected integers in the range [1, n]. You can do this using the following function and with the following experiment.5 create or replace function makeS (n integer) returns void as $$ begin drop table if exists S; create table S (x int); insert into S select * from SetOfIntegers(n,1,n); end; $$ language plpgsql; This function generates a bag S of size n, with randomly select integers in the range [1, n]. Now consider the following SQL statements: select makeS(10); explain analyze select x from S; explain analyze select x from S order by 1; • The ‘select makeS(10)’ statement makes a bag S with 10 elements; • The ‘explain analyze select x from S’ statement provides the query plan and execution time in milliseconds (ms) for a simple scan of S; • The ‘explain analyze select x from S order by 1’ statement provides the query plan and execution time in milliseconds (ms) for sorting S. 6 QUERY PLAN —————————————————————————————————— Sort (cost=179.78..186.16 rows=2550 width=4) (actual time=0.025..0.026 rows=10 loops=1) Sort Key: x Sort Method: quicksort Memory: 25kB -> Seq Scan on s (cost=0.00..35.50 rows=2550 width=4) (actual time=0.004..0.005 rows=10 loops=1) Planning Time: 0.069 ms Execution Time: 0.034 ms Now construct the following timing table:7 5You should make it a habit to use the PostgreSQL vacuum function to perform garbage collection between experiments. 6Recall that 1ms is 1 1000 second. 7 It is possible that you may not be able to run the experiments for the largest S. If that is the case, just report the results for the smaller sizes. 7 size n of relation S avg execution time to scan S (in ms) avg execution time to sort S (in ms) 101 102 103 104 105 106 107 108 (a) What are your observations about the query plans for the scanning and sorting of such differently sized bags S? (b) What do you observe about the execution time to sort S as a function of n? (c) Does this conform with the formal time complexity of (external) sorting? Explain. (d) It is possible to set the working memory of PostgreSQL using the set work mem command. For example, set work mem = ’16MB’ sets the working memory to 16MB.8 The smallest possible working memory in postgreSQL is 64kB and the largest depends on you computer memory size. But you can try for example 1GB. Repeat question 3a for memory sizes 64kB and 1GB and report your observations. (e) Now create a relation indexedS(x integer) and create a Btree index on indexedS and insert into indexedS the sorted relation S. 9 create table indexedS (x integer); create index on indexedS using btree (x); insert into indexedS select x from S order by 1; Then run the range query select x from indexedS where x between 1 and n; where n denotes the size of S. Then construct the following table which contains (a) the average execution time to build the btree index and (2) the average time to run the range query. 8The default working memory size is 4MB. 9For information about indexes in PostgreSQL consult the manual page https://www.postgresql.org/docs/13/indexes.html. 8 size n of relation S avg execution time to create index indexedS avg execution time to sort indexedS (in ms) 101 102 103 104 105 106 107 108 What are your observations about the query plans and execution times to create indexedS and the execution times for sorting the differently sized bags indexedS? Compare your answer with those for the above sorting problems. 4. Practice problem-not graded. Typically, the makeS function returns a bag instead of a set. In the problems in this section, you are to conduct experiments to measure the execution times to eliminate duplicates. (a) Write a SQL query that uses the DISTINCT clause to eliminate duplicates in S and report you results in a table such as that in Problem 3a. (b) Write a SQL query that uses the GROUP BY clause to eliminate duplicates in S and report you results in a table such as that in Problem 3a. (c) Compare and contrast the results you have obtained in problems 4a and 4b. Again, consider using explain analyze to look at query plans. 9 3 Indexes and Indexing Indexes on data (1) permit faster lookup on data items and (2) may speed up query processing on such data. Speedups can be substantial. The purpose of the problems in this section are to explore this. Some other problems in this section are designed to understand the workings of the B +-tree and the extensible hashing data structures. Discussion PostgreSQL permit the creation of a variety of indexes on tables. We will review such index creation and examine their impact on data lookup and query processing. For more details, see the PostgreSQL manual: https://www.postgresql.org/docs/13/indexes.html Example 5 The following SQL statements create indexes on columns or combinations of columns of the personSkill relation.10 Notice that there are 2 arity(personSkill) − 1 = 22 − 1 = 3 such possible indexes. create index pid_index on personSkill (pid); — index on pid attribute create index skill_index on personSkill (skill); — index on skill attribute create index pid_skill_index on personSkill (pid,skill); — index (pid, skill) Example 6 It is possible to declare the type of index: btree or hash. When no index type is specified, the default is btree. If instead of a Btree, a hash index is desired, then it is necessary to specify a using hash qualifier: create index pid_hash on personSkill using hash (pid); — hash index on pid attribute Example 7 It is possible to create an index on a relation based on a scalar expression or a function defined over the attributes of that relation. Consider the following (immutable) function which computes the number of skills of a person: create or replace function numberOfSkills(p integer) returns integer as $$ select count(1)::int from personSkill where pid = p; $$ language SQL immutable; 10Incidentally, when a primary key is declared when a relation is created, PostgreSQL will create a btree index on this key for the relation. 10 Then the following is an index defined on the numberOfSkills values of persons: create index numberOfSkills_index on personSkill (numberOfSkills(pid)); Such an index is useful for queries that use this function such as select pid, skill from personSkill where numberOfSkills(pid) > 2; We now turn to the problems in this section. 5. • Consider a relation Student(sid text, sname text, major, year). A tuple (s, n, m, y) is in Student when s is the sid of a student and n, m, and y are that student’s name, major, and birth year. Further, consider a relation Enroll(sid text, cno text, grade text). A triple (s, c, g) is in Enroll when the student with sid s was enrolled in the course with cname c and obtained a letter grade g in that course. We are interested in answering queries of the form select sid, sname, major, byear from Student where sid in (select sid from Enroll sid where cno = c [and|or|and not] grade = g); Here c denotes a course name and g denotes a letter grade. Read Section 14.1.7 ‘Indirection in Secondary Indexes’ in your textbook Database Systems The Complete Book by Garcia-Molina, Ullman, and Widom. Of particular interest are (a) the concept of buckets (Figure 14.7) which are sets of tids and (b) the technique of performing set operations (like intersections) on relevant buckets (Figure 14.8) to answer queries of the form as shown above. The goal of this problem is to use object-relational SQL to simulate these concepts. To make things more concrete, consider the following Student and Enroll relations: Student: sid | sname | major | byear ——+——–+———+——- s100 | Eric | CS | 1988 s101 | Nick | Math | 1991 s102 | Chris | Biology | 1977 s103 | Dinska | CS | 1978 s104 | Zanna | Math | 2001 s105 | Vince | CS | 2001 Enroll: sid | cno | grade ——+——+——- s100 | c200 | A s100 | c201 | B 11 s100 | c202 | A s101 | c200 | B s101 | c201 | A s101 | c202 | A s101 | c301 | C s101 | c302 | A s102 | c200 | B s102 | c202 | A s102 | c301 | B s102 | c302 | A s103 | c201 | A s104 | c201 | D Now consider associating a tuple id (tid) with each tuple in Enroll: tid | sid | cno | grade —–+——+——+——- 1 | s100 | c200 | A 2 | s100 | c201 | B 3 | s100 | c202 | A 4 | s101 | c200 | B 5 | s101 | c201 | A 6 | s101 | c202 | A 7 | s101 | c301 | C 8 | s101 | c302 | A 9 | s102 | c200 | B 10 | s102 | c202 | A 11 | s102 | c301 | B 12 | s102 | c302 | A 13 | s103 | c201 | A 14 | s104 | c201 | D Use object-relational SQL to construct two secondary indexes indexOnCno and indexOnGrade on the Enroll relation. These indexes should be stored in two separate relations which you can conveniently call indexOnCno and indexOnGrade, respectively. two object-relational views in a manner that simulates the situation illustrated in Figure 14.8. In particular, do not use the ‘create index’ mechanism of SQL to construct these indexes. Then, using the indexOnCno and indexOnGrade views and the technique of intersecting buckets, write a function FindStudents(booleanOperation text, cno text, grade text) that can be used to answer queries of the form as shown above. (Here the booleanOperation is a string which can be ’and’, ’or’, or ’and not’.) For example, the query select * from FindStudents(’and’, ’c202’, ’A’); should return the same result as that of the query select sid, sname, major, byear from Student where sid in (select sid from Enroll sid where cno = ’c202’ and grade = ’A’); 12 Test your solution for the following cases on the Student and Enroll relation given for this problem: (a) select * from FindStudents(’and’, ’c202’, ’A’); (b) select * from FindStudents(’or’, ’c202’, ’A’); (c) select * from FindStudents(’and not’, ’c202’, ’A’); 6. Practice problem–not graded. Read Section 14.7 ‘Bitmap Indexes’ in your textbook Database Systems The Complete Book by Garcia-Molina, Ullman, and Widom. In particular, look at Example 14.39 for an example of a bitmap index for a secondary index. Next, revisit Problem 5. There, we considered two secondary indexes indexOnCno and indexOnGrade. We will now consider the corresponding bitmap indexes bitmapIndexOnCno and bitmapIndexOnGrade: bitmapIndexOnCno cno | bit-vector ——+—————- c200 | 10010000100000 c201 | 01001000000011 c202 | 00100100010000 c301 | 00000010001000 c302 | 00000001000100 and bitmapIndexOnGrade grade | bit-vector ——-+—————- A | 10101101010110 B | 01010000101000 C | 00000010000000 D | 00000000000001 Use object-relational SQL to construct two secondary indexes bitmapIndexOnCno and bitmapIndexOnGrade as two object-relational relations in a manner that simulates the situation just illustrated above. Then, using the bitmapIndexOnCno and bitmapIndexOnGrade relations and the technique of forming the bitmap-AND, bitmap-OR, and bitmap-AND NOT of two bit-vectors, write a function FindStudents(booleanOperation text, cno text, grade text) that can be used to answer queries of the form as shown in Problem 5. For example, the query select * from FindStudents(’and’, ’c202’, ’A’); should return the same result as that of the query select sid, sname, major, byear from Student where sid in (select sid from Enroll sid where cno = ’c202’ and grade = ’A’); 13 Test your solution for the following cases on the Student and Enroll relation given for this problem: (a) select * from FindStudents(’and’, ’c202’, ’A’); (b) select * from FindStudents(’or’, ’c202’, ’A’); (c) select * from FindStudents(’and not’, ’c202’, ’A’); 7. • Consider the following parameters: block size = 4096 bytes block-address size = 9 bytes block access time (I/O operation) = 10 ms (micro seconds) record size = 150 bytes record primary key size = 8 bytes Assume that there is a B+-tree, adhering to these parameters, that indexes 1 billion (109 ) records on their primary key values. Provide answers to the following problems and show the intermediate computations leading to your answers. (a) Specify (in ms) the minimum time to determine if there is a record with key k in the B+-tree. (In this problem, you can not assume that a key value that appears in an non-leaf node has a corresponding record with that key value.) (b) Specify (in ms) the maximum time to insert a record with key k in the B+-tree assuming that this record was not already in the data file. (c) How large must main memory be to hold the first two levels of the B +-tree? How about the first three levels? 14 8. • Consider the following B+-tree of order 2 that holds records with keys 2, 8, and 11.11 (a) Show the contents of your B+-tree after inserting records with keys 4, 10, 12, 1, 7, and 5 in that order. Strategy for splitting leaf nodes: when a leaf node n needs to be split into two nodes n1 and n2 (where n1 will point to n2), then use the rule that an even number of keys in n is moved into n1 and an odd number of keys is moved into n2. So if n becomes conceptually k1|k2|k3 then n1 should be k1|k2 and n2 should be k3| and n1 → n2. (b) Starting from your answer in question 8a, show the contents of your B+-tree after deleting records with keys 12, 2, and 11, in that order. (c) Starting from your answer in question 8b, show the contents of your B+-tree after deleting records with keys 5, 1, and 4, in that order. 11Recall that (a) an internal node of a B+-tree of order 2 can have either 1 or 2 keys values, and 2 or 3 sub-trees, and (b) a leaf node can have either 1 or 2 key values. 15 9. • Consider an extensible hashing data structure wherein (1) the initial global depth is set at 1 and (2) all directory pointers point to the same empty bucket which has local depth 0. So the hashing structure looks like this: Assume that a bucket can hold at most two records. (a) Show the state of the hash data structure after each of the following insert sequences:12 i. records with keys 6 and 7. ii. records with keys 1 and 2. iii. records with keys 9 and 4. iv. records with keys 8 and 3. (b) Starting from the answer you obtained for Question 9a, show the state of the hash data structure after each of the following delete sequences: i. records with keys 9 and 6. ii. records with keys 0 and 1. iii. records with keys 1 and 8. As in the text book, the bit strings are interpreted starting from their left-most bit and continuing to the right subsequent bits. 12You should interpret the key values as bit strings of length 4. So for example, key value 7 is represented as the bit string 0111 and key value 2 is represented as the bit string 0010. 16 The goal in the next problems is to study the behavior of indexes for various different sized instances13 of the Person, worksFor, and Knows relations and for various queries: 10. • Create an appropriate index on the worksFor relation that speedups the lookup query select pid from worksFor where cname = c; Here c is a company name. Illustrate this speedup by finding the execution times for this query for various sizes of the worksFor relation. 11. • Create an appropriate index on the worksFor relation that speedups the range query select pid, cname from worksFor where s1
For this assignment, you will need the material covered in the lectures • Lecture 13: Object-relational databases and queries • Lecture 14: Nested relational and semi-structured databases To turn in your assignment, you will need to upload to Canvas a single file with name assignment5.sql which contains the necessary SQL statements that solve the problems in this assignment. The assignment5.sql file must be so that the AI’s can run it in their PostgreSQL environment. You should use the Assignment-Script-2021-Fall-assignment5.sql file to construct the assignment5.sql file. (Note that the data to be used for this assignment is included in this file.) In addition, you will need to upload a separate assignment5.txt file that contains the results of running your queries. You will also see several problems that are listed as practice problems. You should not include your solutions for practice problems in the materials you submit for this assignment. 1 1 Formulating Query in Object-Relational SQL For the problems in the section, you will need to use the polymorphically defined functions and predicates that are defined in the document SetOperationsAndPredicates.sql Functions set union(A,B) A ∪ B set intersection(A,B) A ∩ B set difference(A,B) A − B add element(x,A) {x} ∪ A remove element(x,A) A − {x} make singleton(x) {x} choose element(A) choose some element from A bag union(A,B) the bag union of A and B bag to set(A) coerce the bag A to the corresponding set Predicates is in(x,A) x ∈ A is not in(x,A) x 6∈ A is empty(A) A = ∅ is not emptyset(A) A 6= ∅ subset(A,B) A ⊆ B superset(A,B) A ⊇ B equal(A,B) A = B overlap(A,B) A ∩ B 6= ∅ disjoint(A,B) A ∩ B = ∅ We now turn to the problems in this section. You will need use the data provided for the Person, Company, companyLocation, worksFor, jobSkill, personSkill, and Knows relations. But before turning to the problems, we will introduce various object-relational views defined over these relations:1 • The view companyHasEmployees(cname,employees) which associates with each company, identified by a cname, the set of pids of persons who work for that company. create or replace view companyHasEmployees as select cname, array(select pid from worksfor w where w.cname = c.cname order by 1) as employees from company c order by 1; 1The various order by clauses in these views are not essential: they simply aid to read the data more easily. 2 • The view cityHasCompanies(city,companies) which associates with each city the set of cnames of companies that are located in that city. create or replace view cityHasCompanies as select city, array_agg(cname order by 1) as companies from companyLocation group by city order by 1; • The view companyHasLocations(cname,locations) which associates with each company, identified by a cname, the set of cities in which that company is located. create or replace view companyHasLocations as select cname, array(select city from companyLocation cc where c.cname = cc.cname order by 1) as locations from company c order by 1; • The view knowsPersons(pid,persons) which associates with each person, identified by a pid, the set of pids of persons he or she knows. create or replace view knowsPersons as select p.pid, array(select k.pid2 from knows k where k.pid1 = p.pid order by pid2) as persons from person p order by 1; • The view isKnownByPersons(pid,persons) which associates with each person, identified by a pid, the set of pids of persons who know that person. Observe that there may be persons who are not known by any one. create or replace view isKnownByPersons as select distinct p.pid, array(select k.pid1 from knows k where k.pid2 = p.pid) as persons from person p order by 1; • The view personHasSkills(pid,skills) which associates with each person, identified by a pid, his or her set of job skills. create or replace view personHasSkills as select distinct p.pid, array(select s.skill from personSkill s where s.pid = p.pid order by 1) as skills from person p order by 1; 3 • The view skillOfPersons(skills,persons) which associates with each job skill the set of pids of persons who have that job skill. create or replace view skillOfPersons as select js.skill, array(select ps.pid from personSkill ps where ps.skill = js.skill order by pid) as persons from jobSkill js order by skill; In the problems in this section, you are asked to formulate queries in objectrelational SQL. You should use the set operations and set predicates defined in the document SetOperationsAndPredicates.sql, the relations Person Company Skill worksFor and the views companyHasEmployees cityHasCompanies companyHasLocations knowsPersons isKnownByPersons personHasSkills skillOfPersons However, you are not permitted to use the Knows, companyLocation, and personSkill relations in the object-relation SQL formulation of the queries. Observe that you actually don’t need these relations since they are encapsulated in these views. Before listing the queries that you are asked to formulate, we present some examples of queries that are formulated in object-relational SQL using the assumptions stated in the previous paragraph. Your solutions need to be in the style of these examples. The goals is to maximize the utilization of the functions and predicates defined in document SetOperationsAndPredicates.sql. Example 1 Consider the query “ Find the pid of each person who knows a person who has a salary greater than 55000.”2 select distinct pk.pid from knowsPersons pk, worksfor w where is in(w.pid, pk.persons) and w.salary > 55000 order by 1; Note that the following formulation for this query is not allowed since it uses the relation Knows which is not permitted. 2 In this example, focus on the is in predicate. 4 select distinct k.pid1 from knows k, worksfor w where k.pid2 = w.pid and w.salary > 55000; Example 2 Consider the query “ Find the pid and name of each person p who (1) has both the AI and Programming and (2) knows at least 5 persons, and report the number of persons who know p.”3 select p.pid, p.pname, (select cardinality(kp.persons) from isKnownByPersons kp where kp.pid = p.pid) as ct_knownByPersons from person p where p.pid in (select ps.pid from personHasSkills ps where subset(’”AI”, “Programming”’, ps.skills)) and cardinality((select kp.persons from knowsPersons kp where kp.pid = p.pid)) >= 5; Example 3 Consider the query “Find the pid and name of each person along with the set of his of her skills that are not among the skills of persons who work for ‘Netflix’”.4 select p.pid, p.pname, set difference((select ps.skills from personHasSkills ps where ps.pid = p.pid), array(select unnest(ps.skills) from personHasSkills ps where is in(ps.pid, (select employees from companyHasEmployees where cname = ’Netflix’)))) from person p; 1. Formulate the following queries in object-relational SQL. (a) Find the cname and headquarter of each company that employs at least two persons who each have both the AI and the Programming job skills. (b) Find each skill that is not a job skill of any person who works for Yahoo or for Netflix. (c) Find the set of companies that employ at least 3 persons who each know at least five persons. (So this query returns only one object, i.e., the set of companies specified in the query.) (d) Find the pid and name of each person p along with the set of pids of persons who (1) know p and (2) who have the AI skill but not the Programming skill. 3 In this example, focus on the set (array) construction ’{“AI”, “Programming”}’ and the subset predicate. Also focus on the use of cardinality function. 4 In this example, focus on (1) the set difference operation and (2) the unnest operation followed by a set (array) construction. 5 (e) Find each pair (s1, s2) of different skills s1 and s2 such that the number of employees who have skill s1 and who make strictly more than 55000 is strictly less than the number of employees who have skill s2 and who make at most 55000. (f) (Practice Problem: not-graded). Find each (c, p) pair where c is the cname of a company and p is the pid of a person who works for that company and who is known by all other persons who work for that company. (g) (Practice Problem: not-graded). Find the pid and name of each person who has all the skills of the combined set of job skills of the highest paid persons who work for Yahoo. 2. Find the following set of sets {S | S ⊆ Skill ∧ |S| ≤ 3}. I.e., this is the set consisting of each set of job skills whose size (cardinality) is at most 3. 3. (Practice Problem: not-graded). Reconsider Problem 2. Let S = {S | S ⊆ Skill ∧ |S| ≤ 3}. Find the following set of sets {X | X ⊆ S ∧ |X| ≤ 2}. 4. Let t be a number called a threshold. We say that a (unordered) pair of different person pids {p1, p2} co-occur with frequency at least t if there are at least t skills that are skills of both the person with pid p1 and the person with pid p2. Write a function coOccur(t integer) that returns the (unordered) pairs {p1, p2} of person pid that co-occur with frequency at least t. Test your coOccur function for t in the range [0, 3]. 5. Let A and B be sets such that A ∪ B 6= ∅. The Jaccard index J(A, B) is defined as the quantity |A ∩ B| |A ∪ B| . The Jaccard index is a frequently used measure to determine the similarity between two sets. Note that if A ∩ B = ∅ then J(A, B) = 0, and if A = B then J(A, B) = 1. Let t be a number called a threshold. We assume that t is a float in the range [0, 1]. 6 Write a function JaccardSimilar(t float) that returns the set of unordered pairs {s1, s2} of different skills such that the set of persons who have skill s1 and the set of persons who have skill s2 have a Jaccard index of at least t. Test your function JaccardSimilar for the following values for t: 0, 0.25, 0.5, 0.75, and 1. 7 2 Nested Relations and Semi-structured databases Consider the lecture on Nested relational and semi-structured databases. In that lecture we considered the studentGrades nested relation and the jstudentGrades semi-structured database and we constructed these using a PostgreSQL query starting from the Enroll relation. 6. Write a PostgreSQL view courseGrades that creates the nested relation of type (cno, gradeInfo{(grade, students{(sid)})}) This view should compute for each course, the grade information of the students enrolled in this course. In particular, for each course and for each grade, this relation stores in a set the sids students who obtained that grade in that course. Test your view. 7. Starting from the courseGrades view in Problem 6 solve the following queries: (a) Find each pair (c, S) where c is the cno of a course and S is the set of sids of students who received an ‘A’ or a ‘B’ in course c. The type of your answer relation should be (cno : text, Students : {(sid : text)}). (b) Find each (s, C) pairs where s is the sid of a students and C is the set of cnos of courses in which the student received an ‘A’. The type of your answer relation should be (sid : text, Courses : {(cno : text)}). (c) (Practice Problem: not-graded). Find each cno c where c is a course in which all students received the same grade. 8. Write a PostgreSQL view jcourseGrades that creates a semi-structured database which stores jsonb objects whose structure conforms with the structure of tuples as described for the courseGrades in Problem 6. Test your view. 9. Starting from the jcourseGrades view in Problem 8 solve the following queries. Note that the output of each of these queries is a nested relation. (a) Find each pair (c, s) where c is the cno of a course and s is the sid of a student who did not received an ‘A’ in course c. The type of your answer relation should be (cno:text, sid:text). (b) Find each pair ({c1, c2}, S) where c1 and c2 are the course numbers of two different courses and S is the set of sids of students who received a ’B’ in both courses c1 and c2. The type of your answer relation should be (coursePair : {(cno : text)}, Students : {(sid : text))}.
This assignment relies on the lectures • Functions and expressions in SQL; • Aggregate functions and partitioning; and • Triggers. To turn in your assignment, you will need to upload to Canvas a single file with name assignment4.sql which contains the necessary SQL statements that solve the problems in this assignment. The assignment4.sql file must be so that the AI’s can run it in their PostgreSQL environment. You should use the Assignment-Script-2021-Fall-assignment4.sql file to construct the assignment4.sql file. (Note that the data to be used for this assignment is included in this file.) In addition, you will need to upload a separate assignment4.txt file that contains the results of running your queries. You will also see several problems that are listed as practice problems. You should not include your solutions for such problems in the materials you submit for this assignments. Only solutions for problems 1 through 10 need to be submitted. 1 Database schema and instances For the problems in this assignment we will use the following database schema:1 Person(pid, pname, city) Company(cname, headquarter) Skill(skill) worksFor(pid, cname, salary) companyLocation(cname, city) personSkill(pid, skill) hasManager(eid, mid) Knows(pid1, pid2) In this database we maintain a set of persons (Person), a set of companies (Company), and a set of (job) skills (Skill). The pname attribute in Person is the name of the person. The city attribute in Person specifies the city in which the person lives. The cname attribute in Company is the name of the company. The headquarter attribute in Company is the name of the city wherein the company has its headquarter. The skill attribute in Skill is the name of a (job) skill. A person can work for at most one company. This information is maintained in the worksFor relation. (We permit that a person does not work for any company.) The salary attribute in worksFor specifies the salary made by the person. The city attribute in companyLocation indicates a city in which the company is located. (Companies may be located in multiple cities.) A person can have multiple job skills. This information is maintained in the personSkill relation. A job skill can be the job skill of multiple persons. (A person may not have any job skills, and a job skill may have no persons with that skill.) A pair (e, m) in hasManager indicates that person e has person m as one of his or her managers. We permit that an employee has multiple managers and that a manager may manage multiple employees. (It is possible that an employee has no manager and that an employee is 1The primary key, which may consist of one or more attributes, of each of these relations is underlined. 2 not a manager.) We further require that an employee and his or her managers must work for the same company. The relation Knows maintains a set of pairs (p1, p2) where p1 and p2 are pids of persons. The pair (p1, p2) indicates that the person with pid p1 knows the person with pid p2. We do not assume that the relation Knows is symmetric: it is possible that (p1, p2) is in the relation but that (p2, p1) is not. The domain for the attributes pid, pid1, pid2, salary, eid, and mid is integer. The domain for all other attributes is text. We assume the following foreign key constraints: • pid is a foreign key in worksFor referencing the primary key pid in Person; • cname is a foreign key in worksFor referencing the primary key cname in Company; • cname is a foreign key in companyLocation referencing the primary key cname in Company; • pid is a foreign key in personSkill referencing the primary key pid in Person; • skill is a foreign key in personSkill referencing the primary key skill in Skill; • eid is a foreign key in hasManager referencing the primary key pid in Person; • mid is a foreign key in hasManager referencing the primary key pid in Person; • pid1 is a foreign key in Knows referencing the primary key pid in Person; and • pid2 is a foreign key in Knows referencing the primary key pid in Person 3 Pure SQL and RA SQL In this assignemt, we distinguish between Pure SQL and RA SQL. Below we list the only features that are allowed in Pure SQL and in RA SQL. In particular notice that • join, NATURAL join, and CROSS join are not allowed in Pure SQL. • The predicates [not] IN, SOME, ALL, [not] exists are not allowed in RA SQL. The only features allowed in Pure SQL select … from … where WITH … union, intersect, except operations exists and not exists predicates IN and not IN predicates ALL and SOME predicates VIEWs that can only use the above RA SQL features The only features allowed in RA SQL select … from … where WITH … union, intersect, except operations join … ON …, natural join, and CROSS join operations VIEWs that can only use the above RA SQL features commas in the from clause are not allowed Full SQL all the features of Pure SQL and RA SQL user-defined functions aggregate functions group … by … having … 4 1 Solving queries using aggregate functions Formulate the following queries in SQL. You should use aggregate functions to solve these queries. You can use views, temporary views, parameterized views, and user-defined functions. 1. Find each pair (c, n) where c is the cname of a company that pays an average salary between 50000 and 55000 and where n is the number of employees who work for company c. 2. Find the pid and name of each person who lacks at least 4 job skils and who knows at least 4 persons. 3. Find the pid and name of each person who has fewer than 2 of the combined set of job skills of persons who work for Google. By combined set of jobskills we mean the set {s | s is a jobskill of an employee of Google}. 4. Find the cname of each company that employs at least 3 persons and that pays the lowest average salary among such companies. 5. Find each pair (c1, c2) of different company cnames such that, among all companies, company c2 pays the closest average salary compared to the average salary paid by company c1. 6. Without using set predicates, find each pid of a person who does not know each person who (1) works for Apple and (2) who makes less than 55000. 7. Without using set predicates, find each pairs (s1, s2) of skills such that the set of persons with skill s1 is the same as the set of persons with skill s2. 8. Find each pairs (s1, s2, n) of different skills s1 an s2 and such that (1) the number of persons with skill s1 is the same as the number of persons with skill s2 and (2) where n is the number of such persons associated with c1 and c2. 5 9. (a) Using the GROUP BY counting method, define a function create or replace function numberOfSkills(c text) returns table (pid integer, salary int, numberOfSkills bigint) as $$ … $$ language sql; that returns for a company identified by its cname, each triple (p, s, n) where (1) p is the pid of a person who is employed by that company, (2) s is the salary of p, and (3) n is the number of job skills of p. (Note that a person may not have any job skills.) (b) Test your function for Problem 9a for the companies Apple, Amazon, and ACM. (c) Write the same function numberOfSkills as in Problem 9c but this time without using the GROUP BY clause. (d) Test your function for Problem 9c for the companies Apple, Amazon, and ACM. (e) Using the function numberOfSkills but without using set predicates, write the following query: “Find each pair (c, p) where c is the name of a company and where p is the pid of a person who (1) works for company c, (2) makes more than 50000 and (3) has the most job skills among all the employees who work for company c.” 6 2 Introduction to Dynamic SQL The examples in this section introduce you to Dynamic SQL. Dynamic SQL is a powerful generalization of SQL: it permits writing programs that generate queries, and furthermore, these queries can subsequently be evaluated in the programs that generated them. Intuitively, in a dynamic program, a string is constructed that represents a SQL query. When an execute statement in the dynamic program is then applied to that string, the corresponding query is evaluated. We will consider a simple example illustrating Dynamic SQL. (For more information about Dynamic SQL, we refer to the PostgreSQL manual under the topic of Dynamic SQL.) Example 1 Assume that there are 3 unary relations P(p : boolean) Q(q : boolean) R(r : boolean) You should think of P, Q, and R as boolean variables that may be true or false. This situation is set up in SQL as follows: create table P (p boolean); create table Q (q boolean); create table R (r boolean); insert into P values (true), (false); insert into Q values (true), (false); insert into R values (true), (false); So we have the following situation: P t f Q t f R t f Next, we consider the set of boolean propositions (propositions, for short) over these variables P, Q, and R. An inductive definition of these propositions is as follows:2 2We have used parantheses around the boolean connectives ‘or’, ‘and’, and ’not’ to remove issues related to the order of precedence for these connectives. We could have ommitted such parantheses but then we need to impose a precedence order. Typically, this is such that ‘not’ has higher precendence than ‘or’ and ‘and’, and ‘and’ has a higher precedence than ‘or’. 7 • P, Q, and R are propositions. • If E and F are propositions then (E or F) is a proposition. • If E and F are propositions then (E and F) is a proposition. • If E is a propositon then (not E) is a proposition. • If E is a propositon then (E) is a proposition. Here are some examples of propositions over P, Q, and R. P Q R (not P) (not (P)) (P or R) (P and P) (P and (not (R and Q))) We will now consider the truth table of a proposition. (For the concept of truth table of a propositon, we refer to the document Propositional Logic in the course module Notes on Discrete Mathematics.) For example, the truth table for the proposition (P or R) is a follows:3 p q r truthValue t t t t t t f t t f t t t f f t f t t t f t f f f f t t f f f f and that for (P and (not (R and Q))) is 3Notice that we have 3 propositional variables in play, i.e., P, Q, and R, and therefore, each truth table will have 23 = 8 truth assignments. This is even the case when the proposition does not mention some of these variables. 8 p q r truthValue t t t f t t f t t f t t t f f t f t t f f t f f f f t f f f f f It is possible to generate the truth table of a proposition in SQL. For example, for the proposition ’(P and (not (R and Q)))’ we can write the SQL query (let’s call it Q1) select p, q, r, (P and (not (R and Q))) from P, Q, R; The problem with this approach is that to determine the truth table for another proposition such as ’(not (Q and (not P)))’, we would need to write a different query (let’s call it Q2) select p, q, r, (P and (not (R and Q))) from P, Q, R; even though the blue parts in Q1 and Q2 are same. A more general approach to generate the truth table of an arbitrary proposition is to use Dynamic SQL. Here is the user-defined function Dynamic SQL that can facilitate this: create or replace function truthTable(proposition text) returns table(p boolean, q boolean, r boolean, truthValue boolean) as $$ begin return query execute ’select p, q, r, ’ || proposition || ’ from P, Q, R;’; end; $$ language plpgsql; To generate the truth table for the proposition “P and (not (R and Q))” we would call the truthTable function with the string representation 9 ’(P and (not (R and Q)))’ of this proposition as follows: select * from truthTable(’(P and (not (R and Q)))’); The critical component of the DynamicSQL code for the function truthTable is the statement return query execute ’select p, q, r, ’ || proposition || ’ from P, Q, R;’; What happens in this statement is that we build a string that represents an SQL query by concatenating the string ’select p, q, r, ’ with the string that is passed into the truthTable function’s proposition parameter (in this case the string ’(P and (not (R and Q)))’) and, finally, with the string ’ from P, Q, R;’. This concatenation of strings is accomplished using the string concatenation function ||. As such we get the string ’select p, q, r, (P and (not (R and Q))) from P, Q, R;’ This is a string that represents a valid SQL query which we can then evaluate with the execute operation. The rest of the code, i.e., the return query statement, then returns the result of this evaluation, i.e., the truth table of the proposition (P and (not (R and Q))). Now that we have the truthTable function, we will consider in the next example the problem of verifying if two Propositional Logic propositions are logically equivalent. Example 2 Consider two Propositional Logic propositions E and F. We say that E and F are logically equivalent if their respective truth tables are the same. We will write a boolean user-defined SQL function 10 logicallyEquivalent(E text, F text) returns boolean which takes as arguments two strings that represent the propositions E and F respectively and that returns ’true’ if E and F are logically equivalent, and ’false’ otherwise. The code for this function relies on the following insight. Let TE and TF be the truth table of E and F, respectively. Then, by definition E and F are logically equivalent, i.e., TE = TF . But the later set condition is equivalent with the condition (TE − TF ) = ∅ and (TF − TE) = ∅. Here is the function: create or replace function logicallyEquivalent(E text, F text) returns boolean as $$ select not exists (select truthTable(E) except (select truthTable(F))) and not exists (select truthTable(F) except (select truthTable(E))); $$ language sql; 11 3 Operations on polynomials and matrices using SQL aggregate functions In the problems in this section, you will practice working with aggregate functions. A useful other aspect of solving these problems is that you will learn how relations can be used to represent polynomials and matrices and how SQL can be used to define operations on such objects. 10. Let P(x) and Q(x) be 2 polynomials with integer coefficients. Let P(coefficient, degree) and Q(coefficient, degree) be two binary relations representing P(x) and Q(x), respectively. E.g., if P(x) = 2x 2 − 5x + 5 and Q(x) = 4x 4 + 3x 3 + x 2 − x then their representations in the relations P and Q are as follows: P coefficient degree 2 2 −5 1 5 0 Q coefficient degree 4 4 3 3 1 2 −1 1 0 0 (a) Write a dynamic SQL function create or replace function multiplyPolynomials(polynomal1 text, polynomial p2 text) returns table(coefficient integer, degree integer) as $$ … $$ language plpgsql; that computes a binary relation representing the multiplication of P(x) and Q(x), i.e., the polynomial P(x) ∗ Q(x). For example, consider P(x) = 2x 2 − 5x + 5 and Q(x) = 4x 4+3x 3+x 2−x. Then P(x)∗Q(x) = (8)x 6+(6−20)x 5+(2− 15+20)x 4+(−2−5+15)x 3+(5+5)x 2+(−5)x = 8x 6−14x 5+ 7x 4+8x 3+10x 2−5x. So, for these polynomials, your function when called with the name of the relation representing the polynomials should return the relation 12 coefficient degree 8 6 −14 5 7 4 8 3 10 2 −5 1 0 0 (b) Your solution should work for arbitrary polynomials P(x) and Q(x). Show how your function can be used to compute i. the polynomial P ∗ Q; ii. the polynomial P ∗ P; and iii. the polynomial P ∗ (Q ∗ P). 11. (Practice problem – not graded) Let P(x) and Q(x) be 2 polynomials with integer coefficients. We can consider their composition P(Q(x)). For example, consider P(x) = 2x 3 −5x+ 5 and Q(x) = 4x 4 + 3x 3 , then P(Q(x)) = 2(Q(x))3 − 5Q(x) + 5 = 2(4x 4 + 3x 3 ) 3 − 5(4x 4 + 3x 3 ) + 5 = 2(4x 4 + 3x 3 )(4x 4 + 3x 3 )(4x 4 + 3x 3 ) − 5(4x 4 + 3x 3 ) + 5 = 128x 12 + 288x 11 + 216x 10 + 54x 9 − 20x 4 − 15x 3 + 5 and Q(P(x)) = 4(P(x))4 + 3(Q(x))3 = 4(2x 3 − 5x + 5)4 + 3(2x 3 − 5x + 5) = 64x 12 − 640x 10 + 664x 8 + 2400x 8 − 4980x 7 − 1420x 6+ 12450x 5 − 10400x 4 − 5925x 3 + 16125x 2 − 11125x + 2875 (a) Write a dynamic SQL function create or replace function compositionPolynomials(polynomial1 text, polynomial2 text) returns table(coefficient integer, degree integer) as $$ … $$ language plpgsql; 13 that computes a binary relation representing the composition of P(Q(x)). Your solution should work for arbitrary polynomials P(x) and Q(x). Show how your function can be used to compute i. the polynomial P(Q(x)); ii. the polynomial Q(P(x)); and iii. the polynomial P(P(P(x))). 12. (Practice problem – not graded) Let M be an n × n matrix of boolean values. (We will assume that n ≥ 0.) We will use T to denote ‘true’ and F to denote ‘false’. For i, j ∈ [1, n], we will denote by M[i, j] the boolean value in matrix M at row i and column j. (Notice that when n = 0, M has no elements.) A boolean matrix M can be represented using a relation M with schema (rw: integer, colmn: integer, value: boolean) and such that for each i, j ∈ [1, n] (i, j, M[i, j]) ∈ M. Notice that we use the attribute names ‘rw’ and ‘colmn’ since the words ‘row’ and ‘column’ are reserved words in PostgreSQL. For example if M is the 3 × 3 boolean matrix M = T F T F T T T F T then M is the following relation of 9 tuples: 14 M rw colmn value 1 1 T 1 2 F 1 3 T 2 1 F 2 2 T 2 3 T 3 1 T 3 2 F 3 3 T Let M and N be two n × n boolen matrices represented by the two relations M and N. (a) Write a dynamic SQL function create or replace function booleanMatrixMultiplication (M text, N text) returns table (rw integer, colmn integer, value boolean) as $$ begin return query execute …; end; $$ language plpgsql; that computes a relation over schema (rw, column, value) that represents the matrix M ∗ N when given the names ‘M’ and ‘N’ of the relations that represent M and N, respectively. Your solution should work for any n ≥ 1. For example if M and N are by the following 3 × 3 boolean matrices stored as the relations 15 M rw colmn value 1 1 T 1 2 F 1 3 T 2 1 F 2 2 T 2 3 T 3 1 T 3 2 F 3 3 T N rw colmn value 1 1 T 1 2 T 1 3 T 2 1 F 2 2 F 2 3 F 3 1 T 3 2 F 3 3 T then your function should produce the relational representation of M ∗ N, i.e., the relation M ∗ N rw colmn value 1 1 T 1 2 T 1 3 T 2 1 T 2 2 F 2 3 T 3 1 T 3 2 T 3 3 T Consider the Person relation. Then the relation Knows, which is a binary relation over Person, can be modeled by a boolean matrix M Knows by using the Person pids as row and column indices for this matrix, and by using the value T at position (i, j) if (i, j) ∈ Knows, and the value F at position (i, j) if (i, j) 6∈ Knows. We can then represent this boolean matrix with a relation M Knows(rw integer, colmn integer, value) by using the following SQL statements: create table M_Knows(rw integer, colmn integer, value boolean); insert into M_Knows select rw.pid, colmn.pid, exists (select 1 from Knows k where rw.pid = k.pid1 and colmn.pid = k.pid2) from person rw, person colmn; 16 (b) Compute the matrix M Knows ∗ M Knows. What does the matrix represent? (c) M Knows ∗ (M Knows ∗ M Knows). What does the matrix represent? 17 4 Triggers (Practice problems – not graded) To begin the problems in this section, you should first remove the entire database, including the relation schemas. You should then create the relations but without specifying the primary and foreign key constraints. You should also not yet populate the relations with data. Solve the following problems: 13. Develop appropriate insert and delete triggers that implement the primary key and foreign key constraints that are specified for the Person, Company, and Worksfor relations. Your triggers should report appropriate error conditions. For this problem, implement the triggers such that foreign key constraints are maintained using the cascading delete semantics. For a reference on cascading deletes associated with foreign keys maintenance consult the PostgreSQL manual page https:www.postgresql.orgdocs9.2ddl-constraints.html Test your triggers using appropriate inserts and deletes. 14. Repeat Problem 13 but subject to the constraint that a person may not appear in the worksFor relation if he or she has fewer than 2 job skils. 15. Consider the following view definition create or replace view PersonIsKnownByNumberofPersons as (select p1.pid, p1.name, (select count(1) from Person p2 where (p2.pid, p1.pid) in (select k.pid1, k.pid2 from Knows k)) as NumberofKnownByPersons from Person p1); This view defines the set of tuples (p, n, c) where p and n are the pid and name of a person and c is the number of persons who know the person with pid p. You should not create this view! Rather your task is to create a relation PersonIsKnownByNumberofPersons that maintains a 18 materialized view in accordance with the above view definition under insert and delete operations on the Person and Knows relation. Your triggers should be designed to be incremental. In particular, when an insert or delete occurs, you should not always have to reevaluate all the number of persons who know persons. Rather the maintenance of PersonIsKnownByNumberofPersons should only apply to the person information that is affected by the insert or delete. Provide tests that show that your solution works.
This assignment relies on the lectures • SQL Part 1 and SQL Part 2 (Pure SQL); • Relational Algebra (RA); • joins and semijoins; • Translating Pure SQL queries into RA expressions; and • Query optimization with particular focus on the last two lectures. To turn in your assignment, you will need to upload to Canvas a single file with name assignment3.sql which contains the necessary SQL statements that solve the problems in this assignment. The assignment3.sql file must be so that the AI’s can run it in their PostgreSQL environment. You should use the Assignment-Script-2021-Fall-Assignment3.sql file to construct the assignment3.sql file. (note that the data to be used for this assignment is included in this file.) In addition, you will need to upload a separate assignment3.txt file that contains the results of running your queries. Finally, you need to upload a file assignment3.pdf that contains the solutions to the problems that require it. The problems that need to be included in the assignment3.sql are marked with a blue bullet •. The problems that need to be included in the assignment3.pdf are marked with a red bullet •. (You should aim to use Latex to construct your .pdf file.) 1 Database schema and instances For the problems in this assignment we will use the following database schema:1 Person(pid, pname, city) Company(cname, headquarter) Skill(skill) worksFor(pid, cname, salary) companyLocation(cname, city) personSkill(pid, skill) hasManager(eid, mid) Knows(pid1, pid2) In this database we maintain a set of persons (Person), a set of companies (Company), and a set of (job) skills (Skill). The pname attribute in Person is the name of the person. The city attribute in Person specifies the city in which the person lives. The cname attribute in Company is the name of the company. The headquarter attribute in Company is the name of the city wherein the company has its headquarter. The skill attribute in Skill is the name of a (job) skill. A person can work for at most one company. This information is maintained in the worksFor relation. (We permit that a person does not work for any company.) The salary attribute in worksFor specifies the salary made by the person. The city attribute in companyLocation indicates a city in which the company is located. (Companies may be located in multiple cities.) A person can have multiple job skills. This information is maintained in the personSkill relation. A job skill can be the job skill of multiple persons. (A person may not have any job skills, and a job skill may have no persons with that skill.) A pair (e, m) in hasManager indicates that person e has person m as one of his or her managers. We permit that an employee has multiple managers and that a manager may manage multiple employees. (It is possible that an employee has no manager and that an employee is 1The primary key, which may consist of one or more attributes, of each of these relations is underlined. 2 not a manager.) We further require that an employee and his or her managers must work for the same company. The relation Knows maintains a set of pairs (p1, p2) where p1 and p2 are pids of persons. The pair (p1, p2) indicates that the person with pid p1 knows the person with pid p2. We do not assume that the relation Knows is symmetric: it is possible that (p1, p2) is in the relation but that (p2, p1) is not. The domain for the attributes pid, pid1, pid2, salary, eid, and mid is integer. The domain for all other attributes is text. We assume the following foreign key constraints: • pid is a foreign key in worksFor referencing the primary key pid in Person; • cname is a foreign key in worksFor referencing the primary key cname in Company; • cname is a foreign key in companyLocation referencing the primary key cname in Company; • pid is a foreign key in personSkill referencing the primary key pid in Person; • skill is a foreign key in personSkill referencing the primary key skill in Skill; • eid is a foreign key in hasManager referencing the primary key pid in Person; • mid is a foreign key in hasManager referencing the primary key pid in Person; • pid1 is a foreign key in Knows referencing the primary key pid in Person; and • pid2 is a foreign key in Knows referencing the primary key pid in Person 3 Pure SQL and RA SQL In this assignemt, we distinguish between Pure SQL and RA SQL. Below we list the only features that are allowed in Pure SQL and in RA SQL. In particular notice that • join, NATURAL join, and CROSS join are not allowed in Pure SQL. • The predicates [not] IN, SOME, ALL, [not] exists are not allowed in RA SQL. The only features allowed in Pure SQL select … from … where WITH … union, intersect, except operations exists and not exists predicates IN and not IN predicates ALL and SOME predicates VIEWs that can only use the above RA SQL features The only features allowed in RA SQL select … from … where WITH … union, intersect, except operations join … ON …, natural join, and CROSS join operations VIEWs that can only use the above RA SQL features commas in the from clause are not allowed 4 1 Theoretical problems related to query translation and optimization 1. Consider two RA expressions E1 and E2 over the same schema. Furthermore, consider an RA expression F with a schema that is not necessarily the same as that of E1 and E2. Consider the following if-then-else query: if F = ∅ then return E1 else return E2 So this query evaluates to the expression E1 if F = ∅ and to the expression E2 if F 6= ∅. We can formulate this query in SQL as follows2 : select e1.* from E1 e1 where not exists (select distinct row() from F) union select e2.* from E2 e1 where exists (select distinct row() from F); Remark 1 The subquery query select distinct row() from F returns the empty set if F = ∅ and returns the tuple () if F 6= ∅. 3 In RA, this query can be written as π()(F). I.e., the projection of F on an empty list of attributes. • In function of E1, E2, and F, write an RA expression in standard notation that expresses this if-then-else query.4 2 In this SQL query E1, E2, and F denote SQL queries corresponding to the RA expressions E1, E2, and F, respectively. 3The tuple () is often referred to as the empty tuple, i.e., the tuple without components. It is akin to the empty string in the theory of formal languages. I.e., the string wihout alphabet characters. 4Hint: consider using the Pure SQL to RA SQL translation algorithm. 5 2. Let R(x) be a unary relation that can store a set of integers R. Consider the following boolean SQL query: select not exists(select 1 from R r1, R 2 where r1.x r2.x) as fewerThanTwo; This boolean query returns the constant “true” if R has fewer than two elements and returns the constant “false” otherwise. • Using the insights you gained from Problem 1, write an RA expression in standard notation that expresses the above boolean SQL query.5 3. In the translation algorithm from Pure SQL to RA we tacitly assumed that the argument of each set predicate was a (possibly parameterized) Pure SQL query that did not use a union, intersect, nor an except operation. In this problem, you are asked to extend the translation algorithm from Pure SQL to RA such that the set predicates [not] exists are eliminated that have as an argument a Pure SQL query (possibly with parameters) that uses a union, intersect, or except operation. More specifically, consider the following types of queries using the [not] exists set predicate. select L1(r1,…,rn) from R1 r1, …, Rn rn where C1(r1,…,rn) and [not] exists (select L2(s1,…,sm) from S1 s1,…, S1 sm where C2(s1,…,sm,r1,…,rn) [union | intersect | except] select L3(t1,…,tk) from T1 t1, …, Tk tk where C3(t1,…,tk,r1,…,rn)) Observe that there are six cases to consider: (a) exists (… union …) 5Hint: recall that, in general, a constant value “a” can be represented in RA by an expression of the form (A: a). (Here, A is some arbitrary attribute name.) Furthermore, recall that we can express (A: a) in SQL as “select a as A”. Thus RA expressions for the constants “true” and “false” can be the expressions (A: true) and (A: false), respectively. 6 (b) exists (… intersect …) (c) exists (… except …) (d) not exists (… union …) (e) not exists (… intersect …) (f) not exists (… except …) • Show how such SQL queries can be translated to equivalent RA expressions in standard notation. Be careful in the translation since you should take into account that projections do not in general distribute over intersections and over set differences. To get practice, first consider the following special case where n = 1, m = 1, and k = 1. I.e., the following case: 6 select L1(r) from R r where C1(r) [not] exists (select L2(s) from S s where C2(s,r) [union | intersect | except] select L3(t) from T t where C3(t,r)) 4. • Let R be a relation with schema (a, b, c) and let S be a relation with schema (d, e). Prove, from first principles7 , the correctness of the following rewrite rule: πa,d(R ./c=d S) = πa,d(πa,c(R) ./c=d πd(S)). 5. • Consider the same rewrite rule πa,d(R ./c=d S) = πa,d(πa,c(R) ./c=d πd(S)) as in problem 4. Furthermore assume that S has primary key d and that R has foreign key c referencing this primary key in S. How can you simplify this rewrite rule? Argue why this rewrite rule is correct. 6Once you can handle this case, the general case is a similar. 7 In particular, do not use the rewrite rule of pushing projections over joins. Rather, use Predicate Logic or TRC to provide a proof. 7 2 Translating Pure SQL queries to RA expressions and optimized RA expressions In this section, you are asked to translate Pure SQL queries into RA SQL queries as well as in standard RA expressions using the translation algorithm given in class. You are required to show the intermediate steps that you took during the translation. After the translation, you are asked to optimize the resulting RA expressions. You can use the following letters, or indexed letters, to denote relation names in RA expressions: P, P1, P2, · · · Person C, C1, C2, · · · Company S, S1, S2, · · · Skill W, W1, W2, · · · worksFor cL, cL1, cL2, · · · companyLocation pS, pS1, pS2, · · · personSkill hM, hM1, hM2, · · · hasManager K, K1, K2, · · · Knows We illustrate what is expected using an example. Example 1 Consider the query “ Find each (p, c) pair where p is the pid of a person who works for a company c located in Bloomington and whose salary is the lowest among the salaries of persons who work for that company. A possible formulation of this query in Pure SQL is select w.pid, w.cname from worksfor w where w.cname in (select cl.cname from companyLocation cl where cl.city = ’Bloomington’) and w.salary w1.salary and w1.cname = w.cname)) q; which is translated to the RA SQL query11 select q.pid, q.cname from (select w.* from worksfor w natural join (select cl.* from companyLocation cl where cl.city = ’Bloomington’) cl intersect (select w.* from worksfor w except select w.* from worksfor w join worksfor w1 on (w.salary > w1.salary and w1.cname = w.cname))) q; This RA SQL query can be formulated as an RA expression in standard notation as follows: πW.pid,W.cname(E ∩ (W − F)) where E = πW.∗(W ./ σcity=Bloomington(cL)) and F = πW.∗(W ./W.salary>W1.salary ∧ W1.cname=W.cname W1). We can now commence the optimization. 9Translation of ‘in’ and ‘W1.salary ∧ W1.cname=W.cname W1). We can push the projection over the join and get the expression πW.∗(W ./W.salary>W1.salary ∧ W1.cname=W.cname πW1.cname,W1.salary(W1)). We will call this expression F opt . Therefore, the fully optimized RA expression is πW.pid,W.cname(E opt − F opt). I.e., πW.pid,W.cname(W n σcity=Bloomington(cL)− πW.∗(W ./W.salary>W1.salary ∧ W1.cname=W.cname πW1.cname,W1.salary(W1))). 10 We now turn to the problems in this section. 6. Consider the query “Find the cname and headquarter of each company that employs persons who earn less than 55000 and who do not live in Bloomington.” A possible way to write this query in Pure SQL is select c.cname, c.headquarter from company c where c.cname in (select w.cname from worksfor w where w.salary < 55000 and w.pid = SOME (select p.pid from person p where p.city ’Bloomington’)); (a) • Using the Pure SQL to RA SQL translation algorithm, translate this Pure SQL query to an equivalent RA SQL query. Show the translation steps you used to obtain your solution. (b) • Optimize this RA SQL query and provide the optimized expression in standard RA notation. Specify at least three conceptually different rewrite rules that you used during the optimization. 7. Consider the query “Find the pid of each person who has all-butone job skill.” A possible way to write this query in Pure SQL is select p.pid from person p where exists (select 1 from skill s where (p.pid, s.skill) not in (select ps.pid, ps.skill from personSkill ps)) and not exists (select 1 from skill s1, skill s2 where s1.skill s2.skill and (p.pid, s1.skill) not in (select ps.pid, ps.skill from personSkill ps) and (p.pid, s2.skill) not in (select ps.pid, ps.skill from personSkill ps)); (a) • Using the Pure SQL to RA SQL translation algorithm, translate this Pure SQL query to an equivalent RA SQL 11 query. Show the translation steps you used to obtain your solution. (b) • Optimize this RA SQL query and provide the optimized expression in standard RA notation. Specify at least three conceptually different rewrite rules that you used during the optimization. 8. Consider the query “Find the pid and name of each person who works for a company located in Bloomington but who does not know any person who lives in Chicago.” A possible way to write this query in Pure SQL is select p.pid, p.pname from person p where exists (select 1 from worksFor w, companyLocation c where p.pid = w.pid and w.cname = c.cname and c.city = ’Bloomington’) and p.pid not in (select k.pid1 from knows k where exists (select 1 from person p where k.pid2 = p.pid and p.city = ’Chicago’)); (a) • Using the Pure SQL to RA SQL translation algorithm, translate this Pure SQL query to an equivalent RA SQL query. Show the translation steps you used to obtain your solution. (b) • Optimize this RA SQL query and provide the optimized expression in standard RA notation. Specify at least three conceptually different rewrite rules that you used during the optimization. 12 9. Consider the query “Find the cname and headquarter of each company that (1) employs at least one person and (2) whose workers who make at most 70000 have both the programming and AI skills.” A possible way to write this query in Pure SQL is select c.cname, c.headquarter from company c where exists (select 1 from worksfor w where w.cname = c.cname) and not exists (select 1 from worksfor w where w.cname = c.cname and w.salary
This assignment relies on the lectures • SQL Part 1 and SQL Part 2 (Pure SQL); • Views; • Relational Algebra (RA); and • Joins and semijoins. Furthermore, the lecture on the correspondence between Tuple Relational Calculus and SQL should help you to solve problems in this assignment. To turn in your assignment, you will need to upload to Canvas a single file with name assignment2.sql which contains the necessary SQL statements that solve the problems in this assignment. The assignment2.sql file must be so that the AI’s can run it in their PostgreSQL environment. You should use the Assignment2Script.sql file to construct the assignment2.sql file. (Note that the data to be used for this assignment is included in this file.) In addition, you will need to upload a separate assignment2.txt file that contains the results of running your queries. Finally, you need to upload a file assignment2.pdf that contains the solutions to the problems that require it. The problems that need to be included in the assignment2.sql are marked with a blue bullet •. The problems that need to be included in the assignment2.pdf are marked with a red bullet •. (You should aim to use Latex to construct your .pdf file.) 1 Database schema and instances For the problems in this assignment we will use the following database schema:1 Person(pid, pname, city) Company(cname, headquarter) Skill(skill) worksFor(pid, cname, salary) companyLocation(cname, city) personSkill(pid, skill) hasManager(eid, mid) Knows(pid1, pid2) In this database we maintain a set of persons (Person), a set of companies (Company), and a set of (job) skills (Skill). The pname attribute in Person is the name of the person. The city attribute in Person specifies the city in which the person lives. The cname attribute in Company is the name of the company. The headquarter attribute in Company is the name of the city wherein the company has its headquarter. The skill attribute in Skill is the name of a (job) skill. A person can work for at most one company. This information is maintained in the worksFor relation. (We permit that a person does not work for any company.) The salary attribute in worksFor specifies the salary made by the person. The city attribute in companyLocation indicates a city in which the company is located. (Companies may be located in multiple cities.) A person can have multiple job skills. This information is maintained in the personSkill relation. A job skill can be the job skill of multiple persons. (A person may not have any job skills, and a job skill may have no persons with that skill.) A pair (e, m) in hasManager indicates that person e has person m as one of his or her managers. We permit that an employee has multiple managers and that a manager may manage multiple employees. (It is possible that an employee has no manager and that an employee is 1The primary key, which may consist of one or more attributes, of each of these relations is underlined. 2 not a manager.) We further require that an employee and his or her managers must work for the same company. The relation Knows maintains a set of pairs (p1, p2) where p1 and p2 are pids of persons. The pair (p1, p2) indicates that the person with pid p1 knows the person with pid p2. We do not assume that the relation Knows is symmetric: it is possible that (p1, p2) is in the relation but that (p2, p1) is not. The domain for the attributes pid, pid1, pid2, salary, eid, and mid is integer. The domain for all other attributes is text. We assume the following foreign key constraints: • pid is a foreign key in worksFor referencing the primary key pid in Person; • cname is a foreign key in worksFor referencing the primary key cname in Company; • cname is a foreign key in companyLocation referencing the primary key cname in Company; • pid is a foreign key in personSkill referencing the primary key pid in Person; • skill is a foreign key in personSkill referencing the primary key skill in Skill; • eid is a foreign key in hasManager referencing the primary key pid in Person; • mid is a foreign key in hasManager referencing the primary key pid in Person; • pid1 is a foreign key in Knows referencing the primary key pid in Person; and • pid2 is a foreign key in Knows referencing the primary key pid in Person The file Assignment2Script.sql contains the data supplied for this assignment. 3 Pure SQL and RA SQL In this assignemt, we distinguish between Pure SQL and RA SQL. Below we list the only features that are allowed in Pure SQL and in RA SQL. In particular notice that • JOIN, NATURAL JOIN, and CROSS JOIN are not allowed in Pure SQL. • The predicates [NOT] IN, SOME, ALL, [NOT] EXISTS are not allowed in RA SQL. The only features allowed in Pure SQL SELECT … FROM … WHERE WITH … UNION, INTERSECT, EXCEPT operations EXISTS and NOT EXISTS predicates IN and NOT IN predicates ALL and SOME predicates VIEWs that can only use the above RA SQL features The only features allowed in Pure SQL features SELECT … FROM … WHERE WITH … UNION, INTERSECT, EXCEPT operations JOIN … ON …, NATURAL JOIN, and CROSS JOIN operations VIEWs that can only use the above RA SQL features commas in the FROM clause are not allowed 4 1 Formulating queries in Pure SQL with and without set predicates An important consideration in formulating queries is to contemplate how they can be written in different, but equivalent, ways. In fact, this is an aspect of programming in general and, as can expected, is also true for SQL. A learning outcome of this course is to acquire skills for writing queries in different ways. One of the main motivations for this is to learn that different formulations of the same query can dramatically impact performance, sometimes by orders of magnitude. For the problems in this section, you will need to formulate queries in Pure SQL with and without set predicates. You can use the SQL operators INTERSECT, UNION, and EXCEPT, unless otherwise specified. You are however allowed and encouraged to use views including temporary and user-defined views. To illustrate what you need to do, consider the following example. Example 1 Consider the query “ Find the pid and name of each person who (a) works for a company located in Bloomington and (b) knows a person who lives in Chicago.” (a) Formulate this query in Pure SQL by only using the EXISTS or NOT EXISTS set predicates. You can not use the set operations INTERSECT, UNION, and EXCEPT. A possible solution is select p.pid, p.pname from Person p where exists (select 1 from worksFor w where p.pid = w.pid and exists (select 1 from companyLocation c where w.cname = c.cname and c.city = ’Bloomington’)) and exists (select from Knows k where p.pid = k.pid1 and exists (select 1 from Person p2 where k.pid2 = p2.pid and p2.city = ’Chicago’)); 5 (b) Formulate this query in Pure SQL by only using the IN, NOT IN, SOME, or ALL set membership predicates. You can not use the set operations INTERSECT, UNION, and EXCEPT. A possible solution is select p.pid, p.pname from Person p where p.pid in (select w.pid from worksFor w where w.cname in (select c.cname from companyLocation c where c.city = ’Bloomington’)) and p.pid in (select k.pid1 from Knows k where k.pid2 in (select p2.pid from Person p2 where p2.city = ’Chicago’)); Another possible solution using the SOME and IN predicates select p.pid, p.pname from Person p where p.pid = some (select w.pid from worksFor w where w.cname = some (select c.cname from companyLocation c where c.city = ’Bloomington’)) and p.pid in (select k.pid1 from Knows k where k.pid2 in (select p2.pid from Person p2 where p2.city = ’Chicago’)); (c) Formulate this query in Pure SQL without using set predicates. A possible solution is select p.pid, p.pname from Person p, worksFor w, companyLocation c where p.pid = w.pid and w.cname = c.cname and c.city = ’Bloomington’ intersect select p1.pid, p1.pname from Person p1, Knows k, Person p2 where k.pid1 = p1.pid and k.pid2 = p2.pid and p2.city = ’Chicago’; 6 We now turn to the problems for this section. 1. Consider the query “Find the pid and name of each person who works for Google and who has a strictly higher salary than some other person who he or she knows and who also works for Google.” (a) • Formulate this query in Pure SQL by only using the EXISTS or NOT EXISTS set predicates. (b) • Formulate this query in Pure SQL by only using the IN, NOT IN, SOME, or ALL set membership predicates. (c) • Formulate this query in Pure SQL without using set predicates. 2. Consider the query “Find the cname of each company with headquarter in Cupertino, but that is not located in Indianapolis, along with the pid, name, and salary of each person who works for that company and who has the next-to-lowest salary at that company.” 2 (a) • Formulate this query in Pure SQL by only using the EXISTS or NOT EXISTS set predicates. You can not use the set operations INTERSECT, UNION, and EXCEPT. (b) • Formulate this query in Pure SQL by only using the IN, NOT IN, SOME, or ALL set membership predicates. You can not use the set operations INTERSECT, UNION, and EXCEPT. (c) • Formulate this query in Pure SQL without using set predicates. 3. Consider the query “Find each (c, p) pair where c is the cname of a company and p is the pid of a person who works for that company and who is known by all other persons who work for that company. (a) • Formulate this query in Pure SQL by only using the EXISTS or NOT EXISTS set predicates. You can not use the set operations INTERSECT, UNION, and EXCEPT. 2By definition, a salary s is next-to-lowest if there exists salary s1 with s1 < s, but there do not exist two salaries s1 and s2 with s2 < s1 < s. 7 (b) • Formulate this query in Pure SQL by only using the IN, NOT IN, SOME, or ALL set membership predicates. You can not use the set operations INTERSECT, UNION, and EXCEPT. (c) • Formulate this query in Pure SQL without using set predicates. 4. Consider the query “Find each skill that is not a jobskill of any person who works for Yahoo or for Netflix.” (a) • Formulate this query in Pure SQL using subqueries and set predicates. You can not use the set operations INTERSECT, UNION, and EXCEPT. (b) • Formulate this query in Pure SQL without using predicates. 5. Consider the query “Find the pid and name of each person who manages all-but-one person who works for Google. (a) • Formulate this query in Pure SQL using subqueries and set predicates. You can not use the set operations INTERSECT, UNION, and EXCEPT. (b) • Formulate this query in Pure SQL without using set predicates. 8 2 Formulating queries in Relational Algebra and RA SQL Reconsider the queries in Section 1. The goal of the problems in this section is to formulate these queries in Relational Algebra in standard notation and in RA SQL. There are some further restrictions: • You can only use WHERE clauses that use conditions involving constants. For example conditions of the form “t.A θ ’a’” are allowed, but conditions of the form ’t.A θ s.B’ are not allowed. The latter conditions can be absorbed in JOIN operations in the FROM clause. • You can not use commas in any FROM clause. Rather, you should use JOIN operations. You can use the following letters, or indexed letters, to denote relation names in RA expressions: P, P1, P2, · · · Person C, C1, C2, · · · Company S, S1, S2, · · · Skill W, W1, W2, · · · worksFor cL, cL1, cL2, · · · companyLocation pS, pS1, pS2, · · · personSkill M, M1, M2, · · · hasManager K, K1, K2, · · · Knows To illustrate what you need to do reconsider the query in Example 1 in Section 1. Example 2 Consider the query “ Find the pid and name of each person who (a) works for a company located in Bloomington and (b) knows a person who lives in Chicago.” (a) Formulate this query in Relational Algebra in standard notation. A possible solution is πpid,pname(P erson ./ worksF or ./ πcname(σcity=Bloomington(companyLocation)))∩ πP erson1.pid,P erson1.pname(P erson1 ./P erson1.pid=pid1 Knows ./pid2=P erson2.pid πP erson2.pid(σcity=Chicago(P erson2))) 9 If we use the letters in the above box, this expression becomes more succinct: πpid,pname(P ./ W ./ πcname(σcity=Bloomington(cL)))∩ πP1.pid,P1.pname(P1 ./P1.pid=pid1 K ./pid2=P2.pid πP2.pid(σcity=Chicago(P2))) You are also allowed to introduce letters that denote expressions. For example, let E and F denote the expression πpid,pname(P ./ W ./ πcname(σcity=Bloomington(cL))) and πP1.pid,P1.pname(P1 ./P1.pid=pid1 K ./pid2=P2.pid πP2.pid(σcity=Chicago(P2))), respectively. Then we can write the solution as follows: E ∩ F. (b) Formulate this query in RA SQL. A possible solution is select pid, pname from Person NATURAL JOIN worksFor NATURAL JOIN (select cname from companyLocation where city = ’Bloomington’) C INTERSECT select P1.pid, P1.pname from Person P1 JOIN Knows ON (P1.pid = pid1) JOIN (select pid from Person where city = ’Chicago’) P2 ON (pid2 = P2.pid) order by 1,2; Observe that the WHERE clauses only use conditions involving constants. 10 We now turn to the problems in this section. 6. Reconsider Problem 1. Find the pid and name of each person who works for Google and who has a strictly higher salary than some other person who he or she knows and who also works for Google. (a) • Formulate this query in Relational Algebra in standard notation. (b) • Formulate this query in RA SQL. 7. Reconsider Problem 2. Find the cname of each company with headquarter in Bloomington, but not located in Indianapolis, along with the pid, name, and salary of each person who works for that company and who has the next-to-lowest salary at that company. (a) • Formulate this query in Relational Algebra in standard notation. (b) • Formulate this query in RA SQL. 8. Reconsider Problem 3. Find each (c, p) pair where c is the cname of a company and p is the pid of a person who works for that company and who is known by all other persons who work for that company. (a) • Formulate this query in Relational Algebra in standard notation. (b) • Formulate this query in RA SQL. 9. Reconsider Problem 4. Find each skill that is not a jobskill of any person who works for Yahoo or for Netflix. (a) • Formulate this query in Relational Algebra in standard notation. (b) • Formulate this query in RA SQL. 10. Reconsider Problem 5. Find the pid and name of each person who manages all-but-one person who works for Google. 11 (a) • Formulate this query in Relational Algebra in standard notation. (b) • Formulate this query in RA SQL. 12 3 Formulating constraints using Relational Algebra Recall that it is possible to express constraints in TRC and as boolean SQL queries. It is also possible to write constraints using RA expressions. Let E, F, and G denote RA expressions. Then we can write RA expression comparisons that express constraints: E 6= ∅ which is true if E evaluates to an non-empty relation E = ∅ which is true if E evaluates to the empty relation F ⊆ G which is true if F evaluates to a relation that is a subset of the relation obtained from G F 6⊆ G which is true if F evaluates to a relation that is not a subset of the relation obtained from G Here are some examples of writing constraints in this manner. Example 3 “ Some person works for Google.” This constraint can be written as follows: πpid(σcname=Google(worksF or)) 6= ∅. Indeed, the RA expression πpid(σcname=Google(worksF or)) computes the set of all person pids who work for Google. If this set is not empty then there are indeed persons who work for Google. Incidentally, the constraint “ No one works for Google” can be written as follows: πpid(σcname=Google(worksF or)) = ∅. Example 4 Each person has at least two skills. This constraint can be written as follows: πpid(P) ⊆ πpS1.pid(σpS1.skill6=pS2.skill(pS1 ./pS1.pid=pS2.pid pS2)). Indeed, πpid(P) computes the set of all person pids and πpS1.pid(σpS1.skill6=pS2.skill(pS1 ./pS1.pid=pS2.pid pS2)) 13 computes the set of all pids of persons who have at least two skills. When the first set is contained in the second we must have that each person has at least two skills. Incidentally, the constraint “ Some person has fewer than two skills” can be written as follows: πpid(P) 6⊆ πpS1.pid(σpS1.skill6=pS2.skill(pS1 ./pS1.pid=pS2.pid pS2)). We now turn to the problems in this section. Formulate each of the following constraints using RA expressions as illustrated in Example 3 and Example 4. 11. • Some person has a salary that is strictly lower than that of each of the persons who he or she manages. 12. • No person knows all persons who work for Google. 13. • Each person knows all of his or her managers. 14. • Each employee and his or her managers work for the same company. 15. • The attribute pid is a primary key of the Person relation. 14 4 Formulating queries in SQL using views Formulate the following views and queries in SQL. You are allowed to combine the features of both Pure SQL and RA SQL. 16. • Create a view Triangle that contains each triple of pids of different persons (p1, p2, p3) that mutually know each other. Then test your view. 17. • Define a parameterized view SalaryBelow(cname text, salary integer) that returns, for a given company identified by cname and a given salary value, the subrelation of Person of persons who work for company cname and whose salary is strictly below salary. Test your view for the parameter values ( ’IBM’,60000), (’IBM’,50000), and (’Apple’,65000). 18. • Define a parameterized view KnowsPersonAtCompany(p integer, c text) that returns for a person with pid p the subrelation of Person of persons who p knows and who work for the company with cname c. Test your view for the parameters (1001, ‘Amazon’), (1001,‘Apple’), and (1015,‘Netflix’). 19. • Define a parameterized view KnownByPersonAtCompany(p integer, c text) that returns the subrelation of Person of persons who know the person with pid p and who work for the company with cname c. Test your view for the parameters (1001, ‘Amazon’), (1001,‘Apple’), and (1015,‘Netflix’). 20. • Let T ree(parent : integer, child : integer) be a rooted parentchild tree. So a pair (n, m) in T ree indicates that node n is a parent of node m. The sameGeneration(n1, n2) binary relation is inductively defined using the following two rules: • If n is a node in T, then the pair (n, n) is in the sameGeneration relation. (Base rule) 15 • If n1 is the parent of m1 in T ree and n2 is the parent of m2 in T ree and (n1, n2) is a pair in the sameGeneration relation then (m1, m2) is a pair in the sameGeneration relation. (Inductive Rule) Write a recursive view for the sameGeneration relation. Test your view.
1 Introduction The goals for this assignment are to 1. become familiar with the PostgreSQL system1 ; 2. create a relational database and populate it with data; 3. examine the side-effects on the state of the database caused by inserts and deletes in the presence or absence of primary and foreign key constraints; 4. formulate some queries in SQL and evaluate them in PostgreSQL; and 5. translate TRC queries to SQL and formulate queries and constraints in TRC.2 To turn in your assignment, you will need to upload to Canvas a single file with name assignment1.sql which contains the necessary SQL statements that solve the problems in this assignment. The assignment1.sql file must be such that the AI’s can run it in their PostgreSQL environment. In addition, you will need to upload a separate assignment1.txt file that contains the results of running your queries. We have posted the exact requirements and an example ∗This assignment covers lectures 1 through 4 1To solve this assignment, you will need to download and install PostgreSQL (version 12 or higher) on your computer. 2To solve problems related to TRC, follow the syntax and semantics described in the TRC SQL.pdf document in the module Tuple Relational Calculus and SQL (lecture 4). That document contains multiple examples of TRC queries and constraints and how they can be translated to SQL. 1 for uploading your solution files. (See the module Instructions for turning in assignments.) Finally, you will need to upload an assignment1.pdf file that contains the solutions for problems related to TRC.3 For the problems in this assignment we will use the following database schema:4 Person(pid, pname, city) Company(cname, headquarter) Skill(skill) worksFor(pid, cname, salary) companyLocation(cname, city) personSkill(pid, skill) hasManager(eid, mid) In this database we maintain a set of persons (Person), a set of companies (Company), and a set of (job) skills (Skill). The pname attribute in Person is the name of the person. The city attribute in Person specifies the city in which the person lives. The cname attribute in Company is the name of the company. The headquarter attribute in Company is the name of the city wherein the company has its headquarter. The skill attribute in Skill is the name of a (job) skill. A person can work for at most one company. This information is maintained in the worksFor relation. (We permit that a person does not work for any company.) The salary attribute in worksFor specifies the salary made by the person. The city attribute in companyLocation indicates a city in which the company is located. (Companies may be located in multiple cities.) A person can have multiple job skills. This information is maintained in the personSkill relation. A job skill can be the job skill of multiple persons. (A person may not have any job skills, and a job skill may have no persons with that skill.) A pair (e, m) in hasManager indicates that person e has person m as one of his or her managers. We permit that an employee has multiple managers and that a manager may manage multiple employees. (It is possible that an employee has no manager and that an employee is not a manager.) We further require that an employee and his or her managers must work for the same company. The domain for the attributes pid, salary, eid, and mid is integer. The domain for all other attributes is text. 3 It is strongly recommended that you use Latex to write TRC formulas and queries. For a good way to learn about Latex, look at https://www.overleaf.com/learn/latex/Free online introduction to LaTeX (part 1). You can also inspect the Latex source code for this assignment as well as the document TRC SQL.tex provided in Module 4. 4The primary key, which may consist of one or more attributes, of each of these relations is underlined. 2 We assume the following foreign key constraints: • pid is a foreign key in worksFor referencing the primary key pid in Person; • cname is a foreign key in worksFor referencing the primary key cname in Company; • cname is a foreign key in companyLocation referencing the primary key cname in Company; • pid is a foreign key in personSkill referencing the primary key pid in Person; • skill is a foreign key in personSkill referencing the primary key skill in Skill; • eid is a foreign key in hasManager referencing the primary key pid in Person; and • mid is a foreign key in hasManager referencing the primary key pid in Person; The file data.sql contains the data supplied for this assignment. 3 2 Database creation and impact of constraints on insert and delete statements. Create a database in PostgreSQL that stores the data provided in the data.sql file. Make sure to specify primary and foreign keys. 1. Provide 4 conceptually different examples that illustrate how the presence or absence of primary and foreign keys affect insert and deletes in these relations. To solve this problem, you will need to experiment with the relation schemas and instances for this assignment. For example, you should consider altering primary keys and foreign key constraints and then consider various sequences of insert and delete operations. You may need to change some of the relation instances to observe the desired effects. Certain inserts and deletes should succeed but other should generate error conditions. (Consider the lecture notes about keys, foreign keys, and inserts and deletes as a guide to solve this problem.) 4 3 Formulating queries in SQL For this assignment, you are required to use tuple variables in your SQL statements. For example, in formulating the query “Find the pid and pname of each person who lives in Bloomington” you should write the query SELECT p.pid, p.pname FROM Person p WHERE p.city = ‘Bloomington’ rather than SELECT pid, pname FROM Person WHERE city = ‘Bloomington’ Write SQL statements for the following queries. Make sure that each of your queries returns a set but not a bag. In other words, make appropriate use of the DISTINCT clause where necessary. You can not use the SQL JOIN operations or SQL aggregate functions such as COUNT, SUM, MAX, MIN, etc in your solutions. 2. Find the pid, pname of each person who (a) lives in Bloomington, (b) works for a company where he or she earn a salary that is higher than 30000, and (c) has at least one manager. 3. Find the pairs (c1, c2) of names of companies whose headquarters are located in the same city. 4. Find the pid and pname of each person who lives in a city that is different than each city in which his or her managers live. (Persons who have no manager should not be included in the answer.) 5. Find each skill that is the skill of at most 2 persons. 6. Find the pid, pname, and salary of each employee who has at least two managers such that these managers have a common job skill but provided that it is not the ‘Networks’ skill. 7. Find the cname of each company that not only employs persons who live in MountainView. (In other words, there exists at least one employee of such a company who does not live in MountainView.) 8. For each company, list its name along with the highest salary made by employees who work for it. 9. Find the pid and pname of each employee who has a salary that is higher than the salary of each of his or her managers. (Employees who have no manager should not be included.) 5 4 Translating TRC queries to SQL Consider the following queries formulated in TRC. Translate each of these queries to an equivalent SQL query.5 You should note that this translating, modulo the handling of universal quantifiers, is almost a syntactic rewrite of the way in which the queries are formulated in TRC. This underscores the close correspondence between TRC and SQL. The SQL queries should be included in the assignment1.sql file and their outputs should be reported in the assignment.txt file. 10. {p.pid, p.pname, w.cname, w.salary | P erson(p) ∧ worksF or(w) ∧ p.pid = w.pid p.city = ‘Bloomington’ ∧ 40000 ≤ w.salary ∧ w.cname 6= ‘Apple’}. 11. {p.pid, p.pname | P erson(p)∧ ∃c∃w(Company(c) ∧ worksF or(w) ∧ c.cname = w.cname ∧ p.pid = w.pid ∧ c.headquarter = ‘LosGatos’∧ ∃hm∃m(hasManager(hm) ∧ P erson(m) ∧ hm.eid = p.pid ∧ hm.mid = m.pid ∧ m.city 6= ‘LosGatos))}. In abbreviated form, {p.pid, p.pname | P erson(p)∧ ∃c ∈ Company ∃w ∈ worksF or(c.cname = w.cname ∧ p.pid = w.pid ∧ c.headquarter = ‘LosGatos’∧ ∃hm ∈ hasManager ∃m ∈ P erson(hm.eid = p.pid ∧ hm.mid = m.pid ∧ m.city 6= ‘LosGatos))}. 12. {s.skill | Skill(s) ∧ ¬(∃p∃ps P erson(p) ∧ personSkill(ps) ∧ p.pid = ps.pid∧ ps.skill = s.skill ∧ p.city = ‘Bloomington’)}. In abbreviated form, {s.skill | Skill(s) ∧ ¬(∃p ∈ P erson ∃ps ∈ personSkill(p.pid = ps.pid∧ ps.skill = s.skill ∧ p.city = ‘Bloomington’)}. 13. {m.pid, m.pname | P erson(m)∧ ∀hm((hasManager(hm) ∧ hm.mid = m.pid) → ∃e(P erson(e) ∧ hm.eid = e.pid ∧ e.city = m.city))} In abbreviated form, {m.pid, m.pname | P erson(m)∧ ∀hm ∈ hasManager(hm.mid = m.pid → ∃e ∈ P erson(hm.eid = e.pid ∧ e.city = m.city))} 5You can not use SQL JOIN operations or aggregate functions. 6 5 Formulating queries in the Tuple Relational Calculus Formulate each of the queries in the even-numbered problems (i.e., problems 2, 4, 6, and 8) in Section 3 as TRC queries. The solutions of these problems should be included in the assignment1.pdf file. 14. (Problem 2) Find the pid, pname of each person who (a) lives in Bloomington, (b) works for a company where he or she earn a salary that is higher than 30000, and (c) has at least one manager. 15. (Problem 4) Find the pid and pname of each person who lives in a city that is different than each city in which his or her managers live. (Persons who have no manager should not be included in the answer.) 16. (Problem 6) Find the pid, pname, and salary of each employee who has at least two managers such that these managers have a common job skill but provided that it is not the ‘Networks’ skill. 17. (Problem 8) For each company, list its name along with the highest salary made by employees who work for it. 7 6 Formulating constraints in the Tuple Relational Calculus Formulate the following constraints in TRC and as boolean SQL queries. The TRC solutions of these problems should be included in the assignment1.pdf file and the SQL solutions should be included in the assignment1.sql file. Here is an example of what is expected for your answers. Example 1 Consider the constraint “ Each skill is the skill of a person.” In TRC, this constraint can be formulated as follows: ∀s Skill(s) → ∃ps (personSkill(ps) ∧ ps.skill = s.skill) or, alternatively ¬∃s(Skill(s) ∧ ¬∃ps(personSkill(ps) ∧ ps.skill = s.skill)). This constraint can be specified using the following boolean SQL query. select not exists (select 1 from Skill s where not exists (select 1 from personSkill ps where ps.skill = s.skill)); 18. Each person works for a company and has at least two job skills. 19. Some person has a salary that is strictly higher than the salary of each of his or her managers. 20. Each employee and his or her managers work for the same company.
Part I Theory Before implementing our own 3D reconstruction, let’s take a look at some simple theory questions that may arise. The answers to the below questions should be relatively short, consisting of a few lines of math and text (maybe a diagram if it helps your understanding). Q1.1 (5 points) Suppose two cameras fixate on a point P (see Figure 1) in space such that their principal axes intersect at that point. Show that if the image coordinates are normalized so that the coordinate origin (0, 0) coincides with the principal point, the F33 element of the fundamental matrix is zero. (0, 0) (0, 0) P C1 C2 Figure 1: Figure for Q1.1. C1 and C2 are the optical centers. The principal axes intersect at point P. Q1.2 (5 points) Consider the case of two cameras viewing an object such that the second camera differs from the first by a pure translation that is parallel to the x-axis. Show that the epipolar lines in the two cameras are also parallel to the x-axis. Backup your argument with relevant equations. Q1.3 (5 points) Suppose we have an inertial sensor which gives us the accurate positions (Ri and ti , where R is the rotation matrix and t is corresponding translation vector) of the robot at time i. What will be the effective rotation (Rrel) and translation (trel) between two frames at different time stamps? Suppose the camera intrinsics (K) are known, express the essential matrix (E) and the fundamental matrix (F) in terms of K, Rrel and trel. 1 Figure 2: Figure for Q1.3. C1 and C2 are the optical centers. The rotation and the translation is obtained using inertial sensors. Rrel and trel are the relative rotation and translation between two frames. Q1.4 (10 points) Suppose that a camera views an object and its reflection in a plane mirror. Show that this situation is equivalent to having two images of the object which are related by a skew-symmetric fundamental matrix. You may assume that the object is flat, meaning that all points on the object are of equal distance to the mirror (Hint: draw the relevant vectors to understand the relationships between the camera, the object and its reflected image.) Part II Practice 1 Overview In this part you will begin by implementing the two different methods seen in class to estimate the fundamental matrix from corresponding points in two images (Section 2). Next, given the fundamental matrix and calibrated intrinsics (which will be provided) you will compute the essential matrix and use this to compute a 3D metric reconstruction from 2D correspondences using triangulation (Section 3). Then, you will implement a method to automatically match points taking advantage of epipolar constraints and make a 3D visualization of the results (Section 4). Finally, you will implement RANSAC and bundle adjustment to further improve your algorithm (Section 5). 2 Figure 3: Temple images for this assignment Figure 4: displayEpipolarF in helper.py creates a GUI for visualizing epipolar lines 2 Fundamental matrix estimation In this section you will explore different methods of estimating the fundamental matrix given a pair of images. In the data/ directory, you will find two images (see Figure 3) from the Middlebury multiview dataset1 , which is used to evaluate the performance of modern 3D reconstruction algorithms. The Eight Point Algorithm The 8-point algorithm (discussed in class, and outlined in Section 10.1 of Forsyth & Ponce) is arguably the simplest method for estimating the fundamental matrix. For this section, you can use provided correspondences you can find in data/some corresp.npz. Q2.1 (10 points) Finish the function eightpoint in submission.py. Make sure you follow the signature for this portion of the assignment: F = eightpoint(pts1, pts2, M) where pts1 and pts2 are N × 2 matrices corresponding to the (x, y) coordinates of the N points in the first and second image repectively. M is a scale parameter. 1 http://vision.middlebury.edu/mview/data/ 3 • You should scale the data as was discussed in class, by dividing each coordinate by M (the maximum of the image’s width and height). After computing F, you will have to “unscale” the fundamental matrix. Hint: If xnormalized = T x, then Funnormalized = T T FT. You must enforce the singularity condition of the F before unscaling. • You may find it helpful to refine the solution by using local minimization. This probably won’t fix a completely broken solution, but may make a good solution better by locally minimizing a geometric cost function. For this we have provided a helper function refineF in helper.py taking in F and the two sets of points, which you can call from eightpoint before unscaling F. • Remember that the x-coordinate of a point in the image is its column entry, and y-coordinate is the row entry. Also note that eight-point is just a figurative name, it just means that you need at least 8 points; your algorithm should use an over-determined system (N > 8 points). • To visualize the correctness of your estimated F, use the supplied function displayEpipolarF in helper.py, which takes in F, and the two images. This GUI lets you select a point in one of the images and visualize the corresponding epipolar line in the other image (Figure 4). • Output: Save your matrix F, scale M to the file q2 1.npz. In your write-up: Write your recovered F and include an image of some example output of displayEpipolarF. The Seven Point Algorithm Q2.2 (15 points) Since the fundamental matrix only has seven degrees of freedom, it is possible to calculate F using only seven point correspondences. This requires solving a polynomial equation. In the section, you will implement the seven-point algorithm (described in class, and outlined in Section 15.6 of Forsyth and Ponce). Manually select 7 points from provided point in data/some corresp.npz, and use these points to recover a fundamental matrix F. The function should have the signature: Farray = sevenpoint(pts1, pts2, M) where pts1 and pts2 are 7 × 2 matrices containing the correspondences and M is the normalizer (use the maximum of the images’ height and width), and Farray is a list array of length either 1 or 3 containing Fundamental matrix/matrices. Use M to normalize the point values between [0, 1] and remember to “unnormalize” your computed F afterwards. • Use displayEpipolarF to visualize F and pick the correct one. • Output: Save your matrix F, scale M, 2D points pts1 and pts2 to the file q2 2.npz. In your write-up: Write your recovered F and print an output of displayEpipolarF. Also, include an image of some example output of displayEpipolarF using the seven point algorithm. • Hints: You can use Numpy’s function roots(). The epipolar lines may not match exactly due to imperfectly selected correspondences, and the algorithm is sensitive to small changes in the point correspondences. You may want to try with different sets of matches. 4 3 Metric Reconstruction You will compute the camera matrices and triangulate the 2D points to obtain the 3D scene structure. To obtain the Euclidean scene structure, first convert the fundamental matrix F to an essential matrix E. Examine the lecture notes and the textbook to find out how to do this when the internal camera calibration matrices K1 and K2 are known; these are provided in data/intrinsics.npz. Q3.1 (5 points) Write a function to compute the essential matrix E given F, K1 and K2 with the signature: E = essentialMatrix(F, K1, K2) In your write-up: Write your estimated E using F from the eight-point algorithm. Given an essential matrix, it is possible to retrieve the projective camera matrices M1 and M2 from it. Assuming M1 is fixed at [I, 0], M2 can be retrieved up to a scale and four-fold rotation ambiguity. For details on recovering M2, see section 7.2 in Szeliski. We have provided you with the function camera2 in python/helper.py to recover the four possible M2 matrices given E. Note: The M1 and M2 here are projection matrices of the form: M1 =I|0and M2 =R|t. Q3.2 (10 points) Using the above, write a function to triangulate a set of 2D coordinates in the image to a set of 3D points with the signature: [P, err] = triangulate(C1, pts1, C2, pts2) where pts1 and pts2 are the N ×2 matrices with the 2D image coordinates and P is an N ×3 matrix with the corresponding 3D points per row. C1 and C2 are the 3 × 4 camera matrices. Remember that you will need to multiply the given intrinsics matrices with your solution for the canonical camera matrices to obtain the final camera matrices. Various methods exist for triangulation – probably the most familiar for you is based on least squares (see Szeliski Chapter 7 if you want to learn about other methods): For each point i, we want to solve for 3D coordinates Pi =xi , yi , zi T , such that when they are projected back to the two images, they are close to the original 2D points. To project the 3D coordinates back to 2D images, we first write Pi in homogeneous coordinates, and compute C1Pi and C2Pi to obtain the 2D homogeneous coordinates projected to camera 1 and camera 2, respectively. For each point i, we can write this problem in the following form: AiPi = 0, where Ai is a 4×4 matrix, and Pi is a 4×1 vector of the 3D coordinates in the homogeneous form. Then, you can obtain the homogeneous least-squares solution (discussed in class) to solve for each Pi . In your write-up: Write down the expression for the matrix Ai . Once you have implemented triangulation, check the performance by looking at the reprojection error: err = X i kp1i , pc1ik 2 + kp2i , pc2ik 2 5 where pc1i = P roj(C1, Pi) and pc2i = P roj(C2, Pi). Note: C1 and C2 here are projection matrices of the form: C1 = K1M1 = K1I|0and C2 = K2M2 = K2R|t. Q3.3 (10 points) Write a script findM2.py to obtain the correct M2 from M2s by testing the four solutions through triangulations. Use the correspondences from data/some corresp.npz. Output: Save the correct M2, the corresponding C2, and 3D points P to q3 3.npz. 4 3D Visualization You will now create a 3D visualization of the temple images. By treating our two images as a stereo-pair, we can triangulate corresponding points in each image, and render their 3D locations. Q4.1 (15 points) Implement a function with the signature: [x2, y2] = epipolarCorrespondence(im1, im2, F, x1, y1) This function takes in the x and y coordinates of a pixel on im1 and your fundamental matrix F, and returns the coordinates of the pixel on im2 which correspond to the input point. The match is obtained by computing the similarity of a small window around the (x1, y1) coordinates in im1 to various windows around possible matches in the im2 and returning the closest. Instead of searching for the matching point at every possible location in im2, we can use F and simply search over the set of pixels that lie along the epipolar line (recall that the epipolar line passes through a single point in im2 which corresponds to the point (x1, y1) in im1). There are various possible ways to compute the window similarity. For this assignment, simple methods such as the Euclidean or Manhattan distances between the intensity of the pixels should suffice. See Szeliski Chapter 11, on stereo matching, for a brief overview of these and other methods. Implementation hints: • Experiment with various window sizes. • It may help to use a Gaussian weighting of the window, so that the center has greater influence than the periphery. • Since the two images only differ by a small amount, it might be beneficial to consider matches for which the distance from (x1, y1) to (x2, y2) is small. To help you test your epipolarCorrespondence, we have included a helper function epipolarMatchGUI in python/helper.py, which takes in two images the fundamental matrix. This GUI allows you to click on a point in im1, and will use your function to display the corresponding point in im2. See Figure 5. It’s not necessary for your matcher to get every possible point right, but it should get easy points (such as those with distinctive, corner-like windows). It should also be good enough to render an intelligible representation in the next question. Output: Save the matrix F, points pts1 and pts2 which you used to generate the screenshot to the file q4 1.npz. In your write-up: Include a screenshot of epipolarMatchGUI with some detected correspondences. 6 Figure 5: epipolarMatchGUI shows the corresponding point found by calling epipolarCorrespondence Q4.2 (10 points) Included in this homework is a file data/templeCoords.npz which contains 288 hand-selected points from im1 saved in the variables x1 and y1. Now, we can determine the 3D location of these point correspondences using the triangulate function. These 3D point locations can then plotted using the Matplotlib or plotly package. Write a script visualize.py, which loads the necessary files from ../data/ to generate the 3D reconstruction using scatter. An example is shown in Figure 6. Output: Again, save the matrix F, matrices M1, M2, C1, C2 which you used to generate the screenshots to the file q4 2.npz. In your write-up: Take a few screenshots of the 3D visualization so that the outline of the temple is clearly visible, and include them with your homework submission. 5 Bundle Adjustment Q5.1 (15 points) In some real world applications, manually determining correspondences is infeasible and often there will be noisy coorespondances. Fortunately, the RANSAC method seen in class can be applied to the problem of fundamental matrix estimation. Implement the above algorithm with the signature: [F, inliers] = ransacF(pts1, pts2, M) where M is defined in the same way as in Section 2 and inliers is a boolean vector of size equivalent to the number of points. Here inliers is set to true only for the points that satisfy the threshold defined for the given fundamental matrix F. We have provided some noisy coorespondances in some corresp noisy.npz in which around 75% of the points are inliers. Compare the result of RANSAC with the result of the eightpoint when ran on the noisy coorespondances. Briefly explain the error metrics you used, how you decided which points were inliers and any other optimizations you may have made. • Hints: Use the seven point to compute the fundamental matrix from the minimal set of points. Then compute the inliers, and refine your estimate using all the inliers. Q5.2 (15 points) 7 Figure 6: An example point cloud So far we have independently solved for camera matrix, Mj and 3D projections, Pi . In bundle adjustment, we will jointly optimize the reprojection error with respect to the points Pi and the camera matrix Mj . err = X ij kpij − P roj(Cj , Pi)k 2 , where Cj = KjMj , same as in Q3.2. For this homework we are going to only look at optimizing the extrinsic matrix. To do this we will be parametrizing the rotation matrix R using Rodrigues formula to produce vector r ∈ R 3 . Write a function that converts a Rodrigues vector r to a rotation matrix R R = rodrigues(r) as well as the inverse function that converts a rotation matrix R to a Rodrigues vector r r = invRodrigues(R) Q5.3 (10 points) Using this parameterization, write an optimization function residuals = rodriguesResidual(K1, M1, p1, K2, p2, x) where x is the flattened concatenation of P, r2, and t2. P are the 3D points; r2 and t2 are the rotation (in the Rodrigues vector form) and translation vectors associated with the projection matrix M2.The residuals are the difference between original image projections and estimated projections (the square of 2-norm of this vector corresponds to the error we computed in Q3.2): 8 residuals = numpy.concatenate([(p1-p1 hat).reshape([-1]), (p2-p2 hat).reshape([-1])]) Use this error function and Scipy’s nonlinear least square optimizer leastsq write a function to optimize for the best extrinsic matrix and 3D points using the inlier correspondences from some corresp noisy.npz and the RANSAC estimate of the extrinsics and 3D points as an initialization. [M2, P] = bundleAdjustment(K1, M1, p1, K2, M2 init, p2, P init) In your write-up: include an image of the original 3D points and the optimized points as well as the reprojection error with your initial M2 and P, and with the optimized matrices. 6 Deliverables If your andrew id is bovik, your submission should be the writeup bovik.pdf and a zip file bovik.zip. Please submit the zip file to Canvas and the pdf file to Gradescope. The zip file should include the following directory structure: • submission.py: your implementation of algorithms. • findM2.py: script to compute the correct camera matrix. • visualize.py: script to visualize the 3d points. • q2 1.npz: file with output of Q2.1. • q2 2.npz: file with output of Q2.2. • q3 3.npz: file with output of Q3.3. • q4 1.npz: file with output of Q4.1. • q4 2.npz: file with output of Q4.2.
1 Theory Q1.1 Theory [2 points] Prove that softmax is invariant to translation, that is sof tmax(x) = sof tmax(x + c) ∀c ∈ R Softmax is defined as below, for each index i in a vector x. sof tmax(xi) = e xi P j e xj Often we use c = − max xi Why is that a good idea? (Tip: consider the range of values that numerator will have with c = 0 and c = − max xi) Q1.2 Theory [2 points] Softmax can be written as a three step processes, with si = e xi , S = Psi and sof tmax(xi) = 1 S si . • As x ∈ R d , what are the properties of sof tmax(x), namely what is the range of each element? What is the sum over all elements? • One could say that ”softmax takes an arbitrary real valued vector x and turns it into a ”. • Can you see the role of each step in the multi-step process now? Explain them. Q1.3 Theory [2 points] Show that multi-layer neural networks without a non-linear activation function are equivalent to linear regression. Q1.4 Theory [3 points] Given the sigmoid activation function σ(x) = 1 1+e−x , derive the gradient of the sigmoid function and show that it can be written as a function of σ(x) (without having access to x directly) Q1.5 Theory [12 points] Given y = x TW +b (or yj = Pd i=1 xiWij +bj ), and the gradient of some loss J with respect y, show how to get ∂J ∂W , ∂J ∂x and ∂J ∂b . Be sure to do the derivatives with scalars and re-form the matrix form afterwards. Here is some notional suggestions. ∂J ∂y = δ ∈ R k×1 W ∈ R d×k x ∈ R d×1 b ∈ R k×1 We won’t grade the derivative with respect to b but you should do it anyways, you will need it later in the assignment. 2 Q1.6 Theory [4 points] When the neural network applies the elementwise activation function (such as sigmoid), the gradient of the activation function scales the backpropogation update. This is directly from the chain rule, d dx f(g(x)) = f 0 (g(x))g 0 (x). 1. Consider the sigmoid activation function for deep neural networks. Why might it lead to a ”vanishing gradient” problem if it is used for many layers (consider plotting Q1.3)? 2. Often it is replaced with tanh(x) = 1−e−2x 1+e−2x . What are the output ranges of both tanh and sigmoid? Why might we prefer tanh ? 3. Why does tanh(x) have less of a vanishing gradient problem? (plotting the derivatives helps! for reference: tanh0 (x) = 1 − tanh(x) 2 ) 4. tanh is a scaled and shifted version of the sigmoid. Show how tanh(x) can be written in terms of σ(x). (Hint: consider how to make it have the same range) 2 Implement a Fully Connected Network All of these functions should be implemented in python/nn.py 2.1 Network Initialization Q2.1.1 Theory [2 points] Why is it not a good idea to initialize a network with all zeros? If you imagine that every layer has weights and biases, what can a zero-initialized network output after training? Q2.1.2 Code [1 points] Implement a function to initialize neural network weights with Xavier initialization [1], where V ar[w] = 2 nin+nout where n is the dimensionality of the vectors and you use a uniform distribution to sample random numbers (see eq 16 in the paper). Q2.1.3 Theory [1 points] Why do we initialize with random numbers? Why do we scale the initialization depending on layer size (see near Fig 6 in the paper)? 2.2 Forward Propagation The appendix (sec 8) has the math for forward propagation, we will implement it here. Q2.2.1 Code [4 points] Implement sigmoid, along with forward propagation for a single layer with an activation function, namely y = σ(XW + b), returning the output and intermediate results for an N × D dimension input X, with examples along the rows, data dimensions along the columns. Q2.2.2 Code [3 points] Implement the softmax function. Be sure the use the numerical stability trick you derived in Q1.1 softmax. 3 Q2.2.3 Code [3 points] Write a function to compute the accuracy of a set of labels, along with the scalar loss across the data. The loss function generally used for classification is the cross-entropy loss. Lf(D) = − X (x,y)∈D y · log(f(x)) Here D is the full training dataset of data samples x (N × 1 vectors, N = dimensionality of data) and labels y (C × 1 one-hot vectors, C = number of classes). 2.3 Backwards Propagation Q2.3.1 Code [7 points] Compute backpropogation for a single layer, given the original weights, the appropriate intermediate results, and given gradient with respect to the loss. You should return the gradient with respect to X so you can feed it into the next layer. As a size check, your gradients should be the same dimensions as the original objects. 2.4 Training Loop You will tend to see gradient descent in three forms: “normal”, “stochastic” and “batch”. “Normal” gradient descent aggregates the updates for the entire dataset before changing the weights. Stochastic gradient descent applies updates after every single data example. Batch gradient descent is a compromise, where random subsets of the full dataset are evaluated before applying the gradient update. Q2.4.1 Code [5 points] Write a training loop that generates random batches, iterates over them for many iterations, does forward and backward propagation, and applies a gradient update step. 2.5 Numerical Gradient Checker Q2.5.1 [5 points] Implement a numerical gradient checker. Instead of using the analytical gradients computed from the chain rule, add offset to each element in the weights, and compute the numerical gradient of the loss with central differences. Central differences is just f(x+)−f(x−) 2 . Remember, this needs to be done for each scalar dimension in all of your weights independently. 4 3 Training Models First, be sure to run the script, from inside the scripts folder, get data.sh. This will use wget and unzip to download https://cmu.box.com/shared/static/ebklocte1t2wl4dja61dg1ct285l65md.zip https://cmu.box.com/shared/static/147jyd6uewh1fff96yfh6e11s98v11xm.zip and extract them to data and image folders Since our input images are 32 × 32 images, unrolled into one 1024 dimensional vector, that gets multiplied by W(1), each row of W(1) can be seen as a weight image. Reshaping each row into a 32×32 image can give us an idea of what types of images each unit in the hidden layer has a high response to. We have provided you three data .mat files to use for this section. The training data in nist36 train.mat contains samples for each of the 26 upper-case letters of the alphabet and the 10 digits. This is the set you should use for training your network. The crossvalidation set in nist36 valid.mat contains samples from each class, and should be used in the training loop to see how the network is performing on data that it is not training on. This will help to spot over fitting. Finally, the test data in nist36 test.mat contains testing data, and should be used for the final evaluation on your best model to see how well it will generalize to new unseen data. Q3.1.1 Code [3 points] Train a network from scratch. Use a single hidden layer with 64 hidden units, and train for at least 30 epochs. Modify the script to plot generate two plots: one showing the accuracy on both the training and validation set over the epochs, and the other showing the cross-entropy loss averaged over the data. The x-axis should represent the epoch number, while the y-axis represents the accuracy or loss. With these settings, you should see an accuracy on the validation set of at least 75%. Q3.1.2 Writeup [2 points] Use your modified training script to train three networks, one with your best learning rate, one with 10 times that learning rate and one with one tenth that learning rate. Include all 4 plots in your writeup. Comment on how the learning rates affect the training, and report the final accuracy of the best network on the test set. Q3.1.3 Writeup [3 points] Visualize the first layer weights that your network learned (using reshape and ImageGrid). Compare these to the network weights immediately after initialization. Include both visualizations in your writeup. Comment on the learned weights. Do you notice any patterns? Q3.1.4 Writeup [2 points] Visualize the confusion matrix for your best model. Comment on the top few pairs of classes that are most commonly confused. 5 4 Extract Text from Images Figure 1: Sample image with handwritten characters annotated with boxes around each character. Now that you have a network that can recognize handwritten letters with reasonable accuracy, you can now use it to parse text in an image. Given an image with some text on it, our goal is to have a function that returns the actual text in the image. However, since your neural network expects a a binary image with a single character, you will need to process the input image to extract each character. There are various approaches that can be done so feel free to use any strategy you like. Here we outline one possible method, another is that given in a tutorial 1. Process the image (blur, threshold, opening morphology, etc. (perhaps in that order)) to classify all pixels as being part of a character or background. 2. Find connected groups of character pixels (see skimage.measure.label). Place a bounding box around each connected component. 3. Group the letters based on which line of the text they are a part of, and sort each group so that the letters are in the order they appear on the page. 4. Take each bounding box one at a time and resize it to 32 × 32, classify it with your network, and report the characters in order (inserting spaces when it makes sense). Since the network you trained likely does not have perfect accuracy, you can expect there to be some errors in your final text parsing. Whichever method you choose to implement for the character detection, you should be able to place a box on most of there characters in the image. We have provided you with 01 list.jpg, 02 letters.jpg, 03 haiku.jpg and 04 deep.jpg to test your implementation on. 6 Q4.1 Theory [2 points] The method outlined above is pretty simplistic, and makes several assumptions. What are two big assumptions that the sample method makes. In your writeup, include two example images where you expect the character detection to fail (either miss valid letters, or respond to non-letters). Q4.2 Code [10 points] Find letters in the image. Given an RGB image, this function should return bounding boxes for all of the located handwritten characters in the image, as well as a binary black-and-white version of the image im. Each row of the matrix should contain [y1,x1,y2,x2] the positions of the top-left and bottom-right corners of the box. The black and white image should be floating point, 0 to 1, with the characters in black and background in white. Q4.3 Writeup [3 points] Run findLetters(..) on all of the provided sample images in images/. Plot all of the located boxes on top of the image to show the accuracy of your findLetters(..) function. Include all the result images in your writeup. Q4.4 Code/Writeup [8 points] Now you will load the image, find the character locations, classify each one with the network you trained in Q3.2.1, and return the text contained in the image. Be sure you try to make your detected images look like the images from the training set. Visualize them and act accordingly. Run your run q4 on all of the provided sample images in images/. Include the extracted text in your writeup. 7 5 Image Compression with Autoencoders An autoencoder is a neural network that is trained to attempt to copy its input to its output, but it usually allows copying only approximately. This is typically achieved by restricting the number of hidden nodes inside the autoencoder; in other words, the autoencoder would be forced to learn to represent data with this limited number of hidden nodes. This is a useful way of learning compressed representations. In this section, we will continue using the NIST36 dataset you have from the previous questions. 5.1 Building the Autoencoder Q5.1.1 Code [5 points] Due to the difficulty in training auto-encoders, we have to move to the relu(x) = max(x, 0) activation function. It is provided for you in util.py. Implement a 2 hidden layer autoencoder where the layers are • 1024 to 32 dimensions, followed by a ReLU • 32 to 32 dimensions, followed by a ReLU • 32 to 32 dimensions, followed by a ReLU • 32 to 1024 dimensions, followed by a sigmoid (this normalizes the image output for us) The loss function that you’re using is total squared error for the output image compared to the input image (they should be the same!). Q5.1.2 Code [3 points] To help even more with convergence speed, we will implement momentum. Now, instead of updating W = W − α ∂J ∂W , we will use the update rules MW = 0.9MW − α ∂J ∂W and W = W + MW . To implement this, populate the parameters dictionary with zero-initialized momentum accumulators, one for each parameter. Then simply perform both update equations for every batch. 5.2 Training the Autoencoder Q5.2 Writeup/Code [2 points] Using the provided default settings, train the network for 100 epochs. What do you observe in the plotted training loss curve as it progresses? 5.3 Evaluating the Autoencoder Q5.3.1 Writeup/Code [2 points] Now lets evaluate how well the autoencoder has been trained. Select 5 classes from the total 36 classes in your dataset and for each selected class include in your report 2 validation images and their reconstruction. What differences do you observe that exist in the reconstructed validation images, compared to the original ones? Q5.3.2 Writeup [2 points] Lets evaluate the reconstruction quality using Peak Signalto-noise Ratio (PSNR). PSNR is defined as PSNR = 20 × log10(MAXI ) − 10 × log10(MSE) (1) 8 where MAXI is the maximum possible pixel value of the image, and MSE (mean squared error) is computed across all pixels. You may use skimage.measure.compare psnr for convenience. Report the average PSNR you get from the autoencoder across all validation images. 6 Comparing against PCA As a baseline for comparison, we will use one of the most popular methods for data dimensionality reduction – Principle Component Analysis (PCA). PCA allows one to find the best low-rank approximation of the data by keeping only a specified number of principle components. To perform PCA, we will use a factorization method called Singular Value Decomposition (SVD). Run SVD on the training data. One of the matrices will be an orthonormal matrix that indicates the components of your data, sorted by their importances. Extract the first 32 principle components and form a projection matrix; you will need to figure out how to do these from the U,S,V matrices. Q6.1 Writeup [2 points] What is the size of your projection matrix? What is its rank? This projection matrix was “trained” from our training data. Now lets “test” it on our test data. Use the projection matrix on test data to obtain the reconstructed test images. Note that these reconstructions can also be represented with only 32 numbers. Q6.2 Writeup [2 points] Use the classes you selected in Q5.3.1, and for each of these 5 classes, include in your report 2 test images and their reconstruction. You may use test labels to help you find the corresponding classes. What differences do you observe that exist in the reconstructed test images, compared to the original ones? How do they compare to the ones reconstructed from your autoencoder? Q6.3 Writeup [2 points] Report the average PSNR you get from PCA. Is it better than your autoencoder? Why? Q6.4 Writeup [2 points] Count the number of learned parameters for both your autoencoder and the PCA model. How many parameters do the respective approaches learn? Why is there such a difference in terms of performance? 9 7 PyTorch While you were able to derive manual backpropogation rules for sigmoid and fully-connected layers, wouldn’t it be nice if someone did that for lots of useful primatives and made it fast and easy to use for general computation? Meet automatic differentiation. Since we have high-dimensional inputs (images) and low-dimensional outputs (a scalar loss), it turns out forward mode AD is very efficient. Popular autodiff packages include pytorch (Facebook), tensorflow (Google), autograd (Boston-area academics). Autograd provides its own replacement for numpy operators and is a drop-in replacement for numpy, except you can ask for gradients now. The other two are able to act as shim layers for cuDNN, an implementation of auto-diff made by Nvidia for use on their GPUs. Since GPUs are able to perform large amounts of math much faster than CPUs, this makes the former two packages very popular for researchers who train large networks. Tensorflow asks you to build a computational graph using its API, and then is able to pass data through that graph. PyTorch builds a dynamic graph and allows you to mix autograd functions with normal python code much more smoothly, so it is currently more popular among CMU students. For extra credit, we will use PyTorch as a framework. Many computer vision projects use neural networks as a basic building block, so familiarity with one of these frameworks is a good skill to develop. Here, we basically replicate and slightly expand our handwritten character recognition networks, but do it in PyTorch instead of doing it ourselves. Feel free to use any tutorial you like, but we like the offical one or this tutorial (in a jupyter notebook) or these slides (starting from number 35). For this section, you’re free to implement these however you like. All of the tasks required here are fairly small and don’t require a GPU if you use small networks. Including 7.2. 7.1 Train a neural network in PyTorch Q7.1.1 Code/Writeup [5 points] Re-write and re-train your fully-connected network on NIST36 in PyTorch. Plot training accuracy and loss over time. Q7.1.2 Code/Writeup [2 points] Train a convolutional neural network with PyTorch on MNIST. Plot training accuracy and loss over time. Q7.1.3 Code/Writeup [3 points] Train a convolutional neural network with PyTorch on the included NIST36 dataset. Q7.1.4 Code/Writeup [10 points] Train a convolutional neural network with PyTorch on the EMNIST Balanced dataset and evaluate it on the findLetters bounded boxes from the images folder. 10 7.2 Fine Tuning When training from scratch, a lot of epochs and data are often needed to learn anything meaningful. One way to avoid this is to instead initialize the weights more intelligently. These days, it is most common to initialize a network with weights from another deep network that was trained for a different purpose. This is because, whether we are doing image classification, segmentation, recognition etc…, most real images share common properties. Simply copying the weights from the other network to yours gives your network a head start, so your network does not need to learn these common weights from scratch all over again. This is also referred to as fine tuning. Q7.2.1 Code/Writeup [5 points] Fine-tune a single layer classifier using pytorch on the flowers 17 (or flowers 102!) dataset using squeezenet1 1, as well as an architecture you’ve designed yourself (3 conv layers, followed 2 fc layers, it’s standard slide 6) and trained from scratch. How do they compare? We include a script in scripts/ to fetch the flowers dataset and extract it in a way that PyTorch ImageFolder can consume it, see an example, from data/oxford-flowers17. You should look at how SqueezeNet is defined, and just replace the classifier layer. There exists a pretty good example for fine-tuning in PyTorch. References [1] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. 2010. http://proceedings.mlr.press/v9/glorot10a/ glorot10a.pdf. [2] P. J. Grother. Nist special database 19 handprinted forms and characters database. https://www.nist.gov/srd/nist-special-database-19, 1995. 11 8 Appendix: Neural Network Overview Deep learning has quickly become one of the most applied machine learning techniques in computer vision. Convolutional neural networks have been applied to many different computer vision problems such as image classification, recognition, and segmentation with great success. In this assignment, you will first implement a fully connected feed forward neural network for hand written character classification. Then in the second part, you will implement a system to locate characters in an image, which you can then classify with your deep network. The end result will be a system that, given an image of hand written text, will output the text contained in the image. 8.1 Basic Use Here we will give a brief overview of the math for a single hidden layer feed forward network. For a more detailed look at the math and derivation, please see the class slides. A fully-connected network f, for classification, applies a series of linear and non-linear functions to an input data vector x of size N × 1 to produce an output vector f(x) of size C ×1, where each element i of the output vector represents the probability of x belonging to the class i. Since the data samples are of dimensionality N, this means the input layer has N input units. To compute the value of the output units, we must first compute the values of all the hidden layers. The first hidden layer pre-activation a (1)(x) is given by a (1)(x) = W(1)x + b (1) Then the post-activation values of the first hidden layer h (1)(x) are computed by applying a non-linear activation function g to the pre-activation values h (1)(x) = g(a (1)(x)) = g(W(1)x + b (1)) Subsequent hidden layer (1 < t ≤ T) pre- and post activations are given by: a (t) (x) = W(t)h (t−1) + b (t) h (t) (x) = g(a (t) (x)) The output layer pre-activations a (T) (x) are computed in a similar way a (T) (x) = W(T)h (T −1)(x) + b (T) and finally the post-activation values of the output layer are computed with f(x) = o(a (T) (x)) = o(W(T)h (T −1)(x) + b (T) ) where o is the output activation function. Please note the difference between g and o! For this assignment, we will be using the sigmoid activation function for the hidden layer, so: g(y) = 1 1 + exp(−y) 12 Figure 2: Samples from NIST Special 19 dataset [2] where when g is applied to a vector, it is applied element wise across the vector. Since we are using this deep network for classification, a common output activation function to use is the softmax function. This will allow us to turn the real value, possibly negative values of a (T) (x) into a set of probabilities (vector of positive numbers that sum to 1). Letting xi denote the i th element of the vector x, the softmax function is defined as: oi(y) = exp(yi) P j exp(yj ) Gradient descent is an iterative optimisation algorithm, used to find the local optima. To find the local minima, we start at a point on the function and move in the direction of negative gradient (steepest descent) till some stopping criteria is met. 8.2 Backprop The update equation for a general weight W (t) ij and bias b (t) i is W (t) ij = W (t) ij − α ∗ ∂Lf ∂W(t) ij (x) b (t) i = b (t) i − α ∗ ∂Lf ∂b(t) i (x) α is the learning rate. Please refer to the backpropagation slides for more details on how to derive the gradients. Note that here we are using softmax loss (which is different from the least square loss in the slides).
This homework consists of four sections. In the first section you will implement a simple Lucas-Kanade (LK) tracker with one single template; in the second section, the tracker will be generalized to accommodate for large appearance variance. The third section requires you to implement a motion subtraction method for tracking moving pixels in a scene. In the final section you shall study efficient tracking which includes inverse composition and correlation filters. Note the first 3 sections are based on the Lucas-Kanade tracking framework; with the final section also incorporating correlation filters. Other than the course slide decks, the following references may also be helpful: 1. Simon Baker, et al. Lucas-Kanade 20 Years On: A Unifying Framework: Part 1, CMU-RI-TR-02-16, Robotics Institute, Carnegie Mellon University, 2002 2. Simon Baker, et al. Lucas-Kanade 20 Years On: A Unifying Framework: Part 2, CMU-RI-TR-03-35, Robotics Institute, Carnegie Mellon University, 2003 Both are available at: https://www.ri.cmu.edu/pub_files/pub3/baker_simon_2002_3/baker_simon_2002_3.pdf. https://www.ri.cmu.edu/publications/lucas-kanade-20-years-on-a-unifying-framework-part-2/. 1 Lucas-Kanade Tracking In this section you will be implementing a simple Lucas & Kanade tracker with one single template. In the scenario of two-dimensional tracking with a pure translation warp function, W(x; p) = x + p . (1) 1 The problem can be described as follows: starting with a rectangular neighborhood of pixels N ∈ {xd} D d=1 on frame It, the Lucas-Kanade tracker aims to move it by an offset p = [px, py] T to obtain another rectangle on frame It+1, so that the pixel squared difference in the two rectangles is minimized: p ∗ = arg minp = X x∈N ||It+1(x + p) − It(x)||2 2 (2) = It+1(x1 + p) . . . It+1(xD + p) − It(x1) . . . It(xD) 2 2 (3) Q1.1 (5 points) Starting with an initial guess of p (for instance, p = [0, 0]T ), we can compute the optimal p ∗ iteratively. In each iteration, the objective function is locally linearized by first-order Taylor expansion, It+1(x 0 + ∆p) ≈ It+1(x 0 ) + ∂It+1(x 0 ) ∂x0T ∂W(x; p) ∂pT ∆p (4) where ∆p = [∆px, ∆py] T , is the template offset. Further, x 0 = W(x; p) = x + p and ∂I(x 0 ) ∂x0T is a vector of the x− and y− image gradients at pixel coordinate x 0 . In a similar manner to Equation 3 one can incorporate these linearized approximations into a vectorized form such that, arg min ∆p ||A∆p − b||2 2 (5) such that p ← p + ∆p at each iteration. • What is ∂W(x;p) ∂pT ? • What is A and b? • What conditions must AT A meet so that a unique solution to ∆p can be found? Q1.2 (15 points) Implement a function with the following signature LucasKanade(It, It1, rect, p0 = np.zeros(2)) that computes the optimal local motion from frame It to frame It+1 that minimizes Equation 3. Here It is the image frame It, It1 is the image frame It+1, rect is the 4-by-1 vector that represents a rectangle describing all the pixel coordinates within N within the image frame It, and p0 is the initial parameter guess (δx, δy). The four components of the rectangle are [x1, y1, x2, y2] T , where [x1, y1] T is the top-left corner and [x2, y2] T is the bottom-right corner. The rectangle is inclusive, i.e., in includes all the four corners. To deal with fractional movement of the template, you will need to interpolate the image using the Scipy module ndimage.shift or something similar. You will also need to iterate the estimation until the change in ||∆p||2 2 is below a threshold. In order to perform interpolation you might find RectBivariateSpline from the scipy.interpolate package. Read the documentation of defining the spline ( RectBivariateSpline) as well as evaluating the spline using RectBivariateSpline.ev carefully. 2 Q1.3 (10 points) Write a script testCarSequence.py that loads the video frames from carseq.npy, and runs the Lucas-Kanade tracker that you have implemented in the previous task to track the car. carseq.npy can be located in the data directory and it contains one single three-dimensional matrix: the first two dimensions correspond to the height and width of the frames respectively, and the third dimension contain the indices of the frames (that is, the first frame can be visualized with imshow(frames[:, :, 0])). The rectangle in the first frame is [x1, y1, x2, y2] T = [59, 116, 145, 151]T . Report your tracking performance (image + bounding rectangle) at frames 1, 100, 200, 300 and 400 in a format similar to Figure 1. Also, create a file called carseqrects.npy, which contains one single n × 4 matrix, where each row stores the rect that you have obtained for each frame, and n is the total number of frames. Figure 1: Lucas-Kanade Tracking with One Single Template Q1.4 (20 points) As you might have noticed, the image content we are tracking in the first frame differs from the one in the last frame. This is understandable since we are updating the template after processing each frame and the error can be accumulating. This problem is known as template drifting. There are several ways to mitigate this problem. Iain Matthews et al. (2003, https://www.ri.cmu.edu/publication_view.html?pub_id=4433) suggested one possible approach. Write a script testCarSequenceWithTemplateCorrection.py with a similar functionality to Q1.3, but with a template correction routine incorporated. Save the resulting rects as carseqrects-wcrt.npy, and also report the performance at those frames. An example is given in Figure 2. Figure 2: Lucas-Kanade Tracking with Template Correction Here the green rectangles are created with the baseline tracker in Q1.3, the yellow ones with the tracker in Q1.4. The tracking performance has been improved non-trivially. Note that you do not necessarily have to draw two rectangles in each frame, but make sure that the performance improvement can be easily visually inspected. 2 Lucas-Kanade Tracking with Appearance Basis The tracker we have implemented in the first secion, with or without template drifting correction, may suffice if the object being tracked is not subject to drastic appearance variance. However, in real life, this can hardly be the case. We have prepared another 3 sequence sylvseq.npy (the initial rectangle is [101, 61, 155, 107]), with exactly the same format as carseq.mat, on which you can test the baseline implementation and see what would happen. In this section, you will implement a variant of the Lucas-Kanade tracker (see Section 3.4 in [2]), to model linear appearance variation in the tracking. 2.1 Appearance Basis One way to address this issue is to use eigen-space approach (aka, principal component analysis, or PCA). The idea is to analyze the historic data we have collected on the object, and produce a few bases, whose linear combination would most likely to constitute the appearance of the object in the new frame. This is actually similar to the idea of having a lot of templates, but looking for too many templates may be expensive, so we only worry about the principal templates. Mathematically, suppose we are given a set of k image bases {Bk} K k=1 of the same size. We can approximate the appearance variation of the new frame It+1 as a linear combination of the previous frame It and the bases weighted by w = [w1, . . . , wK] T , such that It+1(x) = It(x) +X K k=1 wkBk(x) (6) Q2.1 (5 points) Express w as a function of It+1, It, and {Bk} K k=1, given Equation 6. Note that since the Bk’s are orthobases, thay are orthogonal to each other. 2.2 Tracking Given K bases, {Bk} K k=1, our goal is then to simultaneously find the translation p = [px, py] T and the weights w = [w1, . . . , wK] T that minimizes the following objective function: min p,w = X x∈N ||It+1(x + p) − It(x) − X K k=1 wkBk(x)||2 2 . (7) Again, starting with an initial guess of p (for instance, p = [0, 0]T ), one can linearize It+1(x+ p + ∆p) with respect to ∆p. In a similar manner to Equation 5 one can incorporate these linearized approximations into a vectorized form such that, arg min ∆p,w ||A∆p − b − Bw||2 2 . (8) As discussed in Section 3.4 of [2] (ignore the inverse compositional discussion) this can be simplified down to arg min ∆p ||B ⊥(A∆p − b)||2 2 (9) where B⊥ spans the null space of B. Note that ||B⊥z||2 2 = ||z − BBT z||2 2 when B is an orthobasis. Q2.2 (15 points) Implement a function with the following signature 4 LucasKanadeBasis(It, It1, rect, bases, p0 = np.zeros(2)) where bases is a three-dimensional matrix that contains the bases. It has the same format as frames as is described earlier and can be found in sylvbases.npy. Q2.3 (15 points) Write a script testSylvSequence.py that loads the video frames from sylvseq.npy and runs the new Lucas-Kanade tracker to track the sylv (the toy). The bases are available in sylvbases.npy in the data directory. The rectangle in the first frame is [x1, y1, x2, y2] T = [101, 61, 155, 107]T . Please report the performance of this tracker at frames 1, 200, 300, 350 and 400 (the frame + bounding box), in comparison to that of the tracker in the first section. That is, there should be two rectangles for each frame, as exemplified in Figure 3. Also, create a sylvseqrects.npy for all the rects you have obtained for each frame. It should contain one single N × 4 matrix named rects, where N is the number of frames, and each row contains [x1, y1, x2, y2] T , where [x1, y1]T is the coordinate of the top-left corner of the tracking box, and [x2, y2] T the bottom-right corner. Figure 3: Lucas-Kanade Tracking with Appearance Basis 3 Affine Motion Subtraction In this section, you will implement a tracker for estimating dominant affine motion in a sequence of images and subsequently identify pixels corresponding to moving objects in the scene. You will be using the images in the file aerialseq.npy, which consists aerial views of moving vehicles from a non-stationary camera. 3.1 Dominant Motion Estimation In the first section of this homework we assumed the the motion is limited to pure translation. In this section you shall implement a tracker for affine motion using a planar affine warp function. To estimate dominant motion, the entire image It will serve as the template to be tracked in image It+1, that is, It+1 is assumed to be approximately an affine warped version of It. This approach is reasonable under the assumption that a majority of the pixels correspond to the stationary objects in the scene whose depth variation is small relative to their distance from the camera. Using a planar affine warp function you can recover the vector ∆p = [p1, . . . , p6] T , x 0 = W(x; p) = 1 + p1 p2 p4 1 + p5 x y + p3 p6 . (10) One can represent this affine warp in homogeneous coordinates as, x˜ 0 = Mx˜ (11) 5 where, M = 1 + p1 p2 p3 p4 1 + p5 p6 0 0 1 . (12) Note that M will differ between successive image pairs. Starting with an initial guess of p = 0 (i.e. M = I) you will need to solve a sequence of least-squares problem to determine ∆p such that p → p + ∆p at each iteration. Note that unlike previous examples where the template to be tracked is usually small in comparison with the size of the image, image It will almost always not be contained fully in the warped version It+1. Hence, one must only consider pixels lying in the region common to It and the warped version of It+1 when forming the linear system at each iteration. Q3.1 (15 points) Write a function with the following signature LucasKanadeAffine(It, It1) which returns the affine transformation matrix M, and It and It1 are It and It+1 respectively. LucasKanadeAffine should be relatively similar to LucasKanade from the first section (you will probably also find scipy.ndimage.affine transform helpful). 3.2 Moving Object Detection Once you are able to compute the transformation matrix M relating an image pair It and It+1, a naive way for determining pixels lying on moving objects is as follows: warp the image It using M so that it is registered to It+1 and subtract it from It+1; the locations where the absolute difference exceeds a threshold can then be declared as corresponding to locations of moving objects. To obtain better results, you can check out the following scipy.morphology functions: binary erosion, and binary dilation. Q3.2 (10 points) Using the function you have developed for dominant motion estimation, write a function with the following signature SubtractDominantMotion(image1, image2) where image1 and image2 form the input image pair, and the return value mask is a binary image of the same size that dictates which pixels are considered to be corresponding to moving objects. You should invoke LucasKanadeAffine in this function to derive the transformation matrix M, and produce the aforementioned binary mask accordingly. Q3.3 (10 points) Write a script testAerialSequence.py that loads the image sequence from aerialseq.npy and run the motion detection routine you have developed to detect the moving objects. Report the performance at frames 30, 60, 90 and 120 with the corresponding binary masks superimposed, as exemplified in Figure 4. Feel free to visualize the motion detection performance in a way that you would prefer, but please make sure it can be visually inspected without undue effort. 6 Figure 4: Lucas-Kanade Tracking with Appearance Basis 4 Efficient Tracking 4.1 Inverse Composition The inverse compositional extension of the Lucas-Kanade algorithm (see [1]) has been used in literature to great effect for the task of efficient tracking. When utilized within tracking it attempts to linearize the current frame as, It(W(x; 0 + ∆p) ≈ It(x) + ∂It(x) ∂xT ∂W(x; 0) ∂pT ∆p . (13) In a similar manner to the conventional Lucas-Kanade algorithm one can incorporate these linearized approximations into a vectorized form such that, arg min ∆p ||A0∆p − b 0 ||2 2 (14) for the specific case of an affine warp where p ← M and ∆p ← ∆M this results in the update M = M(∆M) −1 . Q4.1 (15 points) Reimplement the function LucasKanadeAffine(It,It1) as InverseCompositionAffine(It,It1) using the inverse compositional method. In your own words please describe why the inverse compositional approach is more computationally efficient than the classical approach? 4.2 Correlation Filters Enter the directory Corr-Filters. Run the script file example.py, and observe a visual depiction of the extraction of sub-images from the Lena image (Figure 1) as well as the desired output response from our linear disciminant (Figure 2). Inspecting example.py you can see that it extracts a set of sub-images X = [x1, . . . , xN ] from within the Lena image. These sub-images are stored in vector form so that xn ∈ R D (in the example code all the sub-images are 29 × 45 therefore D = 1305). Associated with these images are desired output labels y = [y1, . . . , yN ] where yn lies between zero and one. For this example we have made D = N on purpose. Please proceed to answer the following questions:- Q4.2 (15 points) A linear least-squares discriminant can be estimated by solving arg min g X N n=1 1 2 ||yn − x T n g||2 2 7 we can simplify this objective in vector form and include an additional penalty term arg min g 1 2 ||y − XT g||2 2 + λ 2 ||g||2 2 . (15) Please write down the solution to Equation 15 in terms of the matrices S = XXT , X and y. Place the answer in your written response. Q4.3 (15 points) Add your solution to Equation 15 in the example.py code. Visualize the resultant linear discriminant weight vector g for the penalty values λ = 0 and λ = 1 (remember to use the reshape function to convert g back into a 2D array). Apply the filter to the entire Lena image using the scipy.ndimage.correlate function. Visualize the responses for both values of λ. Include these visualizations in your written response document. Can you comment on which value of λ performs best and why? Place your answers and figures in your written response. Q4.4 (15 points) Visualize the response you get if you attempt to use the 2D convolution function scipy.ndimage.convolve. Why does this get a different response to the one obtained using correlate? How could you use the numpy indexing operations to get a response more similar to the one obtained using correlate? Place the answer in your written response. 5 Deliverables The assignment should be submitted to canvas. The writeup should be submitted as a pdf named .pdf. The code should be submitted as a zip named .zip. The zip when uncompressed should prodcue the following files. • LucasKanade.py • LucasKanadeAffine.py • LucasKanadeBasis.py • SubtractDominantMotion.py • InverseCompositionAffine.py • testCarSequence.py • testSylvSequence.py • testCarSequenceWithTemplateCorrection.py • testAerialSequence.py • carseqrects.npy • carseqrects-wcrt.npy • sylvseqrects.npy • aerialseqrects.npy Do not include the data directory in your submission.
In this homework, you will implement an interest point (keypoint) detector, a feature descriptor and an image stitching tool based on feature matching and homography. Interest point (keypoint) detectors find particularly salient points in an image. We can then extract a feature descriptor that helps describe the region around each of the interest points. SIFT, SURF and BRIEF are all examples of commonly used feature descriptors. Once we have extracted the interest points, we can use feature descriptors to match them (find correspondences) between images to do interesting things like panorama stitching or scene reconstruction. 1 We will implement the BRIEF feature descriptor in this homework. It has a compact representation, is quick to compute, has a discriminative yet easily computed distance metric, and is relatively simple to implement. This allows for real-time computation, as you have seen in class. Most importantly, as you will see, it is also just as powerful as more complex descriptors like SIFT for many cases. After matching the features that we extract, we will explore the homography between images based on the locations of the matched features. Specifically, we will look at the planar homographies. Why is this useful? In many robotics applications, robots must often deal with tabletops, ground, and walls among other flat planar surfaces. When two cameras observe a plane, there exists a relationship between the captured images. This relationship is defined by a 3 × 3 transformation matrix, called a planar homography. A planar homography allows us to compute how a planar scene would look from a second camera location, given only the first camera image. In fact, we can compute how images of the planes will look from any camera at any location without knowing any internal camera parameters and without actually taking the pictures, all using the planar homography matrix. 1 Keypoint Detector We will implement an interest point detector similar to SIFT. A good reference for its implementation can be found in [3]. Keypoints are found by using the Difference of Gaussian (DoG) detector. This detector finds points that are extrema in both scale and space of a DoG pyramid. This is described in [1], an important paper in computer vision. Here, we will implement a simplified version of the DoG detector described in Section 3 of [3]. HINT: All the functions to implement here are located in keypointDetect.py. NOTE: The parameters to use for the following sections are: σ0 = 1, k = √ 2, levels = [−1, 0, 1, 2, 3, 4], θc = 0.03 and θr = 12. 1.1 Gaussian Pyramid In order to create a DoG pyramid, we will first need to create a Gaussian pyramid. Gaussian pyramids are constructed by progressively applying a low pass Gaussian filter to the input image. This function is already provided to your in keypointDetect.py. GaussianPyramid = createGaussianPyramid(im, sigma0, k, levels) The function takes as input an image which is going to be converted to grayscale with intensity between 0 and 1 (hint: cv2.cvtColor(…)), the scale of the zeroth level of the pyramid sigma0, the pyramid factor k, and a vector levels specifying the levels of the pyramid to construct. At level l in the pyramid, the image is smoothed by a Gaussian filter with σl = σ0k l . The output GaussianPyramid is a R × C × L matrix, where R × C is the size of the input image im and L is the number of levels. An example of a Gaussian pyramid 2 Figure 1: Example Gaussian pyramid for model chickenbroth.jpg Figure 2: Example DoG pyramid for model chickenbroth.jpg can be seen in Figure 1. You can visualize this pyramid with the provided function displayPyramid(pyramid). 1.2 The DoG Pyramid (5 pts) The DoG pyramid is obtained by subtracting successive levels of the Gaussian pyramid. Specifically, we want to find: Dl(x, y, σl) = (G(x, y, σl−1) − G(x, y, σl)) ∗ I(x, y) (1) where G(x, y, σl) is the Gaussian filter used at level l and ∗ is the convolution operator. Due to the distributive property of convolution, this simplifies to Dl(x, y, σl) = G(x, y, σl−1) ∗ I(x, y) − G(x, y, σl) ∗ I(x, y) (2) = GPl − GPl−1 (3) where GPl denotes level l in the Gaussian pyramid. Q 1.2: Write the following function to construct a Difference of Gaussian pyramid: DoGPyramid, DoGLevels = createDoGPyramid(GaussianPyramid, levels) The function should return DoGPyramid an R × C × (L − 1) matrix of the DoG pyramid created by differencing the GaussianPyramid input. Note that you will have one level less than the Gaussian Pyramid. DoGLevels is an (L−1) vector specifying the corresponding levels of the DoG Pyramid (should be the last L−1 elements of levels). An example of the DoG pyramid can be seen in Figure 2. 3 Figure 3.a: Without edge suppression Figure 3.b: With edge suppression Figure 3: Interest Point (keypoint) Detection without and with edge suppression for model chickenbroth.jpg 1.3 Edge Suppression (10 pts) The Difference of Gaussian function responds strongly on corners and edges in addition to blob-like objects. However, edges are not desirable for feature extraction as they are not as distinctive and do not provide a substantially stable localization for keypoints. Here, we will implement the edge removal method described in Section 4.1 of [3], which is based on the principal curvature ratio in a local neighborhood of a point. The paper presents the observation that edge points will have a “large principal curvature across the edge but a small one in the perpendicular direction.” Q 1.3: Implement the following function: PrincipalCurvature = computePrincipalCurvature(DoGPyramid) The function takes in DoGPyramid generated in the previous section and returns PrincipalCurvature, a matrix of the same size where each point contains the curvature ratio R for the corresponding point in the DoG pyramid: R = Tr(H) 2 Det(H) = (λmin + λmax) 2 λminλmax (4) where H is the Hessian of the Difference of Gaussian function (i.e. one level of the DoG pyramid) computed by using pixel differences as mentioned in Section 4.1 of [3]. (hint: use Sobel filter cv2.Sobel(…)). H = Dxx Dxy Dyx Dyy (5) 4 This is similar in spirit to but different than the Harris corner detection matrix you saw in class. Both methods examine the eigenvalues λ of a matrix, but the method in [3] performs a test without requiring the direct computation of the eigenvalues. Note that you need to compute each term of the Hessian before being able to take the trace and determinant. We can see that R reaches its minimum when the two eigenvalues λmin and λmax are equal, meaning that the curvature is the same in the two principal directions. Edge points, in general, will have a principal curvature significantly larger in one direction than the other. To remove edge points, we simply check against a threshold R > θr. Fig. 3 shows the DoG detector with and without edge suppression. 1.4 Detecting Extrema (10 pts) To detect corner-like, scale-invariant interest points, the DoG detector chooses points that are local extrema in both scale and space. Here, we will consider a point’s eight neighbors in space and its two neighbors in scale (one in the scale above and one in the scale below). Q 1.4: Write the function : locsDoG = getLocalExtrema(DoGPyramid, DoGLevels, PrincipalCurvature, th contrast, th r) This function takes as input DoGPyramid and DoGLevels from Section 1.2 and PrincipalCurvature from Section 1.3. It also takes two threshold values, th contrast and th r. The threshold θc should remove any point that is a local extremum but does not have a Difference of Gaussian (DoG) response magnitude above this threshold (i.e. |D(x, y, σ)| > θc). The threshold θr should remove any edge-like points that have too large a principal curvature ratio specified by PrincipalCurvature. The function should return locsDoG, an N × 3 matrix where the DoG pyramid achieves a local extrema in both scale and space, and also satisfies the two thresholds. The first and second column of locsDoG should be the (x, y) values of the local extremum and the third column should contain the corresponding level of the DoG pyramid where it was detected. (Try to eliminate loops in the function so that it runs efficiently.) NOTE: In all implementations, we assume the x coordinate corresponds to columns and y coordinate corresponds to rows. For example, the coordinate (10, 20) corresponds to the (row 21, column 11) in the image. 1.5 Putting it together (5 pts) Q 1.5: Write the following function to combine the above parts into a DoG detector: locsDoG, GaussianPyramid = DoGdetector(im, sigma0, k, levels, th contrast, th r) 5 The function should take in a gray scale image, im, scaled between 0 and 1, and the parameters sigma0, k, levels, th contrast, and th r. It should use each of the above functions and return the keypoints in locsDoG and the Gaussian pyramid in GaussianPyramid. Figure 3 shows the keypoints detected for an example image. Note that we are dealing with real images here, so your keypoint detector may find points with high scores that you do not perceive to be corners. Include the image with the detected keypoints in your report (similiar to the one shown in Fig. 3 (b) ). You can use any of the provided images. 2 BRIEF Descriptor Now that we have interest points that tell us where to find the most informative feature points in the image, we can compute descriptors that can be used to match to other views of the same point in different images. The BRIEF descriptor encodes information from a 9 × 9 patch p centered around the interest point at the characteristic scale of the interest point. See the lecture notes to refresh your memory. HINT: All the functions to implement here are located in BRIEF.py. 2.1 Creating a Set of BRIEF Tests (5 pts) The descriptor itself is a vector that is n-bits long, where each bit is the result of the following simple test: τ (P; x, y) := ( 1, if P[x] < P[y]. 0, otherwise. (6) Set n to 256 bits. There is no need to encode the test results as actual bits. It is fine to encode them as a 256 element vector. There are many choices for the 256 test pairs {x, y} (remember x and y are 2D vectors relating to discrete 2D coordinates within the 2D image patch matrix P) used to compute τ (P; x, y) (each of the n bits). The authors describe and test some of them in [2]. Read Section 3.2 of that paper and implement one of these solutions. You should generate a static set of test pairs and save that data to a file. You will use these pairs for all subsequent computations of the BRIEF descriptor. Q 2.1: Write the function to create the x and y pairs that we will use for comparison to compute τ : compareX, compareY = makeTestPattern(patchWidth, nbits) patchWidth is the width of the image patch (usually 9) and nbits is the number of tests n in the BRIEF descriptor. compareX and compareY are linear indices into the patchWidth × patchWidth image patch and are each nbits × 1 vectors. Run this routine for the given parameters patchWidth = 9 and n = 256 and save the results in testPattern.npy. Include this file in your submission. 6 2.2 Compute the BRIEF Descriptor (10 pts) Now we can compute the BRIEF descriptor for the detected keypoints. Q 2.2: Write the function: locs,desc = computeBrief(im, GaussianPyramid, locsDoG, k, levels, compareX, compareY) Where im is a grayscale image with values from 0 to 1, locsDoG are the keypoint locations returned by the DoG detector from Section 1.5, levels are the Gaussian scale levels that were given in Section 1, and compareX and compareY are the test patterns computed in Section 2.1 and were saved into testPattern.npy. The function returns locs, an m×3 vector, where the first two columns are the image coordinates of keypoints and the third column is the pyramid level of the keypoints, and desc is an m × n bits matrix of stacked BRIEF descriptors. m is the number of valid descriptors in the image and will vary. You may have to be careful about the input DoG detector locations since they may be at the edge of an image where we cannot extract a full patch of width patchWidth. Thus, the number of output locs may be less than the input locsDoG. Note: Its possible that you may not require all the arguments to this function to compute the desired output. They have just been provided to permit the use of any of some different approaches to solve this problem. 2.3 Putting it all Together (5 pts) Q 2.3: Write a function : locs, desc = briefLite(im) which accepts a grayscale image im with values between 0 and 1 and returns locs, an m × 3 vector, where the first two columns are the image coordinates of keypoints and the third column is the pyramid level of the keypoints, and desc, an m × n bits matrix of stacked BRIEF descriptors. m is the number of valid descriptors in the image and will vary. n is the number of bits for the BRIEF descriptor. This function should perform all the necessary steps to extract the descriptors from the image, including • Compute DoG pyramid. • Get keypoint locations. • Compute a set of valid BRIEF descriptors. 2.4 Check Point: Descriptor Matching (5 pts) A descriptor’s strength is in its ability to match to other descriptors generated by the same world point despite change of view, lighting, etc. The distance metric used to 7 compute the similarity between two descriptors is critical. For BRIEF, this distance metric is the Hamming distance. The Hamming distance is simply the number of bits in two descriptors that differ. (Note that the position of the bits matters.) To perform the descriptor matching mentioned above, we have provided the function in BRIEF.py): matches = briefMatch(desc1, desc2, ratio) Which accepts an m1 × n bits stack of BRIEF descriptors from a first image and a m2×n bits stack of BRIEF descriptors from a second image and returns a p×2 matrix of matches, where the first column are indices into desc1 and the second column are indices into desc2. Note that m1, m2, and p may be different sizes and p ≤ min (m1, m2). Q 2.4: Write a test script or utilize the code provided in the main function of BRIEF.py to load two of the chickenbroth images and compute feature matches. Use the provided plotMatches and briefMatch functions to visualize the result. plotMatches(im1, im2, matches, locs1, locs2) where im1 and im2 are two colored images stored as uint8, matches is the list of matches returned by briefMatch and locs1 and locs2 are the locations of keypoints from briefLite. Save the resulting figure and submit it in your PDF report. Also, present results with the two incline images and with the computer vision textbook cover page (template is in file pf scan scaled.jpg) against the other pf * images. Briefly discuss any cases that perform worse or better. See Figure 4 for an example result. Suggestion for debugging: A good test of your code is to check that you can match an image to itself. 2.5 BRIEF and rotations (10 pts) You may have noticed worse performance under rotations. Let’s investigate this! Q 2.5: Take the model chickenbroth.jpg test image and match it to itself while rotating the second image (hint: cv2.getRotationMatrix2D(…), cv2.warpAffine(…)) in increments of 10 degrees. Count the number of correct matches at each rotation and construct a bar graph showing rotation angle vs the number of correct matches. Include this in your PDF and explain why you think the descriptor behaves this way. Create a separate script briefRotTest.py that performs this task. 3 Planar Homographies: Theory (20 pts) Suppose we have two cameras looking at a common plane in 3D space. Any 3D point w on this plane generates a projected 2D point located at u˜ = [u1, v1, 1]T on the first camera and x˜ = [x2, y2, 1]T on the second camera. Since w is confined to a plane, we 8 Figure 4: Example of BRIEF matches for model chickenbroth.jpg and chickenbroth 01.jpg. expect that there is a relationship between u˜ and x˜. In particular, there exists a common 3 × 3 matrix H, so that for any w, the following condition holds: λx˜ = Hu˜ (7) where λ is an arbitrary scalar weighting. We call this relationship a planar homography. Recall that the ˜ operator implies a vector is employing homogenous coordinates such that x˜ = [x, 1]T . It turns out this homography relationship is also true for cameras that are related by pure rotation without the planar constraint. Q 3.1 We have a set of N 2D homogeneous coordinates {x˜1, . . . , x˜N } taken at one camera view, and {u˜1, . . . , u˜N } taken at another. Suppose we know there exists an unknown homography H between the two views such that, λnx˜n = Hu˜n, for n = 1 : N (8) where again λn is an arbitrary scalar weighting. (a) Given the N correspondences across the two views and using Equation 8, derive a set of 2N independent linear equations in the form: Ah = 0 (9) where h is a vector of the elements of H and A is a matrix composed of elements derived from the point coordinates. Write out an expression for A. Hint: Start by writing out Equation 8 in terms of the elements of H and the homogeneous coordinates for u˜n and x˜n. 9 (b) How many elements are there in h? (c) How many point pairs (correspondences) are required to solve this system? Hint: How many degrees of freedom are in H? How much information does each point correspondence give? (d) Show how to estimate the elements in h to find a solution to minimize this homogeneous linear least squares system. Step us through this procedure. Hint: Use the Rayleigh quotient theorem (homogeneous least squares). 4 Planar Homographies: Implementation (10 pts) Note: Implement the method in planarH.py. Now that we have derived how to find H mathematically in Q 3.1, we will implement in this section. Q 4.1 (10pts) Implement the function H2to1 = computeH(X1,X2) Inputs: X1 and X2 should be 2 × N matrices of corresponding (x, y) T coordinates between two images. Outputs: H2to1 should be a 3 × 3 matrix encoding the homography that best matches the linear equation derived above for Equation 8 (in the least squares sense). Hint: Remember that a homography is only determined up to scale. The numpy.linalg function eigh() or svd() will be useful. Note that this function can be written without an explicit for-loop over the data points. 5 RANSAC (15 pts) Note: Implement the method in planarH.py. The least squares method you implemented for computing homographies is not robust to outliers. If all the correspondences are good matches, this is not a problem. But even a single false correspondence can completely throw off the homography estimation. When correspondences are determined automatically (using BRIEF feature matching for instance), some mismatches in a set of point correspondences are almost certain. RANSAC (Random Sample Consensus can be used to fit models robustly in the presence of outliers. Q 5.1 (15pts): Write a function that uses RANSAC to compute homographies automatically between two images: bestH = ransacH(matches, locs1, locs2, nIter, tol) The inputs and output of this function should be as follows: 10 Figure 5.a: incline L.jpg (img1) Figure 5.b: incline R.jpg (img2) Figure 5.c: img2 warped to img1’s frame Figure 5: Example output for Q 6.1: Original images img1 and img2 (left and center) and img2 warped to fit img1 (right). Notice that the warped image clips out of the image. We will fix this in Q 6.2 • Inputs: locs1 and locs2 are matrices specifying point locations in each of the images and matches is a matrix specifying matches between these two sets of point locations. These matrices are formatted identically to the output of the provided briefMatch function. • Algorithm Input Parameters: nIter is the number of iterations to run RANSAC for, tol is the tolerance value for considering a point to be an inlier. Define your function so that these two parameters have reasonable default values. • Outputs: bestH should be the homography model with the most inliers found during RANSAC. 6 Stitching it together: Panoramas (40 pts) NOTE: All the functions to implement here are in panoramas.py. We can also use homographies to create a panorama image from multiple views of the same scene. This is possible for example when there is no camera translation between the views (e.g., only rotation about the camera center). First, you will generate panoramas using matched point correspondences between images using the BRIEF matching in Section 2.4. We will assume that there is no error in your matched point correspondences between images (Although there might be some errors, and even small errors can have drastic impacts). In the next section you will extend the technique to deal with the noisy keypoint matches. You will need to use the perspective warping function from OpenCV: warp im = cv2.warpPerspective(im, H, out size), which warps image im using the homography transform H. The pixels in warp_im are sampled at coordinates in the rectangle (0, 0) to (out_size[0]-1, out_size[1]-1). The coordinates of the pixels in the source image are taken to be (0, 0) to (im.shape[1]-1, im.shape[0]-1) and transformed according to H. To understand this function, you may review Homework 0. Q 6.1 (15pts) In this problem you will implement and use the function: 11 panoImg = imageStitching(img1, img2, H2to1) on two images from the Dusquesne incline. This function accepts two images and the output from the homography estimation function. This function: (a) Warps img2 into img1’s reference frame using the aforementioned perspective warping function (b) Blends img1 and warped img2 and outputs the panorama image. For this problem, use the provided images incline L as img1 and incline R as img2. The point correspondences pts are generated by your BRIEF descriptor matching. Apply your ransacH() to these correspondences to compute H2to1, which is the homography from incline R onto incline L. Then apply this homography to incline R using warpH(). Note: Since the warped image will be translated to the right, you will need a larger target image. Visualize the warped image and save this figure as results/6 1.jpg using OpenCV’s cv2.imwrite() function and save only H2to1 as results/q6 1.npy using Numpy np.save() function. Q 6.2 (15pts) Notice how the output from Q 6.1 is clipped at the edges? We will fix this now. Implement a function [panoImg] = imageStitching noClip(img1, img2, H2to1) that takes in the same input types and produces the same outputs as in Q 6.1. To prevent clipping at the edges, we instead need to warp both image 1 and image 2 into a common third reference frame in which we can display both images without any clipping. Specifically, we want to find a matrix M that only does scaling and translation such that: warp im1 = cv2.warpPerspective(im1, M, out size) warp im2 = cv2.warpPerspective(im2, np.matmul(M,H2to1), out size) This produces warped images in a common reference frame where all points in im1 and im2 are visible. To do this, we will only take as input either the width or height of out size and compute the other one based on the given images such that the warped images are not squeezed or elongated in the panorama image. For now, assume we only take as input the width of the image (i.e., out size[0]) and should therefore compute the correct height(i.e., out size[1]). Hint: The computation will be done in terms of H2to1 and the extreme points (corners) of the two images. Make sure M includes only scale (find the aspect ratio of the full-sized panorama image) and translation. Again, pass incline L as img1 and incline R as img2. Save the resulting panorama in results/q6 2 pan.jpg. 12 Q 6.3 (10pts): You now have all the tools you need to automatically generate panoramas. Write a function that accepts two images as input, computes keypoints and descriptors for both the images, finds putative feature correspondences by matching keypoint descriptors, estimates a homography using RANSAC and then warps one of the images with the homography so that they are aligned and then overlays them. im3 = generatePanorama(im1, im2) Run your code on the image pair data/incline L.jpg, data/incline R.jpg. However during debugging, try on scaled down versions of the images to keep running time low. Save the resulting panorama on the full sized images as results/q6 3.jpg. (see Figure 7 for example output). Include the figure in your writeup. Figure 6: Final panorama view. With homography estimated using RANSAC. 7 Augmented Reality (25 pts) Homographies are also really useful for Augmented Reality (AR) applications. For this AR exercise we will be using the image in Figure 7 with the following data points:- W = 0.0 18.2 18.2 0.0 0.0 0.0 26.0 26.0 0.0 0.0 0.0 0.0 are the 3D planar points of the textbook (in cm); X = 483 1704 2175 67 810 781 2217 2286 are the corresponding projected points; and K = 3043.72 0.0 1196.00 0.0 3043.72 1604.00 0.0 0.0 1.0 13 Figure 7: Image file we will be using in this exercise, where we provide you with the 3D planar points of corners of the book (W) the 2D projected points (X) in the image, and the camera intrinsic parameters (K). contains the intrinsics matrix for the image. The relationship between 3D planar and 2D projected point can be defined through, λnx˜n = K[R, t]w˜ n (10) where xn and wn relate to the n-th column vector of X and W. Q7.1 (10pts) Using the method from Section 15.4.1 (pp. 335-337), of Prince’s textbook, implement the function R,t = compute extrinsics(K, H) where K contains the intrinsic parameters and H contains the estimated homography. As discussed in class the extrinsic parameters refer to the relative 3D rotation R and translation t of the camera between views. Remember: Prince uses the notation Ω and τ to refer the extrinsic parameters R and t respectively. They also employ the notation Λ to refer to the intrinsics matrix K. Q7.2 (15pts) Next implement the function X=project extrinsics(K, W, R, t) 14 that will apply a pinhole camera projection (described in Equation 10) of any set of 3D points using a pre-deterimined set of extrinsics and intrinsics. Use this function then project the set of 3D points in the file sphere.txt (which has been fixed to have the same radius as a tennis ball – of 6.858l) onto the image prince book.jpg. Specifically, attempt to place the bottom of the ball in the middle of the “o” in “Computer” text of the book title. Use matplotlib’s plot function to display the points of the sphere in yellow. For this question disregard self occlusion (i.e. points on the projected sphere occluding other points). Capture the image and include it in your written response. 8 Submission and Deliverables The assignment should be submitted to Canvas as a zip file named .zip and to Gradescope as a pdf file named hw2.pdf. You can use check files.py to ensure you have all required files ready before submission. The zip file should contain: • A folder code containing all the .py files you were asked to write. • A PDF hw2.pdf. • A folder results containing all the .npy files you were asked to generate. The PDF hw2.pdf should contain the results, explanations and images asked for in the assignment along with to the answers to the questions on homographies. Submit all the code needed to make your panorama generator run. Make sure all the .py files that need to run are accessible from the code folder without any editing of the path variable. You may leave the data folder in your submission, but it is not needed. Note: Missing to follow the structure will incur huge penalty in scores! Appendix: Image Blending Note: This section is not for credit and is for informational purposes only. For overlapping pixels, it is common to blend the values of both images. You can simply average the values but that will leave a seam at the edges of the overlapping images. Alternatively, you can obtain a blending value for each image that fades one image into the other. To do this, first create a mask like this for each image you wish to blend: mask = np.zeros((im.shape[0], im.shape[1])) mask[0,:] = 1 mask[-1,:] = 1 mask[:,0] = 1 mask[:,-1] = 1 mask = scipy.ndimage.morphology.distance_transform_edt(1-mask) mask = mask/mask.max(0) 15 The function distance_transform_edt computes the distance transform of the binarized input image, so this mask will be zero at the borders and 1 at the center of the image. You can warp this mask just as you warped your images. How would you use the mask weights to compute a linear combination of the pixels in the overlap region? Your function should behave well where one or both of the blending constants are zero. References [1] P. Burt and E. Adelson. The Laplacian Pyramid as a Compact Image Code. IEEE Transactions on Communications, 31(4):532–540, April 1983. [2] Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. BRIEF : Binary Robust Independent Elementary Features. [3] David G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60(2):91–110, November 2004.