Assignment Chef icon Assignment Chef

Browse assignments

Assignment catalog

33,401 assignments available

[SOLVED] Small program 5 cop3223c introduction to programming with c

For each problem in the assignment, you will create the definition of the user-defined function that is asked in the description. If you do not create a user-defined function for each of the problems, then you will receive no credit for the problem. Creating user-defined functions is good practice! You also must write the function prototypes! Missing function prototypes will result in points being deducted. Function prototypes are also good practice as well. The file must be named smallprogram5_lastname_firstname.c, where lastname and firstname is your last and first name (as registered in webcourses). For example Dr. Steinberg’s file would be named smallprogram5_Steinberg_Andrew.c. Make sure to include the underscore character _. If your file is not named properly, points will be deducted. The script the graders use will pull your name from the file name properly. It is imperative you follow these steps so you can get your points! In this small programming assignment, you have two problems that involve use of FILE I/O. The necessary text files have been provided for you in Webcourses in the assignment page. Testing on Eustis It is your responsibility to test your code on Eustis. If you submit your assignment without testing on Eustis, you risk points being deducted for things that may behave differently on your Small Program 5 Page 1 operating system. Remember, you cannot dispute grades if your code didn’t work properly on Eustis all because it worked on your machine. The Eustis environment gives the final say. Plan accordingly to test on Eustis!! The Python Script File! Read Carefully! NEW INFORMATION! A python script has been provided for you to test your code with a sample output of Dr. Steinberg’s solution. This script will check to make sure your output matches exactly with Dr. Steinberg’s solution file as the graders are using this to grade your assignments. The script removes leading and trailing white space, but not white space in the actual text. If there is anything off with the output (including the expected answer), the script will say your output is not correct. This includes your output producing the correct answer, however there is something off with the output display. The script is going to run 5 unique scenarios for each problem (5 Test Cases). Each test case contains a different set of input values being used to ensure your code produces the correct answer. Back in your previous assignments, Dr. Steinberg would provide 1 sample solution that you would upload to Eustis. Now, there are 5 solution text files you are going to need to upload to Eustis. Before you test your program, your directory in Eustis should look something like this: NEW INFO FOR FILE I/O In this assignment there are 2 FILE I/O problems. In order to properly test them with the script, you are going to need to upload 10 text files (for problems 2 and 3) that are going to be read OR compared from writing. Those files are: 1. grades1.txt 2. grades2.txt 3. grades3.txt 4. grades4.txt 5. grades5.txt 6. steinbergrecipt_testcase_1.txt 7. steinbergrecipt_testcase_2.txt 8. steinbergrecipt_testcase_3.txt 9. steinbergrecipt_testcase_4.txt 10. steinbergrecipt_testcase_5.txt Your Eustis Directory should look something like this: If you have these files, you are ready to run the script. Use the following command to test your code with Dr. Steinberg’s provided solution sample. python3 sp5test.py Small Program 5 Page 2 Figure 1: Your setup for testing on Eustis. If the script says your output is incorrect, checkout the sample text file that was generated (a new text file will be created from the script that contains YOUR output). If your numbers are off or different from Dr. Steinberg’s, then that means there is something not right with code’s logic and calculating the answer. However if your numbers match with Dr. Steinberg’s solution, then that means there is extra/missing white space or newlines detected. Compare the text file generated by the script with solution text file line by line to find the missing/extra white space or newlines. Once you believe you found the error, rerun the script to see if the output matches. The Rubric Please see the assignment page for the established rubric on webcourses. Comment Header Make sure you place a comment header at the top of your C file. You will use single line comments to write your name, professor, course, and assignment. For example, Dr. Steinberg’s header would be: //Andrew Steinberg //Dr. Steinberg //COP3223C Section 1 //Small Program 5 Missing a comment header will result in point deductions! Small Program 5 Page 3 Problem 1 Write a user defined function called change. The user enters the amount paid and the amount due in the main function. The program determines how many dollars, quarters, dimes, nickels, and pennies should be given as change. The function parameter has two double reference parameters. The first parameter represents the amount paid and the second represents the amount due. Assume the user will always pay over the amount due. Pointers are required to receive full credit for the problem. If pointers aren’t used, then no credit will be given. IMPORTANT! In this problem some of you may run into a round off error (due to loss of information). This is very normal to witness in the industry and it’s good to learn how to deal with them. Some programmers will define a small ϵ value such as (0.00025) to deal with this round off error. There are other methods that exists. Make sure to talk to the TAs and ULAs if you are struggling with this problem. You do not need to worry about invalid input for this problem. The following figure shows the sample output for this problem. Figure 2: Sample output for problem 1. Make sure it matches this for the script when testing. Problem 2 Figure 3: Table that shows option with associated item, price, and response in the terminal after user selects option. You were currently hired to work at Bob’s Burgers Restaurant for a temp contract position. Your position is helping the restaurant create receipts for customers who want a copy of their order transaction. Write a user defined function called resterauntReceipt. The function has no parameters and does not return anything. Inside the function, you are going to collect the order of the customers. This includes the number of each item along with the associated Small Program 5 Page 4 price. After the user finishes ordering, the function will print a receipt to a text file. The name of the receipt is a text file called myreceipt.txt. Use the above table in figure 3 to create the menu/option of selecting items to the order along with the response that is displayed to the terminal window. You do not need to worry about invalid options from being entered in this problem. Once the user is done entering their order, they must hit 0 to proceed with the receipt printing. When 0 is entered, the terminal will display the message Order is now placed. Printing receipt. This will cause the text file to be generated with the items selected along with the total cost. Check out the sample text file in webcourses to see the formatting of the text file. Figure 4: Sample output for problem 2. Small Program 5 Page 5 Problem 3 Dr. Steinberg needs your help! He just finished assigning final letter grades in his Microcomputers Applications course and wants to know the grade distribution. Grade distribution is simply seeing the number of the students that were awarded with the respective letter grade. In his Microcomputers Applications class, Dr. Steinberg doesn’t use the plus/minus system. That means the only letter grades are A, B, C, D, and F in the Microcomputer Applications course. The registrar site downloads all the letters as a txt file called grades.txt (already provided in Webcourses). Each line in the file contains a letter grade that was awarded. Write a user defined function called gradeDistribution that takes no parameters. The function opens the text file of letter grades and reads each one line by line. After reading all the grades, the program will display the distribution of each grade assigned. The text file that you will read from must be in the same directory as your C source file. You do not need to worry about invalid input and each letter will be on its own line. Figure 5: Sample output for problem 3. Small Program 5 Page 6 Problem 4 Write a user defined function called incrementUpdate. The function takes one integer reference called val. The function updates the value stored in val by 1. The user will be able to update val until the user enters another option. This other option will cause the user defined function to terminate. Display the value stored in val in the main function before and after the call. Inside the main the function, declare and initialize val to 0. If pointers aren’t used, then no credit will be given. Figure 6: Sample output for problem 4. Make sure it matches this for the script when testing.

$25.00 View

[SOLVED] Small program 4 cop3223c introduction to programming with c

This assignment contains a set of problems that are to be completed in one C file. You have learned about creating user-defined functions and why they are so beneficial to us programmers. For each problem in the assignment, you will create the definition of the user-defined function that is asked in the description. If you do not create a user-defined function for each of the problems, then you will receive no credit for the problem. Creating user-defined functions is good practice! You also must write the function prototypes! Missing function prototypes will result in points being deducted. Function prototypes are also good practice as well. The file must be named smallprogram4_lastname_firstname.c, where lastname and firstname is your last and first name (as registered in webcourses). For example Dr. Steinberg’s file would be named smallprogram4_Steinberg_Andrew.c. Make sure to include the underscore character _. If your file is not named properly, points will be deducted. The script the graders use will pull your name from the file name properly. It is imperative you follow these steps so you can get your points! Testing on Eustis It is your responsibility to test your code on Eustis. If you submit your assignment without testing on Eustis, you risk points being deducted for things that may behave differently on your operating system. Remember, you cannot dispute grades if your code didn’t work properly on Small Program 4 Page 1 Eustis all because it worked on your machine. The Eustis environment gives the final say. Plan accordingly to test on Eustis!! Displaying Statements! Read Carefully! Dr. Steinberg and his TAs have noticed that students are not using the escape sequence newline for the last display statement of past small programs. Please make sure to use the newline escape character for all statements. The only time you will not use the newline escape character is when user input is collected. Pay close attention to the screenshots. The Python Script File! Read Carefully! NEW INFORMATION! A python script has been provided for you to test your code with a sample output of Dr. Steinberg’s solution. This script will check to make sure your output matches exactly with Dr. Steinberg’s solution file as the graders are using this to grade your assignments. The script removes leading and trailing white space, but not white space in the actual text. If there is anything off with the output (including the expected answer), the script will say your output is not correct. This includes your output producing the correct answer, however there is something off with the output display. New Info: The script is going to run 5 unique scenarios for each problem (5 Test Cases). Each test case contains a different set of input values being used to ensure your code produces the correct answer. Back in your previous assignments, Dr. Steinberg would provide 1 sample solution that you would upload to Eustis. Now, there are 5 solution text files you are going to need to upload to Eustis. Before you test your program, your directory in Eustis should look something like this: After you run the script, 5 new text files are going to be generated. Figure 1: Your setup for testing on Eustis. 5 sample txt files (provided for you in Webcourses), your C program, and the python test script. These files are the solution output for each test case. If you have these files, you are ready to run the script. Use the following command to test your code with Dr. Steinberg’s provided solution sample. python3 sp4test.py Small Program 4 Page 2 Figure 2: Your Eustis setup after running the script in Eustis. The Rubric Please see the assignment page for the established rubric on webcourses. Comment Header Make sure you place a comment header at the top of your C file. You will use single line comments to write your name, professor, course, and assignment. For example, Dr. Steinberg’s header would be: //Andrew Steinberg //Dr. Steinberg //COP3223C Section 1 //Small Program 4 Missing a comment header will result in point deductions! Small Program 4 Page 3 Problem 1 Write a user-defined function definition called perfectSquare that prints a nice hollow square made out of * characters. The function has no parameters and does not return any values. Inside Figure 3: Sample out for problem 1. Make sure your output matches this for the script! the function definition, you will prompt the user for a number that will be used in generating the square. If the user inputs an invalid number (0 or negative), then an error message should display informing the user and to try again. See figure 3 for a sample output of the problem. Small Program 4 Page 4 Problem 2 Write a user defined function definition called elevator that simulates a menu of an elevator with options the user can select. The function takes no arguments and does not return anything. The user defined function will ask the user which floor they would like to go to. The user will select one of the twelve options and the program will display the floor selected. See figure 4 for the message that is displayed for each respective floor number. If the user selects any valid Figure 4: Output based on the floor selected. option 1-12, the program will ask the user to enter another option. If the user selects option 5, the program will not loop again and display the message “Elevator door is now open. Please exit now.” If the user selects an invalid option, the message “That is not a valid option.” is displayed. Figure 5 shows a sample run on the terminal. Figure 5: Sample output for problem 2. Make sure the output matches for the script to test. Small Program 4 Page 5 Problem 3 Write a user-defined function definition called pyramid that prints the following following pattern. The function has no parameters and does not return any values. Inside the function definition, Figure 6: Sample out for problem 3. Make sure your output matches this for the script! you will prompt the user for a number that will be used in generating the pattern of ‘-’. If the user inputs an invalid number (0 or negative), then an error message should display informing the user and to try again. See figure 6 for a sample output of the problem. Small Program 4 Page 6 Problem 4 You have been asked by the legendary Ms. Valerie Frizzle to calculate the average of a recent test from a magic school bus field trip in space. Write a user defined function called classAvg. The function has one parameter that represents the number of the students in the class. That value is collected in the main function (you also have to assume that an invalid number 0 or a negative number could be entered). If an invalid number is entered, then the user should be asked to enter another value. Once the proper value for the number of students is entered, the function is then going to prompt the user to enter the test score of each student. The scores can range from 0 ≤ score ≤ 100 (there can even be decimal values). The function must also handle if the user enters an invalid score that is not in the provided range. If the user enters an invalid score, the program will the user again to input the proper value. Once all scores are entered, the average is computed and sent back to the main function to be displayed. The value of average is of type double. The resulting average is displayed as a percentage up to four decimal places. The following figure shows a sample run with class size of 4. For this problem, you cannot use arrays (we have not covered them yet)! If arrays are used, then No Credit will be given for this problem.

$25.00 View

[SOLVED] Small program 3 cop3223c introduction to programming with c

For each problem in the assignment, you will create the definition of the user-defined function that is asked in the description. If you do not create a user-defined function for each of the problems, then you will receive no credit for the problem. Creating user-defined functions is good practice! You also must write the function prototypes! Missing function prototypes will result in points being deducted. Function prototypes are also good practice as well. The file must be named smallprogram3_lastname_firstname.c, where lastname and firstname is your last and first name (as registered in webcourses). For example Dr. Steinberg’s file would be named smallprogram3_Steinberg_Andrew.c. Make sure to include the underscore character _. If your file is not named properly, points will be deducted. The script the graders use will pull your name from the file name properly. It is imperative you follow these steps so you can get your points! Testing on Eustis It is your responsibility to test your code on Eustis. If you submit your assignment without testing on Eustis, you risk points being deducted for things that may behave differently on your operating system. Remember, you cannot dispute grades if your code didn’t work properly on Small Program 3 Page 1 Eustis all because it worked on your machine. The Eustis environment gives the final say. Plan accordingly to test on Eustis!! The Python Script File! Read Carefully! A python script has been provided for you to test your code with a sample output of Dr. Steinberg’s solution. This script will check to make sure your output matches exactly with Dr. Steinberg’s solution file as the graders are using this to grade your assignments. The script removes leading and trailing white space, but not white space in the actual text. If there is anything off with the output (including the expected answer), the script will say your output is not correct. This includes your output producing the correct answer, however there is something off with the output display. The script does not point directly where your mistake(s) are in the code. It will only produce a success or unsuccessful output message as a whole. If you get an unsuccessful output my suggestion is to look at the sample solution text file provided to see what is different from your answer and Dr. Steinberg’s when comparing. If you have extra white space or new lines or even just missing a space/new line, you will lose points that won’t be changed! Make sure you place the python script in the same directory as your C file. You can use the ls command to check to see if the following items are in the directory. ls You should these three files in your directory. 1. Your C File. 2. The Python Script File 3. Sample Solution Text File If you have these three files. You are ready to run the script. Use the following command to test your code with Dr. Steinberg’s provided solution sample. python3 sp3test.py If the script says your output is incorrect, checkout the sample text file that was generated (a new text file will be created from the script that contains YOUR output). If your numbers are off or different from Dr. Steinberg’s, then that means there is something not right with code’s logic and calculating the answer. However if your numbers match with Dr. Steinberg’s solution, then that means there is extra/missing white space or newlines detected. Compare the text file generated by the script with solution text file line by line to find the missing/extra white space or newlines. Once you believe you found the error, rerun the script to see if the output matches. Small Program 3 Page 2 The Rubric Please see the assignment page for the established rubric on webcourses. Comment Header Make sure you place a comment header at the top of your C file. You will use single line comments to write your name, professor, course, and assignment. For example, Dr. Steinberg’s header would be: //Andrew Steinberg //Dr. Steinberg //COP3223C Section 1 //Small Program 3 Missing a comment header will result in point deductions! Small Program 3 Page 3 The Solution Text File You are provided a solution file that was created by Dr. Steinberg’s Python Script. Now you may notice some strange things about the file. In this assignment, you are going to write statements that involve interacting with the user. Now you are probably wondering from the solution text file where the interaction is happening? The fact is that the Python script handles the interaction. The script creates an input stream that feeds it input. That is why you don’t see the input directly in the text file. For example, the text file on the first line says Enter a key from the keyboard: Lower!. Here this looks like we should be typing input, however the python script has already fed it input. That is why you don’t see the values except for the results. In each problem, a screenshot of the C file being executed manually without the Python script shows how it looks on a normal run. Carefully look at the output in the pictures provided. Note: The arrow symbolizes that the text wrapped onto the next line of the pdf file. In the text file itself it is actually one whole line. samplesolutionsp3.txt Enter a key from the keyboard: Lower! Hi! Thank you for calling the Superhuman Law Division at GLK&H! Our associates are currently working hard for super people like you. Please listen carefully to the options of who you would like to speak to in regards ,→ to your situation. Option 1: Fined for thousands of dollars worth of damage to the city you were trying ,→ to protect. Option 2: Accidentally create a sentient robot who got the feels and tried to ,→ destroy the world. Option 3: You are an Asgardian god who unintentionally leaves a giant burning ,→ imprint on private property every time you visit Earth. Option 4: You just gained new superhero strength that is not recognized by the ,→ department of damage control, and they are chasing you down after you performed a good deed. Option 5: Your secret identity was revealed by a notorious person and now your ,→ personal life is no longer the same. Option 6: Another super being issue that was not mentioned previously in the options ,→ given. Selection: You have selected option 4. Do not worry! We will talk to the department ,→ of damage control. Enter the x-coordinate: Enter the y-coordinate: 2.53, -4.29 is in quadrant IV. Enter three sides separated by a whitespace: Checking these logistics from the ,→ input. 1 + 2 > 3 1 + 3 > 2 2 + 3 > 1 Not Triangle! Small Program 3 Page 4 Problem 1 Write the definition of a user-defined function called letters. The function will determine if a character value is a letter in the English Alphabet. The function has no parameters. Inside, the definition, you will prompt the user to input a character. If the letter is lowercase, then the message “Lower!” is displayed to the terminal. If the letter is uppercase, then the message “Upper!” is displayed to the terminal. If the character value is not part of the alphabet, then the message “Not a Letter!” is displayed to the terminal. Hint, think about what happens when characters are evaluated. The following figure shows a sample output scenario when the letter is lowercase. Make sure your output matches to receive potential credit. Do not worry if the user enters multiple characters at once. We have not learned strings yet! Figure 1: Sample output when the key entered from the keyboard is a lowercase letter from the English Alphabet. Problem 2 You are a super being in need of a super lawyer. You learned that Jennifer Walters is in your city representing super humans who may unintentionally leave a mess behind from saving the day. You dial the number 1-877-SHE-HULK. You connect with the operator who sounds like Jennifer Walters telling you all of the services that the Superhuman Law Division at GLK&H offers. Write Figure 2: Table for problem 2 that shows the message for each option. a user defined function called greenLawyer that displays the list of options, allows the user Small Program 3 Page 5 Figure 3: Sample output for problem 2. This shows the welcoming message along with the result of the option selected. to select one, and then display the result. If none of the values match, then the message “I’m sorry. I don’t recognize that super being option.” is displayed to the terminal window. Inside the user defined function you will display a welcoming message and list the options for the user. Figure 3 shows a sample text to be displayed as the welcoming message along with selecting the output and the result. Make sure it is exact to receive potential credit as the script will check for it. You must also use a switch statement for this question to receive full points. If a switch statement is not used then the highest mark you can receive is half mark on the rubric evaluation. Problem 3 Write a user defined function definition called coordinates that takes x-y coordinates of a point in the Cartesian plane and prints a message telling either an axis on which the point lies or the quadrant in which it is found. Figure 4: Quadrants in the 2D coordinate plane. The user defined function takes two double arguments and doesn’t return anything. Hint: Don’t forget x-axis, y-axis and origin. The coordinates are collected in the main function. The following figure shows a sample output. Make sure the output matches for the script to receive credit. The values of the coordinates must be displayed up to two decimal places. Small Program 3 Page 6 Figure 5: Sample output for problem 3. Make sure your output is in this format for the script! Problem 4 You are working for a architecture company that loves designing triangle shape buildings. The company has grown that they are able to create their own software to perform some blue print designs rather than having someone design manually by hand. Part of the software is configuring measurements to ensure that the potential building is in the shape of a valid triangle. Figure 7 shows the geometric definition of what determines a shape to be a valid triangle. Using Figure 6: The triangle and its properties of validation. these properties, write the definition of a user-defined function called triangle. The function has three parameters that are all integer type. Each parameter represents the length of the side. Assume that the input is always collected in abc order. Inside the main function, you will ask the user for the length of each of the sides. Once the input is collected, you will call the user-defined function and perform the respective operation that was asked in this problem. The function will display the logistics it is checking to determine if the values represent a triangle or not. The function also returns an integer value which represents the outcome. The value 1 returned means the shape is a valid triangle. The value -1 means otherwise. Using those values, you will then display the outcome message in the main function. If the value is 1, then “Triangle!” is displayed. Otherwise display “Not Triangle!”. The following figure shows a sample output of what the python script expects. Make sure to have it exact in terms of white space, newlines, and characters or else the script will mark it wrong. Figure 7: Sample output of problem 4. Make sure your output is in this format for the script!

$25.00 View

[SOLVED] Small program 2 cop3223c introduction to programming with c

For each problem in the assignment, you will create the definition of the user-defined function that is asked in the description. If you do not create a user-defined function for each of the problems, then you will receive no credit for the problem. Creating user-defined functions is good practice! You also must write the function prototypes! Missing function prototypes will result in points being deducted. Function prototypes are also good practice as well. The file must be named smallprogram2_lastname_firstname.c, where lastname and firstname is your last and first name (as registered in webcourses). For example Dr. Steinberg’s file would be named smallprogram2_Steinberg_Andrew.c. Make sure to include the underscore character _. If your file is not named properly, points will be deducted. The script the graders use will pull your name from the file name properly. It is imperative you follow these steps so you can get your points!Testing on Eustis It is your responsibility to test your code on Eustis. If you submit your assignment without testing on Eustis, you risk points being deducted for things that may behave differently on your operating system. Remember, you cannot dispute grades if your code didn’t work properly on Small Program 2 Page 1 Eustis all because it worked on your machine. The Eustis environment gives the final say. Plan accordingly to test on Eustis!! The Python Script File! Read Carefully! A python script has been provided for you to test your code with a sample output of Dr. Steinberg’s solution. This script will check to make sure your output matches exactly with Dr. Steinberg’s solution file as the graders are using this to grade your assignments. The script removes leading and trailing white space, but not white space in the actual text. If there is anything off with the output (including the expected answer), the script will say your output is not correct. This includes your output producing the correct answer, however there is something off with the output display. The script does not point directly where your mistake(s) are in the code. It will only produce a success or unsuccessful output message as a whole. If you get an unsuccessful output my suggestion is to look at the sample solution text file provided to see what is different from your answer and Dr. Steinberg’s when comparing. If you have extra white space or new lines or even just missing a space/new line, you will lose points that won’t be changed! Make sure you place the python script in the same directory as your C file. You can use the ls command to check to see if the following items are in the directory. ls You should these three files in your directory. 1. Your C File. 2. The Python Script File 3. Sample Solution Text File If you have these three files. You are ready to run the script. Use the following command to test your code with Dr. Steinberg’s provided solution sample. python3 sp2test.py If the script says your output is incorrect, checkout the sample text file that was generated (a new text file will be created from the script that contains YOUR output). If your numbers are off or different from Dr. Steinberg’s, then that means there is something not right with code’s logic and calculating the answer. However if your numbers match with Dr. Steinberg’s solution, then that means there is extra/missing white space or newlines detected. Compare the text file generated by the script with solution text file line by line to find the missing/extra white space or newlines. Once you believe you found the error, rerun the script to see if the output matches. Small Program 2 Page 2 The Rubric Please see the assignment page for the established rubric on webcourses. Comment Header Make sure you place a comment header at the top of your C file. You will use single line comments to write your name, professor, course, and assignment. For example, Dr. Steinberg’s header would be: //Andrew Steinberg //Dr. Steinberg //COP3223C Section 1 //Small Program 2 Missing a comment header will result in point deductions! The Solution Text File You are provided a solution file that was created by Dr. Steinberg’s Python Script. Now you may notice some strange things about the file. In this assignment, you are going to write statements that involve interacting with the user. Now you are probably wondering from the solution text file where the interaction is happening? The fact is that the Python script handles the interaction. The script creates an input stream that feeds it input. That is why you don’t see the input directly in the text file. For example, the text file on the third line says How many hours will you keep your car parked here> Car will be parked for 9 hours…. Here this looks like we should be typing input, however the python script has already fed it input. That is why you don’t see the values except for the results. In each problem, a screenshot of the C file being executed manually without the Python script shows how it looks on a normal run. Carefully look at the output in the pictures provided. Note: The arrow symbolizes that the text wrapped onto the next line of the pdf file. In the text file itself it is actually one whole line. samplesolutionsp2.txt Enter the radius: Enter the height: The total surface area of the cone is 4432.16 Welcome to the Parking Garage! How many hours will you keep your car parked here> Car will be parked for 5 hours ,→ and will be charged $21.05. Enter a year after 2016: Predicted Wakanda’s population for 2022 in thousands: ,→ 76.885 Enter the value for n: 12! is approximately 478858054.1927 Small Program 2 Page 3 Problem 1 Write a user defined function definition called coneSurfaceArea. This function calculates the surface area of a cone. The formula for calculating the surface area is described below: SA = π × r × (r + p h 2 + r 2) The function has three parameters. The first parameter is the radius of the base of the cone which can be represented as a double type value. The second parameter is the height of the cone which is also double type. The last parameter is the value of pi (defined as π = 3.14) which is also double type. For this problem you will need to collect input from the user inside the main function of your program. After collecting the input, you will then call the user defined function to have it perform the calculation and display the result up to two decimal places. The function does return the result and displays it in the main function. Make sure your output matches as expected from the following output sample in the figure. Do not worry about invalid input. We have not covered conditions yet. Figure 1: Sample output for question 1. Make sure your output matches exactly for the test script. Problem 2 A parking garage charges customers to park their car at a rate $4.21 per hour. Write code that welcomes the users and prompts them to enter the amount hours they plan to leave their cars parked. The program will display the number of hours and the amount charged to the user. The system deals with whole number hours only. The monetary value displayed has to include $ and two decimal places. Check out the following figure for sample output that the script expects. All of this is to be completed in a user defined function called parkingCharge. The function will not return anything. Do not worry about invalid input. We have not covered conditions yet. Figure 2: Sample output for question 2. Make sure your output matches exactly for the test script. Small Program 2 Page 4 Problem 3 After studying the population growth of the area Wakanda in the last decade of the 20th century, we have modeled Wakanda’s population function as P(t) = 51.451 + 4.239t Where t is years after 2016, and P is population in thousands. Write a user defined function called wakandaPopulation that predicts Wakanda’s population in the year after 2016 and displays it to the user. The user defined function does not return a value. The user defined function also takes one parameter argument of type int called year, which is the year after 2016. Inside the main function, you will ask the user for the year after 2016. Do not worry about invalid input. We have not learned conditions yet. Check out the following figure for the sample output that the script expects. Figure 3: Sample output for question 3. Make sure your output matches exactly for the test script. Problem 4 For any integer n > 0, n! is defined as the product n ∗ n − 1 ∗ n − 2. . . ∗ 2. 1! is defined to be 1. It is sometimes useful to have a closed-form definition instead; for this purpose, an approximation can be used. R.W. Gosper proposed the following such approximation formula: n! ≈ n n e −n r (2n + 1 3 )π For this problem, define a user defined function that performs the calculation. The user defined function takes one argument of type int and will return the result of the computation of type double to display in the main function. The result is displayed up to four decimal places. Name the user defined function factorialApprox. Use the same value of π from problem 1. Do not worry about invalid input. Check out the following figure for the sample output that the test script expects. Note: You don’t need to define Euler’s Number. Use the math library’s natural exponent function. Figure 4: Sample output for question 4. Make sure your output matches exactly for the test script.

$25.00 View

[SOLVED] Small program 1 cop3223c introduction to programming with c

This assignment contains a set of problems that are to be completed in one C file with one main function only. We have not learned user-defined functions yet. This will start in Small Program 2 and beyond. This means the main function in your file will execute each problem in a single run. You do not write separate main functions for each problem. If separate main functions are created, then no credit will be given for the assignment! If you create separate program files for each question, then no credit will be given for the assignment! The file must be named smallprogram1_lastname_firstname.c, where lastname and firstname is your last and first name (as registered in webcourses). For example Dr. Steinberg’s file would be named smallprogram1_Steinberg_Andrew.c. Make sure to include the underscore character _. If your file is not named properly, points will be deducted. The script the graders use will pull your name from the file name properly. It is imperative you follow these steps so you can get your points! Testing on Eustis It is your responsibility to test your code on Eustis. If you submit your assignment without testing on Eustis, you risk points being deducted for things that may behave differently on your operating system. Remember, you cannot dispute grades if your code didn’t work properly on Eustis all because it worked on your machine. The Eustis environment gives the final say. Plan accordingly to test on Eustis!! Small Program 1 Page 1 The Python Script File! Read Carefully! A python script has been provided for you to test your code with a sample output of Dr. Steinberg’s solution. This script will check to make sure your output matches exactly with Dr. Steinberg’s solution file as the graders are using this to grade your assignments. The script removes leading and trailing white space, but not white space in the actual text. If there is anything off with the output (including the expected answer), the script will say your output is not correct. This includes your output producing the correct answer, however there is something off with the output display. The script does not point directly where your mistake(s) are in the code. It will only produce a success or unsuccessful output message as a whole. If you get an unsuccessful output my suggestion is to look at the sample solution text file provided to see what is different from your answer and Dr. Steinberg’s when comparing. If you have extra white space or new lines or even just missing a space/new line, you will lose points that won’t be changed! Make sure you place the python script in the same directory as your C file. You can use the ls command to check to see if the following items are in the directory. ls You should these three files in your directory. 1. Your C File. 2. The Python Script File 3. Sample Solution Text File If you have these three files. You are ready to run the script. Use the following command to test your code with Dr. Steinberg’s provided solution sample. python3 sp1test . py If the script says your output is incorrect, checkout the sample text file that was generated (a new text file will be created from the script that contains YOUR output). If your numbers are off or different from Dr. Steinberg’s, then that means there is something not right with code’s logic and calculating the answer. However if your numbers match with Dr. Steinberg’s solution, then that means there is extra/missing white space detected. Compare the text file generated by the script with solution text file line by line to find the missing/extra white space or newlines. Once you believe you found the error, rerun the script to see if the output matches. The Rubric Please see the assignment page for the established rubric on webcourses. Small Program 1 Page 2 Comment Header Make sure you place a comment header at the top of your C file. You will use single line comments to write your name, professor, course/section, and assignment. For example, Dr. Steinberg’s header would be: // Andrew Steinberg // Dr. Steinberg // COP3223C Section 1 // Small Program 1 Missing a comment header will result in point deductions! The Solution Text File You are provided a solution file that was created by Dr. Steinberg’s Python Script. Now you may notice some strange things about the file. In this assignment, you are going to write statements that involve interacting with the user. Now you are probably wondering from the solution text file where the interaction is happening? The fact is that the Python script handles the interaction. The script creates an input stream that feeds it input. That is why you don’t see the input directly in the text file. For example, the text file on the last line says Enter the radius: Enter the height:… Here this looks like we should be typing input, however the python script has already fed it input. That is why you don’t see the values except for the results. In each problem, a screenshot of the C file being executed manually without the Python script shows how it looks on a normal run. Carefully look at the output in the pictures provided. samplesolutionsp1.txt VV VV VV VV VV VV VV VV VV VV VVVV VV Mileage Reimbursement Calculator Enter beginning odometer reading=> Enter ending odometer reading=> You traveled 7.0 miles. At $2.61 per mile, your reimbursement is $18.27 Enter the weight in pounds: Enter the total height in inches: BMI = 3.649 Enter the radius: Enter the height: The volume of the cone is 78.5397 Small Program 1 Page 3 Problem 1 Write a set of statements in the main function that displays to the monitor a large letter ‘V’ made up of the character ‘V’. Here is a sample output of how it should look when the program runs. Make sure to follow this output to receive full points. Any differences will cause points to be deducted. You do not need to use loops for this problem. Use the text file output to assist with the number white spaces. Figure 1: Output for question 1. The output must match exactly to receive credit. This includes white space and new lines. Problem 2 Write some code that calculates mileage reimbursement for an employee a rate of $2.61 per mile. This code needs to interact with the user. Hint: Use scanf statements. Make sure to follow this output to receive full points. Any differences will cause points to be deducted. Think about the type of data we are working with. Make sure you display up to two decimal places when displaying dollar amount and one decimal place when displaying distance traveled. Make all variables of type double. You do not need to worry about negative values. We have not discussed conditions yet. Figure 2: Output for question 2. The output must match exactly to receive credit. This includes white space and new lines. Small Program 1 Page 4 Problem 3 Write some code that calculates the body mass index (BMI). The BMI can be calculated as follows using the formula BMI = weightInP ounds × 703 heightInInches × heightInInches You are going to ask the user for some input. You will ask for weightInP ounds and heightInInches. Both variables are float type. Make sure to follow this output to receive full points. Any differences will cause points to be deducted. Your output result should only display up to three decimal places. Note: Assume the total height is input. Example, 5 feet and 6 inches would result in 66 inches as the input. You do not need to worry about the denominator being 0 or negative values. We have not discussed conditions. Figure 3: Output for question 3. The output must match exactly to receive credit. This includes white space and new lines. Problem 4 Write some code inside the main function that calculates the volume of a cone. The formula for calculating the volume of a cone is as follows V = 1 3 × π × r 2 × h The value of π = 3.14159. Make sure to declare pi as a constant variable or else points will be deducted. The graders will be checking for this and not the script! You should have the following exact output in your code for the script to give you credit. Display the result up to four decimal places. Be very careful with this problem. Some students may run into small errors do to loss of information. Make sure 1/3 is 0.3 repeating and not just 0. All variables are type double. You do not need to worry about negative values. We have not discussed conditions yet. Figure 4: Output for question 4. The output must match exactly to receive credit. This includes white space and new lines.

$25.00 View

[SOLVED] Cop3223c large program 4 database records

You will write a menu driven program in C that simulates an inventory database system. The theme of the database is up to the student. However, please keep the theme appropriate. In order to do this you will have to use an array of typedef structs.The program will have the following setup. • The program will ask the user what kind of action they would like to perform. Actions include: – Insert (add a brand new record to the array) – Remove (remove an existing record from the array) – Search (check to see if a record exists in the array) – Display (display all records in the array) – Exit (program exits and writes to a text file all records of the array) You will create a file called LargeProgram4_lastnamefirstname.c for this assignment and write out the code. You are allowed to use the same message that is provided in the sample output for your program. You can also make modifications to the text of messages as long as the program follows all directions as stated in this assignment. Check out the sample output of the fully working program on Webcourses. In my sample output, the theme of my database comes from the Harry Potter Series. I’ve created a Hogwarts School Database System.. Requirements for Large Program 4 Requirements for the Program: 1. Please do not use dynamic memory for this assignment. Only static arrays! 2. The program must have a minimum of 6 user-defined functions. It is up to the student of how the functions are utilized in the program. Large Program 4 Page 1 3. Program must be able to support an insert, remove, locate, and display functionality. 4. The maximum size of the array is 30 elements. 5. The struct must be typedef. The minimum number of components for the typedef struct is 4. Students are allowed to have more components based on the theme of their respective large program. 6. When the program runs at the beginning you are required to populate the array with 6 records. Hint: I recommend students create a Hardcode Six function that will hard code values into the array. 7. Use good code practice skills you learned throughout the semester. For example, do not use global variables. Anything that is not good practice will result in point deductions. 8. Before your program terminates, make sure to print all records stored in the array to a text file called Records.txt. 9. Your program must be able handle error input without crashing! Due Date The assignment is due on November 30th at 11:59pm EST via Webcourses. Do not email the professor or TAs your submissions as they will not be accepted! This assignment is accepted late up to 24 hours with a penalty. Please see the syllabus for more information on this. Make sure to submit on time to get potential full credit. Make sure to also take into consideration the uploading time. In the past, students who are working last minute on the assignment sometimes run into uploading issues where their Internet may run slow, resulting in late submissions. The timestamp Webcourses uses for your submission will be applied and will be the final say. Please do not email the instructor or TAs saying your Internet was running slow. If the time is off by a second of the due date, then the assignment is considered late. Plan accordingly! Important! Read Carefully! You will create a file called LargeProgram4_lastnamefirstname.c for this assignment and write out the code. You are allowed to use the same message that is provided in the sample output for your program. You can also make modifications to the text of messages as long as the program follows all directions as stated in this assignment. Check out the sample output of the fully working program on Webcourses. In my sample output, the theme of my database comes from the Harry Potter Series. I’ve created a Hogwarts School Database System. You can add any additional functions, but cannot remove the required ones requested in this assignment. Large Program 4 Page 2 Submit the .c file only of your program to Webcourses. Make sure to name your file LargeProgram4_lastnamefirstname.c. For example, my file would be named LargeProgram4_SteinbergAndrew.c. You do not need to submit the text file output. This will be graded through the compiler on Eustis. If it does not compile, it does not work properly! Testing on Eustis It is your responsibility to test your code on Eustis. If you submit your assignment without testing on Eustis, you risk points being deducted for things that may behave differently on your operating system. Remember, you cannot dispute grades if your code didn’t work properly on Eustis all because it worked on your machine. The Eustis environment gives the final say. Plan accordingly to test on Eustis!! Why was I not provided a Python Script File? You are probably wondering why Dr. Steinberg didn’t provide a python script to check your code like in the small programs. The reason is that Dr. Steinberg wants his students to enjoy the assignment without being told of what must be displayed to the terminal and matching exactly. For large programs, Dr. Steinberg allows students to change the text output to the terminal, however your program must perform the specific functionalities that is requested at least to receive potential full credit. The Rubric Please see the assignment page for the established rubric on webcourses. Comment Header Make sure you place a comment header at the top of your C file. You will use single line comments to write your name, professor, course/section, and assignment. For example, Dr. Steinberg’s header would be: //Andrew Steinberg //Dr. Steinberg //COP3223C Section 1 //Large Program 4 Missing a comment header will result in point deductions! Large Program 4 Page 3 The Solution Text File I have provided a sample output of a working program of what is expected on Webcourses. As I stated before, you are welcome to change the text output to the terminal, or you can keep it exact like in the sample. Recommendations for Completing the Assignment Dr. Steinberg believes in you all! However, the TAs/ULAs and Dr. Steinberg are here to help still! Basic Rules for Completing The Assignment Please make sure to follow these rules when completing the assignment. • Please use all items that were presented to you in class and labs. Everything that was presented to you can help you compete the assignment. • DO NOT COPY ANYONE’S CODE! This is cheating as stated in syllabus. This assignment is individual work. • DO NOT SHOW YOUR CODE to classmates. The only group of people that can see your code are the TAs, ULAs, and Course Instructor. Follow the rules from the syllabus if you talk to other students. Otherwise you may be in cheating violations! • Make sure to use good practice of writing code. Anything that is considered bad practice will result point deductions. Example GOTO statements should not be seen anyone’s code. There is a reason I don’t go cover them. They are extremely bad in practice to use and just awful. Some Advice Here are some tips and tricks that will help you with this assignment and make the experience enjoyable. • Do not try to write out all the code and build it at the end to find syntax errors. For each new line of code written (my rule of thumb is 2-3 lines), build it to see if it compiles successfully. It will go a long way! • After any successful build, run the code to see what happens and what current state you are at with the program writing so you know what to do next! If the program performs what you expected, you can then move onto the next step of the code writing. If you try to write everything at once and build it successfully to find out it doesn’t work properly, you will get frustrated trying find out the logical error in your code! Remember, logical errors are the hardest to fix and identify in a program! Large Program 4 Page 4 • Start the assignment early! Do not wait last minute (the day of) to begin the assignment. • Ask questions! It’s ok to ask questions. If there are any clarifications needed, please ask TAs and the Instructor! We are here to help!!! Do not wait last minute to seek help as generally you could be waiting in a long line of fellow classmates who may need help. As stated previously start early and ask your questions right away! Large Program 4 Page 5

$25.00 View

[SOLVED] Cop3223c large program 3 hangman

In this assignment, the primary objective is to apply your skills in strings!! Strings are very important to understand and this assignment will explore some of the topics discussed from our lectures.You will write a menu driven program in C that simulates the classic word game Hangman. For those that never played hangman, click here to see the rules. In this large programming assignment, you will write a simple terminal hangman game application.The program will have the following setup. • The program will first welcome the user with a friendly message. It will provide the user with instructions and rules on how the game works. • After the welcome message, the game will begin with the first round.• For each round: – User will be asked to enter a letter and will a show masked word with *’s. – Program will inform user if letter is in the word. • If the letter is in the word, the program will reveal where in the masked word the letter occurs.• If the letter is not in the word, the program will inform the user that a strike has been received. • After the round, the program will determine if user won or lost.• Program will ask user if the user wants to play again. In order to create this program, you will use the following function prototypes that are provided for you in the assignment directions.Requirements for Large Program 3 Requirements for the Program: 1. The program will only deal with single words (NO PHRASES!). Large Program 3 Page 1 2. The words must be read from a text file. The text file is provided for you on Webcourses. The text file must be in the same directory as your C source file. Note: If you are first working locally on your machine and not Eustis, you will have to provide a file path. If you go this route, you are responsible for testing on Eustis which means you will have to remove the file path so it doesn’t crash when your grader tests it!3. The max number strikes you can go up to are 6. Use a macro constant! 4. The max size of words used can only go up to 20 characters. Use a macro constant! 5. Words can have the same letter in different places. For example, state, cellphone, and paper are words that can be used in the game. 6. Program must be able to properly handle any kind of input without crashing.Due Date The assignment is due on November 22nd at 11:59pm EST via Webcourses. Do not email the professor or TAs your submissions as they will not be accepted! This assignment is accepted late up to 24 hours with a penalty. Please see the syllabus for more information on this. Make sure to submit on time to get potential full credit. Make sure to also take into consideration the uploading time. In the past, students who are working last minute on the assignment sometimes run into uploading issues where their Internet may run slow, resulting in late submissions. The timestamp Webcourses uses for your submission will be applied and will be the final say. Please do not email the instructor or TAs saying your Internet was running slow. If the time is off by a second of the due date, then the assignment is considered late. Plan accordingly!Important! Read Carefully! In this assignment, you are going to implement 6 user defined functions void rules(void); //display rules of the game void maskWord (char starword[], int size); //mask the word with stars to display int playRound(char starword[], char answer[]); //play a round of hangman int occurancesInWord(char userguess, char answer[]); //number of times letter occurs in word void updateStarWord(char starword[], char answer[], char userguess); //replace respective * void playAgain(int *play); //ask user if to play again. 1 is yes 2 is no You cannot add any additional functions or remove them. Modifying them in any way (name, parameters, etc. . . ) will result in point deductions! Utilize them the way they are provided!The function prototypes are provided for you here. You will create a file called largeprogram3_lastnamefirstname.c for this assignment and write out the code. There is NO Large Program 3 Page 2 SKELETON PROVIDED FOR THIS LARGE PROGRAM! Do NOT modify any of the function prototypes that are provided for you! Any modifications from the provided function prototypes will result in points being deducted! You are allowed to use the same message that is provided in the sample output for your program. You can also make modifications to the text of messages as long as the program follows all directions as stated in this assignment. The following section will discuss the prototypes more in detail. Make sure that you name your C file largeprogram3_lastname_firstname.c, where lastname and firstname is your last and first name (as registered in webcourses). For example Dr. Steinberg’s file would be named largeprogram3_Steinberg_Andrew.c. Make sure to include the underscore character _. If your file is not named properly, points will be deducted. You are also provided a text file called ‘words.txt’ containing words to use for the game.The Function Prototypes This section will discuss the function prototypes. void rules(void); //display rules of the game This function will display the rules of the game. See the sample output to get a general idea of what is displayed to the user when the program runs. void maskWord (char starword[], int size); //mask the word with stars to display The mask function will “mask” the solution word with the * character for the user to see on the terminal window. This will allow the user to keep track of letters that were guessed correctly in forming the word solution. The function has two parameters. The first parameter is a string that will represent the starword. The second parameter is an integer that represents the number of characters for a particular solution in the round of hangman. int playRound(char starword[], char answer[]); //play a round of hangmanThe playRound function simulates an entire round of the Hangman game. The function returns an integer representing the outcome of the game. If 1 is returned, then the user won. Otherwise return 0 if the user lost. The function has two parameters. The first parameter is a string to the starword (string mixed with *’s and letters) and the second parameter is another string representing the solution. This function serves as the heart of the game. This function will have to call other functions except for playAgain, maskWord, and rules. Think carefully how this would work.int occurancesInWord(char userguess, char answer[]); //number of times letter occurs in word The occurancesInWord function counts the number of times a letter occurs in the solution. If the function returns a positive number, then that means it occurs at least once. Otherwise return 0 if it doesn’t occur at all in the solution. This function has two parameters. The first parameter is a character that represents the letter guessed by the user. The second parameter is the answer string.Large Program 3 Page 3 void updateStarWord(char starword[], char answer[], char userguess); //replace respective * The updateStarWord function will update the masked string by replacing the respective * character(s) with the the corresponding the letter that was guessed correctly. The function has three parameters. The first parameter is a string to the masked word. The second parameter is another string representing the answer. The third parameter is the letter that the user guessed. void playAgain(int *play); //ask user if to play again. 1 is yes 2 is no The playAgain function will ask the user if they would like to play another round of hangman. The user will input an option and it will be stored in the play variable. The function has one parameter which represents an integer reference to the variable that keeps track if the user will play again or not. Testing on Eustis It is your responsibility to test your code on Eustis. If you submit your assignment without testing on Eustis, you risk points being deducted for things that may behave differently on your operating system. Remember, you cannot dispute grades if your code didn’t work properly on Eustis all because it worked on your machine. The Eustis environment gives the final say. Plan accordingly to test on Eustis!! Why was I not provided a Python Script File? You are probably wondering why Dr. Steinberg didn’t provide a python script to check your code like in the small programs. The reason is that Dr. Steinberg wants his students to enjoy the assignment without being told of what must be displayed to the terminal and matching exactly. For large programs, Dr. Steinberg allows students to change the text output to the terminal, however your program must perform the specific functionalities that is requested at least to receive potential full credit. The Rubric Please see the assignment page for the established rubric on webcourses. Comment Header Make sure you place a comment header at the top of your C file. You will use single line comments to write your name, professor, course, and assignment. For example, Dr. Steinberg’s header would be: //Andrew Steinberg //Dr. Steinberg Large Program 3 Page 4 //COP3223C Section 1 //Large Program 3 Missing a comment header will result in point deductions! The Solution Text File I have provided a sample output of a working program of what is expected. As I stated before, you are welcome to change the text output to the terminal, or you can keep it exact like in the sample. Large Program 3 Sample Text File Solution Welcome to the Hangman Game! Here are the rules. I will provide you a set of Each You must figure out each letter of the missing ,→ word. For every correct letter guessed, I will reveal its place in the word. Each mistake will result in a strike. 6 strikes will result in a loss that round. Are you ready? Here we go! Welcome to the Round! The size of the word has 4 letters. You currently have 0 strikes. Letters you have guessed: Enter your guess: A The letter a is NOT in the word. You currently have 1 strikes. Letters you have guessed: a Enter your guess: G The letter g is in the word. You currently have 1 strikes. Letters you have guessed: ag Enter your guess: f The letter f is in the word. You currently have 1 strikes. Letters you have guessed: agf Large Program 3 Page 5 f Enter your guess: 0 You did not enter a letter from the alphabet. You currently have 1 strikes. Letters you have guessed: agf f Enter your guess: ! You did not enter a letter from the alphabet. You currently have 1 strikes. Letters you have guessed: agf f Enter your guess: o The letter o is in the word. You currently have 1 strikes. Letters you have guessed: agfo f Enter your guess: r Congratulations! You won! The word was frog. Would you like to play another round? 1: Yes 2: No Choice: 1 Welcome to the Round! The size of the word has 6 letters. You currently have 0 strikes. Letters you have guessed: Enter your guess: a The letter a is NOT in the word. You currently have 1 strikes. Letters you have guessed: a Enter your guess: e Large Program 3 Page 6 The letter e is in the word. You currently have 1 strikes. Letters you have guessed: ae Enter your guess: i The letter i is in the word. You currently have 1 strikes. Letters you have guessed: aei Enter your guess: o The letter o is NOT in the word. You currently have 2 strikes. Letters you have guessed: aeio Enter your guess: u The letter u is NOT in the word. You currently have 3 strikes. Letters you have guessed: aeiou Enter your guess: q The letter q is NOT in the word. You currently have 4 strikes. Letters you have guessed: aeiouq Enter your guess: w The letter w is in the word. You currently have 4 strikes. Letters you have guessed: aeiouqw wi Enter your guess: r The letter r is in the word. You currently have 4 strikes. Large Program 3 Page 7 Letters you have guessed: aeiouqwr wi Enter your guess: t The letter t is in the word. You currently have 4 strikes. Letters you have guessed: aeiouqwrt wi Enter your guess: y The letter y is NOT in the word. You currently have 5 strikes. Letters you have guessed: aeiouqwrty wi Enter your guess: p The letter p is NOT in the word. Sorry you did not win the round. The word was winter. Would you like to play another round? 1: Yes 2: No Choice: 1 Welcome to the Round! The size of the word has 9 letters. You currently have 0 strikes. Letters you have guessed: Enter your guess: a The letter a is in the word. You currently have 0 strikes. Letters you have guessed: a Enter your guess: e The letter e is in the word. You currently have 0 strikes. Letters you have guessed: ae Large Program 3 Page 8 Enter your guess: i The letter i is in the word. You currently have 0 strikes. Letters you have guessed: aei Enter your guess: o The letter o is NOT in the word. You currently have 1 strikes. Letters you have guessed: aeio Enter your guess: u The letter u is NOT in the word. You currently have 2 strikes. Letters you have guessed: aeiou Enter your guess: s The letter s is in the word. You currently have 2 strikes. Letters you have guessed: aeious s Enter your guess: b The letter b is NOT in the word. You currently have 3 strikes. Letters you have guessed: aeiousb s Enter your guess: c The letter c is in the word. You currently have 3 strikes. Letters you have guessed: aeiousbc s Enter your guess: T The letter t is in the word. Large Program 3 Page 9 You currently have 3 strikes. Letters you have guessed: aeiousbct stai Enter your guess: 5 You did not enter a letter from the alphabet. You currently have 3 strikes. Letters you have guessed: aeiousbct stai Enter your guess: r Congratulations! You won! The word was staircase. Would you like to play another round? 1: Yes 2: No Choice: 2 Thank you for playing today! Recommendations for Completing the Assignment At this point in the course, you should start to build a foundation of building programs without Dr. Steinberg directly telling you where to begin. Something Dr. Steinberg will point out is that you can use the sample output to help you build your program, HOWEVER, make sure to consider and think carefully where those lines of output are occurring (i.e. main function or one of the user defined functions). Please don’t feel discourage that Dr. Steinberg didn’t provide this as it wasn’t on purpose to make the course harder (it was with good intentions). We (Dr. Steinberg and his TAs/ULAs) are still here to help you succeed! My objective is to help you build your programming and problem solving skills for your future foundational CS courses such as CS1. If you having trouble of where to begin or where something should occur, please come see us!!! We will still provide the same level of help! Something Dr. Steinberg will provide to assist is how the main function should work in this assignment to help you get started. Inside the main function you are going to welcome to the user to the game. After welcoming the user, you will then have to open the word text file (provided in Webcourses) and read one string at a time. Once a string is read successfully, create the masked version (hint hint maskWord) and then begin the actual round (hint hint call playRound)! Once playRound ends you will then ask the user if they want to play again (OMG this sounds like the function playAgain right!?!) Based on the result, you either have to repeat the action (scan new word and play a new round, or terminate the program). Large Program 3 Page 10 Some Advice Here are some tips and tricks that will help you with this assignment and make the experience enjoyable. • Do not try to write out all the code and build it at the end to find syntax errors. For each new line of code written (my rule of thumb is 2-3 lines), build it to see if it compiles successfully. It will go a long way! • After any successful build, run the code to see what happens and what current state you are at with the program writing so you know what to do next! If the program performs what you expected, you can then move onto the next step of the code writing. If you try to write everything at once and build it successfully to find out it doesn’t work properly, you will get frustrated trying find out the logical error in your code! Remember, logical errors are the hardest to fix and identify in a program! • Start the assignment early! Do not wait last minute (the day of) to begin the assignment. • Ask questions! It’s ok to ask questions. If there are any clarifications needed, please ask TAs and the Instructor! We are here to help!!! You can also utilize the discussion board on Webcourses to share a general question about the large program as long as it doesn’t violate the academic dishonesty policy.

$25.00 View

[SOLVED] Cop3223c large program 2 the unlimited vending machine and wallet

In this assignment, the primary objective is to apply your skills in user-defined functions and pointers!! Pointers are an extremely imperative topic to understand in this class and your future as a CS student. In this assignment, you are going to implement the following scenario.Scenario: You have a magical wallet that contains an unlimited amount of cash and you can pull any amounts of $1, $5, and $10 bills. You find a vending a machine that contains (you guessed it from the title of the assignment) an unlimited amount of drinks. You decide that you are going to purchase a lot of drinks from the unlimited vending machine with your unlimited wallet. In this large programming assignment, you will simulate this scenario that utilizes pointers heavily.The program will have the following setup. • The program will first welcome the user with a friendly message. It will provide the user with instructions on how the program works. • The program will be menu driven. This means the program will display a list of options of what the user would like to do. From the list of options, the user will select one of the following options.– User will be able to view the current amount in their hand. When the program starts for the first time, the user will have $0.00. – User will be able to order a drink from the vending machine. Note: In order for the user to make a purchase, the program must make sure the amount of money in the user’s hand will be able to handle the transaction. If not, the user will have to make sure to pull money from the unlimited wallet until enough money is added to make the purchase.– User will be able to view the menu drinks along with its respective prices. – User will be able to pull money from the unlimited wallet. – User will be able to exit the program and see the amount of money in their hand before the program terminates.Large Program 2 Page 1 • The following table shows the vending machine items you will be using along with their respective price. Make sure to use these items and prices or else points will be deducted.Due Date The assignment is due on October 30th at 11:59pm EST via Webcourses. Do not email the professor or TAs your submissions as they will not be accepted! This assignment is accepted late up to 24 hours with a penalty. Please see the syllabus for more information on this. Make sure to submit on time to get potential full credit. Make sure to also take into consideration the uploading time. In the past, students who are working last minute on the assignment sometimes run into uploading issues where their Internet may run slow, resulting in late submissions. The timestamp Webcourses uses for your submission will be applied and will be the final say. Please do not email the instructor or TAs saying your Internet was running slow. If the time is off by a second of the due date, then the assignment is considered late. Plan accordingly!Important! Read Carefully! In this assignment, you are going to implement 6 user defined functions void greeting(); //welcome the user void order(double *balance); //user will make a purchase void viewHand(double *balance); //display current amount in hand void transaction(double *balance, double price); //transaction with user void pullMoney(double *balance); //grab more money from the unlimited ←- wallet void displayVendingOptions(); //display beverage options and prices You cannot add any additional functions or remove them. Modifying them in any way (name, parameters, etc. . . ) will result in point deductions! Utilize them the way they are provided! The function prototypes are provided for you here. You will create a file called largeproLarge Program 2 Page 2 gram2_lastnamefirstname.c for this assignment and write out the code. There is NO SKELETON PROVIDED FOR THIS LARGE PROGRAM! Do NOT modify any of the function prototypes that are provided for you! Any modifications from the provided function prototypes will result in points being deducted! You are allowed to use the same message that is provided in the sample output for your program. You can also make modifications to the text of messages as long as the program follows all directions as stated in this assignment. The following section will discuss the prototypes more in detail. Make sure that you name your C file largeprogram2_lastname_firstname.c, where lastname and firstname is your last and first name (as registered in webcourses). For example Dr. Steinberg’s file would be named largeprogram2_Steinberg_Andrew.c. Make sure to include the underscore character _. If your file is not named properly, points will be deducted. The Function Prototypes This section will discuss the function prototypes. void greeting(); This function welcomes the user to the vending machine by printing a friendly message to the user. Important: The menu driven component should not be placed in the greeting function. That should be placed in your main function after the greeting function is invoked. void order(double *balance); The order function will handle the collecting of the respective item the user would like to order from the vending machine. Inside the function definition you will ask the user what they would like to order. Once the user makes a selection, the program should display the selection along with the cost of the item. Important: Now that you know conditions, make sure to handle if the user inputs an invalid selection. Once a proper selection was made, the function will begin the transaction. The function has one parameter that holds a reference to the address of the balance variable. Make sure to consider invalid input. void viewHand(double *balance); The viewHand function will display the user’s account balance to the terminal. The function’s parameter is a reference to the address of the variable balance. Note. If the user selects the functionality first during the program run, the account balance should be $0.00. void transaction(double *balance, double price); The transaction function handles the actual transaction based on the item that was purchased from the vending machine. The function has two parameters. One is a reference to the amount of money in the user’s hand and the other is the price of the item selected. Important: The second parameter is passed by value. It is not a typo. Also, there can be a situation where the amount of money in the user’s hand is not enough for the transaction. If such a scenario Large Program 2 Page 3 occurs you must have the user pull money from the unlimited wallet. This is all done in the function definition. Hint: Think about conditional while loop in this scenario. Once the balance can afford the item, then proceed with the actual transaction. void pullMoney(double *balance); The pullMoney function adds money to the user’s hand. Three options are given for how much they can pull from the wallet. The user can add $1, $5, or $10. One of these options is selected and the amount stored in their hand will increase based on the respective amount. The function has one parameter that represents a reference to the balance. Make sure to consider invalid input. void displayVendingOptions(); The displayVendingOptions function displays the beverage items the vending machine has available along with its respective price. The table of items and prices is on page 2 of the assignment. Important: The displayVendingOptions function does not display the menu of options the user can perform in the vending machine. That component should be in the main function. Testing on Eustis It is your responsibility to test your code on Eustis. If you submit your assignment without testing on Eustis, you risk points being deducted for things that may behave differently on your operating system. Remember, you cannot dispute grades if your code didn’t work properly on Eustis all because it worked on your machine. The Eustis environment gives the final say. Plan accordingly to test on Eustis!! Why was I not provided a Python Script File? You are probably wondering why Dr. Steinberg didn’t provide a python script to check your code like in the small programs. The reason is that Dr. Steinberg wants his students to enjoy the assignment without being told of what must be displayed to the terminal and matching exactly. For large programs, Dr. Steinberg allows students to change the text output to the terminal, however your program must perform the specific functionalities that is requested at least to receive potential full credit. The Rubric Please see the assignment page for the established rubric on webcourses. Large Program 2 Page 4 Comment Header Make sure you place a comment header at the top of your C file. You will use single line comments to write your name, professor, course/section, and assignment. For example, Dr. Steinberg’s header would be: //Andrew Steinberg //Dr. Steinberg //COP3223C Section 1 //Large Program 2 Missing a comment header will result in point deductions! The Solution Text File I have provided a sample output of a working program of what is expected in webcourses. As I stated before, you are welcome to change the text output to the terminal, or you can keep it exact like in the sample. Recommendations for Completing the Assignment Here is the order of steps I would strongly consider when attempting the large program assignment. After each of these steps, see if your program builds/compiles successfully and performs the correct task. See if the output is what you were expecting. 1. Start with the Message Greeting. Try to get the program to welcome the user. 2. After a successful message greeting, try to design menu driven component. Hint: Think of how to use a while loop and switch statement. 3. After getting the menu working. Work on the display menu function to display all drink prices. Note. This is different from menu that shows user options of what they can do. 4. After getting the display menu function working, work on the viewHand function. 5. After finishing the viewHand function, work on the pullMoney function to increase the amount of money in your hand. 6. After getting the pullMoney function to work, begin the ordering function. This is where the user will pick an item to purchase and make a transaction. 7. Test your program with different inputs to see if it works successfully. Large Program 2 Page 5 Basic Rules for Completing The Assignment Please make sure to follow these rules when completing the assignment. • Please use all items that were presented to you in class and labs. Everything that was presented to you can help you compete the assignment. • DO NOT COPY ANYONE’S CODE! This is cheating as stated in syllabus. This assignment is individual work. • DO NOT SHOW YOUR CODE to classmates. The only group of people that can see your code are the TAs, ULAs, and Course Instructor. Follow the rules from the syllabus if you talk to other students. Otherwise you may be in cheating violations! • Make sure to use good practice of writing code. Anything that is considered bad practice will result point deductions. Example GOTO statements should not be seen anyone’s code. There is a reason I don’t go cover them. They are extremely bad in practice to use and just awful. Some Advice Here are some tips and tricks that will help you with this assignment and make the experience enjoyable. • Do not try to write out all the code and build it at the end to find syntax errors. For each new line of code written (my rule of thumb is 2-3 lines), build it to see if it compiles successfully. It will go a long way! • After any successful build, run the code to see what happens and what current state you are at with the program writing so you know what to do next! If the program performs what you expected, you can then move onto the next step of the code writing. If you try to write everything at once and build it successfully to find out it doesn’t work properly, you will get frustrated trying find out the logical error in your code! Remember, logical errors are the hardest to fix and identify in a program! • Start the assignment early! Do not wait last minute (the day of) to begin the assignment. • Ask questions! It’s ok to ask questions. If there are any clarifications needed, please ask TAs and the Instructor! We are here to help!!! Do not wait last minute to seek help as generally you could be waiting in a long line of fellow classmates who may need help. As stated previously start early and ask your questions right away!

$25.00 View

[SOLVED] Cop3223c large program 1 the toothpick game

In this assignment, you are going to implement the toothpick game! If you haven’t played the game, here are the general rules. A table contains 31 toothpicks. There are two players (in this assignment you and the computer) who will each take turns picking up either 1, 2, or 3 (cannot grab 4 or more toothpicks or even put back toothpicks) toothpicks off from the table.The objective of winning the game is to not be the last person to pick up the final toothpick. The player that picks up the last toothpick loses. Utilizing the concepts of user-defined functions, control flow, and conditions, you are going to implement this classic game. In this programming assignment, the user will always go first.Due Date The assignment is due on October 2nd at 11:59pm EST via Webcourses. Do not email the professor or TAs your submissions as they will not be accepted! This assignment is accepted late up to 24 hours with a penalty. Please see the syllabus for more information on this.Make sure to submit on time to get potential full credit. Make sure to also take into consideration the uploading time. In the past, students who are working last minute on the assignment sometimes run into uploading issues where their Internet may run slow, resulting in late submissions.The timestamp Webcourses uses for your submission will be applied and will be the final say. Please do not email the instructor or TAs saying your Internet was running slow. If the time is off by a second of the due date, then the assignment is considered late. Plan accordingly and start early!Large Program 1 Page 1 Important! Read Carefully! In this assignment, you are going to implement 6 user defined functions to simulate the classic game. void greeting(); //display welcome message to user int playRound(int round); //play one round int humanPick(); //retrieve the user’s guess int computerPick(); //computer makes its pick int leftOnTable(int toothpicks, int taken); //calculate number of ←- toothpicks left void winnerAnnouncment(int user); //overall winner of round announcementYou cannot add any additional functions or remove them. Modifying them in any way (name, parameters, etc. . . ) will result in point deductions! Utilize them the way they are provided!The function prototypes and partial definitions are provided for you in the file LargeProgram1_Skeleton.c for this assignment. Pay close attention to the comments of where you will fill in the missing code. Download the file and fill in the missing blanks to make the program run perfectly. Do NOT modify any of the code that was already provided for you! You can remove the comments that provide hints of missing code. Any modifications from the provided code will result in points being deducted! You are allowed to use the same message that is provided in the sample output for your program. You can also make modifications to the text of messages as long as the program follows all directions as stated in this assignment.The following section will discuss the prototypes more in detail. Make sure that you name your C file largeprogram1_lastname_firstname.c, where lastname and firstname is your last and first name (as registered in webcourses). For example Dr. Steinberg’s file would be named largeprogram1_Steinberg_Andrew.c. Make sure to include the underscore character _. If your file is not named properly, points will be deducted.Large Program 1 Page 2 The Function Prototypes This section will discuss the function prototypes. Figure 1: Sample output from the greeting function. void greeting(); The greeting function will welcome the user to the game and explain the rules on the terminal window. See the figure 1 of what the function produces when invoked. This should be the first thing that is displayed when the program begins its execution. int playRound(int round); The playRound function simulates an entire round of the game. The function has one parameter of type int which represents the current round being played. Inside the function, a loop has been provided for you (do not modify the loop). You are going to fill in the code that simulates the entire round of the game. That means you will need think about how execute each turn between the user and computer (think about how you will call those other user defined functions). The function should display the number of toothpicks left on the table and allow the user to make a selection (hint think about humanPick function). Also, you will need to take into consideration if the user decides to cheat by taking the incorrect number of toothpicks. If so, the function should display message that tells the user that they are breaking the rules. In this case, make sure to let the user go again (hint think about using a condition). This also includes the scenario where the user might grab extra toothpicks when there is not enough on the table (example, user tries to grab 3, but there are only 2 on the table should not be allowed). If the user makes a valid selection, then the program should let the computer make a pick (call the computerPick function). After each player makes a valid pick, display the number toothpicks taken off the table. Once the table has 0 toothpicks, the function should terminate and return an integer representing who went last in picking up the final toothpick(s). Note: This function has already been called for you in the skeleton file of the main function. You will just need to implement the definition of the function of what it is suppose to be accomplished. You were also provided some components of the code which is the while loop since loops haven’t been discussed yet. Please note that anything inside the control structure of loop will be repeated. Make sure to think about the lines of code you want to see repeated in execution.Large Program 1 Page 3 int humanPick(); The humanPick function will ask the user how many toothpicks they want to take. The user will enter an integer value from the keyboard. They can only take 1, 2, or 3 toothpicks. After the user makes a selection, he function returns the value. Important! It is possible that the user can make an invalid choice (typing in the wrong number). However, you do not need to worry about a non numerical character being typed in by accident. If the user makes an invalid numerical choice, the program will inform the user by displaying a message to the terminal (that part is done inside of the playRound function). That means the invalid numerical number is also returned. int computerPick(int choice, int leftover);The computerPick function allows the computer to make a selection. The function has two paramters. The first paramter is a value that represents the number of toothpicks the user took in the last turn (1, 2, or 3). The second parameter represents the value of toothpicks left on the table. Now the computer has a secret strategy of selecting a number and it based on what the user does (how sneaky). Here is how the computer makes it moves. It chooses one of three options.1. If there are more than 4 toothpicks left on the table, then the computer should take 4 − x toothpicks, where x is the number of toothpicks the user took in the previous turn. 2. If there are 2 to 4 (both inclusive) toothpicks left, then the computer should withdraw enough toothpicks to leave 1 left on the table for the user to select. 3. If there is 1 toothpick left on the table, then the computer takes the last toothpick and of course loses the round. int leftOnTable(int toothpicks, int taken);The leftOnTable function will simply calculate the number of toothpicks left on the table of a player removes them from the table. It has two parameters. The first parameter represents the number of toothpicks on the table and the second parameter represents the number of toothpicks taken based on the respective player’s turn. The resulting value is returned.Large Program 1 Page 4 void winnerAnnouncment(int user); The winnerAnnouncment function determines the overall winner of the round. The function has one parameters. The value represent the user who won. You will use this value to determine the winner. Based on the winner, you will display some of sort of message to the terminal. If you the user wins, then the message You won! I’ll let you have this one. (this is the computer talking to you) is displayed. If the computer wins, then the message I won! Haha better luck next time! is displayed to the terminal. This function is invoked in the main function after the playRound function terminates.The Skeleton File For this large program, you are given a skeleton file with some code. The function prototypes and partial definitions are provided for you in the file LargeProgram1_Skeleton.c for this assignment. Pay close attention to the comments of where you will fill in the missing code. Download the file and fill in the missing blanks to make the program run perfectly. Do NOT remove the loops that were provided for you. Any loops removed from the provided code will result in points being deducted! You are allowed to use the same message that is provided in the sample output for your program. You can also make modifications to the text of messages as long as the program follows all directions as stated in this assignment.You can find the skeleton file in this pdf along with the actual C file in the assignment page of webcourses. You will see the partial comment header, preprocessor directives, function prototypes, the main function, and one of the function definitions partially implemented. The main function provided has some missing items that you will need to fill in (comments are provided in skeleton). Some of you are probably curious about the for statement. This is a counting loop that will allow us to execute code a number of times without rewriting mulitiple times. This topic will be covered after Exam 1. There is nothing else that needs to be added for the main function.As for the partial definition of playRound, there will be code that you need to fill in. Now, some of you probably have not seen the while statement before. This is another loop we will cover after Exam 1. The while loop you see in the code is conditional. This means anything inside the control structure of the while loop will keep executing until the condition between the parenthesis has evaluated to false (note you should be able to recognize the expression from learning conditions). The code in the while loop’s control structure will keep executing until there no more toothpicks left on the table. You will have to write code inside the control structure of the while loop to complete this function’s definition. You will also have to write code outside the while loop’s control structure as well. Look at the sample output provided and please ask the TAs or course instructor for clarification.Large Program 1 Page 5 / / Name: / / Dr . St e i n b e r g / / COP3223C / / Large Program 1 S keleton # d ef i n e ROUNDS 3 #i n c l u d e < s t d i o . h> #i n c l u d e < s t d l i b . h> #i n c l u d e < ct ype . h> / / v o i d g r e et i n g ( ) ; / / d i s p l a y welcome message t o use r i n t playRound ( i n t round ) ; / / p l a y one round / / i n t humanPick ( ) ; / / r e t r i e v e the use r ‘ s guess / / i n t compute rPick ( i n t choice , i n t l e f t o v e r ) ; / / computer makes i t s p i c k / / i n t l eftO nT a b l e ( i n t t o ot h p i c k s , i n t taken ) ; / / c a l c u l a t e number of t o ot h p i c k s l e f t / / v o i d winnerAnnouncment ( i n t use r ) ; / / o v e r a l l winne r of round announcement i n t main ( ) { / / i n s e r t some code here t h a t w i l l g r e et the use r f o r ( i n t x = 0; x < ROUNDS; ++x ) { i n t r e s u l t = playRound ( x + 1 ) ; / / c a l l playRound and a s sign r e s u l t the value f u n c t i o n←- r et u r n s / / i n s e r t some code here t h a t w i l l dete rmine the winne r } p r i n t f ( ” ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ” ) ; p r i n t f ( ” Thank you f o r p l a y i n g ! n ” ) ; r e t u r n 0; } i n t playRound ( i n t round ) { p r i n t f ( “Welcome t o a new round %d ! n ” , round ) ; p r i n t f ( “You may go f i r s t ! n ” ) ; i n t t o ot h p i c k s = 31; / / number of t o ot h p i c k s t o s t a r t w it h / / you can i n s e r t code here / / loop t h a t keeps t r a c k of t o ot h p i c k s u n t i l r e s p e ct i v e no more t o ot h p i c k s l e f t . we w i l l ←- l e a r n about c o n d i t i o n a l loop s soon : ) w h i l e ( t o ot h p i c k s != 0 ) { / / i n s e r t code here t h a t s im u l at e the round p r o p e r l y based on assignment d i r e c t i o n s r e t u r n 0; / / t e rm i n at e s loop HOWEVER YOU WILL NEED TO CHANGE THIS WHEN BUILDING YOUR ←- PROGRAM. THIS WAS PUT IN THE SKELETON SO THAT THE INITIAL RUN ISN ‘T STUCK IN AN ←- INFINITE LOOP } r e t u r n 0; / / r et u r n s 0 HOWEVER YOU WILL NEED TO CHANGE THIS PART OF THE CODE TO MAKE THE ←- PROGRAM WORK PROPERLY! YOU DON ‘T WANT THE SAME VALUE RETURNED ALWAYS } Large Program 1 Page 6 Testing on Eustis It is your responsibility to test your code on Eustis. If you submit your assignment without testing on Eustis, you risk points being deducted for things that may behave differently on your operating system. Remember, you cannot dispute grades if your code didn’t work properly on Eustis all because it worked on your machine. The Eustis environment gives the final say. Plan accordingly to test on Eustis!! Why was I not provided a Python Script File? You are probably wondering why Dr. Steinberg didn’t provide a python script to check your code like in the small programs. The reason is that Dr. Steinberg wants his students to enjoy the assignment without being told of what must be displayed to the terminal and matching exactly. For large programs, Dr. Steinberg allows students to change the text output to the terminal, however your program must perform the specific functionalities that is requested at least to receive potential full credit. The Rubric Please see the assignment page for the established rubric on webcourses. Comment Header Make sure you place a comment header at the top of your C file. You will use single line comments to write your name, professor, course/section number, and assignment. For example, Dr. Steinberg’s header would be: //Andrew Steinberg //Dr. Steinberg //COP3223C Section 1 //Large Program 1 Missing a comment header will result in point deductions! The Solution Text File I have provided a sample output of a working program of what is expected in webcourses. As I stated before, you are welcome to change the text output to the terminal, or you can keep it exact like in the sample. Large Program 1 Page 7 Recommendations for Completing the Assignment Here is the order of steps I would strongly consider when attempting the large program assignment. After each of these steps, see if your program builds/compiles successfully and performs the correct task. See if the output is what you were expecting. 1. Work on the greeting function. See if you can get the program to welcome the user to the game. 2. Work on the leftOnTable function. It is actually simple. 3. Work on the humanPick function. See if you can successfully collect the user’s input and return. I would recommend testing this by writing a printf statement after the function is invoked. 4. Work on the computerPick function. Think about all possible scenarios that can happen based on the user’s selection. 5. Work on the playRound function. This function is the heart and soul of the program. Think about the order of the steps taken in implementing the game. How does humanPick and computerWork help out? How would the leftOnTable function call work in this user-defined function? How would you terminate the function properly when the round is over and what value would you send back to the main function. 6. Work on the winnerAnnouncment function. Using the value passed, how are you able to determine who won? Basic Rules for Completing The Assignment Please make sure to follow these rules when completing the assignment. • Please use all items that were presented to you in class and labs. Everything that was presented to you can help you compete the assignment. • DO NOT COPY ANYONE’S CODE! This is cheating as stated in syllabus. This assignment is individual work. • DO NOT SHOW YOUR CODE to classmates. The only group of people that can see your code are the TAs, ULAs, and Course Instructor. Follow the rules from the syllabus if you talk to other students. Otherwise you may be in cheating violations! • Make sure to use good practice of writing code. Anything that is considered bad practice will result point deductions. Example GOTO statements should not be seen anyone’s code. There is a reason I don’t go cover them. They are extremely bad in practice to use and just awful. Large Program 1 Page 8 Some Last bit of Advice Here are some tips and tricks that will help you with this assignment and make the experience enjoyable. • Do not try to write out all the code and build it at the end to find syntax errors. For each new line of code written (my rule of thumb is 2-3 lines), build it to see if it compiles successfully. It will go a long way! • After any successful build, run the code to see what happens and what current state you are at with the program writing so you know what to do next! If the program performs what you expected, you can then move onto the next step of the code writing. If you try to write everything at once and build it successfully to find out it doesn’t work properly, you will get frustrated trying find out the logical error in your code! Remember, logical errors are the hardest to fix and identify in a program! • Start the assignment early! Do not wait last minute (the day of) to begin the assignment. • Ask questions! It’s ok to ask questions. If there are any clarifications needed, please ask TAs and the Instructor! We are here to help!!! Do not wait last minute to seek help as generally you could be waiting in a long line of fellow classmates who may need help. As stated previously start early and ask your questions right away! Large Program 1 Page 9

$25.00 View

[SOLVED] Cse 6242 / cx 4242: data and visual analytics hw 3: spark, docker, databricks, aws and gcp

CSE 6242 / CX 4242: Data and Visual Analytics HW 3: Spark, Docker, DataBricks, AWS and GCPDownload the HW3 Skeleton, Q1 Data, Q2 Data, and Q4 Data before you begin. Also, create an AWS Academy account as outlined in Step 1 of the AWS Setup GuideModern-day datasets are large. For example, the NASA Terra and Aqua satellites each produces over 300GB of satellite imagery daily. These datasets are too large for typical computer hard drives and requires advanced technologies for processing. In this assignment, you will work with a dataset of over 1 billion taxi trips from the New York City Taxi & Limousine Commission (TLC). Further details on this dataset are available here. This assignment aims to familiarize you with various tools that will be valuable for future projects, research, or career opportunities. By including AWS, Azure and GCP, we want to provide the opportunity to explore and compare these rapidly evolving platforms. This experience will help you make informed decisions when selecting a cloud platform in the future, allowing you to get started quickly and confidently. Many of the computational tasks in this assignment are straightforward, though quite a bit of “setup” will be needed before reaching the actual “programming” stage. Setting up work environments, launching clusters, monitoring compute usage, and running large-scale experiments on cloud platforms are important skills. This assignment familiarizes you with using machine clusters and understanding the pay-per-use model of most cloud services, offering a valuable first experience with cloud computing for many students. The maximum possible score for this homework is 100 points Homework Overview…………………………………………………………………………………………………………………. 1 Important Notes ……………………………………………………………………………………………………………………….. 2 Submission Notes…………………………………………………………………………………………………………………….. 2 Do I need to use the specific version of the software listed?……………………………………………………………. 2 Q1 [15 points] Analyzing trips data with PySpark…………………………………………………………………………… 3 Tasks and point breakdown…………………………………………………………………………………………………. 3 Q2 [30 pts] Analyzing dataset with Spark/Scala on Databricks ………………………………………………………… 6 Tasks and point breakdown…………………………………………………………………………………………………. 7 Q3 [35 points] Analyzing Large Amount of Data with PySpark on AWS…………………………………………….. 9 Tasks and point breakdown………………………………………………………………………………………………..10 Q4 [10 points] Analyzing a Large Dataset using Spark on GCP………………………………………………………12 Tasks and point breakdown………………………………………………………………………………………………..13 Q5 [10 points] Regression: Automobile price prediction using Azure Machine Learning ……………………..14 Tasks and point breakdown………………………………………………………………………………………………..14 2 Version 1Important Notes A. Submit your work by the due date on the course schedule. a. Every assignment has a generous 48-hour grace period, allowing students to address unexpected minor issues without facing penalties. You may use it without asking. b. Before the grace period expires, you may resubmit as many times as you need. c. TA assistance is not guaranteed during the grace period. d. Submissions during the grace period will display as “late” but will not incur a penalty. e. We will not accept any submissions executed after the grace period ends. B. Always use the most up-to-date assignment (version number at the bottom right of this document). The latest version will be listed in Ed Discussion. C. You may discuss ideas with other students at the “whiteboard” level (e.g., how cross-validation works, use HashMap instead of an array) and review any relevant materials online. However, each student must write up and submit the student’s own answers. D. All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will be subject to the institute’s Academic Integrity procedures, directly handled by the Office of Student Integrity (OSI). Consequences can be severe, e.g., academic probation or dismissal, a 0 grade for assignments concerned, and prohibition from withdrawing from the class. Submission Notes A. All questions are graded on the Gradescope platform, accessible through Canvas. B. We will not accept submissions anywhere else outside of Gradescope. C. Submit all required files as specified in each question. Make sure they are named correctly. D. You may upload your code periodically to Gradescope to obtain feedback on your code. There are no hidden test cases. The score you see on Gradescope is what you will receive. E. You must not use Gradescope as the primary way to test your code. It provides only a few test cases and error messages may not be as informative as local debuggers. Iteratively develop and test your code locally, write more test cases, and follow good coding practices. Use Gradescope mainly as a “final” check. F. Gradescope cannot run code that contains syntax errors. If you get the “The autograder failed to execute correctly” error, verify: a. The code is free of syntax errors (by running locally) b. All methods have been implemented c. The correct file was submitted with the correct name d. No extra packages or files were imported G. When many students use Gradescope simultaneously, it may slow down or fail. It can become even slower as the deadline approaches. You are responsible for submitting your work on time. H. Each submission and its score will be recorded and saved by Gradescope. By default, your last submission is used for grading. To use a different submission, you MUST “activate” it (click the “Submission History” button at the bottom toolbar, then “Activate”). Do I need to use the specific version of the software listed? Under each question, you will see a set of technologies with specific versions – this is what is installed on the autograder and what it will run your code with. Thus, installing those specific versions on your computer to complete the question is highly recommended. You may be able to complete the question with different versions installed locally, but you are responsible for determining the compatibility of your code. We will not award points for code that works locally but not on the autograder. 3 Version 1 Q1 [15 points] Analyzing trips data with PySpark Follow these instructions to download and set up a preconfigured Docker image that you will use for this assignment. that you will use for this assignment. Why use Docker? In earlier iterations of this course, students installed software on their own machines, and we (both students and instructor team) ran into many issues that could not be resolved satisfactorily. Docker allows us to distribute a cross-platform, preconfigured image with all the requisite software and correct package versions. Once Docker is installed and the container is running, access Jupyter by browsing to http://localhost:6242. There is no need to install any additional Java or PySpark dependencies as they are all bundled as part of the Docker container.You will use the yellow_tripdata_2019-01_short.csv dataset, a modified record of the NYC Green Taxi trips that includes information about the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, payment types, and driver-reported passenger counts. When processing the data or performing calculations, do not round any values, unless specifically instructed to. Technology PySpark, Docker Deliverables [Gradescope] q1.ipynb: your solution as a Jupyter Notebook file IMPORTANT NOTES: • Only regular PySpark Dataframe Operations can be used. • Do NOT use PySpark SQL functions, i.e., sqlContext.sql(‘select * … ‘). We noticed that students frequently encountered difficult-to-resolve issues when using these functions. Additionally, since you already worked extensively with SQL in HW1, completing this task in SQL would offer limited educational value. • Do not reference sqlContext within the functions you are defining for the assignment. • If you re-run cells, remember to restart the kernel to clear the Spark context, otherwise an existing Spark context may cause errors. • Be sure to save your work often! If you do not see your notebook in Jupyter, then double check that the file is present in the folder and that your Docker has been set up correctly. If, after checking both, the file still does not appear in Jupyter then you can still move forward by clicking the “upload” button in the Jupyter notebook and uploading the file – however, if you use this approach, then your file will not be saved to disk when you save in Jupyter, so you would need to download your work by going to File > Download as… > Notebook (.ipynb), so be sure to download often to save your work! • Do not add any cells or additional library imports to the notebook. • Remove all your additional debugging code that renders output, as it will crash Gradescope. For instance, any additional print, display and show statements used for debugging must be removed. Tasks and point breakdown 1. [1 pt] You will be modifying the function clean_data to clean the data. Cast the following columns into the specified data types: a. passenger_count — integer b. total_amount — float c. tip_amount — float d. trip_distance — float e. fare_amount — float f. tpep_pickup_datetime — timestamp 4 Version 1 g. tpep_dropoff_datetime — timestamp 2. [4 pts] You will be modifying the function common_pair. Return the top 10 pickup-dropoff location pairs that have the highest sum of passenger_count who have traveled between them. Sort the location pairs by total passengers between pairs. For each location pair, also compute the average amount per passenger over all trips (name this per_person_rate), utilizing total_amount. For pairs with the same total passengers, sort them in descending order of per_person_rate. Filter out any trips that have the same pick-up and drop-off location. Rename the column for total passengers to total_passenger_count. Sample Output Format — The values below are for demonstration purposes: PULocationID DOLocationID total_passenger_count per_person_rate 1 2 23 5.242345 3 4 5 6.61345634 3. [4 pts] You will be modifying the function distance_with_most_tip . Filter the data for trips having fares (fare_amount) greater than $2.00 and a trip distance (trip_distance) greater than 0. Calculate the tip percent (tip_amount * 100 / fare_amount) for each trip. Round all trip distances up to the closest mile and find the average tip_percent for each trip_distance. Sort the result in descending order of tip_percent to obtain the top 15 trip distances which tip the most generously. Rename the column for rounded trip distances to trip_distance, and the column for average tip percents tip_percent. Sample Output Format — The values below are for demonstration purposes: trip_distance tip_percent 2 6.2632344561 1 4.42342882 4. [6 pts] You will be modifying the function time_with_most_traffic to determine which hour of the day has the most traffic. Calculate the traffic for a particular hour using the average speed of all taxi trips which began during that hour. Calculate the average speed as the average trip_distance divided by the average trip duration, as distance per hour. Make sure to determine the average durations and average trip distances before calculating the speed. It will likely be helpful to cast the dates to the long data type when determining the interval. A day with low average speed indicates high levels of traffic. The average speed may be 0, indicating very high levels of traffic. Additionally, you must separate the hours into AM and PM, with hours 0:00-11:59 being AM, and hours 12:00-23:59 being PM. Convert these times to the 12 hour time, so you can match the output below. For example, the row with 1 as time of day, should show the average speed between 1 am and 2 am in the am_avg_speed column, and between 1 pm and 2pm in the pm_avg_speed column. Use date_format along with the appropriate pattern letters to format the time of day so that it matches the example output below. Your final table should contain values sorted from 0-11 for time_of_day. There may be data missing for a time of day, and it may be null for am_avg_speed 5 Version 1 or pm_avg_speed. If an hour has no data for am or pm, there may be missing rows. You will not have rows for all possible times of day, and do not need to add them to the data if they are missing. Sample Output Format — The values below are for demonstration purposes: time_of_day am_avg_speed pm_avg_speed 1 0.953452345 9.23345272 2 5.2424622 null 4 null 2.55421905 6 Version 1 Q2 [30 pts] Analyzing dataset with Spark/Scala on Databricks Firstly, go over this Spark on Databricks Tutorial, to learn the basics of creating Spark jobs, loading data, and working with data. You will analyze nyc-tripdata.csv1 using Spark and Scala on the Databricks platform. (A short description of how Spark and Scala are related can be found here.) You will also need to use the taxi zone lookup table using taxi_zone_lookup.csv that maps the location ID into the actual name of the region in NYC. The nyc-trip data dataset is a modified record of the NYC Green Taxi trips and includes information about the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, payment types, and driverreported passenger counts. Technology Spark/Scala, Databricks Deliverables [Gradescope] • q2.dbc: Your solution as Scala Notebook archive file (.dbc) exported from Databricks (see Databricks Setup Guide below) • q2.scala: Your solution as a Scala source file exported from Databricks (see Databricks Setup Guide below) • q2_results.csv: The output results from your Scala code in the Databricks q2 notebook file. You must carefully copy the outputs of the display()/show() function into a file titled q2_results.csv under the relevant sections. Please double-check and compare your actual output with the results you copied.IMPORTANT NOTES: • Use only Firefox, Safari or Chrome when configuring anything related to Databricks. The setup process has been verified to work on these browsers. • Carefully follow the instructions in the Databricks Setup Guide. (You should have already downloaded the data needed for this question using the link provided before Homework Overview.) o You must choose the Databricks Runtime (DBR) version as “10.4 (includes Apache Spark 3.2.1, Scala 2.12)”. We will grade your work using this version. o Note that you do not need to install Scala or Spark on your local machine. They are provided with the DBR environment. • You must use only Scala DataFrame operations for this question. Scala DataFrames are just another name for Spark DataSet of rows. You can use the DataSet API in Spark to work on these DataFrames. Here is a Spark document that will help you get started on working with DataFrames in Spark. You will lose points if you use SQL queries, Python, or R to manipulate a DataFrame. o After selecting the default language as SCALA, do not use the language magic % with other languages like %r, %python, %sql etc. The language magics are used to override the default language, which you must not do for this assignment. o You must not use full SQL queries in lieu of the Spark DataFrame API. That is, you must not use functions like sql(), which allows you to directly write full SQL queries like spark.sql (“SELECT* FROM col1 WHERE …”). This should be df.select(“*”) instead. • The template Scala notebook q2.dbc (in hw3-skeleton) provides you with code that reads a data file nyc-tripdata.csv. The input data is loaded into a DataFrame, inferring the schema using reflection (Refer to the Databricks Setup Guide above). It also contains code that filters the data to only keep the rows where the pickup location is different from the drop location, and the trip distance is strictly greater than 2.0 (>2.0). 1 Graph derived from the NYC Taxi and Limousine Commission 7 Version 1 o All tasks listed below must be performed on this filtered DataFrame, or you will end up with wrong answers. o Carefully read the instructions in the notebook, which provides hints for solving the problems. • Some tasks in this question have specified data types for the results that are of lower precision (e.g., float). For these tasks, we will accept relevant higher precision formats (e.g., double). Similarly, we will accept results stored in data types that offer “greater range” (e.g., long, bigint) than what we have specified (e.g., int). • Remove all your additional debugging code that renders output, as it will crash Gradescope. For instance, any additional print, display and show statements used for debugging must be removed. • Hint: You may find some of the following DataFrame operations helpful: toDF, join, select, groupBy, orderBy, filter, agg, window(), partitionBy, orderBy, etc. Tasks and point breakdown 1. List the top 5 most popular locations for: a. [2 pts] dropoff based on “DOLocationID”, sorted in descending order by popularity. If there is a tie, then one with a lower “DOLocationID” gets listed first. b. [2 pts] pickup based on “PULocationID”, sorted in descending order by popularity. If there is a tie, then the one with a lower “PULocationID” gets listed first. 2. [4 pts] List the top 3 locationID’s with the maximum overall activity. Here, overall activity at a LocationID is simply the sum of all pick-ups and all drop-offs at that LocationID. In case of a tie, the lower LocationID gets listed first. Note: If a taxi picked up 3 passengers at once, we count it as 1 pickup and not 3 pickups. 3. [4 pts] List all the boroughs (of NYC: Manhattan, Brooklyn, Queens, Staten Island, Bronx along with “Unknown” and “EWR”) and their total number of activities, in descending order of a total number of activities. Here, the total number of activities for a borough (e.g., Queens) is the sum of the overall activities (as defined in part 2) of all the LocationIDs that fall in that borough (Queens). An example output format is shown below. 4. [5 pts] List the top 2 days of the week with the largest number of daily average pick-ups, along with the average number of pick-ups on each of the 2 days in descending order (no rounding off required). Here, the average pickup is calculated by taking an average of the number of pick-ups on different dates falling on the same day of the week. For example, 02/01/2021, 02/08/2021 and 02/15/2021 are all Mondays, so the average pick-ups for these is the sum of the pickups on each date divided by 3. An example output is shown below. 8 Version 1 Note: The day of week is a string of the day’s full spelling, e.g., “Monday” instead of the number 1 or “Mon”. Also, the pickup_datetime is in the format: yyyy-mm-dd 5. [6 pts] For each hour of a day (0 to 23, 0 being midnight) — in the order from 0 to 23 (inclusively), find the zone in the Brooklyn borough with the largest number of total pick-ups. Note: All dates for each hour should be included. 6. [7 pts] Find which 3 different days in the month of January, in Manhattan, that saw the largest positive percentage increase in pick-ups compared to the previous day, in the order from largest percentage increase to smallest percentage increase. An example output is shown below. Note: All years need to be aggregated to calculate the pickups for a specific day of January. The change from Dec 31 to Jan 1 can be excluded. List the results of the above tasks in the provided q2_results.csv file under the relevant sections. These preformatted sections also show you the required output format from your Scala code with the necessary columns — while column names can be different, their resulting values must be correct. • You must manually enter the output generated into the corresponding sections of the q2_results.csv file, preferably using some spreadsheet software like MS-Excel (but make sure to keep the csv format). For generating the output in the Scala notebook, refer to show() and display()functions of Scala. • Note that you can edit this csv file using text editor, but please be mindful about putting the results under designated columns. • If you encounter a “UnicodeDecodeError”, please save file as “.csv UTF-8″ to resolve. Note: Do NOT modify anything other than filling in those required output values in this csv file. We grade by running the Spark Scala code you write and by looking at your results listed in this file. So, make sure your output is obtained from the Spark Scala code you write. Failure to include the dbc and scala files will result in a deduction from your overall score. 9 Version 1 Q3 [35 points] Analyzing Large Amount of Data with PySpark on AWS You will try out PySpark for processing data on Amazon Web Services (AWS). Here you can learn more about PySpark and how it can be used for data analysis. You will be completing a task that may be accomplished using a commodity computer (e.g., consumer-grade laptops or desktops). However, we would like you to use this exercise as an opportunity to learn distributed computing on AWS, and to gain experience that will help you tackle more complex problems. The services you will primarily be using are Amazon S3 storage, Amazon Athena. You will be creating an S3 bucket, running code using Athena and its serverless PySpark engine, and then storing the output into that S3 bucket. Amazon Athena is serverless, meaning that it is pay for what you use. There are no servers to maintain that will accrue costs whether it’s being used or not. For this question, you will only use up a very small fraction of your AWS credit. If you have any issues with the AWS Academy account, please post in the dedicated AWS Setup Ed Discussion thread. In this question, you will use a dataset of trip records provided by the New York City Taxi and Limousine Commission (TLC). You will be accessing the dataset directly through AWS via the code outlined in the homework skeleton. Specifically, you will be working with two samples of this dataset, one small, and one much larger. Optionally, if you would like to learn more about the dataset, check out here and here; also optionally, you may explore the structure of the data by referring to [1] [2]. You are provided with a python notebook (q3.ipynb) file which you will complete and load into EMR. You are provided with the load_data() function, which loads two PySpark DataFrames. The first DataFrame, trips, contains trip data where each record refers to one (1) trip. The second DataFrame, lookup, maps a LocationID to its trip information. It can be linked to either the PULocationID or DOLocationID fields in the trips DataFrame. Technology PySpark, AWS Deliverables [Gradescope] • q3.ipynb: PySpark notebook for this question (for the larger dataset). • q3_output_large.csv: output file (comma-separated) for the larger dataset. IMPORTANT NOTES • Use Firefox, Safari or Chrome when configuring anything related to AWS. • EXTREMELY IMPORTANT: Both the datasets are in the US East (N. Virginia) region. Using machines in other regions for computation will incur data transfer charges. Hence, set your region to US East (N. Virginia) in the beginning (not Oregon, which is the default). This is extremely important, otherwise your code may not work, and you may be charged extra. • Strictly follow the guidelines below, or your answer may not be graded. a. Ensure that the parameters for each function remain as defined and the output order and names of the fields in the PySpark DataFrames are maintained. b. Do not import any functions which were not already imported within the skeleton. c. You must NOT round any numeric values. Rounding numbers can introduce inaccuracies. Our grader will be checking the first 8 decimal places of each value in the DataFrame. d. You will not have access to the Spark object directly in the autograder. If you use it in your functions, the autograder will fail! You can use the Spark Context from the DataFrame. 10 Version 1 e. Double check that you are submitting the correct files, and the filenames follow the correct naming standard — we only want the script and output from the larger dataset. Also, double check that you are writing the right dataset’s output to the right file. f. You are welcome to store your script’s output in any bucket you choose if you can download and submit the correct files. g. Do not make any manual changes to the output files. h. Please ensure that you do not remove #export from the HW skeleton; i. Do not import any additional packages, INCLUDING pyspark.sql.functions, as this may cause the autograder to work incorrectly. Everything you need should be imported for you. j. Using .rdd() can cause issues in the GradeScope environment. You can accomplish this assignment without it. In general, since the RDD API is outdated (though not deprecated), you should be wary of using this API. k. Remove all your additional debugging code that renders output, as it will crash Gradescope. For instance, any additional print, display and show statements used for debugging must be removed. l. Regular Pyspark Dataframe Operations and PySpark SQL operations can be used. To use PySpark SQL operations, you must use the SQL Context on the Spark Dataframe.  Example: • df.createOrReplaceTempView(“some_table”) • df.sql_ctx.sql(“SELECT * FROM some_table”) Hints: a. Refer to DataFrame commands such as filter, join, groupBy, agg, limit, sort, withColumnRenamed and withColumn. Documentation for the DataFrame APIs is located here. b. Testing on a single, small dataset (i.e. a “test case”) is helpful, but is not sufficient for discovering all potential issues, especially if such issues only become apparent when the code is run on larger datasets. It is important for you to develop more ways to review and verify your code logic. c. Overwriting the DataFrames from the function parameters can cause unintended side effects when it comes to rounding. Be sure to preserve the DataFrames in each function. d. Precision in data analytics is very important. Keep in mind that precision reduction in an earlier step can accumulate and be magnified, subsequently significantly affecting the final output’s precision (e.g., for a dataset with 1,000,000 data points, a 0.0001 difference for each data point can lead to a total difference of 100 over the whole dataset). This is called precision loss. Check out this post or hints on how to avoid precision loss. e. Check if you’re reducing the precision (or “scale”) too aggressively. Can you relax the restriction during intermediate steps? f. Make sure you return a DataFrame. If you get NoneType errors, you are most likely not returning what you think you are. g. Some columns may need to be cast to the right data type. Keep that in mind! Tasks and point breakdown Your objective is to locate profitable pick-up locations in Manhattan by analyzing taxi trip data (only trips 2 miles or longer). Follow the steps below to identify top pick-up locations based on a “weighted profit” calculation: 1. [0 pts] Setting up the AWS environment. a. Go through all the steps in the AWS Setup Guide. You should have already completed Step 1 to create your account) to set up your AWS environment, e.g., creating S3 storage bucket, and uploading skeleton file. 2. [1 pts] user() 11 Version 1 a. Returns your GT Username as a string (e.g., gburdell3) 3. [2 pts] long_trips(trips) a. This function filters trips to keep only trips 2 miles or longer (e.g., >= 2). b. Returns PySpark DataFrame with the same schema as trips c. Note: Parts 4, 5 and 6 will use the result of this function 4. [6 pts] manhattan_trips(trips, lookup) a. This function determines the top 20 locations with a DOLocationID in Manhattan by sum of passenger count. b. Returns a PySpark DataFrame (mtrips) with the schema (DOLocationID, pcount) c. Note: If you encounter the error ‘Can only compare identically labeled DataFrame objects,’ it is likely due to the use of the RDD API. We recommend avoiding the use of the RDD API since it is not compatible with the autograder. Instead, we suggest rewriting the logic using a join clause. 5. [6 pts] weighted_profit(trips, mtrips) a. This function determines i. the average total_amount, ii. the total count of trips, and iii. the total count of trips ending in the top 20 destinations. b. Using the above values, i. determine the proportion of trips that end in one of the popular drop-off locations (# trips that end in drop off location divided by total # of trips) and ii. multiply that proportion by the average total_amount to get a weighted_profit value based on the probability of passengers going to one of the popular destinations. iii. Return the weighted_profit c. Returns a PySpark DataFrame with the schema (PULocationID, weighted_profit) for the weighted_profit . 6. [5 pts] final_output(wp, lookup) a. This function i. takes the results of weighted_profit, ii. links it to the borough and zone through the lookup data frame, iii. and returns the top 20 locations with the highest weighted_profit. b. Returns a PySpark DataFrame with the schema (Zone, Borough, weighted_profit) c. Note: If you encounter issues with ‘3.5 Test Final Output,’ primarily due to the DataFrame returned from ‘final_output()’ containing incorrect data, it is essential to reformat column data types, particularly when applying ‘agg()’ operations in previous sections. Once you have implemented all these functions, run the main() function, which is already implemented, and update the line of code to include the name of your output s3 bucket and a location. This function will fail if the output directory already exists, so make sure to change it each time you run the function. Example: final.write.csv(‘s3://cse6242-gburdell3/output-large3’) Your output file will appear in a folder in your s3 bucket as a csv file with a name which is similar to part-0000- 4d992f7a-0ad3-48f8-8c72-0022984e4b50-c000.csv. Download this file and rename it to q3_output_large.csv for submission. Do NOT make any other changes to the file. 12 Version 1 Q4 [10 points] Analyzing a Large Dataset using Spark on GCP The goal of this question is to familiarize you with creating storage buckets/clusters and running Spark programs on Google Cloud Platform. This question asks you to create a new Google Storage Bucket and load the NYC Taxi & Limousine Commission Dataset. You are also provided with a Jupyter Notebook q4.ipynb file, which you will load and complete in a Google Dataproc Cluster. Inside the notebook, you are provided with the skeleton for the load_data() function, which you will complete to load a PySpark DataFrame from the Google Storage Bucket you created as part of this question. Using this PySpark DataFrame, you will complete the following tasks using Spark DataFrame functions. You will use the data file yellow_tripdata09-08-2021.csv. The preceding link allows you to download the dataset you are required to work with for this question from the course DropBox. Each line represents a single taxi trip consisting of the comma-separated columns bulleted below. All columns are of string data type. You must convert the highlighted columns below into decimal data type (do NOT use float datatype) inside their respective functions when completing this question. Do not convert any datatypes within the load_data function. While casting to a decimal datatype, use a precision of 38 and a scale of 10. • vendorid • tpep_pickup_datetime • tpep_dropoff_datetime • passenger_count • trip_distance (decimal data type) • ratecodeid • store_and_fwd_flag • pulocationid • dolocationid • payment_type • fare_amount (decimal data type) • extra • mta_tax • tip_amount (decimal data type) • tolls_amount (decimal data type) • improvement_surcharge • total_amount Technology Spark, Google Cloud Platform (GCP) Deliverables [Gradescope] q4.ipynb: the PySpark notebook for this question. IMPORTANT NOTES: • Use Firefox, Safari or Chrome when configuring anything related to GCP. • Strictly follow the guidelines below, or your answer may not be graded. o Regular PySpark Dataframe Operations can be used. o Do NOT use any functions from the RDD API or your code will break the autograder. In general, the RDD API is considered outdated, so you should use the DataFrame API for better performance and compatibility. o Make sure to download the notebook from your GCP cluster before deleting the GCP cluster (otherwise, you will lose your work). o Do not add new cells to the notebook, as this may break the auto-grader. o Remove all your additional debugging code that renders output, as it will crash Gradescope. For instance, any additional print, display and show statements used for debugging must be removed. o Do not use any .rdd function in your code. Not only will this break the autograder, but you should 13 Version 1 be wary of using this function in general. o Ensure that you are only submitting a COMPLETE solution to Gradescope. Anything less will break the autograder. Write local unit tests to help test your code. Tasks and point breakdown 1. [0 pts] Set up your GCP environment a. Instructions to set up GCP Credits, GCP Storage and Dataproc Cluster are provide here: written instructions. b. Helpful tips/FAQs for special scenarios: i. If GCP service is disabled for your google account, try the steps in this google support link ii. If you have any issues with the GCP free credits, please post in the dedicated GCP Setup Ed Discussion thread. 2. [0 pts — required] Function load_data()to load data from a Google Storage Bucket into a Spark DataFrame a. You must first perform this task (part 2) BEFORE performing parts 3, 4, 5, 6 and 7. No points are allocated to task 2, but it is essential that you correctly implement the load_data()function as the remaining graded tasks depend upon this task and its correct implementation. Upload code to Gradescope ONLY after completing all tasks and removing/commenting all the testing code. Anything else will break the autograder. 3. [2 pts] Function exclude_no_pickup_locations() to exclude trips with no pick-up locations (pick-up location id column is null or is zero) in the original data from a. 4. [2 pts] Function exclude_no_trip_distance() to exclude trips with no distance (i.e., trip distance column is null or zero) in the dataframe output by exclude_no_pickup_locations(). 5. [2 pts] Function include_fare_range() to include trips with fare from $20 (inclusively) to $60 (inclusively) in the dataframe output by exclude_no_trip_distance(). 6. [2 pts] Function get_highest_tip() to identify the highest tip (rounded to 2 decimal places) in the dataframe output by include_fare_range(). 7. [2 pts] Function get_total_toll() to calculate the total toll amount (rounded to 2 decimal places) in the dataframe output by include_fare_range(). 14 Version 1 Q5 [10 points] Regression: Automobile price prediction using Azure Machine Learning The primary purpose of this question is to introduce you to Microsoft Machine Learning Studio by familiarizing you to its basic functionalities and machine learning workflows. Go through the Automobile Price Prediction tutorial and create/run ML experiments to complete the following tasks. You will not incur any cost if you save your experiments on Azure till submission. Once you are sure about the results and have reported them, feel free to delete your experiments. You will manually modify the given file q5.csv by adding the results using a plain text editor from the following tasks. Technology Azure Machine Learning Deliverables [Gradescope] q5.csv: a csv file containing results for all parts IMPORTANT NOTES: • Strictly follow the guidelines below, or your answer may not be graded. o DO NOT change the order of the questions. o Report the exact numerical values that you get in your output. DO NOT round any of them. o When manually entering a value into the csv file, append it immediately after a comma, so there will be NO space between the comma and your value, and no trailing spaces or commas after your value. o Follow the tutorial and do not change values for L2 regularization. For tasks 3 and 4, select the columns given in the tutorial. Tasks and point breakdown 1. [0 pts] Create and use a free workspace instance on Azure Machine Learning. Use your Georgia Tech username (e.g., jdoe3) to login. 2. [0 pts] Update q5.csv by replacing gburdell3 with your GT username. 3. [3 pts] Repeat the experiment described in the tutorial and report values of all metrics as mentioned in the Evaluate Model section of the tutorial. Make sure the Split Data looks as it does below: 4. [3 pts] Repeat the experiment mentioned in task 3 with a different value of Fraction of rows in the first output dataset in the Split Data module. Change the value to 0.8 from the originally set value of 0.7. Report corresponding values of the metrics. 15 Version 1 5. [4 pts] After fully completing tasks 3 and 4, run a new experiment — evaluate the model using 5- fold cross-validation CV. a. Select parameters in the Partition and Sample component in accordance with the figure below. b. For Cross Validate model set the column name as “price” for CV and use 0 as a random seed. c. Report the values of Root Mean Squared Error (RMSE) and Coefficient of Determination for each of the five folds (1st fold corresponds to fold number 0 and so on). Do NOT round the results. Report exact values. d. HINT: to see results, right click Cross Validate Model and select Preview data  Evaluation results by fold. Make sure to utilize the same data cleaning/processing steps as you did before. Figure: Property Tab of Partition and Sample Module

$25.00 View

[SOLVED] Cse 6242 / cx 4242: data and visual analytics hw 2: tableau, d3 graphs, and visualization

CSE 6242 / CX 4242: Data and Visual Analytics HW 2: Tableau, D3 Graphs, and Visualization“Visualization gives you answers to questions you didn’t know you have” – Ben Schneiderman Download the HW2 Skeleton before you beginData visualization is an integral part of exploratory analysis and communicating key insights. This homework focuses on exploring and creating data visualizations using two of the most popular tools in the field; Tableau and D3.js. All 5 questions use data on the same topic to highlight the uses and strengths of different types of visualizations. The data comes from BoardGameGeek and includes games’ ratings, popularity, and metadata. Below are some terms you will often see in the questions: • Rating – a value from 0 to 10 given to each game. BoardGameGeek calculates a game’s overall rating in different ways including Average and Bayes, so make sure you are using the correct rating called for in a question. A higher rating is better than a lower rating. • Rank – the overall rank of a boardgame from 1 to n, with ranks closer to 1 being better and n being the total number of games. The rank may be for all games or for a subgroup of games such as abstract games or family games. The maximum possible score for this homework is 100 points. Students have the option to complete any 90 points’ worth of work to receive 100% (equivalent to 15 course total grade points) for this assignment. They can earn more than 100% if they submit additional work. For example, a student scoring 100 points will receive 111% for the assignment (equivalent to 16.67 course total grade points, as shown on Canvas). Download the HW2 Skeleton before you begin ……………………………………………………………………………….. 1Homework Overview …………………………………………………………………………………………………………………… 1 Important Notes………………………………………………………………………………………………………………………….. 2 Submission Notes ………………………………………………………………………………………………………………………. 2 Do I need to use the specific version of the software listed?………………………………………………………………. 2 Q1 [25 points] Designing a good table. Visualizing data with Tableau. ………………………………………………… 3 Important Points about Developing with D3 in Questions 2–5 ……………………………………………………………. 7 Q2 [15 points] Force-directed graph layout……………………………………………………………………………………… 8 Q3 [15 points] Line Charts……………………………………………………………………………………………………………10 Q4 [20 points] Interactive Visualization…………………………………………………………………………………………..14 Q5 [25 points] Choropleth Map of Board Game Ratings……………………………………………………………………18 2 Version 0Important Notes A. Submit your work by the due date on the course schedule. a. Every assignment has a generous 48-hour grace period, allowing students to address unexpected minor issues without facing penalties. You may use it without asking. b. Before the grace period expires, you may resubmit as many times as needed. c. TA assistance is not guaranteed during the grace period. d. Submissions during the grace period will display as “late” but will not incur a penalty. e. We will not accept any submissions executed after the grace period ends. B. Always use the most up-to-date assignment (version number at the bottom right of this document). The latest version will be listed in Ed Discussion. C. You may discuss ideas with other students at the “whiteboard” level (e.g., how cross-validation works, use HashMap instead of array) and review any relevant materials online. However, each student must write up and submit the student’s own answers. D. All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will be subject to the institute’s Academic Integrity procedures, directly handled by the Office of Student Integrity (OSI). Consequences can be severe, e.g., academic probation or dismissal, a 0 grade for assignments concerned, and prohibition from withdrawing from the class. Submission Notes A. All questions are graded on the Gradescope platform, accessible through Canvas. a. Question 1 will be manually graded after the final HW due date and Grace Period. b. Questions 2-5 are auto graded at the time of submission. B. We will not accept submissions anywhere else outside of Gradescope. C. Submit all required files as specified in each question. Make sure they are named correctly. D. You may upload your code periodically to Gradescope to obtain feedback on your code. There are no hidden test cases. The score you see on Gradescope is what you will receive. E. You must not use Gradescope as the primary way to test your code. It provides only a few test cases and error messages may not be as informative as local debuggers. Iteratively develop and test your code locally, write more test cases, and follow good coding practices. Use Gradescope mainly as a “final” check. F. Gradescope cannot run code that contains syntax errors. If you get the “The autograder failed to execute correctly” error, verify: a. The code is free of syntax errors (by running locally) b. All methods have been implemented c. The correct file was submitted with the correct name d. No extra packages or files were imported G. When many students use Gradescope simultaneously, it may slow down or fail. It can become even slower as the deadline approaches. You are responsible for submitting your work on time. H. Each submission and its score will be recorded and saved by Gradescope. By default, your last submission is used for grading. To use a different submission, you MUST “activate” it (click the “Submission History” button at the bottom toolbar, then “Activate”). Do I need to use the specific version of the software listed? Under each question, you will see a set of technologies with specific versions – this is what is installed on the autograder and what it will run your code with. Thus, installing those specific versions on your computer to complete the question is highly recommended. You may be able to complete the question with different versions installed locally, but you are responsible for determining the compatibility of your code. We will not award points for code that works locally but not on the autograder. 3 Version 0 Q1 [25 points] Designing a good table. Visualizing data with Tableau. Goal Design a table, a grouped bar chart, and a stacked bar chart with filters in Tableau. Technology Tableau Desktop Deliverables Gradescope: After selecting HW2 – Q1, click Submit Images. You will be taken to a list of questions for your assignment. Click Select Images and submit the following four PNG images under the corresponding questions: ● table.png: Image/screenshot of the table in Q1.1 ● grouped_barchart.png: Image of the chart in Q1.2 ● stacked_barchart_1.png: Image of the chart in Q1.3 after filtering data for Max.Players = 2 ● stacked_barchart_2.png: Image of the chart in Q1.3 after filtering data for Max.Players = 4 a Q1 will be manually graded after the grace period. Setting Up Tableau Install and activate Tableau Desktop by following “HW2 Instructions” on Canvas. The product activation key is for your use in this course only. Do not share the key with anyone. If you already have Tableau Desktop installed on your machine, you may use this key to reactivate it. a If you do not have access to a Mac or Windows machine, use the 14-day trial version of Tableau Online: 1. Visit https://www.tableau.com/trial/tableau-online 2. Enter your information (name, email, GT details, etc.) 3. You will then receive an email to access your Tableau Online site 4. Go to your site and create a workbook a If neither of the above methods work, use Tableau for Students. Follow the link and select “Get Tableau For Free”. You should be able to receive an activation key which offers you a one-year use of Tableau Desktop at no cost by providing a valid Georgia Tech email. Connecting to Data 1. It is optional to use Tableau for Q1.1. Otherwise, complete all parts using a single Tableau workbook. 2. Q1 will require connecting Tableau to two different data sources. You can connect to multiple data sources within one workbook by following the directions here. 3. For Q1.1 and Q1.2: a. Open Tableau and connect to a data source. Choose To a File – Text file. Select the popular_board_game.csv file from the skeleton. b. Click on the graph area at the bottom section next to “Data Source” to create worksheets. 4. For Q1.3: a. You will need a data.world account to access the data for Q1.3. Add a new data source by clicking on Data – New Data Source. b. When connecting to a data source, choose To a Server – Web Data Connector. c. Enter this URL to connect to the data.world data set on board games. You may be prompted to log in to data-world and authorize Tableau. If you haven’t used data.world before, you will be required to create an account by clicking “Join Now”. Do not edit the provided SQL query. a NOTE: If you cannot connect to data-world, you can use the provided csv files for Q1 in the skeleton. The provided csv files are identical to those hosted online and can be loaded directly into Tableau. a d. Click the graph area at the bottom section to create another worksheet, and Tableau will automatically create a data extract. 4 Version 0 Table and Chart Design 1. [5 points] Good table design. Visualize the data contained in popular_board_game.csv as a data table (known as a text table in Tableau). In this part (Q1.1), you can use any tool (e.g., Excel, HTML, Pandas, Tableau) to create the table. We are interested in grouping popular games into “support solo” (min player = 1) and “not support solo” (min player > 1). Your table should clearly communicate information about these two groups simultaneously. For each group (Solo Supported, Solo Not Supported), show: a a. Total number of games in each category (fighting, economic, …) b. In each category, the game with the highest number of ratings. If more than one game has the same (highest) number of ratings, pick the game you prefer. NOTE: Level of Detail expressions may be useful if you use Tableau. c. Average rating of games in each category (use simple average), rounded to 2 decimal places. d. Average playtime of games in each category, rounded to 2 decimal places. e. In the bottom left corner below your table, include your GT username (In Tableau, this can be done by including a caption when exporting an image of a worksheet or by adding a text box to a dashboard. If you use Tableau, refer to the tutorial here). f. Save the table as table.png. (If you use Tableau, go to Worksheet/Dashboard  Export  Image). NOTE: Do not take screenshots in Tableau since your image must have high resolution. You can take a screenshot If you use HTML, Pandas, etc. a Your learning goal here is to practice good table design, which is not strongly dependent on the tool that you use. Thus, we do not require that you use Tableau in this part. You may decide the most meaningful column names, the number of columns, and the column order. You are not limited to only the techniques described in the lecture. For OMS students, the lecture video on this topic is Week 4 – Fixing Common Visualization Issues – Fixing Bar Charts, Line Charts. For campus students, review lecture slides 42 and 43. 2. [10 points] Grouped bar chart. Visualize popular_board_game.csv as a grouped bar chart in Tableau. Your chart should display game category (e.g., fighting, economic,…) along the horizontal axis and game count along the vertical axis. Show game playtime (e.g.,

$25.00 View

[SOLVED] Cse 6242 / cx 4242: data and visual analytics hw 1: end-to-end analysis of tmdb data, sqlite, d3 warmup, openrefine,

CSE 6242 / CX 4242: Data and Visual Analytics HW 1: End-to-end analysis of TMDb data, SQLite, D3 Warmup, OpenRefine, FlaskVast amounts of digital data are generated each day, but raw data is often not immediately “usable”. Instead, we are interested in the information content of the data such as what patterns are captured? This assignment covers useful tools for acquiring, cleaning, storing, and visualizing datasets. In questions 1 & 2, we’ll perform a simple end-to-end analysis using data from The Movie Database (TMDb).We will collect movie data via API, store the data in csv files, and analyze data using SQL queries. For Q3, we will complete a D3 warmup to prepare our students for visualization questions in HW2. Q4 & 5 will provide an opportunity to explore other industry tools used to acquire, store, and clean datasets. The maximum possible score for this homework is 100 points. Contents Download the HW1 Skeleton before you begin. ………………………………………………………………………… 1Homework Overview…………………………………………………………………………………………………………………. 1 Important Notes ………………………………………………………………………………………………………………………. 2 Submission Notes…………………………………………………………………………………………………………………….. 2 Do I need to use the specific version of the software listed?……………………………………………………………. 2 Q1 [40 points] Collect data from TMDb to build a co-actor network…………………………………………………… 3 Q2 [35 points] SQLite ……………………………………………………………………………………………………………….. 4 Q3 [15 points] D3 Warmup – Visualizing Wildlife Trafficking by Species…………………………………………….. 7 Q4 [5 points] OpenRefine …………………………………………………………………………………………………………10 Q5 [5 points] Introduction to Python Flask …………………………………………………………………………………..12 2 Version 1Important Notes A. Submit your work by the due date on the course schedule. a. Every assignment has a generous 48-hour grace period, allowing students to address unexpected minor issues without facing penalties. You may use it without asking. b. Before the grace period expires, you may resubmit as many times as you need. c. TA assistance is not guaranteed during the grace period. d. Submissions during the grace period will display as “late” but will not incur a penalty. e. We will not accept any submissions executed after the grace period ends. B. Always use the most up-to-date assignment (version number at the bottom right of this document). The latest version will be listed in Ed Discussion. C. You may discuss ideas with other students at the “whiteboard” level (e.g., how cross-validation works, use HashMap instead of an array) and review any relevant materials online. However, each student must write up and submit the student’s own answers. D. All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will be subject to the institute’s Academic Integrity procedures, directly handled by the Office of Student Integrity (OSI). Consequences can be severe, e.g., academic probation or dismissal, a 0 grade for assignments concerned, and prohibition from withdrawing from the class. Submission Notes A. All questions are graded on the Gradescope platform, accessible through Canvas. B. We will not accept submissions anywhere else outside of Gradescope. C. Submit all required files as specified in each question. Make sure they are named correctly. D. You may upload your code periodically to Gradescope to obtain feedback on your code. There are no hidden test cases. The score you see on Gradescope is what you will receive. E. You must not use Gradescope as the primary way to test your code. It provides only a few test cases and error messages may not be as informative as local debuggers. Iteratively develop and test your code locally, write more test cases, and follow good coding practices. Use Gradescope mainly as a “final” check. F. Gradescope cannot run code that contains syntax errors. If you get the “The autograder failed to execute correctly” error, verify: a. The code is free of syntax errors (by running locally) b. All methods have been implemented c. The correct file was submitted with the correct name d. No extra packages or files were imported G. When many students use Gradescope simultaneously, it may slow down or fail. It can become even slower as the deadline approaches. You are responsible for submitting your work on time. H. Each submission and its score will be recorded and saved by Gradescope. By default, your last submission is used for grading. To use a different submission, you MUST “activate” it (click the “Submission History” button at the bottom toolbar, then “Activate”). Do I need to use the specific version of the software listed? Under each question, you will see a set of technologies with specific versions – this is what is installed on the autograder and what it will run your code with. Thus, installing those specific versions on your computer to complete the question is highly recommended. You may be able to complete the question with different versions installed locally, but you are responsible for determining the compatibility of your code. We will not award points for code that works locally but not on the autograder. 3 Version 1 Q1 [40 points] Collect data from TMDb to build a co-actor network Leveraging the power of APIs for data acquisition, you will build a co-actor network of highly rated movies using information from The Movie Database (TMDb). Through data collection and analysis, you will create a graph showing the relationships between actors based on their highly rated movies. This will not only highlight the practical application of APIs in collecting rich datasets, but also introduce the importance of graphs in understanding and visualizing the real-world dataset. Technology • Python 3.10.x • TMDb API version 3 Allowed Libraries The Python Standard Library and Requests only. Max runtime 10 minutes. Submissions exceeding this will receive zero credit. Deliverables • Q1.py: The completed Python file • nodes.csv: The csv file containing nodes • edges.csv: The csv file containing edges Follow the instructions found in Q1.py to complete the Graph class, the TMDbAPIUtils class, and the one global function. The Graph class will serve as a re-usable way to represent and write out your collected graph data. The TMDbAPIUtils class will be used to work with the TMDb API for data retrieval. Tasks and point breakdown 1. [10 pts] Implementation of the Graph class according to the instructions in Q1.py. a. The graph is undirected, thus {a, b} and {b, a} refer to the same undirected edge in the graph; keep only either {a, b} or {b, a} in the Graph object. A node’s degree is the number of (undirected) edges incident on it. In/ out-degrees are not defined for undirected graphs. 2. [10 pts] Implementation of the TMDbAPIUtils class according to instructions in Q1.py. Use version 3 of the TMDb API to download data about actors and their co-actors. To use the API: a. Create a TMDb account and follow the instructions on this document to obtain an API key. b. Be sure to use the key, not the token. This is the shorter of the two. c. Refer to the TMDB API Documentation as you work on this question. 3. [20 pts] Producing correct nodes.csv and edges.csv. a. If an actor’s name has comma characters (“,”), remove those characters before writing that name into the CSV files. 4 Version 1 Q2 [35 points] SQLite SQLite is a lightweight, serverless, embedded database that can easily handle multiple gigabytes of data. It is one of the world’s most popular embedded database systems. It is convenient to share data stored in an SQLite database — just one cross-platform file that does not need to be parsed explicitly (unlike CSV files, which must be parsed). You can find instructions to install SQLite here. In this question, you will construct a TMDb database in SQLite, partition it, and combine information within tables to answer questions. You will modify the given Q2.py file by adding SQL statements to it. We suggest testing your SQL locally on your computer using interactive tools to speed up testing and debugging, such as DB Browser for SQLite. Technology • SQLite release 3.37.2 • Python 3.10.x Allowed Libraries Do not modify import statements. Everything you need to complete this question has been imported for you. Do not use other libraries for this question. Max runtime 10 minutes. Submissions exceeding this will receive zero credit. Deliverables • Q2.py: Modified file containing all the SQL statements you have used to answer parts a – h in the proper sequence. IMPORTANT NOTES: • If the final output asks for a decimal column, format it to two places using printf(). Do NOT use the ROUND() function, as in rare cases, it works differently on different platforms. If you need to sort that column, be sure you sort it using the actual decimal value and not the string returned by printf. • A sample class has been provided to show example SQL statements; you can turn off this output by changing the global variable SHOW from True to False. • In this question, you must only use INNER JOIN when performing a join between two tables, except for part g. Other types of joins may result in incorrect results. Tasks and point breakdown 1. [9 points] Create tables and import data. a. [2 points] Create two tables (via two separate methods, part_ai_1 and part_ai_2, in Q2.py) named movies and movie_cast with columns having the indicated data types: i. movies 1. id (integer) 2. title (text) 3. score (real) ii. movie_cast 1. movie_id (integer) 2. cast_id (integer) 3. cast_name (text) 4. birthday (text) 5. popularity (real) b. [2 points] Import the provided movies.csv file into the movies table and movie_cast.csv into the movie_cast table i. Write Python code that imports the .csv files into the individual tables. This will include looping though the file and using the ‘INSERT INTO’ SQL command. Make sure you use paths relative to the Q2 directory. c. [5 points] Vertical Database Partitioning. Database partitioning is an important technique that divides large tables into smaller tables, which may help speed up queries. Create a new table cast_bio from the movie_cast table. Be sure that the values are unique when inserting into the new cast_bio table. Read this page for an example of vertical database partitioning. 5 Version 1 i. cast_bio 1. cast_id (integer) 2. cast_name (text) 3. birthday (text) 4. popularity (real) 2. [1 point] Create indexes. Create the following indexes. Indexes increase data retrieval speed; though the speed improvement may be negligible for this small database, it is significant for larger databases. a. movie_index for the id column in movies table b. cast_index for the cast_id column in movie_cast table c. cast_bio_index for the cast_id column in cast_bio table 3. [3 points] Calculate a proportion. Find the proportion of movies with a score between 7 and 20 (both limits inclusive). The proportion should be calculated as a percentage. a. Output format and example value: 7.70 4. [4 points] Find the most prolific actors. List 5 cast members with the highest number of movie appearances that have a popularity > 10. Sort the results by the number of appearances in descending order, then by cast_name in alphabetical order. a. Output format and example row values (cast_name,appearance_count): Harrison Ford,2 5. [4 points] List the 5 highest-scoring movies. In the case of a tie, prioritize movies with fewer cast members. Sort the result by score in descending order, then by number of cast members in ascending order, then by movie name in alphabetical order. a. Output format and example values (movie_title,score,cast_count): Star Wars: Holiday Special,75.01,12 Games,58.49,33 6. [4 points] Get high scoring actors. Find the top ten cast members who have the highest average movie scores. Sort the output by average_score in descending order, then by cast_name alphabetically. a. Exclude movies with score < 25 before calculating average_score. b. Include only cast members who have appeared in three or more movies with score >= 25. i. Output format and example value (cast_id,cast_name,average_score): 8822,Julia Roberts,53.00 7. [2 points] Creating views. Create a view (virtual table) called good_collaboration that lists pairs of actors who have had a good collaboration as defined here. Each row in the view describes one pair of actors who appeared in at least 2 movies together AND the average score of these movies is >= 40. The view should have the format: good_collaboration( cast_member_id1, cast_member_id2, movie_count, average_movie_score) For symmetrical or mirror pairs, only keep the row in which cast_member_id1 has a lower numeric value. For example, for ID pairs (1, 2) and (2, 1), keep the row with IDs (1, 2). There should not be any “self-pair” where cast_member_id1 is the same as cast_member_id2. Remember that creating a view will not produce any output, so you should test your view with a few simple select statements during development. One such test has already been added to the code as part of the auto-grading. NOTE: Do not submit any code that creates a ‘TEMP’ or ‘TEMPORARY’ view that 6 Version 1 you may have used for testing. Optional Reading: Why create views? 8. [4 points] Find the best collaborators. Get the 5 cast members with the highest average scores from the good_collaboration view, and call this score the collaboration_score. This score is the average of the average_movie_score corresponding to each cast member, including actors in cast_member_id1 as well as cast_member_id2. a. Order your output by collaboration_score in descending order, then by cast_name alphabetically. b. Output format and example values(cast_id,cast_name,collaboration_score): 2,Mark Hamil,99.32 1920,Winoa Ryder,88.32 9. [4 points] SQLite supports simple but powerful Full Text Search (FTS) for fast text-based querying (FTS documentation). a. [1 point] Import movie overview data from the movie_overview.csv into a new FTS table called movie_overview with the schema: movie_overview id (integer) overview (text) NOTE: Create the table using fts3 or fts4 only. Also note that keywords like NEAR, AND, OR, and NOT are case-sensitive in FTS queries. NOTE: If you have issues that fts is not enabled, try the following steps • Go to sqlite3 downloads page: https://www.sqlite.org/download.html • Download the dll file for your system • Navigate to your Python packages folder, e.g., C:Users… …Anaconda3pkgssqlite-3.29.0- he774522_0Librarybin • Drop the downloaded .dll file in the bin. • In your IDE, import sqlite3 again, fts should be enabled. b. [1 point] Count the number of movies whose overview field contains the word ‘fight’. Matches are not case sensitive. Match full words, not word parts/sub-strings. i. Example: Allowed: ‘FIGHT’, ‘Fight’, ‘fight’, ‘fight.’ Disallowed: ‘gunfight’, ‘fighting’, etc. ii. Output format and example value: 12 c. [2 points] Count the number of movies that contain the terms ‘space’ and ‘program’ in the overview field with no more than 5 intervening terms in between. Matches are not case sensitive. As you did in h(i)(1), match full words, not word parts/sub-strings. i. Example: Allowed: ‘In Space there was a program’, ‘In this space program’ Disallowed: ‘In space you are not subjected to the laws of gravity. A program.’ ii. Output format and example value: 6 7 Version 1 Q3 [15 points] D3 Warmup – Visualizing Wildlife Trafficking by Species In this question, you will utilize a dataset provided by TRAFFIC, an NGO working to ensure the global trade of wildlife is both legal and sustainable. TRAFFIC provides data through their interactive Wildlife Trade Portal, some of which we have already downloaded and pre-processed for you to utilize in Q3. Using species-related data, you will build a bar chart to visualize the most frequently illegally trafficked species between 2015 and 2023. Using D3, you will get firsthand experience with how interactive plots can make data more visually appealing, engaging, and easier to parse. Read chapters 4-8 of Scott Murray’s Interactive Data Visualization for the Web, 2nd edition (sign in using your GT account, e.g., [email protected]). This reading provides an important foundation you will need for Homework 2. The question and autograder have been developed and tested for D3 version 5 (v5), while the book covers v4. What you learn from the book is transferable to v5, as v5 introduced few breaking changes. We also suggest briefly reviewing chapters 1-3 for background information on web development. TRAFFIC International (2024) Wildlife Trade Portal. Available at www.wildlifetradeportal.org. Technology • D3 Version 5 (included in the lib folder) • Chrome 97.0 (or newer): the browser for grading your code • Python HTTP server (for local testing) Allowed Libraries D3 library is provided to you in the lib folder. You must NOT use any D3 libraries (d3*.js) other than the ones provided. Deliverables • Q3.html: Modified file containing all html, javascript, and any css code required to produce the bar plot. Do not include the D3 libraries or q3.csv dataset. IMPORTANT NOTES: • Setup an HTTP server to run your D3 visualizations as discussed in the D3 lecture (OMS students: watch lecture video. Campus students: see lecture PDF.). The easiest way is to use http.server for Python 3.x. Run your local HTTP server in the hw1-skeleton/Q3 folder. • We have provided sections of skeleton code and comments to help you complete the implementation. While you do not need to remove them, you need to write additional code to make things work. • All d3*.js files are provided in the lib folder and referenced using relative paths in your html file. For example, since the file “Q3/Q3.html” uses d3, its header contains:. It is incorrect to use an absolute path such as:. The 3 files that are referenced are: a. lib/d3/d3.min.js b. lib/d3-dsv/d3-dsv.min.js c. lib/d3-fetch/d3-fetch.min.js • In your html / js code, use a relative path to read the dataset file. For example, since Q3 requires reading data from the q3.csv file, the path must be “q3.csv” and NOT an absolute path such as “C:/Users/polo/HW1-skeleton/Q3/q3.csv”. Absolute paths are specific locations that exist only on your computer, which means your code will NOT run on our machines when we grade, and you will lose points. As file paths are case-sensitive, ensure you correctly provide the relative path. • Load the data from q3.csv using D3 fetch methods. We recommend d3.dsv(). Handle any data conversions that might be needed, e.g., strings that need to be converted to integer. See https://github.com/d3/d3-fetch#dsv. • VERY IMPORTANT: Use the Margin Convention guide to specify chart dimensions and layout. Tasks and point breakdown Q3.html: When run in a browser, should display a horizontal bar plot with the following specifications: 8 Version 1 1. [3.5 points] The bar plot must display one bar for each of the five most trafficked species by count. Each bar’s length corresponds to the number of wildlife trafficking incidents involving that species between 2015 and 2023, represented by the ‘count’ column in our dataset. 2. [1 point] The bars must have the same fixed thickness, and there must be some space between the bars, so they do not overlap. 3. [3 points] The plot must have visible X and Y axes that scale according to the generated bars. That is, the axes are driven by the data that they are representing. They must not be hard-coded. The x-axis must be a element having the id: “x_axis” and the y-axis must be a element having the id: “y_axis”. 4. [2 points] Set x-axis label to ‘Count’ and y-axis label to ‘Species’. The x-axis label must be a element having the id: “x_axis_label” and the y-axis label must be a element having the id: “y_axis_label”. 5. [2 points] Use a linear scale for the X-axis to represent the count (recommended function: d3.scaleLinear()). Only display ticks and labels at every 500 interval. The X-axis must be displayed below the plot. 6. [2 points] Use a categorical scale for the Y-axis to represent the species names (recommended function: d3.scaleBand()). Order the species names from greatest to least on ‘Count’ and limit the output to the top 5 species. The Y-axis must be displayed to the left of the plot. 7. [1 point] Set the HTML title tag and display a title for the plot. Those two titles are independent of each other and need to be set separately. Set the HTML title tag (i.e.,). Position the title “Wildlife Trafficking Incidents per Species (2015 to 2023)” above the bar plot. The title must be a element having the id: “title”. 8. [0.25 points] Add your GT username (usually includes a mix of letters and numbers) to the area beneath the bottom-right of the plot. The GT username must be a element having the id: “credit” 9. [0.25 points] Fill each bar with a unique color. We recommend using a colorblind-safe pallete. NOTE: Gradescope will render your plot using Chrome and present you with a Dropbox link to view the screenshot of your plot as the autograder sees it. This visual feedback helps you adjust and identify errors, e.g., a blank plot indicates a serious error. Your design does not need to replicate the solution plot. However, the autograder requires the following DOM structure (including using correct IDs for elements) and sizing attributes to know how your chart is built. 9 Version 1 plot | width: 900 | height: 370 | +– containing Q3.a plot elements | +– containing bars | +– x-axis | | | +– (x-axis elements) | +– x-axis label | +– y-axis | | | +– (y-axis elements) | +– y-axis label | +– GTUsername | +– chart title 10 Version 1 Q4 [5 points] OpenRefine OpenRefine is a powerful tool for working with messy data, allowing users to clean and transform data efficiently. Use OpenRefine in this question to clean data from Mercari. Construct GREL queries to filter the entries in this dataset. OpenRefine is a Java application that requires Java JRE to run. However, OpenRefine v.3.6.2 comes with a compatible Java version embedded with the installer. So, there is no need to install Java separately when working with this version. Go through the main features on OpenRefine’s homepage. Then, download and install OpenRefine 3.6.2. The link to release 3.6.2 is https://github.com/OpenRefine/OpenRefine/releases/tag/3.6.2 Technology • OpenRefine 3.6.2 Deliverables • properties_clean.csv: Export the final table as a csv file. • changes.json: Submit a list of changes made to file in json format. Go to ‘Undo/Redo’ Tab → ‘Extract’ → ‘Export’. This downloads ‘history.json’ . Rename it to ‘changes.json’. • Q4Observations.txt: A text file with answers to parts b.i, b.ii, b.iii, b.iv, b.v, b.vi. Provide each answer in a new line in the output format specified. Your file’s final formatting should result in a .txt file that has each answer on a new line followed by one blank line. Tasks and point breakdown 1. Import Dataset a. Run OpenRefine and point your browser at https://127.0.0.1:3333. b. We use a products dataset from Mercari, derived from a Kaggle competition (Mercari Price Suggestion Challenge). If you are interested in the details, visit the data description page. We have sampled a subset of the dataset provided as “properties.csv”. c. Choose “Create Project” → This Computer → properties.csv. Click “Next”. d. You will now see a preview of the data. Click “Create Project” at the upper right corner. 2. [5 points] Clean/Refine the Data a. [0.5 point] Select the category_name column and choose ‘Facet by Blank’ (Facet → Customized Facets → Facet by blank) to filter out the records that have blank values in this column. Provide the number of rows that return True in Q4Observations.txt. Exclude these rows. Output format and sample values: i.rows: 500 NOTE: OpenRefine maintains a log of all changes. You can undo changes by the “Undo/Redo” button at the upper left corner. You must follow all the steps in order and submit the final cleaned data file properties_clean.csv. The changes made by this step need to be present in the final submission. If they are not done at the beginning, the final number of rows can be incorrect and raise errors by the autograder. b. [1 point] Split the column category_name into multiple columns without removing the original column. For example, a row with “Kids/Toys/Dolls & Accessories” in the category_name column would be split across the newly created columns as “Kids”, “Toys” and “Dolls & Accessories”. Use the existing functionality in OpenRefine that creates multiple columns from an existing column based on a separator (i.e., in this case ‘/’) and does not remove the original category_name column. Provide the number of new columns that are created by this operation, excluding the original category_name column. Output format and sample values: ii.columns: 10 11 Version 1 NOTE: While multiple methods can split data, ensure new columns aren’t empty. Validate by sorting and checking for null values after using our suggested method in step b. c. [0.5 points] Select the column name and apply the Text Facet (Facet → Text Facet). Cluster by using (Edit Cells → Cluster and Edit …) this opens a window where you can choose different “methods” and “keying functions” to use while clustering. Choose the keying function that produces the smallest number of clusters under the “Key Collision” method. Click ‘Select All’ and ‘Merge Selected & Close’. Provide the name of the keying function and the number of clusters produced. Output format and sample values: iii.function: fingerprint, 200 NOTE: Use the default Ngram size when testing Ngram-fingerprint. d. [1 point] Replace the null values in the brand_name column with the text “Unknown” (Edit Cells → Transform). Provide the expression used. Output format and sample values: iv.GREL_categoryname: endsWith(“food”, “ood”) NOTE: “Unknown” is case and space sensitive (“Unknown” is different from “unknown” and “Unknown “.) e. [0.5 point] Create a new column high_priced with the values 0 or 1 based on the “price” column with the following conditions: if the price is greater than 90, high_priced should be set as 1, else 0. Provide the GREL expression used to perform this. Output format and sample values: v.GREL_highpriced: endsWith(“food”, “ood”) f. [1.5 points] Create a new column has_offer with the values 0 or 1 based on the item_description column with the following conditions: If it contains the text “discount” or “offer” or “sale”, then set the value in has_offer as 1, else 0. Provide the GREL expression used to perform this. Convert the text to lowercase in the GREL expression before you search for the terms. Output format and sample values: vi.GREL_hasoffer: endsWith(“food”, “ood”) 12 Version 1 Q5 [5 points] Introduction to Python Flask Flask is a lightweight web application framework written in Python that provides you with tools, libraries, and technologies to build a web application quickly and scale it up as needed. In this question, you will build a web application that displays a table of TMDb data on a single-page website using Flask. You will modify the given file: wrangling_scripts/Q5.py Technology Python 3.10.x Flask Allowed Libraries Python standard libraries Libraries already imported in Q5.py Deliverables Q5.py: Completed Python file with your changes Tasks and point breakdown 1. username() – Update the username() method inside Q5.py by including your GT username. 2. Install Flask on your machine by running $ pip install Flask a. You can optionally create a virtual environment by following the steps here. Creating a virtual environment is purely optional and can be skipped. 3. To run the code, navigate to the Q5 folder in your terminal/command prompt and execute the following command: python run.py. After running the command, go to http://127.0.0.1:3001/ on your browser. This will open up index.html, showing a table in which the rows returned by data_wrangling() are displayed. 4. You must solve the following two sub-questions: a. [2 points] Read and store the first 100 rows in a table using the data_wrangling() method. NOTE: The skeleton code, by default, reads all the rows from movies.csv. You must add the required code to ensure that you are reading only the first 100 data rows. The skeleton code already handles reading the table header for you. b. [3 points]: Sort this table in descending order of the values, i.e., with larger values at the top and smaller values at the bottom of the table in the last (3rd) column. Note that this column needs to be returned as a string for the autograder, but sorting may require float casting.

$25.00 View

[SOLVED] Cse 6242 / cx 4242 hw 3: spark, docker, databricks, aws and gcp

Homework Overview Many modern-day datasets are huge and truly exemplify “big data”. For example, the Facebook social graph is petabytes large (over 1M GB); every day, Twitter users generate over 12 terabytes of messages; and the NASA Terra and Aqua satellites each produce over 300 GB of MODIS satellite imagery per day. These raw data are far too large to even fit on the hard drive of an average computer, let alone to process and analyze. Luckily, there are a variety of modern technologies that allow us to process and analyze such large datasets in a reasonable amount of time. For the bulk of this assignment, you will be working with a dataset of over 1 billion individual taxi trips from the New York City Taxi & Limousine Commission (TLC). Further details on this dataset are available here. In Q1, you will work with a subset of the TLC dataset to get warmed up with PySpark. Apache Spark is a framework for distributed computing, and PySpark is its Python API. You will use this tool to answer questions such as “what are the top 10 most common trips in the dataset”? You will be using your own machine for computation, using an environment defined by a Docker container. In Q2, you will perform further analysis on a different subset of the TLC dataset using Spark on DataBricks, a platform combining datasets, machine learning models, and cloud computing. This part of the assignment will be completed in the Scala programming language, a modern general-purpose language with a robust support for functional programming. The Spark distributed computing framework is in fact, written using Scala. In Q3, you will use PySpark on AWS using Elastic MapReduce (EMR), and in Q4 you will use Spark on Google Cloud Platform, to analyze even larger samples from the TLC dataset. Finally, in Q5 you will use the Microsoft Azure ML Studio to implement a regression model to predict automobile prices using a sample dataset already included in the Azure workspace. A main goal of this assignment is to help students gain exposure to a variety of tools that will be useful in the future (e.g., future project, research, career). The reasoning behind intentionally including AWS, Azure and GCP (most courses use only one), because we want students to be able to try and compare these platforms as they evolve rapidly. This will help the students in the future. Should they need to select a cloud platform to use, they can make more informed decisions and be able to get started right away. You will find that a number of computational tasks in this assignment are not very difficult, and there seems to be quite a bit of “setup” to do before getting to the actual “programming” part of the problem. Being able to set up work environments, start clusters, monitor compute usage, and run large-scale experiments on cloud platforms are important skills. Through this assignment, you will be able to familiarize yourself with using clusters of machines, and the pay-per-use model used by most cloud services. This is a helpful first cloud service experience for many students. 4 Version 0 Q1 [15 points] Analyzing trips data with PySpark Technology PySpark, Docker Allowed Libraries NA Max allowed runtime NA Deliverables [Gradescope] q1.ipynb: your solution as a Jupyter Notebook file Imagine that your boss gives you a large dataset which contains trip information of New York City Taxi and Limousine Commission (TLC). You are asked to provide summaries for the most common trips, as well as information related to fares and traffic. This information might help in positioning taxis depending on the demand at each location. Follow these instructions to download and set up a preconfigured Docker image that you will use for this assignment. Why use Docker? In earlier iterations of this course, students installed software on their own machines, and we (both students and instructor team) ran into many issues that could not be resolved satisfactorily. Docker allows us to distribute a cross-platform, preconfigured image with all the requisite software and correct package versions. Once Docker is installed and the container is running, access Jupyter by browsing to https://localhost:6242. There is no need to install any additional Java or PySpark dependencies as they are all bundled as part of the Docker container. Imagine that your boss gives you a large dataset which contains trip information of New York City Taxi and Limousine Commission (TLC). You are asked to provide summaries for the most common trips, as well as information related to fares and traffic. This information might help in positioning taxis depending on the demand at each location. You are provided with a Jupyter notebook (q1.ipynb) file which you will complete using PySpark using the provided Docker image. Note: 1. Regular PySpark Dataframe Operations and PySpark SQL operations can be used. 2. If you re-run cells, remember to restart the kernel to clear the Spark context, otherwise an existing Spark context may cause errors. 3. Be sure to save your work often! If you do not see your notebook in Jupyter, then double check that the file is present in the folder and that your Docker has been set up correctly. If, after checking both, the file still does not appear in Jupyter then you can still move forward by clicking the “upload” button in the Jupyter notebook and uploading the file – however, if you use this approach, then your file will not be saved to disk when you save in Jupyter, so you would need to download your work by going to File > Download as… > Notebook (.ipynb), so be sure to download often to save your work! Tasks You will use the yellow_tripdata_2019-01_short.csv dataset. This dataset is a modified record of the NYC Green Taxi trips and includes information about the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, payment types, and driver-reported passenger counts. When processing the data or performing calculations, do not round any values. a. [1 pt] You will be modifying the function clean_data to clean the data. Cast the following columns into the specified data types: 5 Version 0 a. passenger_count — integer b. total_amount — float c. tip_amount — float d. trip_distance — float e. fare_amount — float f. tpep_pickup_datetime — timestamp g. tpep_dropoff_datetime — timestamp b. [4 pts] You will be modifying the function common_pair. Return the top 10 pickup-dropoff location pairs having the highest number of trips (count). The location pairs should be ordered by count in descending order. If two or more pairs have the same number of trips, break the tie using the trip amount per distance traveled (trip_rate) in descending order. Use columns total_amount and trip_distance to calculate the trip amount per distance. In certain situations, the pick-up and drop-off locations may be the same (include such entries as well). While calculating trip_rate, first get the average trip_distance and the average total_amount for each pair of PULocationID and DOLocationID (using group by). Then take their ratio to get the trip_rate for a pickup-drop pair. Example: Sample Output Format (values are examples only): PULocationID DOLocationID Count trip_rate 1 2 23 5.242345 3 3 5 6.61345634 c. [4 pts] You will be modifying the function time_of_cheapest_fare. Divide each day into two periods: Day (from 9am to 8:59:59pm, both inclusive), and Night (from 9pm to 8:59:59am, both inclusive). Calculate the average total amount per unit distance traveled (use column total_amount) for both time periods. Sort the result by trip_rate in ascending order to determine when the fare rate is the cheapest. Use tpep_pickup_datetime to divide trips into Day and Night. Output: day_night trip_rate Day 4.2632344561 Night 6.42342882 d. [4 pts] You will be modifying the function passenger_count_for_most_tip . Filter the data for 6 Version 0 trips having fares (fare_amount) greater than $2 and the number of passengers (passenger_count) greater than 0. Calculate the average fare and tip (tip_amount) for all passenger group sizes and calculate the tip percent (tip_amount * 100 / fare_amount). Sort the result in descending order of tip percent to obtain the group size that tips the most generously. Output: passenger_count tip_percent 2 14.22345234 1 12.523334576 3 12.17345231 e. [3 pts] You will be modifying the function day_with_traffic . Sort the days of the week (using tpep_pickup_datetime) in descending order of traffic (day having the highest traffic should be at the top). Calculate traffic for a particular day using the average speed of all taxi trips on that day of the week. Calculate the average speed as the average trip distance divided by the average trip time, as distance per hour. If the average_speed is equal for multiple days, order the days alphabetically. A day with low average speed indicates high levels of traffic. The average speed may be 0, indicating very high levels of traffic. Not all days of the week may be present in the data (do not include the missing days of the week in your output). Use date_format along with the appropriate pattern letters to format the day of the week so that it matches the example output below. Output: day_of_week average_speed Fri 0.953452345 Mon 5.2424622 Tue 9.23345272 IMPORTANT: Strictly follow the requirements below, or your answers may not be graded. 1. Do not add any cells to the notebook. 2. Remove all “testing” code that renders output, or Gradescope will crash. For instance, any additional print, display, and show statements used for debugging must be removed. Q2 [30 pts] Analyzing dataset with Spark/Scala on Databricks Technology Spark/Scala, Databricks Allowed Libraries NA Max allowed runtime NA Deliverables [Gradescope] • q2.dbc: Your solution as Scala Notebook archive file (.dbc) exported from Databricks (see Databricks Setup Guide below) • q2.scala: Your solution as a Scala source file exported from Databricks (see Databricks Setup Guide below) • q2_results.csv: The output results from your Scala code in the Databricks q2 notebook file. You must carefully copy the outputs of the display()/show() function into a file titled q2_results.csv under the relevant sections. Please double-check and compare your actual output with the results you copied. 7 Version 0 Tutorial Firstly, go over this Spark on Databricks Tutorial, to learn the basics of creating Spark jobs, loading data, and working with data. You will analyze nyc-tripdata.csv1 using Spark and Scala on the Databricks platform. (A short description of how Spark and Scala are related can be found here.) You will also need to use the taxi zone lookup table using taxi_zone_lookup.csv that maps the location ID into the actual name of the region in NYC. The nyc-trip data dataset is a modified record of the NYC Green Taxi trips and includes information about the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, payment types, and driverreported passenger counts. VERY IMPORTANT 1. Use only Firefox, Safari or Chrome when configuring anything related to Databricks. The setup process has been verified to work on these browsers. 2. Carefully follow the instructions in the Databricks Setup Guide. (You should have already downloaded the data needed for this question using the link provided before Homework Overview.) a. You must choose the Databricks Runtime (DBR) version as “6.4 (includes Apache Spark 2.4.5, Scala 2.11)”. We will grade your work using this version. b. You must not choose the default DBR version of >= 7.2 c. Note that you do not need to install Scala or Spark on your local machine. They are provided with the DBR environment. 3. You must use only Scala DataFrame operations for this question. Scala DataFrames are just another name for Spark DataSet of rows. You can use the DataSet API in Spark to work on these DataFrames. Here is a Spark document that will help you get started on working with DataFrames in Spark. You will lose points if you use SQL queries, Python, or R to manipulate a DataFrame. a. After selecting the default language as SCALA, do not use the language magic % with other languages like %r, %python, %sql etc. The language magics are used to override the default language, which you must not do for this assignment. b. You must not use full SQL queries in lieu of the Spark DataFrame API. That is, you must not use functions like sql(), which allows you to directly write full SQL queries like spark.sql (“SELECT* FROM col1 WHERE …”). This should be df.select(“*”) instead. 4. The template Scala notebook q2.dbc (in hw3-skeleton) provides you with code that reads a data file nyc-tripdata.csv. The input data is loaded into a DataFrame, inferring the schema using reflection (Refer to the Databricks Setup Guide above). It also contains code that filters the data to only keep the rows where the pickup location is different from the drop location, and the trip distance is strictly greater than 2.0 (>2.0). a. All tasks listed below must be performed on this filtered DataFrame, or you will end up with wrong answers. b. Carefully read the instructions in the notebook, which provides hints for solving the problems. 5. Some tasks in this question have specified data types for the results that are of lower precision (e.g., float). For these tasks, we will accept relevant higher precision formats (e.g., double). Similarly, we will accept results stored in data types that offer “greater range” (e.g., long, bigint) than what we have specified (e.g., int). 1 Graph derived from the NYC Taxi and Limousine Commission 8 Version 0 6. Remove all “testing” code that renders output, or Gradescope will crash. For instance, any additional print, display, and show statements used for debugging must be removed. Tasks 1) List the top-5 most popular locations for: a. [2 pts] dropoff based on “DOLocationID”, sorted in descending order by popularity. If there is a tie, then one with a lower “DOLocationID” gets listed first. b. [2 pts] pickup based on “PULocationID”, sorted in descending order by popularity. If there is a tie, then one with a lower “PULocationID” gets listed first. 2) [4 pts] List the top-3 locationID’s with the maximum overall activity. Here, overall activity at a LocationID is simply the sum of all pick-ups and all drop-offs at that LocationID. In case of a tie, the lower LocationID gets listed first. Note: If a taxi picked up 3 passengers at once, we count it as 1 pickup and not 3 pickups. 3) [4 pts] List all the boroughs (of NYC: Manhattan, Brooklyn, Queens, Staten Island, Bronx along with “Unknown” and “EWR”) and their total number of activities, in descending order of a total number of activities. Here, the total number of activities for a borough (e.g., Queens) is the sum of the overall activities (as defined in part 2) of all the LocationIDs that fall in that borough (Queens). An example output format is shown below. 4) [5 pts] List the top 2 days of the week with the largest number of daily average pick-ups, along with the average number of pick-ups on each of the 2 days in descending order (no rounding off required). Here, the average pickup is calculated by taking an average of the number of pick-ups on different dates falling on the same day of the week. For example, 02/01/2021, 02/08/2021 and 02/15/2021 are all Mondays, so the average pick-ups for these is the sum of the pickups on each date divided by 3. An example output is shown below. Note: The day of week is a string of the day’s full spelling, e.g., “Monday” instead of the number 1 or “Mon”. Also, the pickup_datetime is in the format: yyyy-mm-dd 5) [6 pts] For each hour of a day (0 to 23, 0 being midnight) — in the order from 0 to 23 (inclusively), find the zone in the Brooklyn borough with the largest number of total pick-ups. 9 Version 0 Note: All dates for each hour should be included. 6) [7 pts] Find which 3 different days in the month of January, in Manhattan, that saw the largest positive percentage increase in pick-ups compared to the previous day, in the order from largest percentage increase to smallest percentage increase. An example output is shown below. Note: All years need to be aggregated to calculate the pickups for a specific day of January. The change from Dec 31 to Jan 1 can be excluded. List the results of the above tasks in the provided q2_results.csv file under the relevant sections. These preformatted sections also show you the required output format from your Scala code with the necessary columns — while column names can be different, their resulting values must be correct. • You must manually enter the output generated into the corresponding sections of the q2_results.csv file, preferably using some spreadsheet software like MS-Excel (but make sure to keep the csv format). For generating the output in the Scala notebook, refer to show() and display()functions of Scala. • Note that you can edit this csv file using text editor, but please be mindful about putting the results under designated columns. Note: Do NOT modify anything other than filling in those required output values in this csv file. We grade by running the Spark Scala code you write and by looking at your results listed in this file. So, make sure that your output is actually obtained from the Spark Scala code you write. Hint: You may find some of the following DataFrame operations helpful: toDF, join, select, groupBy, orderBy, filter, agg, Window(), partitionBy, orderBy, etc. Q3 [35 points] Analyzing Large Amount of Data with PySpark on AWS Technology PySpark, AWS Allowed Libraries NA Max allowed runtime NA Deliverables [Gradescope] • q3.ipynb: PySpark notebook for this question (for the larger dataset). • q3_output_large.csv: output file (comma-separated) for the larger dataset. VERY IMPORTANT: Use Firefox, Safari or Chrome when configuring anything related to AWS. You will try out PySpark for processing data on Amazon Web Services (AWS). Here you can learn more about PySpark and how it can be used for data analysis. You will be completing a task that may be accomplished using a commodity computer (e.g., consumer-grade laptops or desktops). However, we would like you to use this exercise as an opportunity to learn distributed computing on Amazon EC2, and to gain experience that will help you tackle more complex problems. 10 Version 0 The services you will primarily be using are Amazon S3 storage, Amazon Elastic Cloud Computing (EC2) virtual servers, and Amazon Elastic MapReduce (EMR) managed Hadoop framework. You will be creating an S3 bucket, running code through EMR, and then storing the output into that S3 bucket. For this question, you will only use up a very small fraction of your AWS credit. If you have any issues with the AWS Academy account, please fill out this form. Setting Up AWS Environment Go through all the steps in the AWS Setup Guide (You should have already completed Step 1 to create your account) to set up your AWS environment, e.g., setting up billing alert, creating S3 storage bucket, uploading skeleton file, and, EXTREMELY IMPORTANTLY learning how to terminate all AWS clusters properly, or you will run out of AWS credits and may not be able to complete this question. Datasets In this question, you will use a dataset of trip records provided by the New York City Taxi and Limousine Commission (TLC). You will be accessing the dataset directly through AWS via the code outlined in the homework skeleton. Specifically, you will be working with two samples of this dataset, one small, and one much larger. Further details about this dataset are available here and here, and you may explore the structure of the data via [1] [2]. EXTREMELY IMPORTANT: Both the datasets are in the US East (N. Virginia) region. Using machines in other regions for computation will incur data transfer charges. Hence, set your region to US East (N. Virginia) in the beginning (not Oregon, which is the default). This is extremely important, otherwise your code may not work, and you may be charged extra. Goal You work at NYC TLC, and since the company bought a few new taxis, your boss has asked you to locate potential places where taxi drivers can pick up more passengers. Of course, the more profitable the locations are, the better. Your boss also tells you not to worry about short trips for any of your analysis, so only analyze trips which are 2.0 miles or longer. First, find the 20 most popular drop off locations in the Manhattan borough by finding which of these destinations had the greatest passenger count. Now, analyze all pick-up locations. • For each pick-up location determine o the average total amount per trip, o the total count of all trips that start at that location, and o the count of all trips that start at that location and end at one of most popular drop-off locations. • Using the above values, o determine the proportion of trips that end in one of the popular drop-off locations (# trips that end in drop off location divided by total # of trips) and o multiply that proportion by the average total amount to get a weighted profit value based on the probability of passengers going to one of the popular destinations. 11 Version 0 Bear in mind, your boss is not as savvy with the data as you are and is not interested in location IDs. To make it easy for your boss, provide the Borough and Zone for each of the top 20 pick-up locations you determined. Tasks You are provided with a python notebook (q3.ipynb) file which you will complete and load into EMR. You are provided with the load_data() function, which loads two PySpark DataFrames. The first is trips which contain a DataFrame of trip data, where each record refers to one (1) trip. The second is lookup which maps a LocationID to its information. It can be linked to either the PULocationID or DOLocationID fields in the trips DataFrame. The following functions must be completed for full credit. VERY IMPORTANT • Ensure that the parameters for each function remain as defined and the output order and names of the fields in the PySpark DataFrames are maintained. • Do not import any functions which were not already imported within the skeleton. • You must NOT round any numeric values. Rounding numbers can introduce inaccuracies. Our grader will be checking the first 8 decimal places of each value in the DataFrame. a) [1 pts] user() i. Returns your GT Username as a string (e.g., gburdell3) b) [2 pts] long_trips(trips) i. This function filters trips to keep only trips 2 miles or longer (e.g., >= 2). ii. Returns PySpark DataFrame with the same schema as trips iii. Note: Parts c, d and e will use the result of this function c) [6 pts] manhattan_trips(trips, lookup) i. This function determines the top 20 locations with a DOLocationID in Manhattan by sum of passenger count. ii. Returns a PySpark DataFrame (mtrips) with the schema (DOLocationID, pcount) d) [6 pts] weighted_profit(trips, mtrips) i. This function determines i. the average total_amount, ii. the total count of trips, and iii. the total count of trips ending in the top 20 destinations and return the weighted_profit as discussed earlier in the homework document. iv. Returns a PySpark DataFrame with the schema (PULocationID, weighted_profit) for the weighted_profit as discussed earlier in this homework document. e) [5 pts] final_output(wp, lookup) i. This function i. takes the results of weighted_profit, ii. links it to the borough and zone through the lookup data frame, and iii. returns the top 20 locations with the highest weighted_profit. ii. Returns a PySpark DataFrame with the schema (Zone, Borough, weighted_profit) Once you have implemented all these functions, run the main() function, which is already implemented, and update the line of code to include the name of your output s3 bucket and a location. This function will fail if the output directory already exists, so make sure to change it each time you run the function. 12 Version 0 Example: final.write.csv(‘s3://cse6242-gburdell3/output-large3’) Your output file will appear in a folder in your s3 bucket as a csv file with a name which is similar to part0000-4d992f7a-0ad3-48f8-8c72-0022984e4b50-c000.csv. Download this file and rename it to q3_output_large.csv for submission. Do NOT make any other changes to the file. Hints: 1. Refer to DataFrame commands such as filter, join, groupBy, agg, limit, sort, withColumnRenamed and withColumn. Documentation for the DataFrame APIs is located here. 2. Testing on a single, small dataset (i.e., a “test case”) is helpful, and is insufficient in discovering all potential issues, especially if such issues only become apparent when the code is run on larger datasets. Thus, it is important for you to develop more ways to review and verify your code logic. 3. Precision in data analytics is very important. Keep in mind that precision reduction in an earlier step can accumulate and be magnified, subsequently significantly affecting the final output’s precision (e.g., for a dataset with 1,000,000 data points, a 0.0001 difference for each data point can lead to a total difference of 100 over the whole dataset). 4. Check if you’re reducing the precision (or “scale”) too aggressively. Can you relax the restriction during intermediate steps? 5. Make sure you return a DataFrame. If you get NoneType errors, you are most likely not returning what you think you are. 6. Some columns may need to be cast to the right data type. Keep that in mind! IMPORTANT: Strictly follow the guidelines below, or your answer may not be graded. 1. Double check that you are submitting the correct files — we only want the script and output from the larger dataset. Also, double check that you are writing the right dataset’s output to the right file. 2. You are welcome to store your script’s output in any bucket you choose, as long as you can download and submit the correct files. 3. Do not make any manual changes to the output files. 4. Regular Pyspark Dataframe Operations and PySpark SQL operations can be used. 4.1. To use PySpark SQL operations, you must use the SQL Context on the Spark Dataframe. Example: df.sql_ctx.sql(“SELECT * FROM some_data”) 5. Do not import any additional packages, INCLUDING pyspark.sql.functions, as this may cause the autograder to work incorrectly. Everything you need should be imported for you. 6. Remove all “testing” code that renders output, or Gradescope will crash. For instance, any additional print, display, and show statements used for debugging must be removed. Q4 [10 points] Analyzing a Large Dataset using Spark on GCP Technology Spark, Google Cloud Platform (GCP) Allowed Libraries NA Max allowed runtime NA Deliverables [Gradescope] q4.ipynb: the PySpark notebook for this question. VERY IMPORTANT: Use Firefox, Safari or Chrome when configuring anything related to GCP. 13 Version 0 GCP Guidelines Instructions to set up GCP Credits, GCP Storage and Dataproc Cluster are provided as video tutorials (part 1, part 2, and part 3) and as written instructions. Helpful tips/FAQs for special scenarios: a) If GCP service is disabled for your google account, try the steps in this google support link b) If you have any issues with GCP free credits, please fill out this form Goal The goal of this question is to familiarize you with creating storage buckets/clusters and running Spark programs on Google Cloud Platform. This question asks you to create a new Google Storage Bucket and load the NYC Taxi & Limousine Commission Dataset. You are also provided with a Jupyter Notebook q4.ipynb file, which you will load and complete in a Google Dataproc Cluster. Inside the notebook, you are provided with the skeleton for the load_data() function, which you will complete to load a PySpark DataFrame from the Google Storage Bucket you created as part of this question. Using this PySpark DataFrame, you will complete the following tasks using Spark DataFrame functions. You will use the data file yellow_tripdata09-08-2021.csv; the preceding link allows you to download the dataset you are required to work with for this question from the course DropBox. Each line represents a single taxi trip consisting of the comma-separated columns bulleted below. All columns are of string data type. You must convert the highlighted columns below into decimal data type (do NOT use float datatype) inside their respective functions when completing this question. Do not convert any datatypes within the load_data function. While casting to a decimal datatype, use a precision of 38 and a scale of 10. • vendorid • tpep_pickup_datetime • tpep_dropoff_datetime • passenger_count • trip_distance (decimal data type) • ratecodeid • store_and_fwd_flag • pulocationid • dolocationid • payment_type • fare_amount (decimal data type) • extra • mta_tax • tip_amount (decimal data type) • tolls_amount (decimal data type) • improvement_surcharge • total_amount Tasks VERY IMPORTANT: you must first perform the task a BEFORE performing task b, c, d, e and f. No points are allocated to task a, but it is essential that you correctly implement the load_data() function as the remaining graded tasks depend upon this task and its correct implementation. 14 Version 0 a) [0 pts — required] Function load_data() to load data from a Google Storage Bucket into a Spark DataFrame b) [2 pts] Function exclude_no_pickuplocations() to exclude trips with no pick-up locations (i.e., pick-up location id column is null or is zero. In other words, assume zero is not a valid pickup location id.) in the original data from a. c) [2 pts] Function exclude_no_tripdistance() to exclude trips with no distance (i.e., trip distance column is null or zero) in the dataframe output by exclude_no_pickuplocations(). . Note: Cast the trip_distance column to decimal datatype before filtering. d) [2 pts] Function include_fare_range() to include trips with fare from $20 (inclusively) to $60 (inclusively) in the dataframe output by exclude_no_tripdistance(). Note: Cast the fare_amount column to decimal datatype before filtering. e) [2 pts] Function get_highest_tip() to identify the highest tip (rounded to 2 decimal places) in the dataframe output by include_fare_range(). Note: Cast the tip_amount column to decimal datatype before filtering. f) [2 pts] Function get_total_toll() to calculate the total toll amount (rounded to 2 decimal places) in the dataframe output by include_fare_range(). Note: Cast the tolls_amount column to decimal datatype before filtering. IMPORTANT: Strictly follow the guidelines below, or your answer may not be graded. 1. Regular PySpark Dataframe Operations and PySpark SQL operations can be used. 2. Make sure to download the notebook from your GCP cluster before deleting the GCP cluster (otherwise, you will lose your work). 3. Do not add new cells to the notebook, as this may break the auto-grader. 4. Remove all “testing” code that renders output, or Gradescope will crash. For instance, any additional print, display, and show statements used for debugging must be removed. Q5 [10 points] Regression: Automobile price prediction, using Microsoft Machine Learning Studio Note: Create and use a free workspace instance on Microsoft Machine Learning Studio. Use your Georgia Tech username (e.g., jdoe3) to login. Goal The primary purpose of this question is to introduce you to Microsoft Machine Learning Studio, familiarize you to its basic functionalities and typical machine learning workflows. Go through the “Automobile price prediction” tutorial and create/run ML experiments to complete the following tasks. You will not incur any cost if you save your experiments on Azure till submission. Once you are sure about the results and have reported them, feel free to delete your experiments. Technology Microsoft Machine Learning Studio Allowed Libraries NA Max allowed runtime NA Deliverables [Gradescope] q5_results.csv: a csv file containing results for all parts 15 Version 0 Tasks You will manually modify the given file q5_results.csv by adding to it the results from the following tasks (e.g., using a plain text editor). Your solution will be autograded. Hence, • DO NOT change the order of the questions. • Report the exact numerical values that you get in your output, and DO NOT round any of them. • When manually entering a value into the csv file, append it immediately after a comma, so there will be NO space between the comma and your value, and no trailing spaces or commas after your value. • Follow the tutorial and do not change values for L2 regularization. For parts b and c, please select the columns given in the tutorial. a) Update your GT username in the q5_results.csv file to replace gburdell3. b) [3 pts] Repeat the experiment described in the tutorial and report values of all metrics as mentioned in the ‘Evaluate Model’ section of the tutorial. c) [3 pts] Repeat the experiment mentioned in part b with a different value of ‘Fraction of rows in the first output’ in the split module. Change the value to 0.8 from the originally set value, 0.75. Report corresponding values of the metrics. d) [4 pts] Run a new experiment — evaluate the model using 5-fold cross-validation (CV). Select parameters in the module ‘Partition and sample’ (Partition and Sample) in accordance with the figure below. Set the column name as “price” for CV. Also, use 0 as a random seed. Report the values of Root Mean Squared Error (RMSE) and Coefficient of Determination for each of the five folds (1st fold corresponds to fold number 0 and so on). Do NOT round the results. Report exact values. To summarize, for part d, you MUST exactly follow each step below to run the experiment: A. Import the entire dataset (Automobile Price Data (Raw)) B. Clean the missing data by dropping rows with missing values (select all columns in the dataset and do not “exclude the normalized losses” from the original tutorial). Leave the maximum missing value ratio to 1. C. Partition and sample the data. (Note: do not use “Split Data”) D. Create a new model: Linear Regression (add the default Linear regression, i.e., do not change any values here) E. Finally, perform cross-validation on the dataset. (Hint: use the price column here) F. Visualize/report the values. 16 Version 0 Figure: Property Tab of Partition and Sample Module Hint: For part 4, follow each of the outline steps carefully. This should result in 5 blocks in your final workflow (including the Automobile price data (Raw) block).

$25.00 View

[SOLVED] Cse 6242 / cx 4242 homework 2: tableau, d3 graphs and visualization

Homework Overview “Visualization gives you answers to questions you didn’t know you have” – Ben Schneiderman This homework focuses on exploring and creating data visualizations using two of the most popular tools in the field. Data visualization is an integral part of exploratory analysis and communicating key insights. All of the questions in this homework use data on the same topic in order to highlight some of the uses and strengths of different types of visualizations. The data for this homework comes from BoardGameGeek and includes information on games’ ratings, popularity, and metadata. Part 1 of the homework uses Tableau to connect to online data which feeds multiple visualizations including a table and bar charts. Part 2 of the homework uses D3 and includes graphs with different scales, network graphs, and a map. Below are some terms you will often see in the questions: • Rating – a value from 0 to 10 given to each game. BoardGameGeek calculates a game’s overall rating in different ways including Average and Bayes, so make sure you are using the correct rating called for in a question. A higher rating is better than a lower rating. • Rank – the overall rank of a boardgame from 1 to n, with ranks closer to 1 being better and n being the total number of games. The rank may be for all games or for a subgroup of games such as abstract games or family games. In Q1, you will design a table, a grouped bar chart, and a stacked bar chart with filters. The data for this question is hosted online and will help you practice connecting Tableau to online data sources. Questions 2-5 highlight different features of D3. The provided skeletons scaffold coding in D3 with the most complete template code being provided for Q2. Q4 and Q5 provide scaled back templates. Q3 does not provide template code, and is an excellent opportunity to separate html, css, and js files because a separate js file can be used for each of the visualizations. Q2: a network graph shows relationships between games. You will add interactive features like pinning nodes to give the viewer some control over the visualization. Q3: you will explore temporal patterns in the BoardGameGeek data, using line charts to compare how the number of ratings grew from month to month for 8 games. You will also integrate additional data about board game rankings onto these line charts and explore the effect of axis scale choice on what information is emphasized in the graph. Q4: you will create line charts that use interactive elements to display additional data. This time, the line charts will show the number of games with each rating for multiple years. You will then implement a bar chart that appears when you mouse over a point on the line chart. Q5: you will create a choropleth map to explore the average rating of each game in different countries. 3 version 0 Note the following important points 1. We highly recommend that you use the latest Firefox browser to complete this question. We will grade your work using Firefox 80.0. 2. You will work with version 5 of D3 in this homework. You must NOT use any D3 libraries (d3*.js) other than the ones provided in the lib folder. 3. You may need to setup a local HTTP server in the root (hw2-skeleton) folder to run your D3 visualizations, depending on your web browser, as discussed in the D3 lecture (OMS students: the video “Week 5 – Data Visualization for the Web (D3) – Prerequisites: JavaScript and SVG”. Campus students: see lecture PDF.). The easiest way is to use http.server for Python 3.x. (for more details, see link). 4. All d3*.js files in the lib folder must be referenced using relative paths, e.g., “../lib/” in your html files. For example, suppose the file “Q2/graph.html” uses d3, its header should contain:It is incorrect to use an absolute path such as:5. For questions that require reading from a dataset, you may be required to submit the dataset in the deliverables too. For such questions, in your html/js code, use a relative path to read in the dataset file. For example, suppose a question reads data from earthquake.csv, the path should simply be “earthquake.csv” and NOT an absolute path such as “C:/Users/polo/hw2- skeleton/Q/earthquake.csv”. 6. You can and are encouraged to decouple the style, functionality and markup in the code for each question. That is, you can use separate files for CSS, JavaScript and html. Q1 [25 points] Designing a good table. Visualizing data with Tableau. Setting Up Tableau Tableau has provided us with student licenses for Tableau Desktop, available for Mac and Windows. Go to Tableau and select “Products/Tableau Desktop”. After installation, you will be asked to provide an activation key, which you can find on the Canvas page for this assignment. This key is for your use in this course only. Do not share the key with anyone. If you already have Tableau installed on your machine, for example from a previous trial, you may use this key to reactivate it. If you do not have access to a Mac or Windows machine, please use the 14-day trial version of Tableau Online: 1. Visit https://www.tableau.com/trial/tableau-online 2. Enter your information (name, email, GT details, etc.) 3. You will then receive an email to access your Tableau Online site 4. Go to your Site and create a workbook One final option, if neither of the above methods work, is to take advantage of Tableau for Students. Follow the link and select “Get Tableau For Free”. You should be able to receive an activation key which offers you a one-year use of Tableau Desktop at no cost by providing a valid Georgia Tech email. Note that it is unclear whether Tableau intends for these licenses to be renewable, so you may only be eligible to receive one in the event that you have never used a Tableau for Students license before. Connecting to the Data Complete all parts of Q1 using a single Tableau workbook. (Technically, you could use multiple workbooks, but we do not recommend that here. The directions below assume you are using one workbook.) 1. You will need a data.world account (created using any email you want) to access the data for Q1. 2. Q1 will require connecting Tableau to multiple data sources. You can connect multiple data sources within one workbook by following the directions here. 4 version 0 3. Open Tableau and when prompted to connect to a data source choose To a Server – Web Data Connector. You may need to select “More…” to see Web Data Connector as an option. 4. Enter this URL (with SQL query embedded) to connect to part of the data.world data set on board games. This data will be used in Q1a and Q1b. You may be prompted to log in to data.world and authorize Tableau. Do not edit the provided SQL query. 5. We recommend renaming the data connection since you will have multiple connections to mjpetrey/boardgamegeek. Rename the connection to something that makes sense to you. (Clicking on the text lets you edit it.) 6. Click to create a new worksheet, and Tableau will then automatically create a data extract. You now have the data needed for Q1a and Q1b! (Live data connections are not an option when connecting to data.world. You can read a comparison of Tableau’s data connection options here.) 7. To add a new data source Click on Data – New Data Source. Then repeat steps 3-6 using this URL for Q1c. If you are unable to connect to data.world for any reason, flat data files for Q1 have also been provided in the skeleton folder. The preferred data source is connecting online as that provides valuable experience (and something you may choose to use in your final projects). The provided csv files are identical to those hosted online and can be loaded directly into Tableau. That is, if data.world does not work for you, use the csv files. a. [5 points] Good table design. You want to help a board game design company to analyze the current popular board game data from the website BoardGameGeek. Create a well-designed table to visualize the data contained in popular_board_game.csv. You can use any tool (e.g., Excel, HTML, Tableau) to create the table. If you choose to use a tool other than Tableau to make the table, you will still need to load the same data into Tableau for use in Q1b. The company is interested in grouping popular games into “support solo” (minimum player = 1) and “not support solo” (minimum player > 1), because single-player games require a different design strategy. Instructions: Your table should clearly communicate information about these two groups (games that support solo & games that do not support solo) simultaneously. For each group, show: 1. Total game count in each category (fighting, economic, …) 2. The most representative game (game with the most ratings) in each category. If more than one game have the same ratings, pick the game that you prefer. 3. Average rating of games in each category, rounded to the nearest 2 decimal places 4. Average playtime of games in each category, rounded to the nearest 2 decimal places 5. In the bottom left corner below your table include your GT username. In Tableau, this can be done by including a caption when exporting an image of a worksheet or by adding a text box to a dashboard. Refer to the tutorial here. 6. Save the table as table.png 7. In Tableau, to save a worksheet image, go to Worksheet – Export – Image. And to save a dashboard image, go to Dashboard – Export Image (Do not simply take a screenshot since your image should have a high resolution). Note: If there is no game under a particular group and category, think about how to visually represent missing data in your table. You may decide on the most meaningful column names to use, the number of columns, and the column order. Keep suggestions from lecture in mind when designing your table. You are not limited to use only the techniques described in lecture. For OMS students, the online lecture video pertaining to this topic is Week 4 – Fixing Common Visualization Issues – Fixing Bar Charts, Line Charts. For campus student, please review slide 52 and onwards of the lecture slides. 5 version 0 b. [10 points] Grouped bar chart. You want to help this board game design company better understand the relationship between game playtime and game category among popular board games. Visualize popular_board_game.csv as a grouped bar chart. Your chart should display game category (e.g., fighting, economic) along the horizontal axis and game count along the vertical axis. Also show game playtime (e.g.,

$25.00 View

[SOLVED] Cse 6242 / cx 4242 hw 1: end-to-end analysis of tmdb data, argo-lite, sqlite, d3 warmup, openrefine, flask

Homework Overview Vast amounts of digital data are generated each day, but raw data are often not immediately “usable.” Instead, we are interested in the information content of the data: what patterns are captured? This assignment covers a few useful tools for acquiring, cleaning, storing, and visualizing datasets. In Question 1 (Q1), you will collect data using an API for The Movie Database (TMDb). You will construct a graph representation of this data that will show which actors have acted together in various movies, and use Argo Lite to visualize this graph and highlight patterns that you find. This exercise demonstrates how visualizing and interacting with data can help with discovery. In Q2, you will construct a TMDb database in SQLite, with tables capturing information such as how well each movie did, which actors acted in each movie, and what the movie was about. You will also partition and combine information in these tables in order to more easily answer questions such as “which actors acted in the highest number of movies?”. In Q3, you will visualize temporal trends in movie releases, using a JavaScript-based library called D3. This part will show how creating interactive rather than static plots can make data more visually appealing, engaging and easier to parse. Data analysis and visualization is only as good as the quality of the input data. Real-world data often contain missing values, invalid fields, or entries that are not relevant or of interest. In Q4, you will use OpenRefine to clean data from Mercari, and construct GREL queries to filter the entries in this dataset. Finally, in Q5, you will build a simple web application that displays a table of TMDb data on a single-page website. To do this, you will use Flask, a Python framework for building web applications that allows you to connect Python data processing on the back end with serving a site that displays these results. Grading and Feedback The maximum possible score for this homework is 100 points. We will auto-grade Q1 and Q2 using the Gradescope platform. We believe our students (you all!) may benefit from being able to use Gradescope to obtain feedback as you work on these questions. Using Gradescope is optional — you can complete these questions without using it. If you decide to use Gradescope, keep the following important points in mind. 1. Every student will receive an email invitation to join Gradescope via the student’s email address listed on Canvas. We expect the invitations will arrive at your inboxes within a few hours after the release of this homework. 2. You may upload your code periodically to Gradescope to obtain feedback for your code. This is accomplished by having Gradescope auto-grade your submission using the same test cases that we will use to grade your work. The test cases’ results may help inform you of potential errors and ways to improve your code. 3. Your grades for Q1 and Q2 will be determined only based on what you submit to Canvas, as some students may not choose to use Gradescope to test their code. We will ignore any code that you may have uploaded to Gradescope (and we will not grade it). In other words, if you decide to use Gradescope to test your work — and when you are happy with your work, you must submit that work via Canvas for us to grade it. 3 Version 0 4. When a lot of students use Gradescope, it is possible for it slow down or fail to communicate with the tester, and it can become even slower as the submission deadline approaches. You are responsible for submitting your work in time. Q1 [40 points] Collect data from TMDb and visualize co-actor network Q1.1 [30 points] Collect data from TMDb and build a graph For this Q1.1, you will be using and submitting a python file. Complete all tasks according to the instructions found in submission.py to complete the Graph class, the TMDbAPIUtils class, and the two global functions. The Graph class will serve as a re-usable way to represent and write out your collected graph data. The TMDbAPIUtils class will be used to work with the TMDB API for data retrieval. NOTE: You must only use a version of Python ≥ 3.7.0 and < 3.8 for this question. This question has been developed, tested for these versions. You must not use any other versions (e.g., Python 3.8). While we want to be able to extend to more Python versions, the specified versions are what we can definitively support at this time. NOTE: You must only use the modules and libraries provided at the top of submission.py and modules from the Python Standard Library. Pandas and Numpy CANNOT be used — while we understand that they are useful libraries to learn, completing this question is not critically dependent on their functionality. In addition, to enable our TAs to provide better, more consistent support to our students, we have decided to focus on the subset of libraries. NOTE: We will call each function once in submission.py during grading. You may lose some points if your program runs for unreasonably long time, such as more than 10 minutes during “non-busy” times. The average runtime of the code during grading is expected to take approximately 4 seconds. When we grade, we will take into account what your code does, and aspects that may be out of your control. For example, sometimes the server may be under heavy load, which may significantly increase the response time (e.g., the closer it is to HW1 deadline, likely the longer the response time!). a) [10 pts] Implementation of the Graph class according to the instructions in submission.py b) [10 pts] Implementation of the TMDbAPIUtils class according to the instructions in submission.py. You will use version 3 of the TMDb API to download data about actors and their co-actors. To use the TMDb API: o Create a TMDb account and obtain your client id / client secret which are required to obtain an authentication Token. Refer to this document for detailed instructions (log in using your GT account). o Refer to the TMDB API Documentation as you work on this question. The documentation contains a helpful ‘try-it-out’ feature for interacting with the API calls. c) [10 pts] Producing correct nodes.csv and edges.csv. You must upload your nodes.csv and edges.csv file as directed in Q1.2. NOTE: Q1.2 builds on the results of Q1.1 4 Version 0 Q1.2 [10 points] Visualizing a graph of co-actors using Argo-Lite Using Argo Lite, visualize a network of actors and their co-actors. You can access Argo Lite here You will produce an Argo Lite graph snapshot your edges.csv and nodes.csv from Q1.1.c. a. To get started, review Argo Lite’s readme on GitHub. Argo Lite has been open-sourced. b. Importing your Graph ● Launch Argo Lite ● From the menu bar, click ‘Graph’ → ‘Import CSV’. In the dialogue that appears: o Select ‘I have both nodes and edges file’ ● Under Nodes, use ‘Choose File’ to select nodes.csv from your computer o Leave ‘Has Headers’ selected o Verify ‘Column for Node ID’ is ‘id’ ● Under Edges, use ‘Choose File’ to select edges.csv from your computer o Verify ‘Column for Source ID’ is ‘source’ o Select ‘Column for Target ID’ to ‘target’ o Verify ‘Selected Delimiter’ is ‘,’ ● At the bottom of the dialogue, verify that ‘After import, show’ is ‘All Nodes’ ● The graph will load in the window. Note that the layout is paused by default; you can select to ‘Resume’ or ‘Pause’ layout as needed. ● Dragging a node will ‘pin’ it, freezing its position. Selecting a pinned node, right clicking it, then choosing ‘unpin selected’ will unpin that node, so its position will once again be computed by the graph layout algorithm. Experiment with pinning and unpinning nodes. c. [7 points] Setting graph display options ● On “Graph Options” panel, under ‘Nodes’ → ‘Modifying All Nodes’, expand ‘Color’ menu o Select Color by ‘degree’, with scale: ‘Linear Scale’ o Select a color gradient of your choice that will assign lighter colors to nodes with higher node degrees, and darker colors to nodes with lower degrees ● Collapse the ‘Color’ options, expand the ‘Size’ options. o Select ‘Scale by’ to ‘degree’, with scale: Linear Scale’ o Select meaningful Size Range values of your choice or use the default range. ● Collapse the ‘Size’ options ● On the Menu, click ‘Tools’ → ‘Data Sheet’ ● Within the ‘Data Sheet’ dialogue: o Click ‘Hide All’ o Set ‘10 more nodes with highest degree’ o Click ‘Show’ and then close the ‘Data Sheet’ dialogue ● Click and drag a rectangle selection around the visible nodes ● With the nodes selected, configure their node visibility by setting the following: o Go to ‘Graph Options’ → ‘Labels’ o Click ‘Show Labels of Selected Nodes’ o At the bottom of the menu, select ‘Label By’ to ‘degree’ o Adjust the ‘Label Length’ so that the full text of the actor name is displayed ● On the Menu, click ‘Tools’ -> ‘Filters’ -> ‘Show All Nodes’ The result of this workflow yields a graph with the sizing and coloring depending upon the node degree and the nodes with the highest degree are emphasized by showing their labels. ● d. [3 points] Designing a meaningful graph layout 5 Version 0 Using the following guidelines, create a visually meaningful and appealing layout: ● Reduce as much edge crossing as possible ● Reduce node overlap as much as possible ● Keep the graph compact and symmetric as possible ● Use the nodes’ spatial positions to convey information (e.g., “clusters” or groups) ● Experiment with showing additional node labels. If showing all node labels creates too much visual complexity, show at least 10 “important” nodes. You may decide what “importance” mean to you. For example, you may consider nodes (actors) having higher connectivity as potentially more “important” (based on how the graph is built). The objective of this task is to familiarize yourself with basic, important graph visualization features. Therefore, this is an open-ended task, and most designs receive full marks. So please experiment with Argo Lite’s features, changing node size and shape, etc. In practice, it is not possible to create “perfect” visualizations for most graph datasets. The above guidelines are ones that generally help. However, like most design tasks, creating a visualization is about making selective design compromises. Some guidelines could create competing demands and following all guidelines may not guarantee a “perfect” design. If you want to save your Argo Lite graph visualization snapshot locally to your device, so you can continue working on it later, we recommend the following workflow. ● Select ‘Graph’ → ‘Save Snapshot’ o In the ‘Save Snapshot` dialog, click ‘Copy to Clipboard’ o Open an external text editor program such as TextEdit or Notepad. Paste the clipboard contents of the graph snapshot, and save it to a file with a .json extension. You should be able to accomplish this with a default text editor on your computer by overriding the default file extension and manually entering ‘.json’. o You may save your progress by saving the snapshot and loading them into Argo Lite to continue your work. ● To load a snapshot, choose ‘Graph’ → ‘Open Snapshot’ ● Select the graph snapshot you created. NOTE: Q1.2 (d) will not be graded on Gradescope. We will give a qualitative score on the overall design and presentation of your graph visualization in Argo Lite. e. Publish and Share your graph snapshot ● Select ‘Graph ‘ → ‘Publish and Share Snapshot’ → ‘Share’ ● Next, click ‘Copy to Clipboard’ to copy the generated URL ● Return the URL in the return_argo_lite_snapshot() function in submission.py If you modify your graph after you publish and share a URL, you will need to re-publish and obtain a new URL of your latest graph. Only the graph snapshot shared via the URL will be graded. Deliverables: Place the files listed below in the Q1 folder.  submission.py: the completed Python file Q2 [35 points] SQLite SQLite is a lightweight, serverless, embedded database that can easily handle multiple gigabytes of data. It is one of the world’s most popular embedded database systems. It is convenient to share data stored in an SQLite database — just one cross-platform file which does not need to be parsed explicitly (unlike CSV 6 Version 0 files, which have to be parsed). You will modify the given Q2_SQL.py file by adding SQL statements to it. NOTE: You must only use a version of Python ≥ 3.7.0 and < 3.8 for this question. This question has been developed, tested for these versions. You must not use any other versions (e.g., Python 3.8) NOTE: Do not modify the import statements, everything you need to complete this question has been imported for you. You may not use other libraries for this assignment. A Sample class has been provided for you to see some sample SQL statements, you can turn off this output by changing the global variable SHOW to False. NOTE: This must be set to false before uploading to Gradescope and turning it in to Canvas. GTusername – Please update the method GTusername with your credentials NOTE: For the questions in this section, you must only use INNER JOIN when performing a join between two tables. Other types of joins may result in incorrect results. a. [9 points] Create tables and import data. i. [2 points] Create two tables (via two separate methods) named movies and movie_cast with columns having the indicated data types: 1. movies 1. id (integer) 2. title (text) 3. score (real) 2. movie_cast 1. movie_id (integer) 2. cast_id (integer) 3. cast_name (text) 4. birthday (text) 5. popularity (real) ii. [2 points] Import the provided movies.csv file into the movies table and movie_cast.csv into the movie_cast table 1. You will write Python code that imports the .csv files into the individual tables. This will include looping though the file and using the ‘INSERT INTO’ SQL command. Only use relative paths while importing files since absolute/local paths are specific locations that exist only on your computer and will cause the auto-grader to fail. iii. [5 points] Vertical Database Partitioning. Database partitioning is an important technique that divides large tables into smaller tables, which may help speed up queries. For this question you will create a new table cast_bio from the movie_cast table (i.e., columns in cast_bio will be a subset of those in movie_cast) Do not edit the movie_cast table. Be sure that when you insert into the new cast_bio that the values are unique. Please read this page for an example of vertical database partitioning. cast_bio 1. cast_id (integer) 2. cast_name (text) 3. birthday (date) 7 Version 0 4. popularity (real) b. [1 point] Create indexes. Create the following indexes for the tables specified below. This step increases the speed of subsequent operations; though the improvement in speed may be negligible for this small database, it is significant for larger databases. i. movie_index for the id column in movies table ii. cast_index for the cast_id column in movie_cast table iii. cast_bio_index for the cast_id column in cast_bio table c. [3 points] Calculate a proportion. Find the proportion of movies having a score > 50 and that has ‘war’ in the name. Treat each row as a different movie. The proportion should only be based on the total number of rows in the movie table. Format all decimals to two places using printf(). Do NOT use the ROUND() function as it does not work the same on every OS. Output format and sample value: 7.70 d. [4 points] Find the most prolific actors. List 5 cast members with the highest number of movie appearances that have a popularity > 10. Sort the results by the number of appearances in descending order, then by cast_name in alphabetical order. Output format and sample values (cast_name,appearance_count): Harrison Ford,2 e. [4 points] Find the highest scoring movies with the smallest cast. List the 5 highest-scoring movies that have the fewest cast members. Sort the results by score in descending order, then by number of cast members in ascending order, then by movie name in alphabetical order. Format all decimals to two places using printf(). Output format and sample values (movie_title,movie_score,cast_count): Star Wars: Holiday Special,75.01,12 War Games,58.49,33 f. [4 points] Get high scoring actors. Find the top ten cast members who have the highest average movie scores. Format all decimals to two places using printf(). ▪ Sort the output by average score in descending order, then by cast_name in alphabetical order. ▪ Do not include movies with score = 40. The view should have the format: good_collaboration( cast_member_id1, 8 Version 0 cast_member_id2, movie_count, average_movie_score) For symmetrical or mirror pairs, only keep the row in which cast_member_id1 has a lower numeric value. For example, for ID pairs (1, 2) and (2, 1), keep the row with IDs (1, 2). There should not be any “self pair” where the value of cast_member_id1 is the same as that of cast_member_id2. NOTE: Full points will only be awarded for queries that use joins for part g. Remember that creating a view will not produce any output, so you should test your view with a few simple select statements during development. One such test has already been added to the code as part of the auto-grading. NOTE: Do not submit any code that creates a ‘TEMP’ or ‘TEMPORARY’ view that you may have used for testing. Optional Reading: Why create views? i. [4 points] Find the best collaborators. Get the 5 cast members with the highest average scores from the good_collaboration view, and call this score the collaboration_score. This score is the average of the average_movie_score corresponding to each cast member, including actors in cast_member_id1 as well as cast_member_id2. Format all decimals to two places using printf(). • Sort your output by this score in descending order, then by cast_name alphabetically. Output format (cast_id,cast_name,collaboration_score): 2,Mark Hamil,99.32 1920,Winoa Ryder,88.32 h. [4 points] SQLite supports simple but powerful Full Text Search (FTS) for fast text-based querying (FTS documentation). Import movie overview data from the movie_overview.csv into a new FTS table called movie_overview with the schema: movie_overview ▪ id (integer) ▪ overview (text) NOTE: Create the table using fts3 or fts4 only. Also note that keywords like NEAR, AND, OR and NOT are case sensitive in FTS queries. i. [1 point] Count the number of movies whose overview field contains the word ‘fight’. Matches are not case sensitive. Match full words, not word parts/sub-strings. e.g., Allowed: ‘FIGHT’, ‘Fight’, ‘fight’, ‘fight.’. Disallowed: ‘gunfight’, ‘fighting’, etc. Output format: 12 ii. [2 points] Count the number of movies that contain the terms ‘space’ and ‘program’ in the 9 Version 0 overview field with no more than 5 intervening terms in between. Matches are not case sensitive. As you did in h(i)(1), match full words, not word parts/sub-strings. e.g., Allowed: ‘In Space there was a program’, ‘In this space program’. Disallowed: ‘In space you are not subjected to the laws of gravity. A program.’, etc. Output format: 6 Deliverables: Place all the files listed below in the Q2 folder 1. Q2_SQL.py: Modified file containing all the SQL statements you have used to answer parts a – h in the proper sequence. Q3 [15 points] D3 (v5) Warmup Read chapters 4-8 of Scott Murray’s Interactive Data Visualization for the Web, 2nd edition (sign in using your GT account, e.g., [email protected]). You may also briefly review chapters 1-3 if you need additional background on web development. This simple reading provides important foundation you will need for Homework 2. This question uses D3 version v5, while the book covers D3 v4. What you learn from the book is transferable to v5. In Homework 2, you will work with D3 extensively. NOTE the following important points: 1. We highly recommend that you use the latest Firefox browser to complete this question. We will grade your work using Firefox 79.0 (or newer). 2. For this homework, the D3 library is provided to you in the lib folder. You must NOT use any D3 libraries (d3*.js) other than the ones provided. 3. You may need to setup an HTTP server to run your D3 visualizations (depending on which web browser you are using, as discussed in the D3 lecture (OMS students: the video “Week 5 – Data Visualization for the Web (D3) – Prerequisites: JavaScript and SVG”. Campus students: see lecture PDF.). The easiest way is to use http.server for Python 3.x. Run your local HTTP server in the hw1-skeleton/Q3 folder. 4. We have provided sections of code along with comments in the skeleton to help you complete the implementation. While you do not need to remove them, you may need to write additional code to make things work. 5. All d3*.js files in the lib folder are referenced using relative paths in your html file. For example, since the file “Q3/index.html” uses d3, its header contains:It is incorrect to use an absolute path such as:The 3 files that are referenced are: – lib/d3/d3.min.js – lib/d3-dsv/d3-dsv.min.js – lib/d3-fetch/d3-fetch.min.js 10 Version 0 6. For a question that reads in a dataset, you are required to submit the dataset too (as part of your deliverable). In your html / js code, use a relative path to read in the dataset file. For example, since Q3 requires reading data from the q3.csv file, the path should be ‘q3.csv’ and NOT an absolute path such as “C:/Users/polo/HW1-skeleton/Q3/q3.csv”. Absolute/local paths are specific locations that exist only on your computer, which means your code will NOT run on our machines when we grade (and you will lose points). 7. You can and are encouraged (though not required) to decouple the style, functionality and markup in the code for each question. That is, you can use separate files for CSS, JavaScript and HTML — this is a good programming practice in general. Deliverables: Place all the files/folders listed below in the Q3 folder ● A folder named lib containing folders d3, d3-fetch, d3-dsv ● q3.csv: the file that we have provided you, in the hw1 skeleton under Q3 folder, which contains the data that will be loaded into the D3 plot. ● index.(html / css / js) : when run in a browser, it should display a barplot with the following specifications: a. [1.5 points] Load the data from q3.csv using D3 fetch methods. We recommend d3.dsv(). b. [2 points] The barplot must display one bar per row in the q3.csv dataset. Each bar corresponds to the running total of movies for a given year. The height of each bar represents the running total. The bars are ordered by ascending time with the earliest observation at the far left. i.e., 1880, 1890, …, 2000 c. [1 point] The bars must have the same fixed width, and there must be some space between two bars, so that the bars do not overlap. d. [3 points] The plot must have visible X and Y axes that scale according to the generated bars. That is, the axes are driven by the data that they are representing. Likewise, the ticks on these axes must adjust automatically based on the values within the datasets, i.e., they must not be hard-coded. e. [2 point] Set x-axis label to ‘Year’ and y-axis label to ‘Running Total’. f. [1 point] Use a linear scale for the Y axis to represent the running total (recommended function: d3.scaleLinear()). g. [3 points] Use a time scale for the X axis to represent year (recommended function: d3.scaleTime()). It may be necessary to use time parsing / formatting when you load and display the year data. The axis would be overcrowded if you display every year value so set the X-axis ticks to display one tick for every 10 years. h. [1 point] Set the HTML title tag and display a title for the plot. ■ Position the title “Running Total of TMDb Movies by Year” above the barplot. ■ Set the HTML title tag (i.e.,). 11 Version 0 i. [0.5 points] Add your GT username (usually includes a mix of letters and numbers) to the area beneath the bottom-right of the plot (see example image). The barplot should appear similar in style to the sample data plot provided below. Q4 [5 points] OpenRefine OpenRefine is a Java application and requires Java JRE to run. Download and install Java if you do not have it (you can verify by typing ‘java -version’ in your computer’s terminal or command prompt). a. Watch the videos on OpenRefine’s homepage for an overview of its features. Then, download and install OpenRefine release 3.3. Do not use version 3.4 (which is in beta status). b. Import Dataset ● Run OpenRefine and point your browser at 127.0.0.1:3333. ● We use a products dataset from Mercari, derived from a Kaggle competition (Mercari Price Suggestion Challenge). If you are interested in the details, visit the data description page. We have sampled a subset of the dataset provided as “properties.csv”. ● Choose “Create Project” → This Computer → properties.csv”. Click “Next”. ● You will now see a preview of the data. Click “Create Project” at the upper right corner. c. Clean/Refine the data NOTE: OpenRefine maintains a log of all changes. You can undo changes. Use the “Undo/Redo” button at the upper left corner. Follow the exact output format specified in every part below. i. [0.5 point] Select the category_name column and choose ‘Facet by Blank’ (Facet → Customized Facets → Facet by blank) to filter out the records that have blank values in this column. Provide the number of rows that return True in Q4Observations.txt. Exclude these rows. 12 Version 0 Output format and sample values: i.rows: 500 ii. [1 point] Split the column category_name into multiple columns without removing the original column. For example, a row with “Kids/Toys/Dolls & Accessories” in the category_name column would be split across the newly created columns as “Kids”, “Toys” and “Dolls & Accessories”. Use the existing functionality in OpenRefine that creates multiple columns from an existing column based on a separator (i.e., in this case ‘/’) and does not remove the original category_name column. Provide the number of new columns that are created by this operation, excluding the original category_name column. Output format and sample values: ii.columns: 10 NOTE: There are many possible ways to split the data. While we have provided one way to accomplish this in step ii, some methods could create columns that are completely empty. In this dataset, none of the new columns should be completely empty. Therefore, to validate your output, we recommend that you verify that there are no columns that are completely empty, by sorting and checking for null values. iii. [0.5 points] Select the column name and apply the Text Facet (Facet → Text Facet). Cluster by using (Edit Cells → Cluster and Edit …) this opens a window where you can choose different “methods” and “keying functions” to use while clustering. Choose the keying function that produces the smallest number of clusters under the “Key Collision” method. Click ‘Select All’ and ‘Merge Selected & Close’. Provide the name of the keying function and the number of clusters that was produced. Output format and sample values: iii.function: fingerprint, 200 NOTE: Use the default Ngram size when testing Ngram-fingerprint. iv. [1 point] Replace the null values in the brand_name column with the text “Unknown” (Edit Cells – > Transform). Provide the General Refine Evaluation Language (GREL) expression used. Output format and sample values: iv.GREL_categoryname: endsWith(“food”, “ood”) v. [1 point] Create a new column high_priced with the values 0 or 1 based on the “price” column with the following conditions: if the price is greater than 90, high_priced should be set as 1, else 0. Provide the GREL expression used to perform this. Output format and sample values: v.GREL_highpriced: endsWith(“food”, “ood”) vi. [1 point] Create a new column has_offer with the values 0 or 1 based on the item_description column with the following conditions: If it contains the text “discount” or “offer” or “sale”, then set the value in has_offer as 1, else 0. Provide the GREL expression used to perform this. Convert the text to lowercase before you search for the terms. 13 Version 0 Output format and sample values: vi.GREL_hasoffer: endsWith(“food”, “ood”) Deliverables: Place all the files listed below in the Q4 folder ● properties_clean.csv : Export the final table as a comma-separated values (.csv) file. ● changes.json : Submit a list of changes made to file in json format. Use the “Extract Operation History” option under the Undo/Redo tab to create this file. ● Q4Observations.txt : A text file with answers to parts c.i, c.ii, c.iii, c.iv, c.v, c.vi. Provide each answer in a new line in the exact output format specified. Your file’s final formatting should result in a .txt file that has each answer on a new line followed by one blank line (to help visually separately the answers) Q5 [5 points] Introduction to Python Flask Flask is a lightweight web application framework written in Python that provides you with tools, libraries and technologies to quickly build a web application. It allows you to scale up your application as needed. You will modify the given file: • wrangling_scripts/wrangling.py NOTE: You must only use a version of Python ≥ 3.7.0 and < 3.8 for this question. This question has been developed, tested for these versions. You must not use any other versions (e.g., Python 3.8). NOTE: You must only use the modules and libraries provided at the top of submission.py and modules from the Python Standard Library (except Flask). Pandas and Numpy CANNOT be used — while we understand that they are useful libraries to learn, completing this question is not critically dependent on their functionality. In addition, to enable our TAs to provide better, more consistent support to our students, we have decided to focus on the subset of libraries. Username()- Update the username() method inside wrangling.py by including your GTUsername. • Get started by installing Flask on your machine by running pip install Flask (Note that you can optionally create a virtual environment by following the steps here. Creating a virtual environment is purely optional and can be skipped.) • To run the code, you must navigate to the Q5 folder in your terminal/command prompt and execute the following command: python run.py. After running the command go to https://127.0.0.1:3001/ on your browser. This will open up index.html showing a table in which the rows returned by data_wrangling() are displayed. • You must solve the following 2 sub-questions: a. [2 points] Read the top 100 rows using the data_wrangling() method. NOTE: The skeleton code by default reads all the rows from movies.csv. You must add the required code to ensure reading only the first 100 rows. The skeleton code already handles reading the table header for you. 14 Version 0 b. [3 points]: Sort the table in descending order of the values i.e., with larger values at the top and smaller values at the bottom of the table in the last (3rd) column. Deliverables: Place the file listed below in the Q5 folder ● wrangling.py : Submit wrangling.py file with your changes. Extremely Important: folder structure & content of submission zip file We understand that some of you may work on this assignment until just prior to the deadline, rushing to submit your work before the submission window closes. Please take the time to validate that all files are present in your submission and that you have not forgotten to include any deliverables! If a deliverable is not submitted, you will receive zero credit for the affected portion of the assignment — this is a very sad way to lose points, since you have already done the work! You are submitting a single zip file named HW1-GTusername.zip (e.g., HW1-jdoe3.zip). The files included in each question’s folder have been clearly specified at the end of the question’s problem description. The zip file’s folder structure must exactly be (when unzipped): HW1-GTusername/ Q1/ submission.py Q2/ Q2_SQL.py Q3/ index.(html / js / css) q3.csv lib/ d3/ d3.min.js d3-fetch/ d3-fetch.min.js d3-dsv/ d3-dsv.min.js Q4/ properties_clean.csv changes.json Q4Observations.txt Q5/ wrangling.py

$25.00 View

[SOLVED] Cse​​6242/cx​4242 homework 4 : scalable pagerank via virtual memory (mmap), random forest, scikit learn

Q1 [30 pts] Scalable single­machine PageRank on 70M edge graph In this question, you will learn how to use your computer’s virtual memory to implement the PageRank algorithm that will scale to graph datasets with as many as billions of edges using a single computer (e.g., your laptop). As discussed in class, a standard way to work with larger datasets has been to use computer clusters (e.g., Spark, Hadoop) which may involve steep learning curves, may be costly (e.g., pay for hardware and personnel), and importantly may be “overkill” for smaller datasets (e.g., a few tens or hundreds of GBs). The virtual­memory­based approach offers an attractive, simple solution to allow practitioners and researchers to more easily work with such data (visit the NSF­funded MMap project’s homepage to learn more about the research). The main idea is to place the dataset in your computer’s (unlimited) virtual memory, as it is often too big to fit in the RAM. When running algorithms on the dataset (e.g., PageRank), the operating system will automatically decide when to load the necessary data (subset of whole dataset) into RAM. This technical approach to put data into your machine’s virtual memory space is called “memory mapping”, which allows the dataset to be treated as if it is an in­memory dataset. In your (PageRank) program, you do not need to know whether the data that you need is stored on the hard disk, or kept in RAM. Note that memorymapping a file does NOT cause the whole file to be read into memory. Instead, data is loaded and kept in memory only when needed (determined by strategies like least recently used paging and anticipatory paging). You will use the Python modules mmap and struct to map a large graph dataset into your computer’s virtual memory. The mmap() function does the “memory mapping”, establishing a mapping between a program’s (virtual) memory address space and a file stored on your hard drive ­­ we call this file a “memory­mapped” file. Since memory­mapped files are viewed as a sequence of bytes (i.e., a binary file), your program needs to know how to convert bytes to and from numbers (e.g., integers). struct supports such conversions via “packing” and “unpacking”, using format specifiers that represent the desired endianness and data type to convert to/from. Q1.1 Set up Pypy Install PyPy, which is a Just­In­Time compilation runtime for Python, which supports fast packing and unpacking. C++ and Java are generally faster than Python. However, several projects aim to boost Python speed. PyPy is one of them. Ubuntu sudo apt­get install pypy MacOS Install Homebrew Run brew install pypy Windows Download the package and then install it. Run the following code in the Q1 directory to learn more about the helper utility that we have provided to you for this question. $ pypy q1_utils.py ­­help Q1.2 Warm Up (10 pts) Get started with memory mapping concepts using the code­based tutorial in warmup.py. You should study the code and modify parts of it as instructed in the file. You can run the tutorial code as­is (without any modifications) to test how it works (run “python warmup.py” on the terminal to do this). The warmup code is setup to pack the integers from 0 to 63 into a binary file, and unpack it back into a memory map object. You will need to modify this code to do the same thing for all odd integers in the range of 1 to 42. The lines that need to be updated are clearly marked. Note: You must not modify any other parts of the code. When you are done, you can run the following command to test whether it works as expected: $ python q1_utils.py test_warmup out_warmup.bin It prints True if the binary file created after running warmup.py contains the expected output. Q1.3 Implementing and running PageRank (20 pts) You will implement the PageRank algorithm, using the power iteration method, and run it on the LiveJournal dataset (an online community with millions of users to maintain journals and blogs). We recommend you revisit the MMap lecture to refresh your memory about the PageRank algorithm and the data structures and files that you may need to memory­map. (For more details, read the MMap paper.) You will perform three steps (subtasks) as described below. Step 1: Download the LiveJournal graph dataset (an edge list file) The LiveJournal graph contains almost 70 million edges. It is available on the SNAP website. We are hosting the graph, to avoid high traffic bombarding their site. Step 2: Convert the graph’s edge list to binary files (you only need to do this once) Since memory mapping works with binary files, you will convert the graph’s edge list into its binary format by running the following command at the terminal/command prompt: $ python q1_utils.py convert Example: Consider the following toy­graph.txt, which contains 7 edges: 0 1 1 0 1 2 2 1 3 4 4 5 5 2 To convert the graph to its binary format, you will type: $ python q1_utils.py convert toy­graph/toy­graph.txt This generates 3 files: toy­graph/ toy­graph.bin: binary file containing edges (source, target) in little­endian “int” C type toy­graph.idx: binary file containing (node, degree) in little­endian “long long” C type toy­graph.json: metadata about the conversion process (required to run pagerank) In toy­graph.bin we have, 0000 0000 0100 0000 # 0 1 (in little­endian “int” C type) 0100 0000 0000 0000 # 1 0 0100 0000 0200 0000 # 1 2 0200 0000 0100 0000 # 2 1 0300 0000 0400 0000 # 3 4 0400 0000 0500 0000 # 4 5 0500 0000 0200 0000 # 5 2 ffff ffff ffff ffff … ffff ffff ffff ffff ffff ffff ffff ffff In toy­graph.idx we have, 0000 0000 0000 0000 0100 0000 0000 0000 # 0 1 (in little­endian “long long” C type ) 0100 0000 0000 0000 0200 0000 0000 0000 # 1 2 … ffff ffff ffff ffff ffff ffff ffff ffff Note: there are extra values of ­1 (ffff ffff or ffff ffff ffff ffff) added at the end of the binary file as padding to ensure that the code will not break in case you try to read a value greater than the file size. You can ignore these values as they will not affect your code. Step 3: Implement and run the PageRank algorithm on LiveJournal graph’s binary files Follow the instructions in pagerank.py to implement the PageRank algorithm. You will only need to write/modify a few lines of code. Next, run the following command to execute your PageRank implementation: $ pypy q1_utils.py pagerank This will output the 10 nodes with the highest PageRank scores. For example: $ pypy q1_utils.py pagerank toy­graph/toy­graph.json node_id score 1 0.4106875 2 0.2542078125 0 0.1995421875 5 0.0643125 4 0.04625 3 0.025 (Note that only 6 nodes are printed here since the toy graph only has 6 nodes.) Step 4: Experiment with different number of iterations. Find the output for the top 10 nodes for the LiveJournal graph for n=10, 25, 50 iterations (try the ­­iterations n argument in the command above; the default number of iterations is 10). A file in the format pagerank_nodes_n.txt for “n” number of iterations will be created. For example: $ pypy q1_utils.py pagerank toy­graph/toy­graph.json ­­iterations 25 You may notice that while the top nodes’ ordering starts to stabilize as you run more iterations, the nodes’ PageRank scores may still change. The speed at which the PageRank scores converge depends on the PageRank vector’s initial values. The closer the initial values are to the actual pagerank scores, the faster they converge. Deliverables 1. warmup.py [6pt]: your modified implementation. 2. out_warmup.bin [3pt]: the binary file, automatically generated by your modified warmup.py. 3. out_warmup_bytes.txt [1pt]: the text file with the number of bytes, automatically generated by your modified warmup.py. 4. pagerank.py [14pt]: your modified implementation. 5. pagerank_nodes_n.txt [6pt]: the 3 files (as given below) containing the top 10 node IDs and their pageranks for n iterations, automatically generated by q1_utils.py. ○ pagerank_nodes_10.txt [2pt] for n=10 ○ pagerank_nodes_25.txt [2pt] for n=25 ○ pagerank_nodes_50.txt [2pt] for n=50 Q2 [50 pts] Random Forest Classifier Note: You must use Python 3.x for this question. You will implement a random forest classifier in Python. The performance of the classifier will be evaluated via the out­of­bag (OOB) error estimate, using the provided dataset. Note: You must not use existing machine learning or random forest libraries like scikit­learn. The dataset you will use is extracted from the UCI Bank Marketing dataset where each record is data related with direct marketing campaigns of a Portuguese banking institution. The dataset has been cleaned to remove missing attributes. The data is stored in a comma­separated file (csv) in your Q2 folder as hw4­data.csv. Each line describes an instance using 20 columns: the first 19 columns represent the attributes of the application, and the last column is the ground truth label for the term deposit subscription (0 means “not subscribed”, 1 means “subscribed”). Note: The last column should not be treated as an attribute. You will perform binary classification on the dataset to determine if a client will subscribe to a term deposit ot not. Essential Reading Decision Trees To complete this question, you need to develop a good understanding of how decision trees work. We recommend you review the lecture on decision tree. Specifically, you need to know how to construct decision trees using Entropy and Information Gain to select the splitting attribute and split point for the selected attribute. These slides from CMU (also mentioned in lecture) provide an excellent example of how to construct a decision tree using Entropy and Information Gain. Random Forests To refresh your memory about random forests, see Chapter 15 in the “Elements of Statistical Learning” book and the lecture on random forests. Here is a blog post that introduces random forests in a fun way, in layman’s terms. Out­of­Bag Error Estimate In random forests, it is not necessary to perform explicit cross­validation or use a separate test set for performance evaluation. Out­of­bag (OOB) error estimate has shown to be reasonably accurate and unbiased. Below, we summarize the key points about OOB described in the original article by Breiman and Cutler. Each tree in the forest is constructed using a different bootstrap sample from the original data. Each bootstrap sample is constructed by randomly sampling from the original dataset with replacement (usually, a bootstrap sample has the same size as the original dataset). Statistically, about one­third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree. For each record left out in the construction of the kth tree, it can be assigned a class by the kth tree. As a result, each record will have a “test set” classification by the subset of trees that treat the record as an out­of­bag sample. The majority vote for that record will be its predicted class. The proportion of times that a predicted class is not equal to the true class of a record averaged over all records is the OOB error estimate. Starter Code We have prepared starter code written in Python for you to use. This would help you load the data and evaluate your model. The following files are provided for you: ● util.py: utility functions that will help you build a decision tree ● decision_tree.py: a decision tree class that you will use to build your random forest ● random_forest.py: a random forest class and a main method to test your random forest What you will implement Below, we have summarized what you will implement to solve this question. Note that you MUST use information gain to perform the splitting in the decision tree. The starter code has detailed comments on how to implement each function. 1. util.py: implement the functions to compute entropy, information gain, and perform splitting. 2. decision_tree.py: implement the learn() method to build your decision tree using the utility functions above. 3. decision_tree.py: implement the classify() method to predict the label of a test record using your decision tree. 4. random_forest.py: implement the functions _bootstrapping(), fitting(), voting() Note: You must achieve a minimum accuracy of 80% for random_forest. As you solve this question, you will need to think about multiple parameters in your design, some may be more straightforward to determine, some may be not (hint: study lecture slides and essential reading above). For example, ● Which attributes to use when building a tree? ● How to determine the split point for an attribute? ● When do you stop splitting leaf nodes? ● How many trees should the forest contain? Note that, as mentioned in lecture, there are other approaches to implement random forests. For example, instead of information gain, other popular choices include Gini index, random attribute selection (e.g., PERT ­ Perfect Random Tree Ensembles). We decided to ask everyone to use an information gain based approach in this question (instead of leaving it open­ended), to help standardize students solutions to help accelerate our grading efforts. Deliverables 1. hw4­data.csv: The dataset used to develop your program. Do not modify this file. 2. util.py [10 pts]: The source code of your utility functions. 3. decision_tree.py [30 pts]: The source code of your decision tree implementation. 4. Random_forest.py [10 pts]: The source code of your random forest implementation with appropriate comments. Q3 [30 points] Using Scikit­Learn Note: You must use Python 3.x for this question. Scikit­learn is a popular Python library for machine learning. You will use it to train some classifiers on the Epileptic Seizure Recognition[1] dataset in the folder, called seizure_dataset.csv. Q3.1 ­ Classifier Setup [7 pts] Train each of the following classifiers on the dataset, using the classes provided in the links below. You will do hyperparameter tuning in Q3.2 to get the best accuracy for each classifier on the dataset. 1. Linear Regression 2. Multi­Layer Perceptron 3. Random Forest 4. Support Vector Machine (The link points to SVC, which is a particular implementation of SVM by scikit.) Scikit has additional documentation on each of these classes, explaining them in more detail, such as how they work and how to use them. Use the skeleton file called hw4q3.py to write your code. In report.txt, under section Q3.1, follow the skeleton and put your training and testing accuracies for each classifier. Report your accuracies as percentages and round them to the nearest whole number, e.g., 85%. As a reminder, the general flow of your machine learning code will look like: 1. Load dataset 2. Preprocess (you will do this in Q3.2) 3. Split the data into x_train, y_train, x_test, y_test 4. Train the classifier on x_train and y_train 5. Predict on x_test 6. Evaluate testing accuracy by comparing the predictions from step 5 with y_test. Here is an example. Scikit has many other examples as well that you can learn from. Q3.2 ­ Hyperparameter Tuning [17 pts] Tune your Random Forest and SVM to obtain their best accuracies on the dataset. For Random Forest, tune the model on the unmodified test and train datasets. For SVM, either standardize or normalize the dataset before using it to tune the model. Note: If you are using StandardScaler: ­ Pass x_train into the fit method. Then transform both x_train and x_test to obtain the standardized versions of both. ­ The reason we fit only on x_train and not the entire dataset is because we do not want to train on data that was affected by the testing set. Tune the hyperparameters specified below, using the GridSearchCV function that Scikit provides: ­ For random forest, tune the parameters “n_estimators” and “max_depth”. ­ For SVM, tune “C” and “kernel” (try only ‘linear’ and ‘rbf’). Use 10 folds by setting the cv parameter to 10. You should test at least 3 values for each of the numerical parameters. For C, the values should be different by factors of at least 10, for example, 0.001, 0.01, and 0.1, or 0.0001, 0.1 and 100. In section Q3.2 of report.txt, state the values you tested for each hyperparameter. Also follow the skeleton in report.txt to report the best combination of hyperparameter values for each classifier tuned, its testing accuracy from Q3.1, and its best testing accuracy from tuning. For each classifier the best testing accuracy from tuning should be at least as high as the testing accuracy from Q3.1. Note: If GridSearchCV is taking a long time to run for SVM, make sure you are standardizing or normalizing your data beforehand. Q3.3 ­ Cross­Validation Results [2 pts] Let’s practice getting the results of cross­validation. For your SVM (only), report the mean training score, mean testing score and mean fit time for the best combination of hyperparameter values that you obtained in Q3.2. The GridSearchCV class holds a  ‘cv_results_’ dictionary that should help you report these metrics easily. Report the metrics in report.txt under the Q3.3 section. Report your accuracies as percentages and round them to the nearest whole number, for example 85%. Q3.4 ­ Best Classifier [4 pts] Out of all 4 classifiers (for Random Forest and SVM take the best one from GridSearchCV for each), assess which one performed the best. Use testing accuracies, fit time or a combination of both in your reasoning. Put your explanation in report.txt under section Q3.4, using at most 50 words. Deliverables ­ report.txt ­ A text file containing your results and explanations for all parts. ­ hw4q3.py ­ Skeleton file filled with your code from Q3.1­Q3.3. ­ seizure_dataset.csv ­ the original dataset. Submission Guidelines Submit the deliverables as a single zip file named HW4­{GT account username}.zip. Write down the name(s) of any students you have collaborated with on this assignment, using the text box on the Canvas submission page. The zip file’s directory structure must exactly be (when unzipped): HW4­{GT account username}/ Q1/ warmup.py out_warmup.bin out_warmup_bytes.txt pagerank.py pagerank_nodes_10.txt pagerank_nodes_25.txt pagerank_nodes_50.txt Q2/ hw4­data.csv util.py decision_tree.py random_forest.py Q3/ report.txt hw4q3.py seizure_dataset.csv You must follow the naming convention specified above. Version 2 [1] Derived from https://archive.ics.uci.edu/ml/datasets/Epileptic+Seizure+Recognition 通过Google云端硬盘发布 – 举报滥用行为 – 每5分钟自动更新一次

$25.00 View

[SOLVED] Cse​​6242/cx​4242 homework 3 : hadoop, spark, pig and azure

Q1 [15 points] Analyzing a Graph with Hadoop/Java Imagine that your boss gives you a large dataset which contains an entire email communication network from a popular social network site. The network is organized as a directed graph where each node represents an email address and the edge between two nodes (e.g., Address A and Address B) has a weight stating how many times A wrote to B. You have been tasked with finding which people have sent the most emails. Your task is to write a MapReduce program in Java to report, for each node in the graph, the largest weight among all of the node’s weighted outbound edges. First, go over the Hadoop word count tutorial to familiarize yourself with Hadoop and some Java basics. You will be able to complete this question with only some knowledge about Java. You should have already loaded two graph files into HDFS and loaded into your HDFS file system in your VM. Each file stores a list of edges as tabseparated­values. Each line represents a single edge consisting of three columns: (source node ID, target node ID, edge weight), each of which is separated by a tab (t). Node IDs and weights are nonnegative integers. Below is a small toy graph, for illustration purposes (on your screen, the text may appear out of alignment). src tgt weight 110 10 3 200 10 1 150 200 30 110 100 10 200 110 15 110 130 67 Your program should not assume the edges to be sorted or ordered in any ways (i.e., your program should work even when the edge ordering is random). Your code should accept two arguments upon running. The first argument (args[0]) will be a path for the input graph file on HDFS (e.g., cse6242/graph1.tsv), and the second argument (args[1]) will be a path for output directory on HDFS (e.g., cse6242/q1output1). The default output mechanism of Hadoop will create multiple files on the output directory such as part­00000, part­00001, which will be merged and downloaded to a local directory by the supplied run script. Please use the run1.sh and run2.sh scripts for your convenience. The format of the output should be such that each line represents a node ID and the largest weight among all its outbound edges. The ID and the largest weight must be separated by a tab (t). Lines do not need to be sorted. The following example result is computed based on the toy graph above. Please exclude nodes that do not have outgoing edges (e.g., those email addresses which have not sent any communication). For the toy graph above, the output is as follows. 110 67 200 15 150 30 Deliverables 1. [5 points] Your Maven project directory including Q1.java. Please see detailed submission guide at the end of this document. You should implement your own MapReduce procedure and should not import external graph processing library. 2. [5 points] q1output1.tsv: the output file of processing graph1.tsv by run1.sh. 3. [5 points] q1output2.tsv: the output file of processing graph2.tsv by run2.sh. Q2 [25 pts] Analyzing a Large Graph with Spark/Scala on Databricks Tutorial: First, go over this Spark word count tutorial to get more background about Spark/Scala. Goal Your objectives: 1. Eliminate any duplicate rows. 2. Filter the graph such that only nodes containing an edge weight >= 5 are preserved. 3. Analyze the graph to find the nodes with the highest in­degree, out­degree, and total degree using DataFrame operations. 4. Download a new DataFrame to q2output.csv containing your analysis (schema provided below). You will analyze bitcoinalpha.csv [3] using Spark and Scala on the Databricks platform. This graph is a whotrusts­whom network of people who trade using Bitcoin on a platform called Bitcoin Alpha. For a node, the number of incoming edges is the in­degree of the node and the number of outgoing edges is called the outdegree. The total degree is the sum of all edges for a node. You should perform this task using the DataFrame API in Spark. Here is a guide that will help you get started on working with data frames in Spark. A template Scala notebook, q2­skeleton.dbc has been included in the HW3­Skeleton that reads in a sample graph file toygraph.csv. In the template, the input data is loaded into a dataframe, inferring the schema using reflection (Refer to the guide above). Note: You must use only Scala DataFrame operations for this task. You will lose points if you use SQL queries, Python, or R to manipulate a dataframe. You may find some of the following DataFrame operations helpful: toDF, join, select, groupBy, orderBy Upload the data file toygraph.csv and q2­skeleton.dbc to your Databricks workspace before continuing. Follow the Databricks Setup Guide for further instructions. Consider the following directed graph example and how to accomplish the stated objectives src tgt weight 1 2 5 1 3 5 1 4 5 2 1 5 2 3 5 2 5 5 3 4 5 4 2 5 4 5 5 4 6 5 5 2 5 1 2 5 3 1 4 1. Eliminate Duplicates: The second instance of src: 1 ­ tgt: 2 should be eliminated from graph snippet below. src tgt weight 1 2 5 1 3 5 . . . . . . 1 2 5 3 1 4 2. Filter the graph such that only nodes containing an edge weight >= 5 are preserved 3. Find node w/ highest in­degree, out­degree, and highest total degree. If we analyzed the toy graph, we would find the following: node out­degree in­degree total­degree 1 3 1 4 2 3 3 6 3 1 2 3 4 3 2 5 5 1 2 3 6 0 1 1 Nodes(s) with the highest in­degree : 2 Node(s) with the highest out­degree: 1, 2, 4 Node(s) with highest combined degree: 2 Notes If two or more nodes have the same out­degree, report the one with the lowest node id If two or more nodes have the same in­degree, report the one with the lowest node id If two or more nodes have the same total degree, report the one with the lowest node id 4. Create a dataframe to store your results using this schema: 3 columns, named: ‘v’, ‘d’, ‘c’ where: ­ v : vertex id ­ d : degree calculation (an integer value. one row with highest in­degree, a row w/ highest out­degree, a row w/ highest total degree ) ­ c : category of degree, containing one of three string values: ‘i’ : in­degree ‘o’ : out­degree ‘t’ : total­degree Your output will be downloaded as a .csv file that meets the following requirements: 1. Your output shall contain exactly 4 rows. (1 header row + 3 data rows) 2. Your output shall contain exactly the column order specified. 3. The order of rows does not matter. A correct output .csv for the input file toygraph.csv would look like: v,d,c 2,3,i 1,3,o 2,6,t Whereas: Node 1 has highest out­degree with a value of 3 Node 2 has highest in­degree with a value of 3 Node 2 has highest total degree with a value of 6 Deliverables 1. [10 pts] a. q2.dbc Your solution as Scala Notebook archive file (.dbc) exported from Databricks. See the Databricks Setup Guide on creating an exportable archive for details. b. q2.scala, Your solution as a Scala source file exported from Databricks. See the Databricks Setup Guide on creating an exportable source file for details. Notes: you are exporting your solution as both a .dbc & a .scala file. 2. [15 pts] q2output.csv: The output file of processing bitcoinalpha.csv from the q2 notebook file. Q3 [35 points] Analyzing Large Amount of Data with Pig on AWS You will try out Apache Pig for processing n­gram data on Amazon Web Services (AWS). This is a fairly simple task, and in practice you may be able to tackle this using commodity computers (e.g., consumer­grade laptops or desktops). However, we would like you to use this exercise to learn and solve it using distributed computing on Amazon EC2, and gain experience (very helpful for your career), so you will be prepared to tackle problems that are more complex. The services you will primarily be using are Amazon S3 storage, Amazon Elastic Cloud Computing (EC2) virtual servers in the cloud, and Amazon Elastic MapReduce (EMR) managed Hadoop framework. For this question, you will only use up a very small fraction of your $100 credit. AWS allows you to use up to 20 instances in total (that means 1 master instance and up to 19 core instances) without filling out a “limit request form”. For this assignment, you should not exceed this quota of 20 instances. Refer to details about instance types, their specs, and pricing. In the future, for larger jobs, you may want to use AWS’s pricing calculator. AWS Guidelines Please read the AWS Setup Guidelines provided to set up your AWS account. Datasets In this question, you will use subsets of the Google books n-grams dataset (full dataset for reference), on which you will perform some analysis. An ‘n-gram’ is a phrase with n words; the full n-gram dataset lists n-grams present in the books on books.google.com along with some statistics. You will perform your analysis on two custom datasets, extracted from the Google books bigrams (2­grams), that we have prepared for you: a small one s3://cse6242oan­2018fall­aws­small (~1GB) and a large one s3://cse6242oan­2018fall­aws­big (~150GB). VERY IMPORTANT: Both the datasets are in the US East (N. Virginia) region. Using machines in other regions for computation would incur data transfer charges. Hence, set your region to US East (N. Virginia) in the beginning (not Oregon, which is the default). This is extremely important otherwise your code may not work and you may be charged extra. The files in these two S3 buckets are stored in a tab (‘t’) separated format. Each line is in the following format: n­gram TAB year TAB occurrences TAB books NEWLINE Some example lines: I am 1936 342 90 I am 1945 211 25 I am 1951 47 12 very cool 1923 118 10 very cool 1980 320 100 very cool 2012 994 302 very cool 2017 1820 612 The above lines tell us that, in 1936, the bigram “I am” appeared 342 times in 90 different books. In 1945, “I am” appeared 211 times in 25 different books. And so on. Goal Output the 15 bigrams having the highest average number of appearances per book along with their corresponding averages, in tab­separated format, sorted in descending order. Only consider entries with at least 300 occurrences and at least 12 books. If multiple bigrams have the same average, order them alphabetically. For the example above, the output will be: I am 3.80 very cool 3.09 Refer to the calculations given below I am (342) / (90) = 3.80 very cool (320 + 994 + 1820) / (100 + 302 + 612) = 3.09 Sample Output To help you evaluate the correctness of your output, we provide you with the output for the small dataset. Note: Please strictly follow the formatting requirements for your output as shown in the small dataset output file. You can use https://www.diffchecker.com/ to make sure the formatting is correct. Improperly formatting outputs may not receive any points. Using PIG (Read these instructions carefully) There are two ways to debug PIG on AWS (all instructions are in the AWS Setup Guidelines): 1. Use the interactive PIG shell provided by EMR to perform this task from the command line (grunt). Refer to Section 8: Debugging in the AWS Setup Guidelines for a detailed step­by­step procedure. You should use this method if you are using PIG for the first time as it is easier to debug your code. However, as you need to have a persistent ssh connection to your cluster until your task is complete, this is suitable only for the smaller dataset. 2. Upload a PIG script with all the commands which computes and direct the output from the command line into a separate file. Once you verify the output on the smaller dataset, use this method for the larger dataset. You don’t have to ssh or stay logged into your account. You can start your EMR job, and come back after a few hours when the job is complete! Note: In summary, verify the output for the smaller dataset with Method 1 and submit the results for the bigger dataset using Method 2. Sample Commands: Load data in PIG To load the data from the s3://cse6242oan­2018fall­aws­small bucket into a PIG table, you can use the following command: grunt> bigrams = LOAD ‘s3://cse6242oan­2018fall­aws­small/*’ AS (bigram:chararray, year:int, occurrences:int, books:int); Note: ● Refer to other commands such as LOAD, USING PigStorage, FILTER, GROUP, ORDER BY, FOREACH, GENERATE, LIMIT, STORE, etc. ● Copying the above commands directly from the PDF and pasting on console/script file may lead to script failures due to the stray characters and spaces from the PDF file. ● Your script will fail if your output directory already exists. For instance, if you run a job with the output folder as s3://cse6242oan­/output­small, the next job which you run with the same output folder will fail. Hence, please use a different folder for the output for every run. ● You might also want to change the input data type for occurrences and books to handle floating point values. ● While working with the interactive shell (or otherwise), you should first test on a small subset of the data instead of the whole data (the whole data is over 100 GB). Once you believe your PIG commands are working as desired, you can use them on the complete data and wait since it will take some time. Deliverables ● pig­script.txt: The PIG script for the question (using the larger data set). ● pig­output.txt: Output (tab­separated) (using the larger data set). Note: Please strictly follow the guidelines below, otherwise your solution may not be graded. ● Ensure that file names (case sensitive) are correct. ● Ensure file extensions (.txt) are correct. ● The size of each pig­script.txt and pig­output.txt file should not exceed 5 KB. ● Double check that you are submitting the correct set of files ­­­ we only want the script and output from the larger dataset. Also double check that you are writing the right dataset’s output to the right file. ● Ensure that unnecessary new lines, brackets, commas etc. aren’t in the file. ● Please use tabs (not space) in the output file for separating the 2 columns. Q4 [35 points] Analyzing a Large Graph using Hadoop on Microsoft Azure VERY IMPORTANT: Use Firefox or Chrome in incognito/private browsing mode when configuring anything related to Azure (e.g., when using Azure portal), to prevent issues due to browser caches. Safari sometimes loses connections. Goal The goal is to analyze graph using Microsoft Azure, and your task is to write a MapReduce program to compute the distribution of a graph’s node degree differences (see example below). Note that this question shares some similarities with Question 1 (e.g., both are analyzing graphs). Question 1 can be completed using your own computer. This question is to be completed using Azure. We recommend that you first complete Question 1. You will use two data files in this questions: ● small.tsv [4] (zipped as 10MB small.zip; ~38MB when unzipped) ● large.tsv [5] (zipped as 900MB large.zip; ~3GB when unzipped) Each file stores a list of edges as tab­separated­values. Each line represents a single edge consisting of two columns: (Source, Target), each of which is separated by a tab. Node IDs are positive integers and the rows are already sorted by Source. Source Target 1 2 2 1 2 3 3 2 4 2 4 3 Your code should accept two arguments upon running. The first argument (args[0]) will be a path for the input graph file, and the second argument (args[1]) will be a path for output directory. The default output mechanism of Hadoop will create multiple files on the output directory such as part­00000, part­00001, which will have to be merged and downloaded to a local directory. The format of the output should be as follows. Each line of your output is of the format diff count where (1) diff is the difference between a node’s out­degree and in­degree (out­degree ­ in­degree); and (2) count is the number of nodes that have the value of difference (specified in 1). The out­degree of a node is the number of edges where that node is the Source. The in­degree of a node is the number of edges where that node is the Target. diff and count must be separated by a tab (t), and lines do not have to be sorted. The following result is computed based on the toy graph above. ­1 2 0 1 2 1 The explanation of the above example result: Output Explanation ­1 2 There are 2 nodes (node 2 and 3) whose degree difference is ­1 0 1 There is 1 node (node 1) whose degree is 0 2 1 There is 1 node (node 4) whose degree difference is 2 Hint: One way of doing it is using the mapreduce procedure twice. The first one for finding the difference between out­degree and in­degree for each node, the second for calculating the node count of each degree difference. You will have to make changes in the skeleton code for this. In the Q4 folder of the hw3-skeleton, you will find the following files we have prepared for you: ● src directory contains a main Java file that you will work on. We have provided some code to help you get started. Feel free to edit it and add your files in the directory, but the main class should be called “Q4”. ● pom.xml contains necessary dependencies and compile configurations for the question. To compile, you can run the command: mvn clean package in the directory which contains pom.xml. This command will generate a single JAR file in the target directory (i.e. target/q4-1.0.jar). Creating Clusters in HDInsight using the Azure portal Azure HDInsight is an Apache Hadoop distribution. This means that it handles large amount of data on demand. The next step is to use Azure’s web-based management tool to create a Linux cluster. Follow the documentation here to create a new cluster — make sure to use the following settings ● Select “Quick Create” instead of “Custom” ● “Subscription” drop down menu: select “Microsoft Azure Sponsorship 2” ● “Cluster type”: choose “Hadoop 2.7.3 (HDI 3.6)” At the end of this process, you will have created and provisioned a New HDInsight Cluster and Storage (the provisioning will take some time depending on how many nodes you chose to create). Please record the following important information for later use: ● Cluster login credentials ● SSH credentials ● Container credentials VERY IMPORTANT: HDInsight cluster billing starts once a cluster is created and stops when the cluster is deleted. To save the credit, you’d better to delete your cluster when it is no longer in use. Please refer https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-delete-cluster. Uploading data files to HDFS­compatible Azure Blob storage We have listed the main steps from the documentation for uploading data files to your Azure Blob storage here: 1. Install Azure CLI. 2. Open a command prompt, bash, or other shell, and use az login command to authenticate to your Azure subscription. When prompted, enter the username and password for your subscription. 3. az storage account list command will list the storage accounts for your subscription. 4. az storage account keys list ­­account­name ­­resource­group command should return Primary and Secondary keys. Copy the Primary key value because it will be used in the next steps. 5. az storage container list ­­account­name ­­account­key command will list your blob containers. 6. az storage blob upload ­­account­name  ­­account­key ­­ file  ­­container­name  ­­name / command will upload the source file to your blob storage container. Using these steps, upload small.tsv and large.tsv to your blob storage container. After that write your hadoop code locally and convert it to a jar file using the steps mentioned above. Uploading your Jar file to HDFS­compatible Azure Blob storage Azure Blob storage is a general-purpose storage solution that integrates with HDInsight. Your Hadoop code should directly access files on the Azure Blob storage. Upload the jar file created in the first step to Azure storage using the following command: scp /q4­1.0.jar @­ssh.azurehdinsight.net: SSH into the cluster using the following command: ssh @­ssh.azurehdinsight.net Note: if you see the warning ­ REMOTE HOST IDENTIFICATION HAS CHANGED, you may clean /home//.ssh/known_hosts”. Refer to host identification. Run the ls command to make sure that the q4­1.0.jar file is present. To run your code on the small.tsv file, run the following command: yarn jar q4­1.0.jar edu.gatech.cse6242.Q4 wasbs://@.blob.core.windows.net//small.tsv wasbs://@.blob.core.windows.net/smalloutput Command format: yarn jar jarFile packageName.ClassName dataFileLocation outputDirLocation Note: if “Exception in thread “main” org.apache.hadoop.mapred.FileAlreadyExistsException…” occurs, you need to delete the output folder from your Blob. You can do this at portal.azure.com. The output will be located in the wasbs://@.blob.core.windows.net/smalloutput. If there are multiple output files, merge the files in this directory using the following command: hdfs dfs ­cat wasbs://@.blob.core.windows.net/smalloutput/* > small.out Command format: hdfs dfs ­cat location/* >outputFile Exit to your local machine: exit Download the merged file to the local machine (this can be done either from https://portal.azure.com/ or by using the scp command from the local machine). Here is the scp command for downloading this output file to your local machine: scp @­ssh.azurehdinsight.net:/home//small.out . Using the above command from your local machine will download the small.out file into the current directory. Repeat this process for large.tsv. Deliverables 1. [15pt] Q4.java & q4­1.0.jar: Your java code and converted jar file. You should implement your own map/reduce procedure and should not import external graph processing library. 2. [10pt] small.out: the output file generated after processing small.tsv. 3. [10pt] large.out: the output file generated after processing large.tsv. Q5 [10 points] Regression: Automobile price prediction, using Azure ML Studio Note: Create and use a free workspace instance at https://studio.azureml.net/ instead of your Azure credit for this question. Goal This question’s main purpose is to introduce you to Microsoft Azure Machine Learning Studio and familiarize you with its basic functionality and typical machine learning workflow. Go through the “Automobile price prediction” tutorial and complete the tasks below. You will modify the given file, results.csv, by adding your results for each of the tasks below. We will autograde your solution, therefore DO NOT change the order of the questions or anything else. Report the exact numbers that you get in your output, DO NOT round the numbers. 1. [3 points] Repeat the experiment mentioned in the tutorial and report the values of the metrics as mentioned in the ‘Evaluate Model’ section of the tutorial. 2. [3 points] Repeat the same experiment, change the ‘Fraction of rows in the first output’ value in the split module to 0.85 (originally set to 0.75) and report the corresponding values of the metrics. 3. [4 points] Evaluate the model with the 3­fold cross­validation (CV), select the parameters in the module ‘Partition and sample’ (Partition and Sample) (see figure below). Report the value of Root Mean Squared Error (RMSE) for each fold. Specifically, you need to do the following: A. Create a new model (Linear Regression) B. Import the entire dataset (Automobile Price Data (Raw)) C. Clean the missing data by removing rows that have any missing values D. Perform cross validation on the dataset obtained after Step C Deliverables 1. [10pt] results.csv: a csv file containing results for all of the three parts. Important: folder structure of the zip file that you submit You are submitting a single zip file HW3­GTUsername.zip (e.g., HW3­jdoe3.zip, where “jdoe3” is your GT username), which must unzip to the following directory structure (i.e., a folder “HW3­jdoe3”, containing folders “Q1”, “Q2”, etc.). The files to be included in each question’s folder have been clearly specified at the end of each question’s problem description above. HW3­GTUsername/ Q1/ src/main/java/edu/gatech/cse6242/Q1.java pom.xml run1.sh run2.sh q1output1.tsv q1output2.tsv (do not attach target directory) Q2/ q2.dbc q2.scala q2output.csv Q3/ pig­script.txt pig­output.txt Q4/ src/main/java/edu/gatech/cse6242/Q4.java pom.xml q4­1.0.jar (from target directory) small.out large.out (do not attach target directory) Q5/ results.csv Version 5 [1] Graph derived from the LiveJournal social network dataset, with around 30K nodes and 320K edges. [2] Graph derived from the LiveJournal social network dataset, with around 300K nodes and 69M edges. [3] Graph derived from the Stanford Large Network Dataset Collection [4] subset of Youtube social network data [5] subset of Friendster data 通过Google云端硬盘发布 – 举报滥用行为 – 每5分钟自动更新一次

$25.00 View

[SOLVED] Cse​​6242/cx​4242 homework 2 : d3 graphs and visualization

Q1 [10 points] Designing a good table. Visualizing data with Tableau. Imagine you are a data scientist working with the data for the Men’s Gymnastics World Championships and the Summer Olympics. Perform the task a to help the Championships committee analyze the Men’s Gymnastics data and task b to help the Olympics Committee better understand the overall Medals by Country. a. [5 points] Good table design. Create a table to display the details of Men’s Gymnastics World Championships provided in worldchamps_mens_gymnastics.csv [1] . You can use any tool (e.g., Excel, HTML) to create the table. ● The table should contain data from the following columns: Name, Overall Rank, Nationality, Apparatus, Total Score ● Save the table as table.png. You may reorder the columns while creating the table. Keep suggestions from lecture in mind when designing your table. You are not required to use only the techniques described in lecture. For OMS students, the online lecture video pertaining to this topic is Week 4 ­ Fixing Common Visualization Issues ­ Fixing Bar Charts, Line Charts). For campus student, please review slide 43 and onwards of the lecture slides. b. [5 points] Tableau: Visualize the Summer Olympics data as a stacked bar chart. Your chart should display the Count of Medals for Olympic using the dataset summer_olympics.csv [2] (in Q1 folder). (Optional reading: the effectiveness of stacked bar charts is often debated ­­­ sometimes, they can be confusing to understand and may make data series comparison challenging.) Our main goal here is for you to try out Tableau, a popular information visualization tool. Thus, we keep this part more open­ended, so you can practice making design decisions. We will accept most designs from you all. We show one possible design in the figure below, based on the tutorial from Tableau, and you are not limited to the techniques presented there. Please follow the instructions below: ● Your design should visualize the Count of Medals during the years 1992­2012 (inclusive) for the countries: France, Germany, United Kingdom, United States, and China. ● Your design should design a stacked bar chart to show the count for each type of medal (Bronze, Silver, Gold) for the following Sports: Aquatics, Athletics, Football, Gymnastics, Rowing. ● For each country, there should be two stacked bars (one for each gender). Make sure the bars are grouped by country. ● Your design should have clear label axes and chart title. Include a legend for your chart. ● Save the chart as barchart.png. Tableau has provided us with student licenses for Tableau Desktop, available for Mac and Window. Go to tableau activation and select “Tableau Desktop”. After the installation, you will be asked to add an activation key. The Desktop Key for activation is available on the course github repository as “Tableau Desktop Key­Fall 2018” at download. This key is for your use in this course only. Do not share the key with anyone. If you do not have access to a Mac or Windows machine, please use the 14­day trial version of Tableau Online: 1. Visit https://www.tableau.com/trial/tableau­online 2. Enter your information (name, email, GT details, etc) 3. You will then receive an email to access your Tableau Online site 4. Go to your Site and create a workbook Figure 3: Example for scatter plots, on a single HTML page. Q1 Deliverables: The directory structure should be as follows: Q1/ table.png barchart.png worldchamps_mens_gymnastics.csv summer_olympics.csv ● table.png ­ An image/screenshot of the table in Q1.a (png format only). ● barchart.png ­ An image of the chart in Q1.b (png format only), Tableau workbooks will not be graded!). The image should be clear and of high­quality. ● worldchamps_mens_gymnastics.csv and summer_olympics.csv ­ the datasets Q2 [15 points] Force­directed graph layout You will experiment with many aspects of D3 for graph visualization. To help you get started, we have provided the graph.html file (in the Q2 folder). Note: You are welcome to split graph.html into graph.html, graph.css, and graph.js. Please also make certain that any paths in your code are relative paths. Nonfunctioning code will result in a five point deduction. a. [3 points] Adding node labels: Modify graph.html to show a node label (the node name, i.e., the source) to the right of each node. If a node is dragged, its label must also move with the node. b. [3 points] Coloring links: Color the links based on the “value” field in the links array. Assign the following colors: If the value of the edge is equal to 0, assign the color blue to the link. If the value of the edge is equal to 1, assign the color red to the link. c. [3 points] Scaling node sizes: 1. Scale the radius of each node in the graph based on the degree of the node (you may try linear or squared scale, but you are not limited to these choices). Note: Regardless of which scale you decide to use, you should avoid extreme node sizes (e.g., nodes that are mere points, or barely visible, as well as very large nodes). Failure to due so will result in a poor quality visualization. d. Pinning nodes (fixing node positions): 1. [2 points] Modify the code so that when you double click a node, it pins the node’s position such that it will not be modified by the graph layout algorithm (note: pinned nodes can still be dragged around by the user but they will remain at their positions otherwise). Node pinning is an effective interaction technique to help users spatially organize nodes during graph exploration. 2. [2 points] Mark pinned nodes to visually distinguish them from unpinned nodes, e.g., pinned nodes are shown in a different color, border thickness or visually annotated with an “asterisk” (*), etc. 3. [2 points] Double clicking a pinned node should unpin (unfreeze) its position and unmark it. Q2 Deliverables: The directory structure should be as follows: Q2/ graph.html graph.js, graph.css (if not included in graph.html) ● graph.html ­ the html file created. ● graph.(js / css) ­ the js / css files if not included in graph.html Q3 [15 points] Scatter plots Use the dataset [3] provided in the file movies.csv (in the Q3 folder) to create a scatter plot. Refer to the tutorial for scatter plot here. Attributes in the dataset: Feature 1: Id of the movie Feature 2: Title Feature 3: Year Feature 4: Runtime (minutes) Feature 5: Country Feature 6: IMDb Rating Feature 7: IMDb Votes Feature 8: Budget (in USD) Feature 9: Gross (in USD) Feature 10: Wins and nominations Feature 11: Is good rating? ( value 1 means “good”, value 0 ­ “bad”) Optional: to learn more about IMDb, visit https://en.wikipedia.org/wiki/IMDb a. [8 points] Creating scatter plots: 1. [6 points] Create two scatter plots, one for each feature combination specified below. In the scatter plots, visualize “good rating” class instances as blue crosses, and “bad rating” instances as red circles. Add a legend to the top right corner showing the symbols’ mapping to the classes. ■ Feature 10 (Wins and nominations) vs. Feature 6 (IMDb Rating) ● Figure title: Wins+Nominations vs. IMDb Rating ● X axis (horizontal) label: IMDb Rating ● Y axis (vertical) label: Wins+Noms ■ Feature 8 (Budget) vs. Features 6 (IMDb Rating) ● Figure title: Budget vs. IMDb Rating ● X axis (horizontal) label: IMDb Rating ● Y axis (vertical) label: Budget 2. [2 points] In explanation.txt, use no more than 50 words to discuss which feature combination is better at separating the classes and why. Note: Your scatter plots should be placed one after the other on a single HTML page, similar to the example image below (Figure 3). Note that your design need NOT be identical to the example. b. [3 points] Scaling symbol sizes. Create a scatter plot (append to the HTML page) using the feature combination specified below. Set the size of each symbol to be proportional to the value of Feature 10 (Wins and nominations); use a good scaling coefficient to make the scatter plot legible, visually attractive and meaningful. Visualize “good rating” class instances as blue crosses, and “bad rating” instances as red circles. ■ Feature 7 (IMDb Votes) vs. Feature 6 (IMDb Rating) sized by Feature 10 (Wins+Nominations) ● Figure title: Votes vs. IMDb Rating sized by Wins+Nominations ● X axis (horizontal) label: IMDb Rating ● Y axis (vertical) label: IMDb Votes c. [4 points] Axis scales in D3. Create two plots for this part (append to the HTML page) to try out two axis scales in D3: the first plot uses the square root scale for its y­axis (only), and the second plot uses the log scale for its y­axis (only). In explanation.txt, explain when we may want to use square root scale and log scale in charts, in no more than 50 words. Note: the x­axes should be kept in linear scale, and only the y­axes are affected. Hint: You may need to carefully set the scale domain to handle the 0s in data. ■ First Figure: uses the square root scale for its y­axis (only) ● Figure title: Wins+Nominations (square­root­scaled) vs. IMDb Rating ● X axis (horizontal) label: IMDb Rating ● Y axis (vertical) label: Wins+Noms ■ Second Figure: uses the log scale for its y­axis (only) ● Figure title: Wins+Nominations (log­scaled) vs. IMDb Rating ● X axis (horizontal) label: IMDb Rating ● Y axis (vertical) label: Wins+Noms Figure 3: Example for scatter plots, on a single HTML page. Q3 Deliverables: The directory structure should be organized as follows: Q3/ scatterplot.(html / js / css) explanation.txt scatter_plots.pdf movies.csv ● scatterplot.(html / js / css) ­ the html / js / css files created. ● explanation.txt ­ the text file explaining your observations for Q3.a.2 and Q3.c. ● scatter_plots.pdf ­ a PDF document showing the screenshots of the five scatter plots created above (two for Q3.a.1, one for Q3.b and two for Q3.c). You may print the HTML page as a PDF file, and each PDF page shows one plot (hint: use CSS page break). Clearly title the plots as instructed (see examples in Figure 3). ● movies.csv ­ the dataset. Q4 [15 points] Heatmap and Select Box Example: 2D Histogram, Select Options Use the dataset provided in heatmap.csv (in the Q4 folder) that describes the number and types of Charms cast by each of the wizarding houses across J.K. Rowling’s 7 Harry Potter books. Visualize the data using D3 heatmaps. a. [5 points] Create a file named heatmap.html. Within this file, create a heatmap of the number of spells, for each spell type used in each book for the wizarding house ‘Gryffindor’. Place the Spell Type on the heatmap’s horizontal axis and the Book on its vertical axis. b. [1 point] The color scheme of a heatmap is a very important part of its design. The number of Spell Types for each Book should be represented by colors in the heatmap. Pick a meaningful color scheme (hint: color gradients) with 9 color gradations for the heatmap. c. [3 pt] Add axis labels and a legend to the chart. Place the name of the Book (“Sorcerer’s Stone”, “Chamber of Secrets”, “Prisoner of Azkaban”, etc.) on the vertical axis in the order of publication (the earliest book goes to the top). Place the Spell Type (“Hex”, “Counter Spell”, “Jinx”, etc.) on the horizontal axis. Order the Spell Types alphabetically, from left to right. d. [6 pt] Now create a drop down select box with D3 that is populated with the wizarding house names (“Gryffindor”, “Hufflepuff”, “Ravenclaw”, “Slytherin”). When the user selects a different house in this select box, the heatmap and the legend should both be updated with values corresponding to the selected house. Note the differences in the legends for house “Gryffindor” and “Ravenclaw” in Figure 4a and Figure 4b below. While the 9 color gradations in the legend remain the same, the thresholds values are different. The default house when the page loads should be “Gryffindor”. Note: 1. The Harry Potter dataset being used here has been synthetically generated. It doesn’t accurately resemble the count of spell types used per Wizarding House across different books in the Harry Potter world. 2. The data provided in heatmap.csv would need to be “reshaped” in such a way that it can produce the expected output. All data reshaping must only be performed in javascript; you must not modify heatmap.csv. That is, your code should first read the data from heatmap.csv file as is, then you may reshape that loaded data using javascript, and then use it to create the heatmap. 3. The threshold values should not be hardcoded. They do not necessarily have to match the ones provided in the screenshots below. The screenshots provided below serve as an example only. You are not expected to produce an exact copy of the screenshots. Please feel free to experiment with fonts, placement, colors, etc., as long as the output looks reasonable for a heatmap. Figure 4a: Number of spells of each type used by house Gryffindor for all books. Figure 4b: Number of spells of each type used by house Ravenclaw for all books. Q4 Deliverables: The directory structure should look like: Q4/ heatmap.(html / js /css) heatmap.csv ● heatmap.(html / js/ css) ­ the html / js / css files created. ● heatmap.csv ­ the dataset Q5 [20 points] Interactive Visualization Use the dataset [4] provided in the dataset.txt file (in the Q5 folder) to create an interactive bar chart. Each line in the file represents population growth (per year) of an US city over the past five years, starting with total population of year 2012. You will copy the data contained in dataset.txt and paste it, directly into your code as is, as an array variable named data, similar to what is shown below. Note: You must NOT modify or reorder the content of the data file; what you paste into your code should be the same content that the data file contains. If you believe you want to sort or order the data in any way (e.g., by population), do so using javascript. Example: var data=[];a. [5 points] Create a horizontal bar chart with its vertical axis denoting the city names (ordered by population) and its horizontal axis denoting the total cumulative population (with “,” as the thousand separator) at the end of 2017. Each bar should have its associated total population shown on top of it. Refer to the example shown in Figure 5a. Note: The vertical axis of the chart should use city names as labels. Figure 5a. Bars representing total population of each city b. [10 points] When hovering the mouse over a bar, a smaller line chart representing the population growth of that city for each year (2013­2017) should be displayed in the top right corner. For example, Los Angeles has a growth value of 0.84%, 0.79%, 0.78%, 0.70%, 0.47% for the years 2013, 2014, 2015, 2016 and 2017 respectively. On hovering over the bar representing Los Angeles, a line chart depicting these 5 values in % is displayed. See Figure 5b for an example. The calculation to show the percentage of population growth is: Figure 5b. On hovering over the bar for Los Angeles, a smaller line chart representing its percentage of growth per year in decimal over the past 5 years is displayed at the top right corner. c. [3 points] On mouse out, the line chart should no longer be visible. d. [2 points] On hovering over any horizontal bar representing a city, the color of the bar should change. You can use any color that is visually distinct from the regular bars. On mouseout, the color should be reset. Q5 Deliverables: The directory structure should be as follows: Q5/ interactive.(html/js/css) interactive.(html/js/css) ­ The html, javascript, css to render the visualization in Q5 (dataset.txt is NOT required to be included in the final directory structure as the data provided in dataset.txt should have already been integrated into the “data” variable in your code). Q6 [20 points] Choropleth Map of County Data Example: Unemployment rates Use the provided dataset [5] in education.csv, us.json and education_details.csv (in the Q6 folder) and visualize them as a choropleth map. ● Each record in education.csv represents a county and is of the form , where ○ id corresponds to the county id ○ name is the county name ○ percent_educated is the percentage of educated people living in that county ● The education_details.csv file contains a list of records, each having four fields: ○ an id field corresponding to a county in the United States, ○ a qualified_professionals field corresponding to the number of professionals in the county, ○ a high_school and a middle_school_or_lower fields corresponding to the number of high school students and middle school students respectively. ● The us.json file is a TopoJSON topology containing three geometry collections: counties, states, and nation. a. [15 points] Create a choropleth map using the provided datasets, use Figure 6 below as a reference. 1. [10 points] The color of each county should correspond to the percentage of educated people in that county, i.e., darker colors correspond to higher percentage in that county and lighter colors correspond to lower percentage in that county. Use gradients of only one particular hue. Use d3­queue (in the lib folder) to easily load data from multiple files into a function [6] . Use topojson (present in lib) to draw the choropleth map. 2. [5 points] Add a legend showing how colors map to percentage of educated people. Create the legend using a threshold scale as shown in the figure. The domain used for the legend should be [0,90] (inclusively) with a step size of 10. b. [5 points] Add a tooltip using the d3.tip library (in the lib folder) that, on hovering over a county, shows the (1) county name, (2) percentage of educated people, (3) number of qualified professionals, (4) high school graduates, and (5) middle school or lower graduates, each on a separate line each. The tooltip should appear when the mouse hovers over the county. On mouseout, the tooltip should disappear. Use Figure 6 below as a reference. We recommend that you position the tooltip some distance away from the mouse cursor, which will prevent the tooltip from “flickering” as you move the mouse around quickly (the tooltip disappears when your mouse leaves a county and enters the tooltip’s bounding box). Please ensure that the tooltip is fully visible (i.e., not clipped). Note: You must create the tooltip by only using d3.tip.v0.6.3.js present in the lib folder. Figure 6. Reference example for Choropleth Maps Q6 Deliverables: The directory structure should be organized as follows: Q6/ q6.(html/js/css) education.csv education_details.csv us.json ● q6.(html /js /css)­ The html/js/css file to render the visualization. ● education.csv and education_details.csv ­ The datasets used. ● us.json ­ Dataset needed to draw the map. Q7 [5 points] Pros and Cons of Visualization Tools This an open­ended question. Your answer will depend on what you have learned from working through the questions in this assignment, and your personal experience. Pick a visualization system/tool/library/framework that you are familiar with (R, R Shiny, Python, Plotly, Excel, JMP, Matlab, Mathematica, Julia, etc.), then using no more than 150 words in total, compare it with Tableau and D3 in terms of: 1. Ease to develop (for developers) 2. Ease to maintain the visualization (for developers) 3. Usability of visualization developed (for end users) 4. Scalability of visualization to “large” datasets 5. System requirements to run the visualization (e.g., browsers, OS, software licensing) (for end users) We recommend formatting your answers as bullet lists for better readability. For example: 1. Ease to … Seaborn: … Tableau: … D3: … 2. Ease to … … Q7 Deliverables: The directory structure should be as follows: Q7/ analysis.txt ● analysis.txt ­ comparison of visualization tools. Important: folder structure of the zip file that you submit You are submitting a single zip file HW2­GTUsername.zip (e.g., HW2­jdoe3.zip, where “jdoe3” is your GT username), which must unzip to the following directory structure (i.e., a folder “HW2­jdoe3”, containing folders “Q1”, “Q2”, etc.). The files to be included in each question’s folder have been clearly specified at the end of each question’s problem description above. HW2­GTUsername/ lib/ d3.v3.min.js d3.tip.v0.6.3.js d3­queue.v3.min.js topojson.v1.min.js Q1/ table.png barchart.png worldchamps_mens_gymnastics.csv summer_olympics.csv Q2/ … Q3/ scatterplot.(html / js / css) explanation.txt scatter_plots.pdf movies.csv Q4/ heatmap.(html / js /css) heatmap.csv Q5/ interactive.html Q6/ … Q7/ analysis.txt Version 1 [1] Source: https://www.kaggle.com/cjdaffern/gymnastic­champs­mens­all­round [2] Source: https://www.kaggle.com/the­guardian/olympic­games [3]Source: derived from a “movies” dataset prepared by Dr. Guy Lebanon, for an earlier version of OMSCS CSE 6242 (the source raw data is available at the following URL; you do not need to download it when working on this question https://s3.amazonaws.com/content.udacity­data.com/courses/gt­cs6242/project/movies_merged) [4] Source: US City Populations [5] Source: Derived from USDA. [6] d3­queue evaluates a number of asynchronous tasks concurrently ­­ in this question, each task would be loading one data file. When all tasks have finished, d3­queue passes the results to a user­defined callback function. 通过Google云端硬盘发布 – 举报滥用行为 – 每5分钟自动更新一次

$25.00 View