You must take your assignment 1 application and move it to a Kubernetes environment (e.g. using k3s) while making it more flexible to analyse multiple data sources. You must also implement testing and monitoring for the application. You must also create a serverless function that could potentially be useful for your application. Get your application running using Kubernetes. Lab 06 has a good example to work from – best approach is to create a Docker Hub repository for each service. You must scale the pods (using deployments), but be careful not to overload your system. Exactly how many replicas you have isn’t hugely important – what is more important is the ratio from one microservice to the next, e.g. do you have more web front ends or analysis microservices? You will need to explain why you scaled the way you did. In addition, you will need to access another CSV file (GBvideos.csv from https://www.kaggle.com/datasnaek/youtube-new), this time YouTube video details. This should be done in parallel to the other log file – i.e. don’t just process the files one after the other in the same pod. Here you could choose to modify your existing data reading/extraction microservice to be more flexible (e.g. pass in the file as a parameter and the field(s) to look at) or have a microservice per file format. You be a bit creative in coming up with an architecture – just explain your thinking in the Word form. If you want, you can change the statistics from assignment 1 to something else, e.g. something that applies to both Reddit and YouTube posts. In the Word form, include screenshots of the application running with the web page showing statistics, etc. You are to address the testing and monitoring of your application. You are to develop 1 test and 1 monitoring solution for your application. 1) Test (9%) – explain what type of test you have created (i.e. why it is functional or non-functional). Provide the following: – A paragraph on why you chose to implement the test – Screenshots or video of the test in action – Any supporting files, e.g. test scripts, along with instructions on how to run the test. Best marks will be provided where you can automate the tests (or at least explain how the tests could be automated).In the lectures, we looked at Gatling as a way to load test your application. We also looked at Postman, but this would need an API in your application, which you may not have unless you are using REST (you might implement REST somewhere in your app for demo purposes). However, you may choose an alternative tool / approach.2) Monitoring solution (9%) – develop using whatever tool(s) you choose a monitoring solution for your application. Explain what type of monitoring you are doing and why you have taken the approach you did. Provide the following: – A paragraph on why you chose to implement the monitor – Screenshots or video of the monitor in action – Any supporting files, e.g. config files, along with instructions on how to run the monitor. Best marks will be provided where you can visualize the monitoring in some way, e.g. a chart. You may not succeed in setting everything up, but document what you did get working with any relevant screenshots and some credit can be given for effort. Using Kubeless, create a function that would be useful for your application. It does not have to be called from within your application – it should run as per the quickstart guide from the command line. However, you must explain where in your application it would be placed and how it would be useful. Explain the inputs and outputs to/from your function.See the following quick start guide: https://kubeless.io/docs/quick-start/The hello-world function there is not much use. Come up with a useful function for your application. The function does not have to be embedded into your application – i.e. just run it from the command line as per the quick start guide. Include a screenshot of the function being run with its return value(s). What to submit: Notes:
Thus far, we have covered the concept of microservices, bounded contexts and various architectural and integration concepts. We have also covered tools and technologies that can help us build and deploy microservices (see lab documents 1 to 4). This assignment requires that you analyse a requirement and devise an appropriate architecture and implementation.You are to analyse a stream of data from Reddit. You will simulate the stream of data by reading posts from a data file and serving them through a stream channel to an analytics client. The client should analyse each post as it arrives and calculate 4 different types of metric / result:There should be a single web page that displays those metrics / results whenever the page is loaded or refreshed. It is up to you how and where the web page gets the data.It is recommended that on average 2 posts per second should be streamed, though there should be some random variability. The file r_dataisbeautiful_posts.csv has more than 190,000 posts in CSV format. All the posts are from the “Data is Beautiful” subreddit. The file was downloaded from https://www.kaggle.com/unanimad/dataisbeautiful and has the following comma separated fields:Id, title, score, author, author_flair_text, removed_by, total_awards_received, awarders, created_utc, full_link, num_comments, over_18There is no post text or comments associated with the posts, just the post title.There are likely to be at least 3 microservices in your architecture, e.g. one that reads the posts and streams them to a client connection, the client microservice that does the analytics, and a web server microservice that serves out that single web page. There may also be a data storage microservice.Use gRPC as the communication mechanism and use a stream channel to send the posts to the client.Flask is recommended for the web server and web page. A simple table is all that is required. You could consider using a charting library or you could improvise by using hash symbols to build up a bar chart, for example. You have a lot of freedom here.Because a sleep is placed between each tweet read, it should be possible to keep refreshing the web page (either automatically or manually) to see updated metrics / results.Use Docker (with Dockerfiles) for each microservice and use Docker Compose to orchestrate your system. It should be possible to simply enter “docker-compose up” to bring your system up.
You are to develop a website that stores information about directors and the films they have directed. You may assume that a film has only one director. • Any user can o view a list of the directors in the database in alphabetical order (by surname) o select a director from the list to view that director’s films o view a list of the films in the database in chronological order showing all information you have about that film o register by providing an email and password • Authenticated users can o add a director to the database ▪ no two directors can have the same name. ▪ apply sensible limitations to the name e.g. use @Pattern and @Size o add a film providing the film’s title, year of release and a director, which must be selected from a list provided. ▪ Two films can have the same title. When a film is remade it often has the same name as the original. ▪ A film’s title cannot be left blank. ▪ The film’s year of release must be an integer between 1888 and the current year (which will not always be 2020). o edit the title of a film ▪ requires that the method’s signature created in the repository be annotated as @Transactional (facilitating rollback if necessary) and @Modifying. • Administrators can o delete a film o delete a director along with their films You must also provide • Two REST API endpoints which you write yourself, to allow authenticated users to, o access (in json format) films released in a particular year o delete a director (and associated films) given their id • An example of consuming these two APIs which requires another Spring project. Nothing fancy is needed here, just a second WebMVC project that has two controllers and two views. The controllers send authentication data to the REST APIs and present the result of the request in suitable format. No forms are needed. Technical Notes You will use WebMVC Spring Boot application. Use an in-memory h2 database to create an “out of the box” application. This database must be populated with sample data. The application must be implemented using be implemented using JPA, making use of all that offers. This project essentially has two entities, a film and its director, along with users and their roles. Use the Security module for authentication and authorisation. I am not interested in visual styling – just make the web site well structured, readable, navigable, etc – although you should use some CSS. Credit any templates (CSS/HTML) if you use them. I recommend the use of fragments to reduce the amount of code. Forms must be validated using form binding and your own (not Spring or clientside) error messages should be displayed by the view if the user makes a mistake. It must be possible to change the language of the website. User input must be validated with suitable (international) error messages. You do not have to translate content. I will accept subtle changes e.g. “Film Title” becoming “Film Title_FR” to indicate the language has changed to French. You must use the following: • Spring Boot • Spring MVC • Thymeleaf (not JSP) • Spring Data JPA • Spring Security • H2 for an embedded database • Maven Unit tests are not required. You may use Project Lombok to reduce boilerplate code. Provide a brief document outlining the high-level design of your system (1 or 2 A4 pages) including but not limited to your database design and class diagrams and the beans which you used.
The final project for the course will require you to complete some exploratory analyses for a womens clothing firm. The data for this project come from the Kaggle website at the following link: https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews/data I have attached the dataset to the Project entry on myCourses so that you do not have to register for an account. The dataset contains information on customer product ratings for an anonymous womens clothing e-commerce firm. For this project, you only need to focus on the following fields: Variable Description Review_ID Row ID for the review Clothing_ID Product ID for particular article of clothing Age Age of the reviewer Rating Reviewer rating of article on 0-5 scale Recommended Whether the reviewer positively recommends the product (1) or not (0) Department_Name The department which is responsible for that article of clothingObjectives and evaluation The project requires you to complete three tasks, detailed below. You should prepare a report for the e-commerce firm answering their questions for each task. They would also like you to include the code for each task in your report, for reproducibility purposes. You may include the code as code chunks where the analyses are taking place or, if you prefer, you may include it at the end (although the code should be clearly commented so that it is clear which task each block of code corresponds to).The completion of each task is worth 25 points. The quality of presentation will also be worth 25 points, i.e. clarity of explanation, plots, tables, and code. The length of the projects will vary, depending on the number and formatting of figures and tables and the conciseness of the writing. Rather than focusing on the number of pages, I encourage students to focus on completing each task (and subtask) below to the best of their ability in the clearest and most efficient manner.The first task is to provide some exploratory data analyses and describe the distributions of age, product rating, recommendations, and article departments amongst the respondents. For each of the four variables, first produce appropriate plots and summary tables and then, in words, describe the distribution of each variable individually. Note if there are any missing values and then remove them from the data set for the remainder of the tasks.The firm is also interested in answering two questions about associations between some of the variables. Please address each question by using both graphical and numerical summaries and describe the nature of the associations described by the questions below. Question 1: The firm would like to know whether the distribution of age of reviewers varies across product departments.Question 2: For marketing purposes, they would also like to divide respondent age into five demographic categories: 25 and under, 26 – 35, 36-45, 46-64, and 65 and over and compare the distribution of product ratings amongst each of the five age groups to see which groups are most enthusiastic about their company’s products.Task 3: For the final task, the company would like to compile a list of their ten most popular products based on recommendations (with each product indicated by ID number). However, they read an article which indicated that comparing products based just on the average review or the proportion positively recommended is dangerous, as some products have many fewer reviews than others, e.g. one could have 100% of reviews with positive recommendations, but only 3 total reviews.The company feels that a measure of popularity should balance both the number of reviews with the proportion recommended. One such measure for binary values is known as the Wilson’s lower confidence limit approximation for proportions.Let pˆi be the proportion of respondents who positively recommended a certain product and ni be the number of respondents who rated that product (positively or negatively). Then Wilson’s lower confidence limit can be computed via: ai = 1.962 2ni bi = pˆi(1 − pˆi) ni ci = ai 2ni W LCL(ˆpi) = pˆi + ai − 1.96 × √ bi + ci 1 + 2aiThis lower confidence limit is the value for which we are 97.5% confident that the true value of the proportion in the population who positively recommend product i lies above this quantity. For two products with the same proportion of positive recommendations, pˆi , the one with the larger number of reviews will have the higher lower confidence limit. It is possible for a product with a larger number of reviews to have a smaller proportion of positive recommendations than a second product, but a larger lower confidence limit if the second product has many fewer total reviews.For this task, compile three different lists in the form of tables: a) the 10 product ID’s with the highest average ratings; b) the 10 product ID’s with the highest proportion of positive recommendations; and c) the 10 product ID’s with the highest Wilson lower confidence limits for positive recommendations as described aboveIn each table, include the product ID, the number of reviews for that product, the average rating, the proportion of positive recommendations, and the department. Which list do you think best represents the products which are the most popular? Explain your answer clearly to the firm in your report.
Question 1 The data for this question come from a high performance ceramics experiment done at NIST. The purpose of the experiment was to characterize the effect of five machining factors on the mean strength of the resulting ceramic.The five factors were: Table Speed, Down Feed Rate, Wheel Grit, Direction, and Batch. Each factor had two levels. The Mean Speed was measured for each of the 2 5 = 32 possible combinations of factor levels.You can read in the data from the file attached to the Assignment on myCourses by using ceramic_data
Question 1 (50 points) The basics Logistic regression is a fundamental prediction model in statistics and modern data science. Assume that we have observed two predictors, Xi1 and Xi2 and want to predict a binary outcome Yi (i.e. Yi = 0 or Yi = 1). A logistic regression model assumes that the probability that Yi = 1 can be modelled using the following function of Xi1 = xi1 and Xi2 = xi2. P r(Yi = 1|Xi1 = xi1, Xi2 = xi2, θ1, θ2, θ3) = p(xi1, xi2) = 1 1 + exp(−xi1θ1 − xi2θ2 − θ3)).(a) Write a function to compute p(x1, x2) for n observations which takes as arguments: i) A vector of three parameters θ = (θ1, θ2, θ3). ii) Two predictor vectors, x1 = (x1,1, …, xn,1) and x2 = (x1,2, …xn,2) and returns a length n vector corresponding to p(x11, p12), …p(xn1, xn2) for the corresponding θ values.Hint: You can do this without loops by subscripting for θ and using vectorized calculations for x1 and x2. Given a dataset of n observations where we observe (Y, X1, X2) = (yi , xi1, xi2) for each observation i, one way to estimate values for θ1, θ2 and θ3 is to minimize the cross-entropy loss: L(θ1, θ2, θ3) = − Xn i=1 [yi × log(p(xi1, xi2)) + (1 − yi) × log(1 − p(xi1, xi2))]Note that because 0 ≤ p(x1, x2) ≤ 1, L(θ1, θ2, θ3) will be smaller when p(xi1, xi2) is close to 1 for yi = 1 and p(xi1, xi2) is close to 0 for yi = 0. (b) Write a function to compute L(θ1, θ2, θ3) for n observations which takes as arguments: i) A vector of three parameters θ = (θ1, θ2, θ3). ii) Two predictor vectors, x1 = (x1,1, …, xn,1) and x2 = (x1,2, …xn,2) iii) An outcome vector, y = (y1, …, yn) Hint: Use your function p(x1, x2) from part (a).Writing a function to use with optim optim is an opitmizer function that, by default, minimizes an argument function fn as a function of a vector first argument of fn, starting from initial values par. Other arguments for fn can be passed in …. An example function of using optim would be: ## The loss function is (x_1-a)^4 + (x_2 – b)^4, which is minimized at ## x_1 = a/2 and x_2 = b/2. f_x
Question 1 (30 points) The FiveThirtyEight website publishes many of the datasets that are used in their articles. A former McGill statistics undergrad (now an assistant professor at Smith College) created an R package, fivethirtyeight that allows easier access to the all but the largest data sets.For this question, we will look at the biopics dataset used in the article: Straight Outta Compton Is The Rare Biopic Not About White Dudes. Note that the article defines “biopics” as a class of movies that “are dramatizations, loosely based on the real-life events of actual people. Biopics offer an interpretation of lives deemed important (and profitable) by Hollywood, and they often try to make a statement about their subjects’ historical or cultural significance.” ## install.packages(“fivethirtyeight”) library(fivethirtyeight) data(biopics)(a) Using the plot of your choice, assess whether the total number of biopics released per year has increased over time based on the data collected from the IMDB movie database. (b) Produce a stacked barplot similar to the barplot in the original article showing the relative numbers of male and female subjects over time (Note the figures will not exactly be the same as the data in the article figures is not the same as in the dataset).(c) Produce a stacked barplot similar to the barplot in the original article showing the relative numbers of white subjects, subjects who are persons of color, and unknown race subjects over time. (Mote the figures will not exactly be the same as the data in the article figures is not the same as in the dataset).(d) Based on a mosaic plot (collapsing over year of release), which sex / white-nonwhite-NA group is the most underrepresented in biopics based on number of subjets? (e) Produce a summary table containing counts and proportions of biopic subjects per year for each sex/white-nonwehite-NA factor combination. (f) Create (i) a line plot showing the counts of these groups over time and (ii) a line plot showing the relative proportions of subjects over time. Would you infer from these plots that the imbalance is improving over time or not? Explain your answer.Question 2 (30 points) For this question, we will examine a famous diabetes dataset analyzed by Reaven and Miller (1979). We can obtain this from the heplots R package. We won’t use any of the functions in the package (for now), but we can access the data. Using the help (?) we can see the definition of the variables. install.packages(“heplots”) library(heplots) data(Diabetes) ?Diabetes(a) First, create a summary table that finds the mean and median for each of the six quantitive variables with a column for each group. (Hint: use summarise, pivot_longer, and pivot_wider). Which varible(s) seem to differentiate amongst the different types of diabetes? (b) Create 3 scatterplots, comparing all possible pairs of the glucose test variable, the insulin test varible and the sspg variable. Which pair of variables seems to allow for the strongest distinction amongst the three groups?(c) Using the pair of variables that you chose in part (b), make 2-d histograms and contour plots for each group separately. Do you find for this dataset that these plot provide useful summaries of the differences in distributions in the three groups? Feel free to adjust the amount of binning/smoothing and the number of levels from the defalut levels.
Question 1 (30 points) You can access datasets from the R datasets package by using data(NAME_OF_DATASET) For this question, we will use the ToothGrowth data. data(ToothGrowth) (a) Determine the (i) mode and (ii) class of the ToothGrowth data object.(b) Determine how many rows and columns the object has by using R functions. (c) Using boxplots, histograms, and density plots to describe the distribution of odontoblast lengths by supplement type. Does one supplement seem to be associated with greater lengths? Explain your answer.(d) Based on your output from part (c), which plot do you think is most effective for assessing whether there is a difference in distribution of lengths between the two groups? Explain your answer. (e) Create an appropriate scatterplot to assess the association betweeen the dose of the supplement and the lengths and to determine whether the nature of the association depends on the type of supplement. Does the association between length and dose seem to depend on the type of supplement? Explain your answer.(f) Generate a summary table that contains the mean, median, and standard deviation of the lengths for each supplement type.Question 2 (30 points) One of the most popular datasets on the UCI Machine Learning repository is the Abalone dataset, which contains characteristics of sea abalone. The goal of this analysis is to predict the number of rings of the abalone shell, which indicates the age of the abalone.The dataset contains the following data: Name Data Type Measurement Unit Sex nominal Length continuous mm Diameter continuous mm Height continuous mm Whole weight continuous grams Shucked weight continuous grams Viscera weight continuous grams Shell weight continuous gramsRings integer (a) Read in the data directly to a tibble object from the URL (https://archive.ics.uci.edu/ml/ machine-learning-databases/abalone/abalone.data) by using the read_csv() function (note: the column names are NOT included in the dataset).(b) Assign names to the columns of the tibble. The columns are in order of the measurements given in the table above. (c) Create a new column for the radius of the abalone shell by using the diameter. (d) Find the maximum and minimum number of rings for each value of the Sex variable by using R functions.(e) Using only plots, explain whether you think the association between total weight and the number of rings depends on the value for Sex.Question 3 (20 points) Assume that Prof. Steele creates the following list in R to help manage his life: shopping_list
Write a parse function in the file e1000.c in the code in network-sockets-xv6- e1000-lab.zip to give a human readable dump of received packet details.Your parse function should distinguish between UDP, ARP, TCP, packets. For the MAC layer your parse function should detail the mac addresses and the ethertype field. For the IP layer your parse function should detail the source and destination ip addresses and the flags. (9 marks)Download the network-sockets-xv6-e1000-lab.zip from Canvas. The code is taken from here https://pdos.csail.mit.edu/6.828/2019/labs/net.htmlNote the code at this link https://pdos.csail.mit.edu/6.828/2019/labs/net.html is for the 2019 version of xv6 i.e. riscv. See canvas for a port of the code to xv6 for the x86 i.e. network-sockets-xv6-e1000-lab.zip.You may find it helpful to review “Traps and device drivers”, ” File descriptor layer” from the xv6 book, and the lecture notes on networking.We are using a virtual network device called the E1000 to handle network communication. To xv6, the E1000 looks like a real piece of hardware connected to a real Ethernet local area network (LAN). But in reality, the E1000 that the driver talks to is an emulation provided by qemu, connected to a LAN that is also emulated by qemu. On this LAN, xv6 (the “guest”) has an IP address of 10.0.2.15. The only other (emulated) computer on the LAN has IP address 10.0.2.2. qemu arranges that when xv6 uses the E1000 to send a packet to 10.0.2.2, it’s really delivered to the appropriate application on the (real) computer on which you’re running qemu (the “host”).We will be using QEMU’s user mode network stack since it requires no administrative privileges to run. QEMU’s documentation has more about user-net here. We’ve updated the Makefile to enable QEMU’s user-mode network stack and the virtual E1000 network card.QEMU’s network stack will record all incoming and outgoing packets to packets.pcap. To get a hex/ASCII dump of captured packets use tcpdump like this: tcpdump -XXnr packets.pcap or use wireshark wireshark packets.pcapInstructions 1. Download the code from canvas – network-sockets-xv6-e1000-lab.zip 2. See the slides qemu-ethernet on canvas and associated video for details about various the network layer packet formats. 3. You are to write the parse function in e1000.c. 4. To test your parse code for TCP, ARP – use the browser to connect to http://localhost:20001/ 5. To test your parse code for UDP. See net.h for a c struct to represent an UDP packet. See also the details in question 2.Read the description below and also the “File descriptor layer” section in the xv6 book and describe how the user is able to send and receive packets to/from the E1000 device with simple system calls such as read and write. (9 marks)Overview of the socket layer in xv6 Download the network-sockets-xv6-e1000-lab.zip from Canvas. The code is taken from here https://pdos.csail.mit.edu/6.828/2019/labs/net.html .Note the code at this link https://pdos.csail.mit.edu/6.828/2019/labs/net.html is for the 2019 version of xv6 i.e. riscv. See canvas for a port of the code to xv6 for the x86 i.e. network-sockets-xv6-e1000-lab.zip.Network sockets are a standard abstraction for OS networking that bear similarity to files. Sockets are accessed through ordinary file descriptors (just like files, pipes, and devices). Reading from a socket file descriptor receives a packet while writing to it sends a packet. If no packets are currently available to be received, the reader must block and wait for the next packet to arrive (i.e. allow rescheduling to another process). The code in xv6-e1000-sockets.zip is a stripped down version of sockets that supports the UDP network protocol.Each network socket only receives packets for a particular combination of local and remote IP addresses and port numbers, and xv6 is required to support multiple sockets. A socket can be created and bound to the requested addresses and ports via the connect system call, which returns a file descriptor. The implementation of this system call is in kernel/sysfile.c. The code for sockalloc() and related functions is in kernel/sysnet.c.Take note of the provided data structures; one struct sock object is created for each socket. sockets is a singly linked list of all active sockets. It is useful for finding which socket to deliver newly received packets to. In addition, each socket object maintains a queue of mbufs waiting to be received. Received packets will stay in these queues until the read() system call dequeues them. Running the test program.(in one terminal on your laptop) $ python2 server.py 26099 listening on localhost port 26099 (then on xv6 in another terminal on the same machine run make qemu and run nettests at the xv6 shell prompt $ nettests 3 | P a g e Question 3 – Firecracker, virtio and virtio-sock Give an overview of firecracker and its use of virtio for networking. In your answer focus specifically on virtio-sock. (12 marks)References: https://www.usenix.org/system/files/nsdi20-paper-agache.pdf https://github.com/firecracker-microvm/firecracker https://github.com/firecracker-microvm/firecracker/blob/master/docs/vsock.md https://terenceli.github.io/%E6%8A%80%E6%9C%AF/2020/04/18/vsock-internals https://developer.ibm.com/technologies/linux/articles/l-virtio/
xv6 divides the disk into several sections, as shown in the Figure below. The file system does not use block 0 (it holds the boot sector). Block 1 is called the superblock, it contains metadata about the file system (the file system size in blocks, the number of data blocks, the number of inodes, and the number of blocks in the log).Blocks starting at 2 hold inodes. After those come bitmap blocks tracking which data blocks are in use. Most of the remaining blocks are data blocks. The blocks at the end of the disk hold the logging layer’s log.The term inode can have one of two related meanings. It might refer to the ondisk data structure ( struct dinode, fs.h ) containing a file’s size and list of data block numbers. Or “inode” might refer to an in-memory inode ( struct inode, file.h ), which contains a copy of the on-disk inode as well as extra information needed within the kernel.The on-disk inode is defined by a struct dinode. The type field distinguishes between files, directories, and special files (devices). A type of zero indicates that an on-disk inode is free. The nlink field counts the number of directory entries that refer to this inode, in order to recognize when the on-disk inode and its data blocks should be freed. The size field records the number of bytes of content in the file. The addrs array records the block numbers of the disk blocks holding the file’s content.The function readi is essentially called by the user level read function to read an amount of data from a file. The readi parameters are: readi(struct inode *ip, char *dst, uint off, uint n)The inode pointed to by ip abstracts the file layout/structure in the filesystem. Readi uses the addrs array to find the block numbers that are associated with the file. It then reads in all the blocks from the disk to satisfy the read request, in xv6 each block is the same size as a disk sector, which is 512 bytes.The function Bmap makes it easy for readi and writei to get at an inode’s data. Readi (5503) starts by making sure that the offset and count are not beyond the end of the file. Reads that start beyond the end of the file return an error (5514- 5515) while reads that start at or cross the end of the file return fewer bytes than requested (5516-5517) .The main loop processes each block of the file, copying data from the buffer into dst (5519-5524) . writei (5553) is identical to readi, with three exceptions: writes that start at or cross the end of the file grow the file, up to the maximum file size (5566-5567) ; the loop copies data into the buffers instead of out (5572) ; and if the write has extended the file, writei must update its size (5577-5580) .Read chapter 6 of the xv6 book and briefly explain how the read calls in the following code taken from cat.c void cat(int fd) { int n; while((n = read(fd, buf, sizeof(buf))) > 0) { if (write(1, buf, n) != n) { printf(1, “cat: write error ”); exit(); } } if(n < 0){ printf(1, “cat: read error ”); exit(); } }are associated with sectors on the disk by the xv6 operating system. You may use a diagram such as the one here to illustrate your answer. (10 marks)Read the sections on Drivers, Code: Drivers in chapter 3 and section 36.8 of filedevices.pdf, which gives a summary of the IDE disk controller protocol. See in particular the code in ide.c. See image belowThe xv6 source code includes a working IDE driver in ide.c. For example the piece of code in idestart outb(0x1f6, 0xe0 | ((b->dev&1)24)&0x0f)); is associated with: I/O Address 0x1F6 = 1B1D TOP4LBA: B=LBA, D=drive The 0xe is to set 1B1 in bits 5,6,7 B=1 indicates that we are going to have the low 4 bits at address 0x1f6 set the top 4 bits of the Logical Block Address (LBA). ((b→dev&1)24)&0x0f)) sets the top 4 bits of LBA to low four bits at address 0x1f6.An IDE disk presents a simple interface to the Disk system, consisting of four types of register: control, command block, status, and error. These registers are available by reading or writing to specific “I/O addresses” (such as 0x3F6 ) using (on x86) the in and out I/O instructions. On page 48 of the xv6 book we read “The xv6 bootloader issues disk read commands and reads the disk controller status bits repeatedly until the data is ready (see Appendix B). This polling or busy waiting is fine in a boot loader, which has nothing better to do. In an operating system, however, it is more efficient to let another process run on the CPU and arrange to receive an interrupt when the disk operation has completed.”a) Explain how the xv6 operating system uses interrupts to schedule I/O requests to the disk? b) Explain how the bootloader interfaces with the IDE controller to load the xv6 operating system – see Appendix B of the xv6 book (section “Code: C bootstrap” in particular) and bootmain.c in the xv6 source code? (12 marks)Currently xv6 files are limited to 140 sectors, or 71,680 bytes. This limit comes from the fact that an xv6 inode contains 12 “direct” block numbers and one “singly-indirect” block number, which refers to a block that holds up to 128 more block numbers, for a total of 12+128=140. You’ll change the xv6 file system code to support a “doubly-indirect” block in each inode, containing 128 addresses of singly-indirect blocks, each of which can contain up to 128 addresses of data blocks. The result will be that a file will be able to consist of up to 16523 sectors (or about 8.5 megabytes).Preliminaries Modify your Makefile’s CPUS definition so that it reads: CPUS := 1 Add QEMUEXTRA = -snapshot right before QEMUOPTS The above two steps speed up qemu tremendously when xv6 creates large files. mkfs initializes the file system to have fewer than 1000 free data blocks, too few to show off the changes you’ll make. Modify param.h to set FSSIZE to: #define FSSIZE 20000 // size of file system in blocksDownload big.c into your xv6 directory, add it to the UPROGS list, start up xv6, and run big. It creates as big a file as xv6 will let it, and reports the resulting size. It should say 140 sectors.What to Look At The format of an on-disk inode is defined by struct dinode in fs.h. You’re particularly interested in NDIRECT, NINDIRECT, MAXFILE, and the addrs[] element of struct dinode. Look here for a diagram of the standard xv6 inode.The code that finds a file’s data on disk is in bmap() in fs.c. Have a look at it and make sure you understand what it’s doing. bmap() is called both when reading and writing a file. When writing, bmap() allocates new blocks as needed to hold file content, as well as allocating an indirect block if needed to hold block addresses. bmap() deals with two kinds of block numbers. The bn argument is a “logical block” — a block number relative to the start of the file. The block numbers in ip- >addrs[], and the argument to bread(), are disk block numbers. You can view bmap() as mapping a file’s logical block numbers into disk block numbers.Your Job Modify bmap() so that it implements a doubly-indirect block, in addition to direct blocks and a singly-indirect block. You’ll have to have only 11 direct blocks, rather than 12, to make room for your new doubly-indirect block; you’re not allowed to change the size of an on-disk inode. The first 11 elements of ip- >addrs[] should be direct blocks; the 12th should be a singly-indirect block (just like the current one); the 13th should be your new doubly-indirect block. You don’t have to modify xv6 to handle deletion of files with doubly-indirect blocks.If all goes well, big will now report that it can write 16523 sectors. It will take big a few dozen seconds to finish. Hints Make sure you understand bmap(). Write out a diagram of the relationships between ip->addrs[], the indirect block, the doubly-indirect block and the singly-indirect blocks it points to, and data blocks. Make sure you understand why adding a doubly-indirect block increases the maximum file size by 16,384 blocks (really 16383, since you have to decrease the number of direct blocks by one).Think about how you’ll index the doubly-indirect block, and the indirect blocks it points to, with the logical block number. If you change the definition of NDIRECT, you’ll probably have to change the size of addrs[] in struct inode in file.h. Make sure that struct inode and struct dinode have the same number of elements in their addrs[] arrays. If you change the definition of NDIRECT, make sure to create a new fs.img, since mkfs uses NDIRECT too to build the initial file systems. If you delete fs.img, make on Unix (not xv6) will build a new one for you.If your file system gets into a bad state, perhaps by crashing, delete fs.img (do this from Unix, not xv6). make will build a new clean file system image for you. Don’t forget to brelse() each block that you bread().You should allocate indirect blocks and doubly-indirect blocks only as needed, like the original bmap(). (8 marks)
On completion please zip up your files and upload to Canvas. Install qemu on your Ubuntu vm sudo apt-get update && sudo apt-get install git nasm build-essential qemu gdb Download xv6 using git. Open a terminal window git clone git://github.com/mit-pdos/xv6-public.git cd xv6-public run make to build xv6 and make qemu to run xv6 Execute some shell commands to get familiar with the user part of the OS. The shell commands are separate programs e.g. ls.c, cat.c. To get a feel for how programs look in xv6, and how various APIs should be called, you can look at the source code for other utilities: echo.c, cat.c, wc.c, ls.c.Hints: In places where something asks for a file descriptor, you can use either an actual file descriptor (i.e., the return value of the open function), or one of the standard I/O descriptors: 0 is “standard input”, 1 is “standard output”, and 2 is “standard error”. Writing to either 1 or 2 will result in something being printed to the screen.The standard header files used by xv6 programs are “types.h” (to define some standard data types) and “user.h” (to declare some common functions). You can look at these files to see what code they contain and what functions they define.Write a program for xv6 that, prints “Hello world” to the xv6 console. This can be broken up into a few steps: 1. Create the file hello.c in the xv6 directory 2. Edit the file Makefile, find the section UPROGS (which contains a list of programs to be built), and add a line to tell it to build your hello.c code. When you’re done that portion of the Makefile should look like: UPROGS= _cat _echo _forktest _grep _init _kill _ln _ls _mkdir _rm _sh _stressfs _usertests _wc _zombie _hello 3. Run make to build xv6, including your new program 4. Run make qemu to launch xv6, and then type hello in the QEMU window. You should see “Hello world” being printed out. (1 mark)a) Write a program that prints the first 10 lines of its input for the xv6 operating system. If a filename is provided on the command line (i.e., head FILE) then head should open it, read and print the first 10 lines, and then close it. If no filename is provided, head should read from standard input. See how the program cat.c works e.g cat README grep the README | head should show ten lines each line has the word the in it.Hints: Many aspects of this are similar to the wc program: both can read from standard input if no arguments are passed or read from a file if one is given on the command line. Reading its code will help you if you get stuck. b) What is a user program for the xv6 operating system? Explain how a user program links to library functions such as printf and how does it access the operating system? c) Explain how the xv6 shell works – refer to the xv6 source code for your answer. d) Explain how xv6 implements the ls program – refer to the xv6 source code for your answer. (10 marks)Add basic versions of the commands cp, mv, to xv6. The cp and mv commands are to work on files only (no dirs). System calls to be used are reported in brackets. Please see user.h (on xv6) for a complete list of syscalls and library functions available.These commands are available on Linux. * cp src dst (open, read, write) * mv oldname newname (link, unlink) The build procedure can be broken up into a few steps: 1. Create the file cp.c in the xv6 directory 2. Edit the file Makefile, find the section UPROGS (which contains a list of programs to be built), and add a line to tell it to build your cp.c code. 3. Run make to build xv6, including your new program 4. Run make qemu to launch xv6, and then type execute cp in the QEMU window. (4 marks)a) Read chapter 3 of the XV6 book and the xv6 source and describe how xv6 dispatches interrupts and system calls – refer to the xv6 source code for your answer.b) Explain how the keyboard driver buffers keystrokes for the xv6 operating system. (12 marks)The trace syntax is int trace(int) When called with a non-zero parameter, e.g., trace(1), system call tracing is turned on for that process. Each system call from that process will be printed to the console in a user-friendly format showing: the process ID the process name the system call number the system call name Any other processes will not have their system calls printed unless they also call trace(1).Calling trace(0) turns tracing off for that process. System calls will no longer be printed to the console In all cases, the trace system call also returns the total number of system calls that the process has made since it started. Hence, you can write code such as: printf(“total system calls so far = %d ”, trace(0)); How to add a new system call to XV6You need to touch several files to add a system call in xv6. Look at the implementation of existing system calls for guidance on how to add a new one. The files that you need to edit to add a new system call include: user.h This contains the user-side function prototypes of system calls as well as utility library functions (stat, strcpy, printf, etc.). syscall.hThis file ontains symbolic definitions of system call numbers. You need to define a unique number for your system call. Be sure that the numbers are consecutive. That is, there are no missing number in the sequence. These numbers are indices into a table of pointers defined in syscall.c (see next item). syscall.cThis file contains entry code for system call processing. The syscall(void) function is the entry function for all system calls. Each system call is identified by a unique integer, which is placed in the processor’s eax register. The syscall function checks the integer to ensure that it is in the appropriate range and then calls the corresponding function that implements that call by making an indirect funciton call to a function in the syscalls[] table. You need to ensure that the kernel function that implements your system call is in the proper sequence in the syscalls array. usys.SThis file contains macros for the assembler code for each system call. This is user code (it will be part of a user-level program) that is used to make a system call. The macro simply places the system call number into the eax register and then invokes the system call. You need to add a macro entry for your system call here. sysproc.c This is a collection of process-related system calls. The functions in this file are called from syscall. You can add your new function to this file. Per-process state is stored in a proc structure: struct proc in proc.h. You’ll need to extend that structure to keep track of the process related metrics. You’ll also need to find where the proc structure is allocated so that you can ensure that the elements are initialized appropriately.When you implement your trace call, you’ll need to retrieve the incoming parameter. The file sysproc.c defines a few helper functions to do this. The functions argint, argptr, and argstr retrieve the n th system call argument, as either an integer, pointer, or a string. argint uses the esp register to locate the argument: esp points at the return address for the system call stub.Implementation steps 1. Write a test program. Add the test program to the Makefile so that it will be compiled and built whenever you run make. In the Makefile, add your program (e.g., try.c) to the list of user commands in the UPROGS= section. That should be all you need to do to that file.Beware that programs don’t have access to the typical stdio library that you expect to find on most systems. You’ll have many of the functions you expect but some of the behavior might be different. For example, printf accepts an initial parameter that is the output stream: 1 represents the standard output and 2 represents the standard error stream. There is no FILE* type and no fopen, fclose, fgets, etc. calls. Look through usertests.c for examples on how all of the system calls provided with xv6 are used.2. Add system call tracing to the kernel. Print a message identifying everysystem call that is requested by any process as well as the process ID and process name. You do not need to print the arguments to the system calls. When you run any program, including the shell, you will see output similar to this: … pid: 2 [sh] syscall(5=read) pid: 2 [sh] syscall(5=read) pid: 2 [sh] syscall(1=fork) pid: 2 [sh] syscall(3=wait) pid: 3 [sh] syscall(12=sbrk) pid: 3 [sh] syscall(7=exec) pid: 3 [try] syscall(20=mkdir) pid: 3 [try] syscall(15=open) pid: 3 [try] syscall(16=write) pid: 3 [try] syscall(21=close) pid: 3 [try] syscall(2=exit) pid: 2 [sh] syscall(16=write) pid: 2 [sh] syscall(16=write) … Use the cprintf function in the kernel, which prints using direct access to the vga controller. It works just like the normal Linux printf function. For example: cprintf(“hello, I’m number %d ”, num);3. Restrict this output to a single process. Create a new system call called trace(int). This turns console-based system call logging on and off for only the calling process. Extend the proc structure (the process control block) in proc.h to keep track of whether tracing for the process is on or off. Be sure that the elements are cleared (initialized) whenever a new process is created.In implementing your system call, you’ll need to access the single parameter passed by trace. Use the helper functions defined in syscall.c (argint, argptr, and argstr). Take a look at how other system calls are implemented in xv6. For example, getpid is a simple system call that takes no arguments and returns the current process’ process ID; kill and sleep are examples of system calls that take a single integer parameter.4. Add system call counting. Be sure to count calls on a per-process basis. You will need to keep track of this in the process control block, the proc structure. (7 marks)The process table and struct proc in proc.h are used to maintain information on the current processes that are running in the XV6 kernel. Since ps is a user space program, it cannot access the process table in the kernel. So we’ll add a new system call. The ps command should print: process id parent process id state size nameThe system call you need to add to xv6 has the following interface: int getprocs(int max, struct uproc table[]); struct uproc is defined as (add it to a new file uproc.h): struct uproc { int pid; int ppid; int state; uint sz; char name[16]; };Your ps program calls getprocs with an array of struct proc objects and sets max to the size of that array (measured in struct uproc objects). Your kernel code copies up to max entries into your array, starting at the first slot of the array and filling it consecutively. The kernel returns the actual number of processes in existence at that point in time, or -1 if there was an error. (6 marks)
For the first two parts of this homework we will use the Amazon co-purchasing network dataset Leskovec et al. (2007) to perform social network analysis. This dataset contains various products’ networks including books, music CDs, DVDs, and VHS video tapes. It was collected by crawling Amazon website in March, 2003 according to Customers Who Bought This Item Also Bought on the Amazon website. So, if a product A is always co-purchased with product B, the graph contains a directed edge from A to B.We recommend that you use Jupyter Notebooks and Python libraries (Numpy, Sci-kit learn, Pandas, and NetworkX) for this homework. The last part of this homework contains a novel peer-assessed exam question generation problem!This homework is divided into three parts. 1. Exploratory Social Network Analysis. 2. Predicting Review Rating using features derived from Network Properties. 3. Generating a peer-assessed exam question.2.1 Part 1: Exploratory Social Network Analysis [30 Points]This part of the homework is designed to help you familiarize yourself with the dataset and basic concepts of network analysis. The insights from this part of the homework will help you in building the prediction models for Part 2 of the homework.1. Read NetworkX library documentation closely to understand the context and review some code examples of network analyses. [0 points] 2. Read the document linked below to understand the basics of Social Network Analysis. https://www.datacamp.com/community/tutorials/social-network-analysis-python [0 points]3. Perform some basic network analyses and briefly explain each of your findings [30 points]: (a) Load the directed network graph (G) from the file amazonNetwork.csv. [2 points] (b) How many items are present in the network and how many co-purchases happened? [7 points] (c) Compute the average shortest distance between the nodes in graph G. Explain your results briefly. [7 points](d) Compute the transitivity and the average clustering coefficient of the network graph G. Explain your findings briefly based on the definitions of clustering coefficient and transitivity. [7 points] (e) Apply the PageRank algorithm to network G with damping value 0.5 and find the 10 nodes with the highest PageRank. Explain your findings briefly. NetworkX document of the PageRank algorithm: https://networkx.github. io/documentation/networkx-1.10/reference/generated/networkx.algorithms. link_analysis.pagerank_alg.pagerank.html [7 points]The main deliverable for this part of the homework is 1) a step-by-step exploration of data in your Jupyter Notebook. 2) a PDF document containing the answers to each of the questions above. You should also describe your conclusions.2.2 Part 2: Predicting Review-Rating using Features derived from network properties [50 Points] For this part of the homework, you will build a machine learning model to predict the review rating of the Amazon products on a scale of 0-5 using various network properties as features.We provide you with the training dataset (reviewTrain.csv) which you should use judiciously to train your models. We also provide a test dataset reviewTest.csv where the “match” label is missing.You need to extract at least 4 different features based on the network properties to train your model. The error-metric that we will use for evaluating your match labels on the test dataset is the mean absolute error (MAE). Some of the features that you can consider using include: • Clustering Coefficient • Page Rank • Degree centrality • Closeness centrality • Betweenness centrality 2 Some of the models that you can consider using include: • Logistic Regression • Support Vector Machine (SVM) • Multi-layer perceptronThe main deliverable for this part of the homework is a step-by-step analysis of your feature selection and extraction and model building exercise, describing clearly how you generated features from your dataset and why you chose a specific feature over the other. Your Jupyter notebook should contain the reproducible code for training various models as well as text descriptions of your conclusions after each step.Your grade on this part of the homework will depend on the accuracy of your model on the test dataset as well as your step-by-step description of how you arrived at your final model. We will evaluate your model using mean absolute error (MAE).Here’s the description of files included with this homework. 1. amazonNetwork.csv: This file contains the data for Part 1 of the homework. It contains 10841 observations and 2 columns with the numbers representing product IDs. Each node represents a product and each directed edge between two nodes represents a co-purchase. The column fromNodeId contains the ID of the main purchasing item and ToNodeId contains the ID of the co-purchased items.2. reviewTrain.csv: This file contains the training data for Part 2 of the homework. It contains 1674 observations and 4 columns/features. The review column contains ratings on a scale of 1-5. 3. reviewTest.csv: This file contains the test data for Part 2 of the homework. Please insert your prediction results in the review column in the file.After receiving some great feedback from the students regarding the questions on the midterm exam, we thought of having a “tiny” competition among the students to generate potential midterm questions! Here are the details: • You need to generate 1 question that can be a potential exam question for SI 671/721 based on the material that we have covered till 11/1/2022 (Streaming data). • The question should be a multiple choice with 1 or more correct answers. In other words, questions with descriptive answers are not allowed.• It can be a standalone question testing some course concepts, e.g., the midterm question “Which of the following are frequent itemsets. . . ” OR it can be a composite question with few sub-questions similar to the scenario-based questions on the midterm, e.g., “Planning the course paths for students. . . ” • You also need to provide the correct answer for the question.• Your submitted questions will be evaluated anonymously by your fellow classmates! Each student will be assigned 5 questions (from other students), and they will 1) rank those questions from 1 to 5 in terms of quality, and 2) reply with yes/no regarding whether the submitted question was correct in the first place.Here is how we will grade your submitted questions: • Submitting the ranked list (and correctness) for the 5 questions assigned to you on time. [5 points] • Correctness of your own submitted question. [5 points] (0 points if your submitted question/answer was incorrect as judged by your classmates) • The remaining 10 points will be given based on the quality of your question. 1st rank= 10 points, 2nd rank =8 points, 3rd rank =6 points, 4th rank=4 points, 5th rank=2 points. For example, if my submitted question received 1 vote each for 1st, 2nd, 3rd, 4th, 5th ranks by the students, then I’d receive (10+8+6+4+2)/5= 6 points out of 10. If my question got all 5th rank votes, then I’d receive (2+2+2+2+2)/5= 2 points out of 10, and so on.Note that questions that involve asking arcane facts embedded in a footnote on one of the slides might not be rated as high-quality by your peers! So, get set to unleash your creativity!5 Submission All submissions should be made electronically Here are the main deliverable files: • HTML version of your Jupyter notebook.(Only one HTML files should be submitted) • The actual Jupyter notebook with “step-by-step analysis,” so that we could replicate your results. • PDF document containing Part1’s answer. • File reviewTest.csv with your predicted ratings on a scale of 1-5 for Part 2 of the homework. Keep all the columns in the file reviewTest.csv which we shared with you, as they are. Just update the file with your predictions in the correct column. • Submission details TBD for Question 4. (Most likely, the submission will be as an anonymous submission to Canvas).
Homework 2: Mining Time Series – do the top 5 countries with the most cumulative COVID-19 cases demonstrate similar patterns?Summary We will continue to explore the data we used in the Time Series lab from the Johns Hopkins University CSSE COVID-19 dataset. However, this time, we are interested in the number of daily new cases exclusively from the top 5 countries that have the most cumulative cases as of August 21, 2020.To explore and analyze this dataset, this assignment will focus on extracting the seasonal component from the countries’ time series, computing the similarity between them, and calculating the Dynamic Time Warping (DTW) Cost.Data For this assignment, we will be reusing the time_series_covid19_confirmed_global.csv file from the Time Series lab.Packages We recommend using the following Python packages in this assignment: ● numpy ● pandas ● matplotlib ● statsmodels ● mathAssignment Structure This homework is divided into the following parts: Part 1: Load & Transform the Data Part 2: Extract Seasonal Components Part 3: Time Series Similarities Part 4: Dynamic Time Warping (DTW) Costa) [15 points] To begin, create a function called `load_data` that reads in the csv file and produces a `pd.DataFrame` that looks like: where ● the index of the DataFrame is a `pd.DatetimeIndex`; ● the column names “?” are the top 5 countries with the most cumulative cases as of August 21, 2020, sorted in descending order from left to right;● the values of the DataFrame are daily new cases; and ● the DataFrame doesn’t contain any `NaN` values. This function should return a `pd.DataFrame` of shape (212, 5), whose index is a `pd.DatetimeIndex` and whose column labels are the top 5 countries.b) [5 points] Then, using your newly created ‘load_data’ function, plot one line for each country that is in the top 5 for most cumulative cases where the x-axis is the date and the y-axis is the number of cases. Please do so within one figure.Recall from lecture and lab that an additive Seasonal Decomposition decomposes a time series into the following components: Y(t) = T(t) + S(t) + R(t) where T(t) represents trends, S(t) represents seasonal patterns and R(t) represents residuals. In the rest of the assignment, we will work with the seasonal component S(t) to understand the similarities among the seasonal patterns of the five time series we have, so let’s write a function that extracts this very seasonal component.a) [10 points] Complete a function, ‘sea_decomp’, that accepts a `pd.DataFrame` and returns another `pd.DataFrame` of the same shape that looks like: where ● the index of the DataFrame is a `pd.DatetimeIndex`; ● the column names “?” are the top 5 countries with the most cumulative cases as of August 21, 2020, sorted in descending order from left to right;● the values of the DataFrame are the seasonal components S(t) as returned by the `seasonal_decompose` function from the `statsmodels` package; and ● the DataFrame doesn’t contain any `NaN` values. This function should return a `pd.DataFrame` of shape (len(df), 5), whose index is a `pd.DatetimeIndex` and whose column labels are the top 5 countries.b) [5 points] Then, using this function, please plot one line for each country in the top 5 showing the seasonal component – you should have a total of 5 line graphs where the x-axis is the date and the y-axis is the seasonal component.3.1 Euclidean Distance [20 points] Now, we may start to ask questions like, “which country in the top 5 countries are the most similar to Country A in terms of seasonal patterns?”. In addition to the seasonal components that reflect seasonal patterns, we also need a measure of similarity between two time series in order to answer questions like this. One of such measures is the good old Euclidean Distance. Recall that the Euclidean Distance between two vectors x and y is the length of the vector x – y:a) [15 points] Complete a function, ‘calc_euclidean_dist’, that accepts a `pd.DataFrame`, whose columns are time series for each country, and that returns all pairwise Euclidean Distance among these time series, similar to the following: where ● the index and the column names “?” are the top 5 countries with the most cumulative cases as of August 21, 2020, sorted in descending order from top to bottom and from left to right; and ● the values of the DataFrame are pairwise Euclidean Distance, for example, `233760.757213` is the Euclidean Distance between the time series of the Rank 1 country and the Rank 2 countryThis function should return a `pd.DataFrame` of shape (5, 5) whose index and column labels are the top 5 countries. b) [5 points] Then, use this new function to calculate the pairwise Euclidean Distance matrix for the extracted seasonal components from the top 5 countries with the most cumulative cases.3.2 Cosine Similarity [20 points] Another commonly used similarity measure is the Cosine Similarity. Recall that the Cosine Similarity between two vectors x and y is the cosine of the angle between x and y: a) [15 points] Complete a function, ‘calc_cos_sim’, that accepts a `pd.DataFrame`, whose columns are the time series for each country, and that returns all pairwise Cosine Similarity among these time series, similar to the following: where● the index and the column names “?” are the top 5 countries with the most cumulative cases as of August 21, 2020, sorted in descending order from top to bottom and from left to right; and ● the values of the DataFrame are pairwise Cosine Similarity, for example, `0.898664` is the Cosine Similarity between the time series of the Rank 1 country and the Rank 2 country This function should return a `pd.DataFrame` of shape (5, 5), whose index and column labels are the top 5 countries.b) [5 points] Now, use this new function to calculate the pairwise Cosine Similarity between seasonal patterns.4.1 Define a Function to Calculate DTW Cost [10 points] Last but not least, the cost of aligning two time series can also be used as a similarity measure. Two time series are more similar if it incurs less cost to align them. One of the commonly used alignment costs is the Dynamic Time Warping (DTW) cost, which we will explore in this problem. Recall from lecture that the DTW cost is defined by the following recursive relations: where we define d(xi , yj ) = (xi , yj ) 2 .a) [10 points] With reference to the demo of the DTW algorithm in the lecture slides, implement a function, ‘calc_pairwise_dtw_cost’, below that computes the DTW cost for two time series. We don’t take the square root of the results just yet, until later when we compare the DTW costs with the Euclidean Distance. This function should EITHER return a `np.ndarray` of shape (len(y), len(x)) which represents the DTW cost matrix, OR a single `float` that represents the overall DTW cost, depending whether the parameter `ret_matrix=True`.4.2 Compute Pairwise DTW Cost [15 points] Now let’s compute all pairwise DTW costs for our five time series. a) [10 points] Implement a function, ‘calc_dtw_cost’, below that accepts a `pd.DataFrame`, whose columns are the time series for each country, and that returns all pairwise DTW costs among these time series, similar to the following: where● the index and the column names “?” are the top 5 countries with the most cumulative cases as of August 21, 2020, sorted in descending order from top to bottom and from left to right; and ● the values of the DataFrame are pairwise DTW costs, for example, `9.575974e+09` is the DTW cost between the time series of the Rank 1 country and the Rank 2 country This function should return a `pd.DataFrame` of shape (5, 5), whose index and column labels are the top 5 countries.b) [5 points] Now, use this function to calculate the pairwise DTW costs between seasonal patterns. Please take the square root so that we can compare it with the Euclidean Distance. What can you say about the similarities among these seasonal patterns? Do the results of the pairwise Euclidean Distance, Cosine Similarity and DTW Cost calculations tell the same story?Submission All submissions should be made electronically Here are the main deliverables: ● A PDF version of your executed Jupyter Notebook ● The actual Jupyter notebook, so that we can check your results Please make sure to provide appropriate conclusions drawn from the code/results throughout the notebook.
Do you ever have questions like “how long does coronavirus survive on surfaces,” “how do i beat Gattuso in Tales of Vesparia” or “how exactly do apps not running ”in the background” receive notifications”? If so, then this assignment is for you. In Homework 2, you’ll be building a vertical search engine (of sorts) by designing document ranking functions and testing them in three different domains: covid-19, gaming, and android software development.The goal of this assignment is to understand how ranking works in a real retrieval system using a vector space model (VSM) of retrieval. One of the major tasks in building an IR system is to implement a basic ranking algorithm that scores documents based on their relevance to the query. This assignment has two parts where you will design different VSM rankers. In the first, you’ll re-implement the BM25 ranking function and Pivoted Length Normalization methods for ranking.These algorithms are simple but effective and should help you familiarize yourself with the basic features and data you’ll likely use for part two. In part two, you’ll implement your own new ranking function. Part two is where you will likely spend most of your time To evaluate correctness, we’ll be using Kaggle to have you submit your document rankings for each query. Your rankings will be evaluated using NDCG@10 or NDCG@5, depending on the dataset. An untuned BM25 implementation will serve as a simple-yet-effective method will serve as the baseline for how well retrieval could work.To participate, you must join three Kaggle InClass competitions that are linked to each datasets (see links in each section for when/where to submit). Please be sure choose a username that we can identify as you or specify your Kaggle username in your homework submission itself.This assignment uses the pyserinilibrary,1 which is a modern Information Retrieval library that supports a huge variety of different functionality—including deep learning. We recommend installing pyserinifrom pip (e.g., pip install pyserini). However, if this gives you issues, feel free to post on Piazza or run the assignment on Google colab (https://colab.research.google.com), which is particularly recommended if you’re running this on Windows.This assignment has the following learning goals: 1. Become familiar with a major information retrieval software package (pyserini) 2. Gain new software development and debugging skills 3. Improve the ability to read software library documentation 1https://github.com/castorini/pyserini4. Learn how to translate equations into code 5. Understand the influence of hyperparameters on the performance of BM25 and Pivoted Length Normalization 6. Understand how vector space ranking functions work through exploring new formulations of the functions 7. Learn about issues in domain transfer when applying a tuned ranker to a new type of document collection2 Provided Code and Kaggle Setup We have provided several files to help you get started. • rankers.py — You will implement your various rankers here as subclasses of Ranker, which is provided in this file as well. • main.py — This is basic skeleton code for running your ranking function. You can modify this code to have it write results or explore what kind of ranking your model gives you.Ultimately, this code is just a starting point and you are welcome to rewrite any part of it, provided that we can clearly identify the three ranking methods we ask for (described next). When submitted, we’ll ask for all the code you used (even if you left some of these files unmodified) so that we can check that things work.Your first task will be to re-implement two famous ranking functions we’ve talked about in class. Implementing these functions will mostly require you to figure out how to translate the equations into pyserinioperations.In general, you need to write a function that, for a query, computes a relevance score for every document by aggregating the weight of every word that occurs in both the document and the query: s(q, d) = g[f(w1, q, d) + … + f(wN , q, d), q, d], where w1, w2, …, wN are terms in the query q that appeared in the documents d. That is, the score of a document d given a query q is a function g of the accumulated weight f for each matched term.Before you can rank, you need some way of representing the query and documents in the same vector space. The first step is then to index the data using pyserini’s API and write this index to a file. We recommend reviewing pyserini’s documentation on indexing to get started https:// github.com/castorini/pyserini#how-do-i-index-and-search-my-own-documents. We’ll be using three document collections for this assignment that are provided on the Kaggle InClass competitions: You will use your code index each of the collections separately to create different index files.NOTE: The indexing itself is actually very easy with pyserini(can be done in a single line of code), so if your method seems complicated, something has gone amiss.Once you have the data indexed, your second task will be to implement Pivoted Length Normalization as a ranking function. We have provided some skeleton code for getting you started. You can find the formulas for Pivoted Length Normalization in the lecture slides or in the textbooks. Most of your effort in this task will be spent learning how to use pyseriniand access the kind of data you need to calculated the scoring function. The code itself is relatively simple so one goal is to have you familiarize yourself with their documentation and, in general, practice how to learn a new complex software library.Once you have finished your implementation, please upload your predictions to the following Kaggle InClass for the trec-covid dataset: https://www.kaggle.com/t/c9b042ef090f49f4bd70fa18d5fb4b3f. You are welcome to tune the hyper parameters of the scoring function, but that is not needed for Task 2 (but will be for Task 3!). Full credit will depend on a correct implementation.The third task asks you to implement another slightly more complicated ranking function, BM25. You can find the formulas for BM25 in the lecture slides. You should use the definitions in the course slides, as other definitions may be different (e.g., Wikipedia’s BM25 is different and missing the k3 term).Once you have finished your implementation, submit your rankings to the KaggleInClass for the gaming dataset https://www.kaggle.com/t/992a05c211b940088942205b5f3b4e5a. You will be graded based on your best performing submission. You’ll get full credit if your retrieval function can beat the provided baseline in the dataset, which is an untuned BM25, which means you’ll need to do some hyperparameter tuning.The second half of the assignment will have you implement at least one retrieval function different from BM25, Dirichlet Prior, and Pivoted Length Normalization (and any variations of these). You will be graded based on your best performing function. You’ll get full credit if your retrieval function can beat the provided baseline in the dataset, which is an untuned BM25. By “beat,” we mean that your implemented function and your choice of parameters should reach higher NDCG@5 than the baseline on Kaggle for our dataset, which you can check at any time.Most of your time in this part of the assignment will be spent trying to see what kinds of adjustments you can make to a ranking function improve. In past years, nearly all students have managed to outperform BM25 so it is very much possible. You are welcomed to be creative in your approaches as well. The intent of this part is to help you understand how different parts of the function help or hurt ranking, which often is a very empirical and hands-on process when trying to fine-tune a ranker for a particular dataset.In your submission, please include the code to implement the retrieval function, the parameters you used that achieved the best performance, and the best performance. In addition, explain what your function does, how you designed in, and why you decide to choose those hyperparameter values. Your explanation should be at least a few sentences, if not longer. You will lose points if you cannot explain why your function can reach a higher performance. You can include your explanations in the end of the submitted notebook. You should aim to explain your design in at least three sentences.Note: Simply varying the value of parameters in Okapi/BM25, Dirichlet Prior or Pivoted Length Normalization does not count as a new retrieval function. The variations of any of the algorithms listed in the slides or in the textbooks also do not count as new functions. Once you have completed your function, upload your results to the following Kaggle InClass, which uses the android dataset: https://www.kaggle.com/t/4cd177bc66ba411192984e81466d9518. Full credit for this function depends on whether it can surpass an untuned BM25 ranking of the same dataset.5 What to submit? You need to submit three things: 1. Please submit your code in a runnable format to Canvas; .ipynb files are acceptable. 2. Please submit the rankings for all three ranking functions to their respective Kaggle InClass sites. Be sure the join using the link above. Please be sure to include your Kaggle username in this file so we can figure out which score is yours. 3. Please submit a text (pdf/Word) file that describes your choices and implementation for Part 2 on designing your own ranking function. We will need to understand this to make sense of your code. Everything needs to be submitted to Canvas. Please do not submit a zip; submitting each file separately is strongly preferred.6 Late Policy Throughout the semester, you have three free late days total. These are counted as whole days, so 1 minute past deadline result sin 1 late day used (Canvas tracks this so it’s easier/fair). However, 4 if you have known issues (interviews, conference, etc.) let us know at least 24 hours in advance and we can work something out. Special Covid TimesTM Policy: If you are dealing with Big Life Stuff R , let the instructor know and we’ll figure out a path forward (family/health should take priority over this course). Once the late days are used up, the homework cannot be submitted for credit, though speak with the instructor if you think this is actually a possibility before actually not submitting.
Sam lets a parent borrow their phone only to discover that they have opened up TikTok and managed to subscribe to thousands of accounts. Their feed is now a useless mess of spam accounts, weird dances, and outrage videos. Sam doesn’t have time to deal with all of this manually so they decide to find a programmatic way of unfollowing accounts, starting with getting rid of spam accounts.Luckily, Sam attended SI 650 / EECS 549 and decided to filter the accounts in a Bayesian way. They went through 12 accounts manually and noted down 4 observations for each account in Table 1 to figure out which features are more likely used in spam accounts. 1. S: whether the account posts mostly spam (1 for yes, 0 for no); 2. B: whether the account’s bio has the word “based” (1 for yes, 0 for no); 3. U: whether the account’s bio has an URL link (1 for yes, 0 for no); 4. E: whether the account’s bio has an emoji (1 for yes, 0 otherwise).Sam now wants to build a filter with these observations. Now in terms of probabilistic reasoning, we can formulate the question as evaluating the conditional probability P(S|B, U, E), we say that the account is a spam if P(S = 1|B, U, E) > P(S = 0|B, U, E). We make a further conditional independence assumption that P(B, U, E|S) = P(B|S)P(U|S)P(E|S).In other words, we assume that if the status whether a account is a spam is known (i.e., value of S is known), the values of B, U, and E would be independent to each other. a) (5 points) Fill in the Table 2 with conditional probabilities using only the information present in the 12 samples. b) (5 points) With the independence assumption, use the Bayes formula and the calculated conditional probabilities to compute the probabilities that account a with B = 0, U = 1, E = 0 is a spam. That is, compute P(S = 1|B = 0, U = 1, E = 0) and P(S = 0|B = 0, U = 1, E = 0). Would you conclude that account a is a spam? Show your computation.c) (5 points) Now, compute P(S = 1|B = 0, U = 1, E = 0) and P(S = 0|B = 0, U = 1, E = 0) directly from the 12 examples in Table 1, just like what you did in problem A. Do you get the same value as in problem B? Why? 1 S B U E 0 1 1 1 1 0 1 0 1 1 1 1 0 1 1 0 1 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 1 0 1 1 1 1 1 0 1 1 0 0 Table 1: Sample observations of the accountsd) (5 points) Now, ignore Table 1, and consider any possibilities you can fill in Table 2. Are there any constraints on these values that we must respect when assigning these values? In other words, can we fill in Table 2 with 8 arbitrary values between 0 and 1? If not, are there any constraints on some values that we must follow? Describe your answer.e) (5 points) Can you change your conclusion of problem a (i.e., whether account a is a spam) by only changing the value E (i.e., if the account bio has an emoji) in one example of Table 1? Describe your answer. f) (5 points) Explain why the independence assumption P(B, U, E|S) = P(B|S)P(U|S)P(E|S) does not necessarily hold in reality.In this exercise, we are going to get our hands dirty and play with some data in the wild. Download two collections from Canvas, reddit-questions.10k.txt and wiki-bios.10k.txt. The first collection are 100,000 questions randomly sampled from r/AskReddit. The second collection is 100,000 biographies of people taken from Wikipedia. You can also find a stopword list in stoplist.txt.For text processing, we’ll use the SpaCy library, which is a modern NLP library that is well documented and highly performant. You want to use nlp function from SpaCy for tokenizing and part-of-speech (POS) tagging. S P(B = 1|S) P(U = 1|S) P(E = 1|S) prior P(S) 1 0.71428 ? ? ? 0 ? ? ? 0.41666 Table 2: Conditional Probabilities and Prior1. (5 points) Tokenize the text using SpaCy and compute the frequency of words. Then, plot the frequency distribution of words in each collection after the removal of the stopwords: x-axis – each point is a word, sorted overall by frequency (number of times a word appears in the collection)1 ; y-axis – how many times the word occurred. Plot this using a log scale on each axis. Does each plot look like a power-law distribution? Are the two distributions similar or different?2. (10 points) Now compare the two collections more rigorously. Report the following properties of each collection, using SpaCy to POS tag. Can you explain these differences based on the nature of the two collections? a) frequency of stopwords (percentage of the word occurrences that are stopwords.); b) percentage of capital letters; c) average number of characters per word; d) percentage of nouns, adjectives, verbs, adverbs, and pronouns; e) the top 10 nouns, top 10 verbs, and top 10 adjectives.3. (10 points) We would like to summarize each document with a few words. However, picking the most frequently used words in each document would be a bad idea, since they are more likely to appear in other document as well. Instead, we pick the words with the highest TF-IDF weights in each document.In this problem, term frequency (TF) and inverse document frequency (IDF) are defined as: T F(t, d) = log(c(t, d) + 1) IDF(t) = 1 + log(N/k). c(t, d) is the frequency count of term t in doc d, N is the total number of documents in the collection, and k is the document frequency of term t in the collection. For each of the first 10 documents in the Wikipedia biographies collection, print out the 5 words that have the highest TF-IDF weights. Write whether you think these could be a good summary of the documents.4. (5 points) As discussed in the class, TF-IDF is a common way to weight the terms in each document. It can also be easily calculated from the inverted index (covered in Week 3), since TF can be obtained from the postings and IDF can be summarized as a dictionary. Could you think of another weighting that cannot be calculated directly from inverted index? What is the advantage of such a weighting? • Hint 1: You can find a tutorial for SpaCy at https://spacy.io/usage/spacy-101 which covers all of the functionality you’ll need here as well as many more advanced preprocessing steps. ) 1This also gets called the “rank” of the word in the collection• Hint 2: You may find a lot of decision to make: Should I lower-case the words? Should I use stemmer or a lemmatizer? What to do with the punctuation? How should I handle html, markdown, or emoji? There is not always right and wrong, different answers are accepted. But you should write down clearly how you process the data in each parts and explain your decisions.Suppose we have a query with a total of 20 relevant documents in a collection of 100 documents. A system has retrieved 20 documents whose relevance status is [++, -, +, ++, -, +, -, ++, +, -, -, +, ++, -, -, +, +, ++, +, -] in the order of ranking. A + or ++ indicates that the corresponding document is relevant, while a – indicates that the corresponding document is non-relevant.• (10 points) Compute the precision, recall, F1 score, and the mean average precision (MAP). • (10 points) Consider ++ as the corresponding document being highly relevant (ri = 2), while + indicates somewhat relevant (ri = 1), – being non-relevant (ri = 0). For the nine rest relevant documents, treat them as somewhat relevant (ri = 1) Calculate the Cumulative Gain (CG) at rank 10, Discounted Cumulative Gain (DCG) at rank 10, and Normalized Cumulative Gain (NDCG), at rank 10. Use log2 for the discounting function.Note You may find the definition of DCG in Wikipedia is different from the definition in our lecture. Please use the one in our lecture to calculate DCG and NDCG. (i.e. DCGp = rel1 + Pp i=2 reli log2iLet’s build a simple search engine using some of the techniques we learned in class and evaluate it in practice both for its ability to retrieve relevant documents and its speed of retrieval. Here, we’ll use our Reddit questions and play the role of an auto-suggest: If someone is about to ask a question on Reddit, they can quickly see if someone else has already asked that question by searching for a few keywords. Specfically, you’ll implement a very simple search that (1) measures the cosine similarity of a bag-of-words representation of the query and a bag-of-words representation for a document and (2) returns the k most similar documents to the query.To support the basic search functionality, we’ll use the scikit-learn package to convert our Reddit questions and queries to bag of words vectors. • (5 points) Write a function that uses CountVectorizer to convert the Reddit questions corpus to vectors. Write a function that given a new query, will convert its text to a vector (using the same vectorizer), estimate the cosine similarity between the query and each document (i.e., each Reddit question) and return the 10 most similar.• (10 points) Using your method, – run the following queries and show the questions they return: (1) programming, (2) pets, (3) college, (4) love, and (5) food. – Score each retrieved question for relevance using a three point scale as in Problem 3 (very relevant, somewhat relevant, not relevant) – compute NDCG for each query and report it.• (5 points) In a 2-3 sentences, describe how well you think your IR system is doing. What kids of queries do you think it would work well on? What kinds of queries do you think it will perform poorly on? (Feel free to describe example queries if you want to test things!)Finally, let’s get a sense of how scalable our system is. Because we’re not imposing a minimum frequency for including a word in our CountVectorizer, we’re effectively indexing every wold. For this exercise, we have 100K questions, but what if we had 1M or 1B—how many terms would we be indexing? Let’s estimate this by looking at how our index size (i.e., the number of words recognized by the vectorizer) grows relative to our corpus size.• (5 points) Re-run your CountVectorizer code with 1K, 5K, 10K, 50K, and 100K questions and for each plot how many terms appear in the vocabulary field for the vectorizer. In general, we recommend using Seaborn for all plotting.2 Write 2-3 on why you think this approach will or won’t scale as we get more documents and justify you answer.If you’re feeling particularly curious, try repeating this process with random samples of each number of questions (e.g., a random 1K questions) and show the mean and standard error.
Pretrained language models like BERT and GPT have revolutionized many areas of NLP. These models have been trained on massive volumes of text and their parameters reect a basic understanding of the structure and meaning of many kinds of text.The huge advantage for such pre-training is that models can be quickly adapted to new tasks by performing ne-tuning, where the parameters are modied to reect a particular type of text (e.g., social media) and to use them to perform some specic task (e.g., classication). As a result, with just a limited amount of data, these models are capable of generalizing to a much wider set of instances; for example, a BERT-based classier trained on a few thousand classiers might perform substantially better than a logistic regression classier trained on the same data (like we saw in the demo notebook in class!). Homework 4 is designed to get you experience with these kinds of models.In the long long ago of Lecture 2, we learned about language models that could generate text and briey played around with the Talk to Transformer website, which can generate stories from the start of one. Wouldn’t it be cool to try building and using that kind of technology yourself? In this assignment, you’ll nally get face to face with some of the latest NLP deep learning libraries to do two tasks: 1. Generate the lyrics to a song 2. Detect whether a song you’ve been given was generated by a machine or not. As a part of this assignment, you’ll work with one of the many libraries built on top of the HuggingFace transformers library (or work with transformers itself), which has implementations of many of the latest-and-greatest NLP models and, conveniently, has included example scripts for how to eectively ne-tune pretrained models to do both of these tasks.The overarching goal of Homework 4 is to generate the most realistic song lyrics you can from the generation model. However, you’re also tasked with detecting these machine-generated lyrics, which creates a dilemma—you want your generation system to fool your classication system 1 https://talktotransformer.com/ 1 (making it think the lyrics are human-authored), yet you also want the classication system to do well.Most of the homework eort will consist of familiarizing yourself with how to run these models and will likely not involve writing much code (unless you want to go wild on creating cool generators/classiers, which I fully support).Homework 4 has the following (very practical) learning goals: 1. Learn how to use modern deep learning libraries like transformers to ne-tune pretrained models like BERT (for classication) and GPT (for generation) 2. Learn how to run your code on a GPU (as provided by Great Lakes) 3. Get a sense of training time and complexity for working with these models 4. Try your hand at tuning hyperparameters to make better models, especially as it relates to convergenceIn summary, we want to help you get familiar with the basics of these models in a way that gives you skills you might use on a job, a course project, or in a research setting (e.g., ne-tuning a BERT classier for your particular domain).Pre-trained language models are trained on a variety of text and taken o the shelf, won’t generate “lyric-like text” on their own. As a result, we rst need to ne-tune a language model to have it learn the structure and vocabulary of song lyrics.For Part 1, you’ll ne-tune one of the transformers from the HuggingFace library to generate text. I personally recommend the openai-gpt or gpt2 model, as they’re small enough to reliably t on the GPUs given to you on the Great Lakes clusters. Note that in deep learning terminology, for training your language generation system in Part 2, you’ll want to ne-tune what are known as causal language models (CLMs) which are the language models we talked about in class on Week 2. GPT and GPT2 are examples of CLMs. These models capture the left-to-right ordering of words, e.g., p(wi |wi-1 ), and are trained to predict from only the prior context. CLMs are in contrast to masked language models (MLMs) like BERT which condition on the whole context to predict missing words.You can use either the transformers library with one of its training scripts or the simpletransformers library and its wrappers. Both are easy to use in practice and will mostly require you to learn how to call the training functions using the provided scripts or skeleton code.2 We’ll talk about this concept much more in class as a part of a special kind of network called a Generative Adversarial Network (GAN) but you are not implementing that here.We have provided three large datasets for you to use on Piazza. These are real lyrics (see the copyright disclaimer below) and come with the artist and genre(s) that the artist is known for. The data is divided into train, dev, and test. You should use the training data to ne-tune your model. You do not need to use all of the training data. Given the time constraints (both end-of-semester and Great Lakes runtimes), you are welcome to use subsets of the data, provided that you can justify your decision and have enough data to train the model. For example, using the rst 100K lyrics is ne, as is using all lyrics from a particular genre (as long as it is not rare); however, using all lyrics from a single artist is insucient as there are likely too few examples. If you have some concern, please let us know.The lyric data is provided to you in a JSON format, which preserves the line breaks. However, most (if not all) of the language models will read their input data as one instance per line. Since you want to generate entire songs and not lines of a song, you’ll need to encode the lyric as a single line.There are various ways of doing this: one is to add a special token to the vocabulary to represent a newline and include this in your model; another is to simply use a regular text word to indicate a newline. You are free to use any approach for training. Recognize that your model will generate lyrics in this format in Part 2, which you’ll then need to convert back to the actual new line.When ne-tuning your model, you’ll need to choose hyperparameters that converge to a low loss. To gure out the right hyperparameters, you should use the dev data to select the best hyperparameters that give the lowest perplexity (reminder: perplexity is the metric we discussed back in Lecture 2 that measure how surprised the model is by the words that come next; a lower perplexity indicates the language model expects to be generating this kind of word, which is what we want).If you want to get fancy (very optional!), your model can also incorporate more structure too by including meta-data on the input. Part of the art of making these models is guring out how to encode it. The encoding is like training a language model to predict p(wi | wi-1 , genre=country, artist=”Taylor Swift”). One way to do this is to include this kind of information as input to the model, either by using special tokens for artist/genre or even just putting it as text the model learns to use.Important Note: Some of the lyrics in the data we provide you may contain oensive language as written by the original songwriter. The lyrics’ use in the course is for instructional purposes in generating language and the instructional team does not condone their message or intent. As a part of generating language later, you may nd that your model learns to reproduce such language, which is one of the potential harms addressed way back in Week 2 with the Stochastic Parrots paper. You are welcome to lter out such messages from your ne-tuning data (though the underlying language model may still be biased towards generating them to begin with!).What to do: ● Convert the train, dev, and test data to single-line input format (one song per line)● Train a model using the training data and save it to an output directory. You can/should train multiple models and evaluate them using dev data to pick the model with the lowest perplexity on the dev set.● Report the following a. The loss on the training for your best model and its perplexity (on train.txt) b. the perplexity on dev data (remember you have to process it too in the same way) c. once you have nalized the model, the perplexity on test tataImportant Copyright Disclaimer The lyrics contained within the provided data are the property of the original author and are provided here solely for educational purposes within the context of the SI 630 class, in accordance with the fair use exemption for education of the copyright act. Redistribution or use of the data for purposes other than for purposes of this class is strictly forbidden. This data is provided at present scale in order to facilitate the training of deep learning models for Homework 4.Once you have your lyric-generating model ne-tuned (which is the hard part), now it’s time to generate lyrics. You can again use transformer’s scripts for this or simpletransformer’s code (or whatever library you used to ne-tune if it was dierent). For both libraries, you’ll need to provide a prompt to start the generation process. You should use prompts from the initial word of a song from your data. E.g., for a lyric starting “I saw the”, you would begin the song with “I”. You are welcome to use any of the initial words to generate your lyrics (though you need variety).Note: if you decided to add more context to the lyrics (e.g., the genre) during ne-tuning, be sure to lter these out when generating your lyrics.What to do: ● Using your nal model and random seed 2021, generate lyrics for the following 5 prompts: a. My b. The c. One d. When e. If ● Generate 500 new lyrics using a sample of the initial word of songs from your data. E.g., for a lyric starting “I saw the”, you would begin the song with “I”. You are welcome to use any of the initial words to generate your lyrics (though you need variety). ● Use your model to generate lyrics (or pick from the 500) and pick your “favorite” to post in the Piazza post called “Put your favorite song lyrics here.” Please pick something relatively clean, if possibleSince the model is originally pre-trained on lots of language, feel free to try out new words (not in training) or write longer song prompts. Note that for the generation, you’ll eventually want to convert the output of the generation back to a human-readable format that looks like lyrics (e.g., with no special newline characters).Part 3 will develop a classier to recognize machine-generated text. You’ll once again use the transformers or simpletransformers library to train a model`—this time a classier. For Part 3, you’ll want to use a Masked Language Model (MLM) for classication (unlike Parts 1 and 2 which used a CLM). I personally recommend the distilbert-base-cased model, as it’s small enough to reliably t on the GPUs given to you on the Great Lakes clusters. This particular model is a pared-down version of BERT with fewer parameters, however you’re welcomed to try out dierent models!Both the transformers and simpletransformers libraries provide scripts and code to train these kinds of classication models. In particular, your task will involve several steps: 1. Create a dataset of machine-generated vs. human-authored lyrics from the data we give you.2. Convert the lyrics to a single-line-per-lyric format 3. (Optionally) ne-tune your classication model like you did for language generation 4. Train a classier on the single-line-per-lyric dataFor training data, we’ve provided 10K examples of machine generated lyrics for train, and 5K each for dev and test. (More data may potentially be available if needed). You will need to write some code to create the full training, dev, and test data from these machine-generated examples and the human-authored lyrics we’ve given you. You’ll assign the classication label of each lyric . This step is the majority of the code you’ll need to write for this part of the homework. If you want to go wild, you can use your lyric generation from Step 2 to generate more machine-generated training data.What to include in your write-up: ● Describe in a few sentences how you constructed your train, dev, and test data (e.g., which kinds of lyrics did you use? How did you decide on how many to use?) ● Report the followinga. The accuracy on train.txt for your best model b. the accuracy on dev.txt c. once you have nalized the classication model, ■ the accuracy on test.txt ■ the accuracy on the 500 lyrics you have generated from Part 2, i.e., does the model predict your machine-generated lyrics as human or machine? ● Once nished scoring the model, describe what changes you might make to the lyric generation part to improve its ability to generate human-looking lyrics.What to submit 1. A report with a. Answers to all the questions (and all the models’ performances) b. The 5 generated lyrics that had specic prompts we asked for 2. A json le with your 500 generated lyrics (these should be in the “normal” format with newlines) 3. All the code for your generation and classication systems (including any preprocessing)
Despite its seeming chaos, natural language has lots of structure. We’ve already seen some of this structure in part of speech tags and how the order of parts of speech are predictive of what kinds of words might come next (via their parts of speech).In Homework 3, you’ll get a deeper view of this structure by implementing a dependency parser. We covered this topic in Week 10 of the course and it’s covered extensively in Speech & Language Processing chapter 13, if you want to brush up .1 Briefly, dependency parsing identifies the syntactic relationship between word pairs to create a parse tree, like the one seen in Figure 1.In Homework 3, you’ll implement the shift-reduce neural dependency parser of Chen and Manning [2014],2 which was one of the first neural network-based parser and is quite famous. Thankfully, its neural network is also fairly straight-forward to implement. We’ve provided the parser’s skeleton code in Python 3 that you can use to finish the implementation, with comments that outline the steps you’ll need to finish. And, importantly, we’ve provided a lot of boilerplate code that handles loading in the training, evaluation, and test dataset, and converting that data into a representation suitable for the network. Your part essentially boils down to two steps: (1) fill in the implementation of the neural network and (2) fill in the main training loop that processes each batch of instances and does backprop.Thankfully, unlike in Homeworks 1 and 2, you’ll be leveraging the miracles of modern deep learning libraries to accomplish both of these! Homework 3 has the following learning goals: 1https://web.stanford.edu/˜jurafsky/slp3/13.pdf 2https://cs.stanford.edu/˜danqi/papers/emnlp2014.pdf ROOT He has good control . PRP VBZ JJ NN . root nsubj punct dobj amod ROOT has VBZ He PRP nsubj good Stack Correct tranTransition Stack Buffer A [ROOT] [He has good control .] ; SHIFT [ROOT He] [has good control .] SHIFT [ROOT He has] [good control .] LEFT-ARC(nsubj) [ROOT has] [good control .] A[ SHIFT [ROOT has good] [control .] SHIFT [ROOT has good control] [.] Figure 1: An example dependency parse from the Chen and Manning [2014] paper.Note that each word is connected to another word to symbolize its syntactic relationship. Your parser will determine these edges and their types! 1 1. Gain a working knowledge of the PyTorch library, including constructing a basic network, using layers, dropout, and loss functions. 2. Learn how to train a network with PyTorch 3. Learn how to use pre-trained embeddings in downstream applications 4. Learn about the effects of changing different network hyper parameters and designs 5. Gain a basic familiarity with dependency parsing and how a shift-reduce parser works. You’ll notice that most of the learning goals are based on deep learning topics, which is the primary focus of this homework. The skills you learn with this homework will hopefully help you with your projects and (ideally) with any real-world situation where you’d need to build a new network.However, you’re welcome—encouraged, even!—to wade into the parsing setup and evaluation code to understand more of how this kind of model works. In Homework 3, we’ve also included several optional tasks for those that feel ambitious. Please finish the regular homework first before even considering these tasks. There is no extra credit for completing any of these optional tasks, only glory and knowledge.2 PyTorch Homework 3 will use the PyTorch deep learning library. However, your actual implementation will use only a small part of the library’s core functionality, which should be enough to get you building networks. Rather than try to explain all of PyTorch in a mere homework write-up, we’ll refer you to the fantastic PyTorch community tutorials3 for comprehensive coverage. Note that you do not need to read all these tutorials! We’re only building a feed forward network here, so there’s no need to read up on Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), or any variant thereof.Instead, as a point of departure into deep learning land, try walking through this tutorial on Logistic Regression4 which is the PyTorch version of what you implemented in Homework 1. That tutorial will hopefully help you see how the things you had to implement in HW1 get greatly simplified when using these deep learning libraries (e.g., compare their stochastic gradient descent code with yours!). The biggest conceptual change for using PyTorch with this homework will be using batching. We talked briefly about batching during the Logistic Regression lecture, where instead of using just a single data point to compute the gradient, you use several—or a batch. In practice, using a batch of instances greatly speeds up the convergence of the model.Further when using a GPU, often batching is significantly more computationally efficient because you’ll be using more of the special matrix multiplication processor at once (GPUs are designed to do lots of multiplications in parallel, so batching helps “fill the capacity” of work that can be done in a single time step). In practice, we’ve already set up the code for you to be in batches so when you get an instance with k features, you’re really getting a tensor5 of size b × k where b is the batch size. Thanks to the magic 3https://pytorch.org/tutorials/ 4https://www.kaggle.com/negation/pytorch-logistic-regression-tutorial 5Tensor is a fancier name for multi-dimensional data. A vector is a 1-dimensional tensor and a matrix is a 2- dimensional tensor.Most of the operations for deep learning libraries will talk about “tensors” so it’s important to get used to this terminology. 2 ··· ··· ··· ··· Input layer: [xw, xt , xl ] Hidden layer: h = (Ww 1 xw + Wt 1xt + Wl 1xl + b1)3 Softmax layer: p = softmax(W2h) words POS tags arc labels ROOT has VBZ He PRP nsubj good JJ control NN . . Stack Buffer Configuration Figure 2: Our neural network architecture. {lc1(s2).t, s2.t, rc1(s2).t, s1.t}, we will extract PRP, VBZ, NULL, JJ in order.Here we use a special token NULL to represent a non-existent element. We build a standard neural network with one hidden layer, where the corresponding embeddings of our chosen elements from Sw, St , Sl will be added to the input layer. Denoting nw, nt, nl as the number of chosen elements of each type, we add xw = [ew w1 ; ew w2 ; …ew wnw ] to the input layer, where Sw = {w1,…,wnw }. Similarly, we add the POS tag features xt and arc label features xl to the input layer. We map the input layer to a hidden layer with dh nodes through a cube activation function: h = (Ww 1 xw + Wt 1xt + Wl 1xl + b1) 3 where Ww 1 2 Rdh⇥(d·nw) , Wt 1 2 Rdh⇥(d·nt) , Wl 1 2 Rdh⇥(d·nl) , and b1 2 Rdh is the bias. A softmax layer is finally added on the top of the hidden layer for modeling multi-class probabilities p = softmax(W2h), where W2 2 R|T |⇥dh .POS and label embeddings To our best knowledge, this is the first attempt to introduce POS tag and arc label embeddings instead of discrete representations. Although the POS tags P = {NN, NNP, NNS, DT, JJ,…} (for English) and arc labels L{amodtmodnsubjcsubjdobj}1 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1 1 0.5 0.5 1 cube sigmoid tanh identity Figure 3: Different activation functions used in neural networks. noun) than DT (determiner), and amod (adjective modifier) should be closer to num (numeric modifier) than nsubj (nominal subject). We expect these semantic meanings to be effectively captured by the dense representations. Cube activation function As stated above, we introduce a novel activation function: cube g(x) = x3 in our model instead of the commonly used tanh or sigmoid functions (Figure 3).Intuitively, every hidden unit is computed by a (non-linear) mapping on a weighted sum of input units plus a bias. Using g(x) = x3 can model the product terms of xixjxk for any three different elements at the input layer directly: g X (w1x1 + … + wmxm + b) = (wiwjwk)xixjxk +X b(wiwj )xixj … Figure 2: The network architecture for the Chen and Manning [2014] parser. Note that this is a feed-forward neural network, which is just one more layer than logistic regression! of PyTorch, you can effectively treat this as a vector of size k and PyTorch will deal with the fact that it’s “batched” for you;6 i.e., you can write your neural network in a way that ignores batching and it will just happen naturally (and more efficiently).7 3 The Parser The Chen and Manning [2014] parser is a feed-forward neural network that encodes lexical features from the context (i.e., words on the stack and words on the buffer) as well as their parts of speech and the current dependency arcs that have been produced. Figure 2 shows a diagram of the network.The input layer consists of three pieces, x w, w t , and x l , which denote the embeddings for the words, POS tags, and dependency args. Each of these embeddings is actually multiple embeddings concatenated together; i.e., if we represent the two words on the top if the stack, each with 50- dimensional embeddings, then x w has a total length of 50 × 2 = 100. Said another way, x w = [e w w1 ; e w w2 , . . . , ew wn ] where e w wi ] denotes the embedding for wi and ; is the concatenation operator. Each of the input types has separate weights W1 for computing hidden layer raw output (before the activation function is applied), i.e., Ww 1 , Wt 1 , and Wl 1 .Normally, we’ve talked about activation functions like sigmoid or a Rectified Linear Unit (ReLU); however, in the Chen and Manning [2014] parser, we’ll use a different activation function. Specifically, you’ll implement a cubic activation function that cubes the raw output value of a neuron. As a result, the activation of the hidden layer in the original paper is computed as h = (Ww 1 x w + Wt 1x t + Wl 1x l + b1) 2 where b1 is the bias term. Chen and Manning [2014] use separate weight matrices for words, POS tags, and dependencies to make use of several optimizations which are not described here. However, we can simplify this equation and the implementation by using one weight matrix W1 that takes in the concatenated inputs to make the activation h = (W1[x w; x t ; x l ; b1])3 . In your implementation, this means you can use a single layer to represent all three weight matrices (Ww 1 , Wt 1 , and Wl 1 ) which should simplify your bookkeeping.6For a useful demo of how this process works, see https://adventuresinmachinelearning.com/ pytorch-tutorial-deep-learning/ 7 In later assignments where we have sequences, we’ll need to revise this statement to make things even more efficient! 3 ROOT He has good control . PRP VBZ JJ NN . nsubj 1 ROOT has VBZ He PRP nsubj good JJ control NN . . Stack Buer 1 Transition Stack Buffer A [ROOT] [He has good control .] ; SHIFT [ROOT He] [has good control .] SHIFT [ROOT He has] [good control .] LEFT-ARC(nsubj) [ROOT has] [good control .] A[ nsubj(has,He) SHIFT [ROOT has good] [control .] SHIFT [ROOT has good control] [.] LEFT-ARC(amod) [ROOT has control] [.] A[amod(control,good) RIGHT-ARC(dobj) [ROOT has] [.] A[ dobj(has,control) … … … … RIGHT-ARC(root) [ROOT] [] A[ root(ROOT,has) Figure 1: An example of transition-based dependency parsing. Above left: a desired dependency tree, above right: an intermediate configuration, bottom: a transition sequence of the arc-standard system.Features UAS All features in Table 1 88.0 single-word & word-pair features 82.7 only single-word features 76.9 excluding all lexicalized features 81.5 Table 2: Performance of different feature sets. UAS: unlabeled attachment score. • Incompleteness. Incompleteness is an unavoidable issue in all existing feature templates. Because even with expertise and manual handling involved, they still do not include the conjunction of every useful word combination. For example, the conjunction of s1 and b2 is omitted in almost all commonly used feature templates, however it could indicate that we cannot perform a RIGHT-ARC action if there is an arc from s1 to b2.• Expensive feature computation. The feature generation of indicator features is generally expensive — we have to concatenate some words, POS tags, or arc labels for generating feature strings, and look them up in a huge table containing several millions of features. In our experiments, more than 95% of the time is consumed by feature computation during the parsing process. So far, we have discussed preliminaries of transition-based dependency parsing and existing problems of sparse indicator features. In the following sections, we will elaborate our neural network model for learning dense features along with experimental evaluations that prove its efficiency. 3 Neural Network Based Parser In this section, we first present our neural network model and its main components. Later, we give details of training and speedup of parsing process.3.1 Model Figure 2 describes our neural network architecture. First, as usual word embeddings, we represent each word as a d-dimensional vector ew i 2 Rd and the full embedding matrix is Ew 2 Rd⇥Nw where Nw is the dictionary size. Meanwhile, we also map POS tags and arc labels to a ddimensional vector space, where et i, el j 2 Rd are the representations of i th POS tag and jth arc label. Correspondingly, the POS and label embedding matrices are Et 2 Rd⇥Nt and El 2 Rd⇥Nl where Nt and Nl are the number of distinct POS tags and arc labels. We choose a set of elements based on the stack / buffer positions for each type of information (word, POS or label), which might be useful for our predictions. We denote the sets as Sw, St , Sl respectively. For example, given the configuration in Figure 2 and St = Figure 3: An example snippet of the parsing stack and buffer from Chen and Manning [2014] using the sentence in Figure 1.This diagram is an example of how a transition-based dependency parser works. At each step, the parser decides whether to (1) shift a word from the buffer onto the stack or (2) reduce the size of the stack by forming an edge between the top two words on the stack (further deciding which direction the edge goes). The final outputs are computed by multiplying the hidden layer activation by the second layer’s weights and passing that through a softmax:8 p = softmax(W2h). This might all seem like a lot of math and/or implementation, but PyTorch is going to take care of most of this for you! 4 Implementation Notes The implementation is broken up into five key files, only two of which you need to deal with: • main.py — This file drives the whole program and is what you’ll run from the command line. It also contains the basic training loop, which you’ll need to implement. You can run python main.py -h to see all the options • model.py — This file specifies the neural network model used to do the parsing, which you’ll implement. • feature extraction.py — This file contains all the gory details of reading the Penn Treebank data and turning it into training and test instances. This is where most of the parsing magic happens. You don’t need to read any of this file, but if you’re curious, please do! • test functions.py — This code does all the testing and, crucially, computes the Unlabeled Attachment Score (UAS) you’ll use to evaluate your parser. You don’t need to read this file either. • general utils.py — Random utility functions. 8Reminder: the softmax is the generalization of the sigmoid (σ) function for multi-class problems here. 4 Figure 4: The analogy for how it might seem to implement this parser based on the instructions, but in reality, it’s not too hard! We’ve provided skeleton code with TODOs for where the main parts of what you need to do are sketched out. All of your required code should be written in main.py or model.py, though you are welcome to read or adjust the other file’s code to help debug, be more verbose, or just poke around to see what it does.5 Data Data has already been provided for you in the data/ directory in CoNLL format. You do not need to deal with the data itself, as the feature extraction.py code can already read the data and generate training instances. 6 Task 1: Finish the implementation In Task 1, you’ll implement the feed-forward neural network in main.py based on the description in this write-up or the original paper. Second, you’ll implement the core training loop which will 1. Loop through the dataset for the specified number of epochs 2. Sample a batch of instances 3. Produce predictions for each instance in the batch 4. Score those predictions using your loss function 5 5. Perform backpropagation and update the parameters. Many of these tasks are straightforward with PyTorch and none of them should require complex operations. Having a good understanding of how to implement/train logistic regression in PyTorch will go a long way. The main.py file works with command line flags to enable quick testing and training. To train your system, run python main.py –train.Note that this will break on the released code since the model is not implemented! However, once it’s working, you should see an output that looks something like the following after one epoch: Loading dataset for training Loaded Train data Loaded Dev data Loaded Test data Vocab Build Done! embedding matrix Build Done converting data into ids.. Done! Loading embeddings Creating new trainable embeddings words: 39550 some hyperparameters {’load_existing_vocab’: True, ’word_vocab_size’: 39550, ’pos_vocab_size’: 48, ’dep_vocab_size’: 42, ’word_features_types’: 18, ’pos_features_types’: 18, ’dep_features_types’: 12, ’num_features_types’: 48, ’num_classes’: 3} Epoch: 1 [0], loss: 99.790, acc: 0.309 Epoch: 1 [50], loss: 4.170, acc: 0.786 Epoch: 1 [100], loss: 2.682, acc: 0.795 Epoch: 1 [150], loss: 1.795, acc: 0.818 Epoch: 1 [200], loss: 1.320, acc: 0.840 Epoch: 1 [250], loss: 1.046, acc: 0.837 Epoch: 1 [300], loss: 0.841, acc: 0.843 Epoch: 1 [350], loss: 0.715, acc: 0.848 Epoch: 1 [400], loss: 0.583, acc: 0.854 Epoch: 1 [450], loss: 0.507, acc: 0.864 Epoch: 1 [500], loss: 0.495, acc: 0.863 Epoch: 1 [550], loss: 0.487, acc: 0.863 Epoch: 1 [600], loss: 0.423, acc: 0.869 Epoch: 1 [650], loss: 0.386, acc: 0.867 Epoch: 1 [700], loss: 0.338, acc: 0.867 Epoch: 1 [750], loss: 0.340, acc: 0.874 Epoch: 1 [800], loss: 0.349, acc: 0.868 Epoch: 1 [850], loss: 0.320, acc: 0.873 Epoch: 1 [900], loss: 0.322, acc: 0.879 End of epoch Saving current state of model to saved_weights/parser-epoch-1.mdlEvaluating on valudation data after epoch 1 Validation acc: 0.341 – validation UAS: 70.42 6 Here, the core training loop is printing out the accuracy at each step as well as the cross-entropy loss. At the end of the epoch, the core loop scores the model on the validation data and reports the UAS, which is the score we care about. Further, the core loop will save the model after each epoch in saved weights. 7 Task 2: Score Your System We want to build a good parser and to measure how good our parser is doing, we’ll use UAS for this assignment, which corresponds to the percentage of words that have the correct head in the dependency arc. Note that this score isn’t looking at the particular label (e.g., nsubj), just whether we’ve created the correct parsing structure. For lots more details on how to evaluate parsers, see Kubler et al.[2009] page 79. ¨ For Task 2, you’ll measure the performance of your system performance relative to the number of training epochs and evaluate on the final test data. This breaks down into the following problems to solve. Problem 2.1. Train your system for at least 5 epochs, which should generate 5 saved models in saved-weights. For each of these saved models, compute the UAS score.9 You’ll make three plots: (1) the loss during training for each epoch, (2) the accuracy for each epoch during training, and (3) the UAS score for the test data for each epoch’s model.10 You can make a nice plot each run’s performance using Seaborn: http://seaborn.pydata.org/examples/ wide_data_lineplot.html Problem 2.2. Write at least three sentences describing what you see in the graphs and when you would want to stop training.8 Task 3: Try different network designs and hyperparameters Why stop at 1 hidden layer?? And why not use a ReLU instead of Cubic for an activation function? In Task 3, you get to try out different network architectures. We suggest trying out some of the following and then repeating Task 2 to see how performance changes. For easier debugging and replicability, you should make a new class that is a copy of ParserModel (once it’s working for Task 2) and make all your modifications to that class. Some suggested modifications are: 1. Add 1 or more layers to the network. 2. Add normalization or regularization to the layers. 3. Change to a different activation function 4. Change the size (number of neurons) in layers 9The code makes it easy to do this where you can specify which saved model to use, e.g., python main.py –test –load model file saved weights/parser-epoch-2.mdl 10For a fun exercise, try doing this process 4-5 times and see how much variance there is. 7 5. Change the embedding size How high can you get the performance to go?Problem 3.1. Train your system for at least 5 epochs and generate the same plots as in Problem 2.1 for this new model’s performance but include both the old model and the new model’s performances in each. 8.1 Task 4: What’s the parser doing, anyway? A big part of the assignment is learning about how to build neural networks that solve NLP problems. However, we care about more than just a single metric! In Task 4, you’ll look at the actual shift-reduce parsing output to see how well your model is doing. We’ve already provided the functionality for you to input a sentence and have the model print out (1) the steps that the shift-reduce parser takes and (2) the resulting parse.This functionality is provided using the –parse sentence argument that takes in a string. python main.py –parse_sentence “I eat” –load_model_file saved_weights/parser-epoch-5.mdl […model loading stuff…] Done! —- buffer: [’i’, ’eat’] stack: [’’] action: shift —- buffer: [’eat’] stack: [’’, ’i’] action: shift —- buffer: [] stack: [’’, ’i’, ’eat’] action: left arc, :compound:prt —- buffer: [] stack: [’’, ’eat’] action: right arc, :det:predet | eat | i In Task 4, you’ll take a look at these outputs and determine whether they were correct. Problem 4.1. Using one of your trained models, report the shift-reduce output for the sentence “The big dog ate my homework” and the parse tree Problem 4.2. More than likely, the model has made a mistake somewhere. For the output, report what was the correct operation to make at each time step: shift, left-arc, right-arc (you do not need to worry about the specific dependency arc labels for this homework). 8 9 Optional Task Homework 3 has lots of potential for exploration if you find parsing interesting or want to try building models. Here, we’ve listed a few different fully optional tasks you could try to help provide guidance. These are only for glory and will not change your score. Please please please make sure you finish the homework before trying any of these.Optional Task 1: Measure the Effect of Pre-Trained Embeddings In Task 2, your model learned word embeddings from scratch. However, there’s plenty of rare words in the dataset which may not have useful embeddings. Another idea is to pre-train word embeddings from a large corpus and then use those during training. This leverages the massive corpus to learn the meanings so that your model can effectively make use of the information— even for words that are rare in training. But which corpus should we use?In Optional Task 1, we’ve conveniently pre-trained 50-dimensional vectors for you from two sources: all of Wikipedia and 1B words from Twitter. Specifically, for Optional Task 1, you will update your code to allow providing a file containing word embedding in word2vec’s binary format and use those embeddings in training instead of pretraining. You shouldn’t update these vectors like you would do if you were learning from scratch, so you’ll need to turn off the gradient descent for them. Part of Optional Task 1 is thinking about why you shouldn’t change these vectors. Finally, once you have the vectors loaded, you’ll measure the performance just like you did in Task 2. This breaks down to the following steps: Problem 3.1. Write code that loads in word vectors in word2vec’s binary format (see Homework 2’s code which has something like this). You’ll need to convert these vectors into PyTorch’s Embedding object to use. Problem 3.2. Prevent the pretrained embeddings from being updated during gradient descent. Problem 3.3. Write a few sentences about why we would turn off training. Be sure to describe what effect allowing the weights to change might have on future performance?Problem 3.4. Repeat the 5 epochs training like you did in Task 2 using the Twitter and Wikipedia embeddings and plot the performance of each on the development data (feel free to include the learned-embedding performance in this plot too). Write a few sentences describing what you see and why you think the performance is the way it is. Are you surprised? 10 Submission Please upload the following to Canvas as separate files by the deadline: 1. a PDF (preferred) or .docx with your responses and plots for the questions above 2. your code for the parser12 11If you’re struggling to write this part, try allowing their values to get updated and compare the performance difference between the development and test data.Feel free to report the scores! 12No zip files 9 Code should be submitted for any .py file you modified. Please upload your code and responsedocument separately. We reserve the right to run any code you submit; code that does not run or produces substantially different outputs will receive a zero. 11References Danqi Chen and Christopher Manning. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 740–750, 2014. Sandra Kubler, Ryan McDonald, and Joakim Nivre. Dependency parsing. ¨ Synthesis Lectures on Human Language Technologies, 1(1):1–127, 2009.
Training an NLP model can require lots of training data—we saw this in Homework 3 when we had to create our own data, and even that was a challenge! But what if we didn’t need a lot of labeled data? One branch of NLP has focused on few-shot learning where a model is only given a few examples to learn from and then the model is asked to generalize to classify/label new examples. If it works, few-shot learning offers huge advantages since we no longer need to create massive datasets and can instead get by with a potentially much smaller set of examples. The central questions are then how to learn from a few examples and estimating how much can we learn from just a few examples.In Homework 4, we’ll work with one very recent approach few-shot learning: pattern-based learning. Here, we’ll rely on large language models (LLMs) like BERT that already know a lot about word order and language to help learn from a few examples. Specifically, we’ll take advantage of these models’ abilities to fill in the blank. Consider the following sentence with a blank at the end:I loved it so much I bought three. I thought it was . If you were to fill in that blank in the sentence above, you might say something like “great” or “amazing.” Hopefully, the LLM might look at the earlier part of the sentence to fill in the blank with something similar too! In homework 4, we’ll use patterns like this with an approach called Pattern-Exploiting Training [PET; Schick and Schutze, 2020a,b], where we put in text that we ¨ want to classify and generate a prompt that a language model will fill in. The text that the language model fills in will tell us about the class. For example, say we were trying to do a sentiment task over movie reviews. Our prompt might look like [review text]. Overall, I thought it was [mask].where we fill in the “[review text]” part with the text of the instance we’re trying to classify and then look at what the model generates for the masked token position at [mask]. For these kinds of prompts, we’ll write a pattern that lets us plug in the instance text want we want to label (the “[review text]” part above) and then specify a masked token that the LLM will fill in that is hopefully related to the label we want to know. The PET system aims to learn how to classify the instance based on what the model fills in. We will specify how to map some of what gets filled in using what’s known as a verbalizer. In essence, we specify words that correspond to our labels. For example, if we were doing sentiment analysis we specify a verbalizer where the positive class getsmapped to words like “good” or “great” and the negative class to words like “bad” or “terrible.” Your job as a practitioner is to figure out a few mappings. Your words don’t have to be adjectives either! You can write prompts that use verbs, nouns, or even adverbs to indicate the class of the text—be creative!This assignment has the following learning goals: • Familiarize you with the idea of few-shot learning and see how a model learns relative to how much data it is trained on. • Learn how to use a tokenizer for a large language model. • Improve your NLP skills when working with cutting-edge code with examples. • Become able to train a PET model using limited data. This last assignment is aimed at giving you one more skill in your arsenal for when you need an NLP classifier but don’t have labeled data and can’t create a large dataset of labeled examples.In Homework 4, we’ll try using Jigsaw’s Toxic Language dataset using PET to train our classifier. Conveniently the PET authors have already provided code for you to use at https://github. com/timoschick/pet. Your task will be to (1) write your own custom verbalizer and patterns and (2) train your model by modifying one of their example scripts. The PET repository has good documentation on how to set up their model, train it, and use the code.Like in Homework 3, in this assignment we will use a much smaller but nearly-as-performant version of BERT, https://huggingface.co/microsoft/MiniLM-L12-H384-uncased, to train our models. While PET can work on any LLM, MiniLM will make the homework much faster to finish.One small hitch to writing a verbalizer is that they need to be a single token in the LLM’s vocabulary. For word-piece based tokenerizers, this means you can’t use phrases like “super awesome” or longer words that the tokenizer will break up. For example, the word “tokenize” is broken up into the tokens “token” and “##ize” (the ## part lets the model know the token is connected to the preceding one). While the PET model can support multi-token words, using them requires more work and coding, which is not needed in this assignment (we want to keep it simple!). Since MiniLM is trained as a drop-in version of BERT, we can use BERT’s tokenizer to check whether a word is entirely in the vocabulary.Note that you do not need to load in the BERT model to check this; instead, you can load in its pretrained tokenizer, which is available on HuggingFace.1 Note that all the BERT-like models will use this tokenizer, so it’s helpful to see how it works! ■ Problem 1. (10 points) Write a simple piece code that takes a single word as input and then tokenizes it with the BERT tokenizer in huggingface and returns the word’s corresponding tokens (or token IDs) in the BERT vocabulary. You’ll want to use this piece of code in the next task to check that your verbalizer is using only single-token words. 1https://huggingface.co/docs/transformers/fast_tokenizers■ Problem 2. (40 points) Write 10 different prompts that can be used to classify toxic speech. Prompts should be relatively different (not just adding/changing one word). For each, come up with at least 2 verbalizations of each class (toxic/non-toxic). You can share verbalizations across prompts if needed. We really want to see some creativity across your prompts (this will also help the model learn more too).■ Problem 3. (10 points) For comparison with PET, train a regular classifier using Trainer and the MiniLM parameters on all the training data (very similar to what you did in Homework 3!). You should train your model for at least two epochs, but you’re not required to do any hyperparameter tuning (you just need a score). Predict the toxicity of the provided test data and calculate the F1.■ Problem 4. (30 points) Using your patterns and verbalizers, train separate PET models on 10, 50, 100, and 500 instances of data. Your data should be randomly sampled from the training data but be sure to have examples of each class. You are free to choose which instances you use and what distribution of toxic/non-toxic labels are in your training data (provided you have at least one example of each). For each model, predict the scores for the provided test data and calculate the Macro F1.■ Problem 5. (10 points) Let’s compare our PET-based models and our regular all-data MiniLM model. Plot the score for each PET model and your full-data MiniLM model using Seaborn. If you are feeling curious, feel free to train models on different sizes/distributions of data and include those too. Write your guess on how many instances you think you need to train a PET model that will reach the performance of a MiniLM model trained on all the data.3 What to submit You should submit the following parts to Canvas. • Your code for all parts • A write-up showing your plot with models’ scores and a sentence saying how many instances you think PET needs to reach the performance of a MiniLM model trained on all the data.References Timo Schick and Hinrich Schutze. Exploiting cloze questions for few-shot text classification and natural language in- ¨ ference. Computing Research Repository, arXiv:2001.07676, 2020a. URL http://arxiv.org/abs/2001. 07676. Timo Schick and Hinrich Schutze. It’s not just size that matters: Small language models are also few-shot learners. ¨ Computing Research Repository, arXiv:2009.07118, 2020b. URL http://arxiv.org/abs/2009.07118.