Assignment Store | Assignment Chef

[SOLVED] COMPSCI5089 INTRODUCTION TO DATA SCIENCE AND SYSTEMS 2019

INTRODUCTION TO DATA SCIENCE AND SYSTEMS (M) COMPSCI 5089 Thursday 19 December 2019 1. Linear algebra, probability, visualisation and optimisation Your data science team has been asked to analyse a subsystem for a car manufacturer. After some experimentation it is clear that the system you are considering can be described by the following set of coupled equations: -14 + xα + zγ = -yβ 2xα -yzβ + 8 = -xγ + xα (1) -zγ = -5 -yβ where x = 1;y = 2; z = 3 are scalar inputs to the system and the output of the system is de- noted by c = f ([x;y; z]T ; [α ; β ; γ]T ) = [14; -8; -5]T . b = [α ; β ; γ]T is a vector containing the parameters of the system. (a) Convert the set of coupled equations in Eq. (1) into the matrix form. Ab = c. [3] (b) You are now asked to ﬁnd the parameters, b, of the system using a numerical optimization method without the availability of standard solvers and matrix inversion. (i) Deﬁne an optimization problem that would allow you to solve a problem of the type Ab = c with respect to b with the constraint that you cannot use matrix inversion but have access to partial derivatives of A, b and c with respect to b. [2] (ii) State a form. of the update equations for standard gradient descent which will allow you to solve the optimization problem outlined in the previous question and explain under which conditions your gradient descent optimization algorithm is guaranteed to converge. [4] Figure 1: A scatter plot illustrating two datasets. The two different datasets can clearly be identiﬁed as two distinct clusters (as validated by the manager) (c) Your manager has provided you with two datasets obtained on two different days each containing several observation of x and y. (i) Your manager has illustrated the observations in Figure 1. Criticise this graph, and redraw a sketch that corrects the issues you have identiﬁed. [3] (ii) Your manager asks you to summarise each of the datasets using a separate Normal distribution for each dataset. Explain how you would parameterise the Normal distribu- tions needed to model the (x,y) values from the two individual datasets, including a description of the array shape of any parameters that the distribution would have. [3] (iii) Explain how eigendecomposition could be used on the parameters estimated in the previous question to identify the major axis of variation. Draw a simple sketch to show: the data points; the estimated normal distributions; the relevant eigenvectors (for each dataset separately). [5] 2. Text processing in data science (a) (i) Consider two documents with term frequency vectors as follows: D1 = [4, 2, 0] and D2 = [2, 0, 4]. Calculate the cosine similarity between these two documents. Give the formula for cosine similarity and show your workings. Note: the ﬁnal result may be in the form of a formula. [3] (ii) Name and describe an application where cosine could be used. Justify why cosine should be used for this application, explain the key geometric properties of cosine similarity and why it is important for the application. [2] (iii) Your are given the following list of documents in python: docs = (‘The sky is green’, ‘The sun is yellow’, ‘We can see the shining sun, the bright sun in the sky’). Write python code to compute the TF-IDF cosine similarity matrix of the docs list using the appropriate Sci-Kit Learn libraries. [3] (iv) Deﬁne the concept of lemmatization. Compare and contrast it with stemming. [2] (b) (i) Explain the k-means clustering algorithm using pseudo-code or precise word descrip-tions. Name and describe three key clustering properties of k-means. [3] (ii) You work at a large social media company with an advertising network. Describe a task where k-means could be applied and describe how it would be implemented. Provide details including specifying appropriate textual features and their representation, the similarity function, and how to address issues of scale on large datasets. [3] (iii) The default k-means algorithm runs on the task from part 2.b.(ii) for a very large data collection. The clustering is too slow and takes too long to complete. The product requirements dictate that the number of clusters and features are ﬁxed. Discuss why it is slow and suggest a modiﬁcation to the k-means algorithm that will speed it up. [2] (iv) How many clusters would you guess the data illustrated in Figure 2 has? Describe the method you would use to determine a correct value of k. Does it matter if the value is determined over a single run vs many runs of the algorithm? Explain why or why not. [2] Figure 2: Example cluster data 3. Database systems (a) Consider a relation Employee(ID, Name, Age) where the primary key (ID) is a 64-bit integer, Age is an 8-bit integer, and that we need 51 bytes for the Name attribute. Assume that the relation has 1000 tuples, stored in a ﬁle on disk organised in 512-byte blocks each having a 24-byte header. Note that the database system adopts ﬁxed-length records – i.e., each ﬁle record corresponds to one tuple of the relation and vice versa. (i) Compute the blocking factor and the number of blocks required to store this relation? [2] (ii) Consider the following SQL query: SELECT Name FROM Employee WHERE ID >= 101 AND ID