Assignment Chef icon Assignment Chef

Browse assignments

Assignment catalog

33,401 assignments available

[SOLVED] COMPSCI5089 INTRODUCTION TO DATA SCIENCE AND SYSTEMS 2019

INTRODUCTION TO DATA SCIENCE AND SYSTEMS (M) COMPSCI 5089 Thursday 19 December 2019 1. Linear algebra, probability, visualisation and optimisation Your data science team has been asked to analyse a subsystem for a car manufacturer. After some experimentation it is clear that the system you are considering can be described by the following set of coupled equations: -14 + xα + zγ =       -yβ 2xα -yzβ + 8   =   -xγ + xα (1) -zγ =   -5 -yβ where x = 1;y = 2; z = 3 are scalar inputs to the system and the output of the system is de- noted by c = f ([x;y; z]T ; [α ; β ; γ]T ) = [14; -8; -5]T . b = [α ; β ; γ]T is a vector containing  the parameters of the system. (a) Convert the set of coupled equations in Eq. (1) into the matrix form. Ab = c.                  [3] (b) You are now asked to find the parameters, b, of the system using a numerical optimization method without the availability of standard solvers and matrix inversion. (i)  Define an optimization problem that would allow you to solve a problem of the type Ab = c with respect to b with the constraint that you cannot use matrix inversion but have access to partial derivatives of A, b and c with respect to b.         [2] (ii)  State a form. of the update equations for standard gradient descent which will allow you to solve the optimization problem outlined in the previous question and explain under which conditions your gradient descent optimization algorithm is guaranteed to converge.                                                                                                                        [4] Figure 1: A scatter plot illustrating two datasets. The two different datasets can clearly be identified as two distinct clusters (as validated by the manager) (c) Your manager has provided you with two datasets obtained on two different days each containing several observation of x and y. (i)  Your manager has illustrated the observations in Figure 1.  Criticise this graph, and redraw a sketch that corrects the issues you have identified.                                      [3] (ii)  Your manager asks you to summarise each of the datasets using a separate Normal  distribution for each dataset. Explain how you would parameterise the Normal distribu- tions needed to model the (x,y) values from the two individual datasets, including a  description of the array shape of any parameters that the distribution would have.   [3] (iii)  Explain how eigendecomposition could be used on the parameters estimated in the previous question to identify the major axis of variation. Draw a simple sketch to show: the data points; the estimated normal distributions; the relevant eigenvectors (for each dataset separately).         [5] 2. Text processing in data science (a) (i)  Consider two documents with term frequency vectors as follows: D1 = [4, 2, 0] and D2 = [2, 0, 4]. Calculate the cosine similarity between these two documents. Give the formula for cosine similarity and show your workings. Note: the final result may be in the form of a formula.                 [3] (ii)  Name and describe an application where cosine could be used.  Justify why cosine should be used for this application, explain the key geometric properties of cosine similarity and why it is important for the application.                                                [2] (iii)  Your are given the following list of documents in python: docs  =   (‘The  sky is  green’,  ‘The  sun  is  yellow’,  ‘We  can  see  the  shining sun,  the  bright  sun  in  the  sky’).  Write python code to compute the TF-IDF cosine similarity matrix of the docs list using the appropriate Sci-Kit Learn libraries.                         [3] (iv)  Define the concept of lemmatization. Compare and contrast it with stemming.         [2] (b) (i)  Explain the k-means clustering algorithm using pseudo-code or precise word descrip-tions. Name and describe three key clustering properties of k-means.                      [3] (ii)  You work at a large social media company with an advertising network. Describe a task where k-means could be applied and describe how it would be implemented. Provide details including specifying appropriate textual features and their representation, the similarity function, and how to address issues of scale on large datasets.                  [3] (iii)  The default k-means algorithm runs on the task from part 2.b.(ii) for a very large data   collection. The clustering is too slow and takes too long to complete.  The product   requirements dictate that the number of clusters and features are fixed. Discuss why it   is slow and suggest a modification to the k-means algorithm that will speed it up.   [2]  (iv)  How many clusters would you guess the data illustrated in Figure 2 has? Describe the   method you would use to determine a correct value of k. Does it matter if the value is  determined over a single run vs many runs of the algorithm? Explain why or why not.    [2] Figure 2: Example cluster data 3. Database systems (a) Consider a relation Employee(ID, Name, Age) where the primary key (ID) is a 64-bit integer, Age is an 8-bit integer, and that we need 51 bytes for the Name attribute. Assume that the relation has 1000 tuples, stored in a file on disk organised in 512-byte blocks each having a 24-byte header. Note that the database system adopts fixed-length records – i.e., each file record corresponds to one tuple of the relation and vice versa. (i)  Compute the blocking factor and the number of blocks required to store this relation?      [2] (ii)  Consider the following SQL query: SELECT  Name  FROM  Employee  WHERE   ID   >=   101  AND   ID  

$25.00 View

[SOLVED] COMPSCI5089 INTRODUCTION TO DATA SCIENCE AND SYSTEMS April 2021

INTRODUCTION TO DATA SCIENCE AND SYSTEMS (M) COMPSCI 5089 Monday 26 April 2021 1. Computational linear algebra and optimisation You have been asked to help design the subcomponents of a music streaming service. The service has access to 101,750 music tracks (i.e. the audio files). Each music track can be summarised based on the audio content using so-called audio features resulting in a 15 dimensional vector, x ∈ R1x15 , for each track. The meaning and importance of the individual dimensions in the vector is unknown. The vectors for the individual tracks are collected in a matrix X as row vectors. Aside from the audio file itself, the service has access to the title and artist for each track, the genre(s) associated with each track (e.g. jazz) and finally the popularity of each track as a scalar y ∈ R. (a) The team wants to develop a function called ”What is this track called?” where users can upload an audio file with the purpose to identify the name of the track and artist. To this end we are interested in computing Euclidian distances between the music tracks based on their vector representations. (i)  Certain aspects of X is summarised in Table  1.  Explain why it is a good idea to normalise the data in X before computing the similarity between the tracks and suggest a suitable normalisation approach. Justify your approach and make reference to specific elements in Table 1.                         [3] (ii)  Design a simple search routine which can find the closest match between the uploaded track and a track in the existing dataset. Write the procedure using equations or NumPy code (1-3 lines). Determine how many individual distances you will need to compute and discuss any potential scaleability issues.          [3] (b) A subcomponent of the system relies on a mapping from tracks to popularity. This can be formulated as a matrix problem: X wT - y = 0, where X is a matrix containing the music features for the tracks. w is a 15 dimensional vector and y is vector containing the popularity scores for each track. The team is interested in the most efficient and robust method for finding w using the squared error as the loss function. (i)  Specify the dimensions of the matrix X and determine if w and y are considered row or column vectors, respectively.            [1] (ii)  Determine a method for solving the matrix equation wrt. w. Justify your approach.          [2] μ σ Table 1: Basic statistics for each dimension in X. Figure 1: Eigenspectrum (unordered) (c) The user interface team has requested that you provide a procedure for projecting the music tracks to 2D or 3D based on the vector representation so they can visualise the music tracks on a computer screen. You must use a linear map due to computational constraints. (i)  Outline a procedure for finding the 3D coordinates so the projection preserves most of the variance and can be implemented using only basic Python and NumPy by a junior data scientist. You should not provide the code, but explain the individual steps in the  procedure using text or equations and only recommend the suitable NumPy commands. You must specify the dimension of the all vectors or matrices required to compute the  projection.                                                       [4] (ii)  The eigenspectrum of the covariance matrix of X is shown in Figure 1. Discuss what the eigenspectrum says about the vectors representing the audio files and how this could be leveraged to make the system more efficient. Discuss if the team’s idea of a 2D or 3D interface is justified.                     [3] (d) Your team is contemplating a new subcomponent which would enable users to generate a new music track. The team has already developed a function, r(x), which makes it possible to map from the vector representation, x, to the audio file. The aim is to create a new track based on a genre profile which is a 5 dimensional vector, g ∈ R5 . Your team has provided a non-linear function, f : x → g, that maps from the track vector to the 5D genre profile. Provide a solution in the form. of an optimisation problem and determine a suitable method to solve the stated problem. Justify your choice of method and explain under which circumstances it is guaranteed to converge to a sensible solution in this scenario. You will need to make assumptions which must be clearly stated.             [4] 2. Probabilities & Bayes rule Consider a scenario where you are in charge of analysing the data and modelling a pandemic. We consider a given disease (let's call it 'VIRUS'), which has an unknown prevalence r in  the population (we will assume that r ∈ [0; 1] is the proportion of the population that has  the disease). We will write the probability that a person is diseased as p(D) = r. (a) Your lab has developed a fast testing procedure to detect this disease. In order to evaluate the accuracy and reliability of this test, you have conducted trials on 132 subjects, and compared the results of your test with perfectly accurate (supposedly more expensive) diagnostic. The results of those trials are collated in the following table: positive negative diseased        28              3 healthy         12             89 (i)  Using Bayes formula, and the trial data in the table, provide an estimate of the probab- ilities: p(Dj T ), that a subject who tested positive is truly diseased; and p(Dj T- ), that a subject who tested negative is actually diseased.                          [4] (ii)  Taking into consideration the test accuracy and reliability as evidenced in the trials, would this test be appropriate for the following situations: 1.  regular testing of people working with vulnerable populations; 2.  deciding on whether to administer a treatment with severe side effects; or 3.  applying to the whole population to   nd all diseased individuals (justify your answers).         [3] (iii)  You administer a test with probabilities p(Dj T ) = 0.7 and p(Dj T- ) = 0.01 to a sample of 1000 subjects drawn randomly from the population. The test returns 980 negatives and 20 positives. From this data, calculate an estimate of the prevalence p(D) explaining your reasoning. [4] (b) Let us consider that you are experimenting with a vaccine against the disease. You have 1000 subjects in group A who take the vaccine and 1000 in group B who take a placebo. Let us assume that you test the subjects in both groups daily, and after one month you obtain the following results: 2 subjects from group A tested positive at some point during the month, and 40 subjects from group B. In this part we will assume that we are using a test with the following statistics: the probability of having the disease if tested positive is p(Dj T ) = 0.7 the probability of having the disease if tested negative is p(Dj T- ) = 0.01. (i)  Accounting for the limitations of the test, how many subjects in group A and B did possibly catch the disease during this month?      [5] (ii)  The efficacy of a vaccine is typically calculated as Use your results from above to calculate the efficacy of the vaccine.  Discuss what would happen if your test were less accurate: What would happen if p(Dj T ) would be lower? If p(Dj T- ) would be higher?                                                                             [4] 3. Database systems Consider a relation Student(ID, Name, StudyPlan) – abbreviated as S – where the primary key (ID) is a 64-bit integer, the Name attribute is a 40-byte (fixed length) string, and Study- Plan is a 16-bit integer.  Further consider a relation Marks(ID, CourseID, AssessmentID, Mark) – abbreviated as M – with ID being a foreign key to Student’s ID, CourseID and  AssessmentID being 16-bit integers, Mark being a 64-bit float, and the first three attributes  making up the relation’s (composite) primary key. Assume that both relations are stored in files on disk organised in 512-byte blocks, with each block having a 10-byte header. Assume that S has rS = 1; 000 tuples and that M has rM = 100; 000 tuples. Last, assume that Student is stored organised in a heap file, and Marks is stored organised in a sequential file sorted by its primary key.  Note that the database system adopts fixed-length records – i.e., each file record corresponds to one tuple of the relation and vice versa. (a) Compute the blocking factors and the number of blocks required to store these relations. Show your work.                [2] (b) Consider the following query: SELECT  S .Name,  M . ID,  M .Mark  FROM  Student  as  S,  Marks  as  M WHERE  S . ID  =  M . ID  AND  S . ID  >=   10,000  and  S . ID  

$25.00 View

[SOLVED] COMPSCI 5096 TEXT AS DATAHaskell

DEGREES of MSci, MEng, BEng, BSc, MA and MA (Social Sciences) TEXT AS DATA (M) COMPSCI 5096 Wednesday, 20 May, 09:15 BST 1. This question is about tokenisation and similarity. (a) This part concerns processing text. Consider the input string: [He didn’t like the U.S. movie “Snakes on a train, revenge of Viper-man!”, now playing in the U.K.] (i)  Provide a tokenised form of the above string. Identify and discuss two elements of the above string that present ambiguities. Justify your tokenisation decision for each.  [3] (ii)  Compare and contrast ‘standard’ word-based tokenisation with the tokenisation method used by BERT. Illustrate key differences using the example provided.  Analyse and  discuss why they differ and their relative advantages and disadvantages. (Hint: Recall we used BERT’s tokeniser in Lab 1 and in the in-class embedding exercise.)           [4] (b) Consider the two tokenized documents: S1: [a, woman, is, under, a, mayan, curse] S2: [a, woman, sees, a, mayan, shaman, to, lift, the, curse] Create a Dictionary from the two documents above (S1 and S2) with appropriate ordering. Give your answer in the form of a table with ID and token. Discuss the following properties  of the dictionary and provide reasons for the decision:  1) what is included in the dictionary and 2) the order of the dictionary.                  [3] (c) Critically evaluate the Bag-of-Words (BoW) model as a term weighting feature model for documents. Discuss its strengths and give three weaknesses of the model and propose a modification that addresses each. You should relate each to Sci-kit Learn vectorizers and their important parameters.                                                                                                [4] (d) You are measuring the similarity between two molecular compounds for drug discovery  research. They have been processed to create a series of unique structural ‘fingerprints’ and a one-hot encoding of the compounds is created. A compound has tens of thousands  of fingerprints on average and all the compounds are approximately the same size. Also,  most of the compounds in the dataset share more than 90% of fingerprints in common. A  lab partner suggests using Jaccard overlap to measure the similarity between compounds. First, critically discuss why Jaccard is or is not appropriate for this task and the challenges  it presents. Second, propose and justify a change to both the representation and similarity measure to address them.                                                                  [6] 2. This question is about language modelling and classification. (a) This task involves developing an order error corrector for a popular burger chain, ‘out-and-in burger’. Below is a table of five separate order interactions transcribed from a mobile app. mburgerno i wanna eata hamburgeri would like to eat breakfasti would like to eat acheesebur Table 1: Five interactions for a burger restaurant ordering system. Sample text collections statistics for a bigram model are below: • V = 22 unique words (including reserved tokens) • N = 45 tokens, including padding (i)  Use the text provided in Table 1 above to compute word unigram probabilities. In a list  or table format complete the probability table with Laplace smoothing that has K = 0.5. Show your workings.  Discuss the impact on the probability values of increasing or decreasing the value of K. Describe the effect of K when these probabilities are used in  a spelling (error) correction task. [5] (ii)  A larger collection of restaurant ordering data is collected. It has the following statistics: N = 73194, V = 1996 from a total of 8565 documents (utterances). Compute the bigram probability of the following sequence: [i  might  like  a  cheeseburger] with Stupid Backoff smoothing with default values. Collection statistics for the required  terms are provided below. Show your workings, including each bigram’s probability. Describe how and why a smoothing method is used here.    [6] (i1926i might0might like1likea49a cheeseburger3cheeseburgers〉 (b) Compare and contrast the APIs for SKLearn Transformers (e.g.  Count or TF-IDF) and Classifiers/Predictors (e.g. NaiveBayes, LogisticRegression). Include descriptions of their key interface functions with descriptions of their behaviour.  Discuss how they are used together to solve machine learning tasks on text.                            [3] (c) Below is a snippet of code to vectorize and classify text with Scikit-learn.  Assume that tokenize_normalize and evaluation_summary have been defined, as we did in the labs.  The input data has been pre-processed into a vector of unnormalized text documents (each a single string). from  sklearn .feature_extraction .text  import  CountVectorizer from  sklearn .linear_model  import  LogisticRegression #  Data  processing data  =   . . .  #  Loads  a  vector  of  raw  text  documents train_index  =  int(len(data)  *  0 . 1) train_data  =  data[:train_index,:] validation_data  =  data[int(train_index *0 .2):,:] test_data  =  data[train_index:,:] #  Assume  corresponding  labels  for  each  data  subset train_labels,  test_labels,  validation_labels  =  . . . #  Vectorization one_hot_vectorizer  =  CountVectorizer(tokenizer=tokenize_normalize, binary=True,  max_features=20)   one_hot_vectorizer .fit(train_data) train_features  =  one_hot_vectorizer .transform(train_features) validation_features  =  one_hot_vectorizer .fit_transform(validation_data) test_features  =  one_hot_vectorizer .transform(test_data) #  Classification lr  =  LogisticRegression(solver=’saga’,  max_iter=500) lr_model  =  lr .fit(train_features,  train_labels) evaluation_summary("LR  Train  summary", lr_model .predict(train_features),  validation_labels) lr_model  =  lr .fit(validation_features,  validation_features) evaluation_summary("LR  Validation  summary", lr_model .predict(validation_features),  validation_labels) lr_model  =  lr .fit(test_features,  test_labels) evaluation_summary("LR  Test  summary", lr_model .predict(validation_features),  test_labels) Copy and paste the code above and fix its mistakes. Although there may be more, discuss three important mistakes with their consequence, one from each section (data processing, vectorization, classification).                     [6] 3. This question is about word embedding models and Natural Language Processing. (a) Compare and contrast static word embeddings with contextual embedding models. Discuss the trade-offs between them for downstream tasks.                           [4] (b) Using your knowledge of the self-attention mechanism, answer the following question considering the following sentence: S1: [The president of the European Union spoke] Use the following weight matrices and layer parameters to compute the unnormalised attention weights between the query “president” and the keys “spoke” and “the”. What can you infer from these values? [3] (c) Explain how and why attention-based encoders can be “stacked” to form layers in Trans- former models.                  [2] (d) In this question we explore what can be done when faced with a completely “alien” scenario. Klingon is a language originating from TV series Star Trek. Many classic works such as  Hamlet, Much Ado About Nothing, Tao Te Ching, and Gilgamesh have been translated  by hand to Klingon. It is studied and formalised by the Klingon Language Institute (KLI) and was designed to be dissimilar from English. Below are some sample Klingon-English  translations.  bIpIv arDaneH , existence ?Where are youfrom? ?

$25.00 View

[SOLVED] MongoDB AssignmentR

MongoDB Assignment Design a MongoDB database with at least two collections that contain related documents. You must: 1. Create a minimum of 10 documents per collection. 2. Demonstrate at least 15 queries, covering: ü Basic CRUD operations (insert, update, delete, find). ü Aggregation Framework (e.g., $group, $match, $sort, $limit). ü Data validation (using MongoDB schema validation techniques). Extra Marks Will Be Awarded For: ü Using document relationships (e.g., using _id references or embedded documents). ü Real-time data ingestion (e.g., streaming IoT sensor data, stock prices, or Twitter API data). ü Indexing Strategies (e.g., creating indexes on frequently queried fields). ü Joins using $lookup to relate data across collections. ü Sharding (demonstrating how MongoDB distributes data across servers). Report: Create a report discussing your project, covering the following: ü Brief background/context ü Discuss the Dataset/Data used ü Screenshots of queries and output ü Discuss your main queries, and any extra implementations such as joins, document relationships, indexing etc. Screencast Presentation of Your Assignment Create an MP4 video (max 4 minutes) with voice-over, presenting your MongoDB work. Your video should include: 1. Overview of your MongoDB database – describe the collections and document structures. 2. Demo of your key MongoDB queries – show execution in the MongoDB shell. Use Screencast-O-Matic (or similar tools) to record your video. Submission Instructions: Upload your report as “StudentNumber_MongoDB.docx”, your Mongo queries in a separate word doc, and the screencast to Moodle under Upload area. You must also submit a signed A1 plagiarism form for this assignment. Final Notes: · Ensure you provide clear screenshots for all MongoDB queries. · Add complexity to MongoDB queries for higher marks.  

$25.00 View

[SOLVED] COMPSCI5089 INTRODUCTION TO DATA SCIENCE AND SYSTEMS December 2021

INTRODUCTION TO DATA SCIENCE AND SYSTEMS (M) COMPSCI5089 Wednesday 15th of December 2021 1. Computational linear algebra and optimisation (a) Given a collection of N documents D = {D1,..., DN}, your task is to implement a func- tionality that provides a list of suggested ‘more like this’ documents. With this problem  context, answer the following questions. (i)  Explain how would you represent each document D ∈ D as a (real-valued) vector d. What is the dimension of each vector?        [2] (ii)  What does the L0  norm of a document vector indicate (in plain English) as per your definition of the document vectors in the previous question?                                     [1] (iii)  How would you define the Lp distance between two document vectors d and d′?     [2] (iv)  What distance or similarity measure would you use for finding the set of ‘more like this’ documents for a current (given) document vector d, and why.                           [2] (b) The probability distribution function of an n dimensional Gaussian is given by f(x) = (x −μ)T Σ−1 (x −μ), where μ ∈ Rn is the mean vector and Σ ∈ Rn×n is a square and invertible matrix, called the covariance matrix. Consider the particular case of n = 2. Answer the following questions. (i)  Plot the contours of the following Gaussians. For each contour plot, show the condi- tional distributions along the two axes. [2] (ii)  Which one/ones of the above 4 Gaussian distributions can be reduced to a a single  dimensional Gaussian with PCA on the covariance matrix without too much loss of information. Note that you do not need to explicitly compute the Eigenvalues. You  should rather derive your answer from a visual interpretation of the contour plots. Clearly explain your answer.                                            [2] (c) With respect to linear regression, answer the following questions. (i)  Derive the expression for stochastic gradient descent for linear regression with the squared loss function. Clearly introduce your notations for the input/output instances, and the parameter vector.       [4] (ii)  Explain how linear regression can be extended to polynomial (higher order) regression? What is the problem of using high degree polynomials for regression? How can that problem be alleviated?             [3] (iii)  A common practice in stochastic gradient descent is to use a variable learning rate α for the parameter updates where θj(t ) denotes the j th component of the parameter vector θ at iteration t , and α(t ) denotes the value of the learning rate at iteration t . Which of the following alternatives of the learning rate update would you prefer (α is a constant) and why? [2] 2. Probabilities & Bayes rule Consider a card game where you have 4 suits (heart, diamonds, clubs and spades) and in each suit the cards 7, 8, 9, 10, Jack (J), Queen (Q), King (K) and Ace (A). In this question we will use the following commonly used terms: • the the pack: is the set of all cards that have not been drawn yet. •  to draw: is to pick a card at random amongst the pack of remaining cards, removing it from the pack. • the hand: is the set of cards a player has drawn from the pack. • a payout: is the amount of points you get for a given hand. •  to fold: is to stop playing and put back your cards in the pack, forfeiting any payout for this game. (a) Assuming that you draw a single card at random from the pack, give the probabilities for the following events (i)  Drawing an Ace? (ii)  Drawing a red card? (iii)  Drawing a diamonds? (iv)  Drawing a royalty figure (Jack, Queen or King)? (v)  Drawing the Ace of spades?                 [5] (b) Now assume that you have already drawn the three cards:  10,J,Q. When drawing two more cards from the pack, what is the probability to obtain: (i)  A pair of two cards with the same value (eg, two Jacks). (ii)  Two pairs (eg, two Jacks and two Queens). (iii)  Three of a kind (eg, three Jacks). (iv)  A sequence of 5 cards (eg, 10, J, Q, K, A). Note that the cards can be of any suit, but there cannot be a break in the sequence.  [4] (c) Now let us assume the following payout table for each hand of 5 cards: As before, you have the cards 10, J, Q in hand. (i)  If you draw two more cards randomly from the deck, what is the expected value of the payout for this hand?                            [3] (ii)  Assuming that you need to pay 5 every time you draw a card (hence you would need to pay 10 to draw two cards), should you fold your hand or draw cards?                       [2] (iii)  Should you fold after drawing the first card (and having paid 5), if the card is: (i) the 7 of heart, (ii) the 8 of spades or (iii) the Queen of diamonds?                                     [6] 3. Database systems An online retail company is trying to assess the performance of its DB systems and has asked you to investigate some of the operations.  Consider a relation Seller (ID, Name, Country) – abbreviated as S – where the primary key (ID) is a 32-bit integer, the Name attribute is a 54-byte (fixed length) string, and Country is a 16-bit integer. Further consider a relation Product(ID, ProductID, ManufacturerID, Price) – abbreviated as P – with ID being a foreign key to Seller’s ID, ProductID and ManufacturerID being 64-bit integers, Price being a 32-bit float, and the first three attributes making up the relation’s (composite) primary key. Assume that both relations are stored in files on disk organised in 512-byte blocks, with each block having a 10-byte header. Assume that S has rS = 1, 000 tuples and that P has rP = 100, 000 tuples.  Last, assume that Product is stored organised in a sequential file sorted by its primary key and Student is stored organised in a heap file. Finally, note that the database system adopts fixed-length records – i.e., each file record corresponds to one tuple of the relation and vice versa. (a) Compute the blocking factors and the number of blocks required to store these relations. Show your work.           [2] (b) Consider the following query: SELECT  S .Name,  P . ID,  P .Price  FROM  Seller  as  S,  Product  as  P WHERE  S . ID  =  P . ID  AND  S . ID  >=   6,000  and  S . ID  

$25.00 View

[SOLVED] Developing Effective Consulting Skills

You should take the role of a management consultant. Using an organisational context of your choice, prepare a proposal for a management consulting project you feel will help the organisation. 1. Provide a short description of the organisation you have chosen including the nature of the issue that you proposing to address in your consulting work – around 200 words. 2. Propose a consulting project to that organisation setting out clearly the situation and key issues, your proposed approach and deliverables, and the resources & timescales involved in the delivery of the project. In doing so it will be important to reflect with the appropriate elements of the consulting cycle introduced on the course - around 1,400 words. A student failing to pass the assignment with a mark of 50% or more will be required to resubmit the entire assignment.

$25.00 View

[SOLVED] COMPSCI 5011 INFORMATION RETRIEVAL 2021 Web

DEGREES OF MSc, MSci, MEng, BEng, BSc, MA and MA (Social Sciences) INFORMATION RETRIEVAL M COMPSCI 5011 Monday 10 May 2021 SECTION A 1. (a) The following documents have been processed by an IR system where stemming is not applied: DocID Text Doc1 breakthrough vaccine for covid19 Doc2 new covid19 vaccine is approved Doc3 new approach for treating patients Doc4 new hopes for new covid19 patients in the world (i)        Assume that the following terms are stopwords: in, is, for, the. Construct an inverted file for these documents, showing clearly the dictionary and posting list components. Your inverted file needs to store sufficient information for computing a simple tf*idf term weight, where wij = tfij *log2(N/dfi)          [5] (ii)       Compute the term weights ofthe two terms “breakthrough” and “vaccine” in Doc1. Show your working.      [2] (iii)      Assuming the use of a best match ranking algorithm, rank all documents using their relevance scores for the following query: covid19 vaccine Show your working.  Note that log2(0.75)= -0.4150 and log2(1.3333)= 0.4150.      [3] (iv)      Typically, a log scale is applied to the tf (term frequency) component when scoring documents using a simple tf*idf term weighting scheme. Explain why this is the case illustrating your answer with a suitable example in IR. Explain through examples how models such as BM25 and PL2 control the  term frequency counts.         [4] (b)    Consider the recall-precision graph below showing the performances of two variants of a search engine that mimic Google Scholar on a collection of research papers. There is no difference between the two variants apart from how they score documents. Assume that you are a student looking to find all published papers on a given topic. In other words, you do not want to miss any of the relevant documents. Explain which search engine will be more suitable for your task and why?           [5] (c)    Assume that you have decided to modify the approach you use to rank the documents of your collection. You have developed a new Web ranking approach that makes use of recent  advances  in  neural  networks. Explain  in  detail the  steps  you  need  to undertake to determine whether your new Web ranking approach produces a better retrieval performance than the original ranking approach.      [5] (d)    Consider a query with two terms, whose posting lists are as follows: term1 → [id=2, tf=2], [id=5, tf=1], [id=6, tf=1] term2 → [id=2, tf=4], [id=4, tf=3] , [id=5, tf=4] Explain and provide the exact steps/order in which the posting lists will be traversed by the TAAT & DAAT query evaluation strategies and the memory requirements of both strategies for obtaining a result set of K documents from a corpus of N documents (K , = ,   10).  Explain if the probability of q will become larger,  smaller or if it will remain the same. Justify your answer.         [2] (vi)      Assume another document doc2  in the corpus, which is identical to doc1 with the exception that one occurrence ofw1 has been changed to word w5. Hence, we have ct(w1, doc2  ) = 1 and ct(w5, doc2) = 3. Let q1  = w1  w5  be the new query. If no smoothing is applied, using the query likelihood retrieval method, state which of the two documents (doc1  or doc2) will be ranked higher. Justify you answer. Using the query likelihood retrieval method but this time with Dirichlet prior smoothing applied (μ = 10), show which of the two documents (doc1 or doc2) would be ranked higher. Show your calculations. Discuss whether smoothing has an impact on the ranking order of doc1  and doc2 and how? Justify your answer.    [6] SECTION B 3.      (a) Consider the following vector space scoring formula:   where ct(w,d) and ct(w, q) are the raw counts of word w in document d and query q, respectively (in other words, the term frequency of w in d and q, respectively); Nw  is the number of documents in the corpus that contain word w, and Mis the total number of documents in the corpus. Provide 4 reasons why the retrieval formula above is very unlikely to perform well in a Web search context. Justify your answers.       [5] (b) For a particular query q, the multi-grade relevance judgements of all documents are {(d1,1),(d3, 4),(d6, 2),(d9, 3),(d11, 1),(d31, 2)}, where each tuple represents a document ID and a relevance judgment pair, and all the other documents are judged as non-relevant. Documents are judged on the scale 0-4 (0:not relevant - 4:highly relevant). Two IR systems return their retrieval results with respect to this query as follows (these are all results they have returned for this query): System A: {d1, d2, d3, d4, d5, d6, d7} System B: {d31, d22, d3, d6, d15} For both  System A and  System B,  compute  the  following ranking  evaluation metrics. You must clearly articulate how you compute each of these metrics. Since there are two DCG definitions discussed in the class, you should use the original one where 1/log2  (rank)  is used as the discount factor that is applied to the gain: (i)   Average Precision (AP).   Show your calculations.         [3] (ii)  Normalised Discounted Cumulative Gain (NDCG) for each rank position. In your answer, provide the ideal DCG values for the perfect ranking for the given query. You might wish to note that log2  2 = 1; log2  3 = 1.59; log2  4 = 2; log2  5 = 2.32; log2  6 = 2.59 and log2  7 = 2.81. Show your calculations.         [6] (c)     URL length has been shown to be an important feature for some Web search tasks. Discuss which types of information needs on the Web, the URL length feature is most appropriate for. Consider a linear learning to rank model for Web search using 4 features: PL2, Proximity, PageRank and URL length. Using such a model, explain the main disadvantage of using linear learning to rank models in Web search.           [5] (d)    A posting list for a term in an inverted index contains the following three entries: id=3 tf=4      id=7 tf=3       id=10  tf=5 Applying  the  delta  compression  of  ids,  show  the  binary  form  of the  unary compressed posting list. What is the resulting (mean) compression rate, in bits per integer?            [5] (e)     A  Web search engine has devised a new  interface to present its search results. Describe three specific approaches that could be used by the search engine to evaluate the interface change. Which approach you would recommend and why?             [6]

$25.00 View

[SOLVED] BU7530-202425 MSC FINANCE DISSERTATION

  Assessing the Impact of EU State Aid on the Sustainable Transformation of the Iron and Steel Industries in Montenegro and Albania PROGRAMME NAME: BU7530-202425 MSC FINANCE DISSERTATION March 2025 (I) Aims and Objectives 1. Research Aim The aim of this study is to critically assess the impact of EU state aid policies (including subsidies, tax incentives and regulatory frameworks) on the sustainable transformation of the iron and steel industries in Montenegro and Albania. These aid policies have played  a  positive  and  important  role  in  reducing  carbon  emissions,  adopting  green technologies and improving industrial competitiveness. As such, these policies benefit not only the economic but also the environmental aspects, supporting the EU's broader decarbonisation  and  eco-transition  objectives.  As  Montenegro  and  Albania  are  EU candidate  countries,  their  alignment  with  EU  industrial  and  environmental  policies makes them ideal cases for examining how external financial and regulatory support can influence sustainable transformation. These two countries were selected because of their increasing integration with the EU, their strategic importance in the steel industry and the urgent need for economic reforms aimed at sustainable development. In addition, the need for these two countries to rely on state aid and foreign investment to modernise their industrial infrastructure provides an opportunity to assess the effectiveness of EU interventions in promoting low-carbon industrial transformation. The steel industry in Montenegro accounts for about 8 per cent of GDP and 12 per cent of industrial employment (World Bank, 2023). For Albania, the steel industry is the main driver of exports, a large part of which goes to the EU (European Commission, 2024). However, EU interventions have not been very effective for these economies due to systemic problems such as outdated domestic infrastructure (World Bank, 2023), lax regulatory enforcement (European Court of Auditors, 2023) and market fragmentation (European Commission, 2023). By  integrating  financial,  environmental  and  governance  indicators,  this  study  will explore  whether  EU-funded  interventions  have  been   effective  in   achieving  the objectives of the EU Green Deal (i.e. to reduce carbon emissions, promote innovation and improve industrial competitiveness). 2. Research Objectives To achieve this goal, the study will endeavour to achieve the following objectives: In  order  to unpack the  causal  mechanisms  through which  EU  state  aid  may  shape sustainable industrial transformation, the study adopts a multi-dimensional approach. This involves examining both financial and technological channels through which such interventions operate at the firm level. 2.1 Quantify the financial impact of EU state aid policies on steel firms in Montenegro and  Albania by using  targeted  firm-level  financial  indicators.  The  analysis  will focus on three main metrics: Return on Assets (ROA) to assess how efficiently firms use their assets to generate profits. Return on Equity (ROE) to evaluate profitability from shareholders’ perspective. and EBITDA (Earnings Before Interest, Taxes, Depreciation, and Amortization) to capture firms’ operational cash flow and debt sustainability. These  indicators  are  selected  because  they  reflect  different  dimensions  of  firm performance: efficiency, profitability, and financial resilience. The objective is to compare these outcomes between firms receiving EU state aid and those that do not, to  test  whether  subsidised  firms  demonstrate  higher  investment  returns  and/or greater financial volatility. One analytical strand will focus on positive financial outcomes (e.g., improved ROE), while the other will assess financial risk exposure (e.g.,  higher  variance  in  EBITDA  for  larger  firms  subject  to  corporate  tax differentials). 2.2 Assess the influence of EU-funded green R&D subsidies and technology-specific state aid schemes—such as those supporting hydrogen-based steelmaking, carbon capture   and   storage   (CCS),   and   energy-efficient   retrofitting—on   innovation outcomes in steel firms. The focus will be on identifying whether such interventions stimulate firm-level adoption of low-carbon technologies and increased patenting activity  in  environmental  technologies.  Emphasis  will  also  be  placed  on  how targeted aid differs in impact from broader industrial subsidies. 2.3 Examine the environmental outcomes of EU state aid by measuring changes in CO₂ emissions intensity (per ton of steel produced) and alignment with EU sustainability benchmarks, such as those set under the Green Deal and the Fit-for-55 package. In addition,  firm-level  ESG  scores—drawn  from  recognized  rating  agencies  (e.g., Sustainalytics,   MSCI    ESG   Ratings)—will    be    used   to    assess    longer-term environmental and governance impacts. While carbon intensity remains the primary environmental indicator, ESG performance will be further explored in a dedicated robustness check. 2.4 Compare   the   financial,   operational,   and   environmental   performance   of  EU-  subsidised and non-subsidised steel firms to assess the effectiveness of state aid in  promoting sustainable industrial transformation. Financial metrics will include ROA, ROE,  and  EBITDA;  operational  performance  will  be  measured  through  output  growth and productivity levels; and environmental performance will focus on CO₂  intensity and energy efficiency. This comparison will help isolate the net effect of state aid on firm-level transformation. 2.5 Analyse  regional  implementation  differences  in  EU  state  aid  policy  between Montenegro and Albania, with particular attention to governance challenges such as inconsistent criteria in subsidy allocation, limited monitoring capacity, bureaucratic delays,  and  institutional  enforcement  gaps.  The  analysis  will  also  assess  how variations in tax incentive structures and administrative procedures influence policy outcomes at the firm level. 2.6 Evaluate  the role  of firm-level  ESG performance  as  a  moderating  or  secondary explanatory variable in understanding variation in state aid outcomes. ESG scores— drawn from established rating providers such as MSCI or Sustainalytics—will be used to examine how firms with stronger environmental, social, and governance profiles respond differently to EU state aid, particularly in emissions reduction and access to green financing. ESG performance will also be considered in an additional robustness check section to test the stability of main financial and environmental results. 3. Research Questions Q1: Has the EU state aid policy improved the financial performance of subsidised steel companies? Q2: Does state subsidies encourage increased investment in R&D? Q3: Does the state aid policy reduce carbon emissions in the steel sector? Hypothesese H1: EU-subsidised  firms  exhibit  higher  returns   on  investment  compared  to  non- subsidised  firms,  due  to  relaxed  capital  constraints  and  improved  competitiveness associated with state aid. H2: Firms receiving EU state aid allocate a greater proportion of their budgets to R&D activities—particularly in low-carbon technologies—than firms that do not receive such support. H3: Firms  benefiting  from  targeted  EU  state  aid  experience  a  reduction  in  carbon intensity over time, relative to firms without such policy support. H4: Firms with stronger ESG performance are more likely to qualify for favourable EU state aid conditions and demonstrate faster emissions reductions. (II) Rationale and Contribution 1. Description of Topic The steel industry is a crucial component of the European Union's industrial strategy,  employing  over  300,000  individuals  and  contributing  approximately  €140  billion  annually to the EU economy (World Steel Association, 2024). However, the steel sector, which accounts for 11 per cent of total CO₂ emissions in the EU, remains one of the  largest  industrial  sources  of  carbon  emissions  (European  Commission,  2023).  The  European Green Deal requires the steel industry to reduce its emissions by 55% by 2030  and to  achieve  climate  neutrality  by  2050,  under  a  legally  binding  decarbonisation  framework  (European  Commission,  2023;  UNEP,  2023).  Since  2020,  the  EU  has  allocated over €50 billion in state aid to energy-intensive industries—including steel—  in support of these environmental objectives (European Commission, 2023; European  Court of Auditors, 2023). Yet the  effectiveness  of such  interventions  remains  contested,  especially  in  smaller Balkan  economies  like  Albania  and  Montenegro,  where  outdated  coal-based  blast furnaces remain in operation and structural barriers persist (World Bank, 2023; OECD, 2023).  These  include  inadequate  infrastructure,  fragmented  markets,  and  limited institutional implementation capacity, all of which may constrain the transformative potential of EU-funded aid (European Commission, 2024; OECD, 2023). This study therefore addresses a critical knowledge gap by assessing how EU financial support  influences  sustainable  transformation  in  steel  firms  operating  in  transitional economies. 2. Rationale for the Choice of Topic Montenegro and Albania, as EU accession candidates, are required to progressively align   their   industrial   and   environmental   policies   with   the   EU’s    sustainability frameworks. This study seeks to generate empirically grounded insights to inform the design  of  EU   state   aid  programs  that  effectively   support  low-carbon  industrial transformation in candidate countries with transitional economies. Although  the  literature  on  EU  state  aid  has  expanded  in  recent  years,  it  remains disproportionately  focused  on  Western  European  economies,  such  as  Germany  and France.  In  contrast,  the  specific  institutional  and  structural  challenges  faced  by  the Western  Balkans—namely,  weak  regulatory  enforcement,  systemic  corruption,  and reliance on external financial assistance—have received limited scholarly attention. This research aims to address this gap by focusing on two underexamined but strategically important cases. The economic relevance of the steel industry in these countries further underscores the importance of this inquiry. In Montenegro, the steel sector accounts for 8 per cent of GDP,  12 per cent of industrial employment and 22 per cent of total steel exports. In Albania, the iron and steel sector is the main source of industrial emissions, emitting about 1.8 million metric tonnes of carbon dioxide per year, or 15 per cent of the national total. The situation is even more pressing in the Montenegrin steel industry, which alone accounts for 22 per cent of the country's total carbon dioxide emissions. In this context, there are both environmental and economic reasons for accelerating industrial decarbonisation. Failure to achieve a low-carbon transition as soon as possible will not only jeopardise the requirements for EU accession, but also destabilise the domestic  economy  through  increased  unemployment,  reduced  competitiveness  and widening trade deficit. This study is therefore conducted at the intersection of economic policy, environmental governance and regional integration. 3. Literature Review The literature on EU state aid and industrial transformation can be broadly categorised into  three  areas:  (1)  financial  outcomes  for  the  firm  level,  (2)  innovation  and environmental  performance,  and  (3)  governance  and  distributional  challenges.  This section reviews the main contributions in each of these areas. 3.1 State Aid and Financial Performance: Zhang and Wang (2020) show that state subsidies can improve firms' competitiveness and profitability, especially for capital-intensive industries. However, at the same time, unregulated  subsidies may  lead to misallocation  of resources.The  trade-off between short-term fiscal gains and long-term sustainability is also mentioned in Midttun (2021), which argues that financial assistance without incentives to innovate may lead to delays in structural transformation. An empirical study of strategic industries (Zhang & Wang, 2020) similarly mentions significant improvements in financial indicators such as return on investment (ROI) and earnings before interest, taxes, depreciation, and amortisation (EBITDA) among firms receiving targeted subsidies. However, long-term investment returns can also be volatile if financial assistance is not strictly regulated. Conclusion: While state aid can improve business profitability in the short term, it must be strictly regulated to avoid inefficiencies and support sustainable competitiveness. 3.2 Innovation and Environmental Outcomes: Innovation  plays  an  important  role  in  achieving  long-term  environmental  goals,  especially in carbon-intensive industries such as steel. However, the development of innovation often depends on the availability of well-designed  State  aid instruments.  When subsidies are linked to specific innovation and decarbonisation targets, they can  accelerate technological progress and contribute to measurable environmental outcomes. The  OECD  (2023)  criticises  the  use  of  subsidy  packages  that  lack  environmental  conditionality and calls for financial support to be linked to tangible green outcomes,  such as carbon emission reductions or investments in low-carbon technologies.Hartley  (2021) supports this position, emphasising the role of conditional public-private finance  in accelerating innovation, particularly in heavy industries such as steel. Midttun (2021) notes that green R&D is often relegated to the back burner in favour of immediate job retention and industrial stabilisation, especially in emerging economies. zhang and Wang (2020) add that untargeted aid may cause firms to delay long-term environmental investments due to weaker regulatory pressures. Innovation-oriented  conditional  subsidies  are  more  effective  than  general  financial support in advancing environmental transformation goals. Hartley (2021) also notes that long-term investments in decarbonisation are more successful when firms co-finance aid with private capital, such as green bonds or debt swaps. 3.3 Governance and Distributional Imbalance: Governance capacity and the fair distribution of EU state aid have emerged as key concerns  in  the  literature  on  regional  development  policy.  The  European  Court  of Auditors (2023) reports that only 40% of EU member states apply a comprehensive monitoring framework for state aid, resulting in large regulatory inconsistencies across countries.  These  problems  are  particularly  acute  in  candidate  countries  such  as Montenegro and Albania, where institutional capacity to manage, monitor, and evaluate aid remains underdeveloped (European Commission, 2023). Recent  research  highlights  the  unequal  spatial  allocation  of  green  transition  funds. According  to  Pelikaan  (2022)  and  the  EU  Industrial  Policy  White  Paper  (2023), Western European  countries  continue to receive the majority  of EU  funding, while newer and candidate member states in the Western Balkans receive significantly less per capita, despite facing more severe structural barriers. This disparity contributes to the emergence  of  a  so-called  ‘two-speed  Europe’,  whereby  well-integrated  economies accelerate their green industrial transformation, while peripheral regions  fall  further behind in meeting EU climate and competitiveness targets. Moreover, transparency and enforcement mechanisms in aid disbursement remain weak in  some  newer  member  states.  The  State  Aid  Scoreboard  (European  Commission, 2023)  warns  that   in  the   absence  of  stronger  oversight,  risks   such   as  regulatory capture   and   selective   implementation   of  conditionalities   may   undermine  policy effectiveness in transitional economies. In  sum,  the  literature   suggests  that  unless  issues  of  institutional  governance  and distributional equity are systematically addressed, EU state aid may reinforce existing disparities  rather  than  mitigate  them.  This  study  contributes  to  this  debate  by empirically   examining   whether   aid   recipients   in   Albania   and   Montenegro   are disadvantaged not only in terms of financial outcomes, but also in the consistency of policy implementation. 4. Contribution to Financial, Management, and Public Policy Financial Impact: The study will quantify the return on investment (ROI) of state aid to  steel  companies,  enabling  investors  and  policymakers  to  gain  a  more  intuitive understanding of the profitability risks and opportunities of green subsidies. Management  Strategy: By  analysing the  distribution model, the  study will provide companies with suitable recommendations that can be used to optimise state aid for technological  upgrades,  such  as  the  transition  from  blast  furnaces  to  electric  arc furnaces (EAFs). Public Policy: The results of the  study will help the EU institutions to redesign and refine  the  Country  Assistance  Framework  (CAF)  to  address  regional  disparities, enhance accountability and prioritise high-impact decarbonisation projects. (III) Methodology 1. Research Methodology In order to analyse the impact of EU state aid on financial and environmental outcomes, this study will use a difference-in-differences (DiD) model. The two main dependent variables are return on assets (ROA), which indicates financial performance, and carbon intensity (CO2 emissions per tonne of steel), which reflects environmental outcomes. Control variables include firm size (log of total assets) and leverage (debt-to-equity ratio), which reflect differences in economies of scale and financial structure. In addition, the study will include R&D intensity (R&D expenditures/revenues) as a mediating variable rather than a control variable to be used to examine whether state aid indirectly affects environmental and financial outcomes by stimulating investment in innovation.  This  mediating  relationship  will  be  explored  using  interaction  terms  or causal mediation analyses, as appropriate. Semi-structured interviews with industry stakeholders will supplement the quantitative analysis by providing context on implementation mechanisms and firm-level decision- making. The  DiD  model  is  appropriate  for  identifying  the  causal  impact  of  state  aid,  as  it compares outcome changes over time between  subsidised and non-subsidised  firms, while accounting for firm-level heterogeneity in scale, capital structure, and innovation behaviour. 2. Data Collection Financial Data: Data from Bureau van Dijk's Orbis database covering  100-200 steel companies in Montenegro and Albania (2015-2024). Key variables: Profitability: ROA, ROE, EBITDA margins. Liquidity: Current ratio, quick ratio. Investment: R&D expenditure, capital expenditures (CapEx). State Aid Data: Extracted from the European Commission's State Aid Transparency Database, inclusive: Subsidy amounts (grants, tax breaks). Conditionalities (e.g., emissions reduction targets, job retention quotas). Compliance records (e.g., Montenegro’s 2020 violations). Environmental Data: Carbon emissions from the 2024 Sustainable Development Indicators Report (Scope 1 and 2). In addition, as part of the robustness check, the performance of the ESG will be analysed to assess its role in moderating the effectiveness of state aid policies. ESG scores from DSTI-SC(2024)1-FINAL 3. Sample and Population Treatment  Group: Between  2015 and 2024, 50-100 enterprises in Montenegro and Albania received EU state aid. Control Group: 50-100 non-subsidised firms, matched by size (employees, revenue), capacity (tonnes/year) and pre-treatment financial performance (average 2015-2019). Exclusions:   Companies   with   incomplete   or   questionable   financial   records   and companies that were acquired during the study period. 4. Data Analysis Techniques 4.1 Difference-in-Differences (DiD) Model: Yit = α + β1Treatmenti + β2Postt + β3(Treatmenti × Postt) + γXit + εit Dependent Variables (Y): ROA, ROE, carbon intensity (CO2/ton steel), ESG scores. Treatment Variable: Binary indicator (1 = received state aid; 0 = did not receive). Control Variables (X): Firm size (log of total assets), leverage (debt-to-equity ratio), R&D intensity (R&D expenditure as a share of revenue), EU policy milestones (e.g., Green  Deal  ratification),  and  country  fixed  effects  to  account  for  institutional  and structural differences between Montenegro and Albania. 4.2 Robustness Checks: Propensity Score Matching (PSM): To address selection bias by pairing treated and control firms with similar pre-treatment characteristics. Event Study Analysis: To assess stock market reactions to state aid announcements, proxying investor confidence in sustainability outcomes. 5. Ethical Considerations and Data Availability Ethical Compliance: All data is anonymised and publicly accessible, complying with the GDPR and EU open data regulations. No confidential or personally identifiable information is used. Data Limitations: Reliance on self-reported emissions data may bias measurements. Sensitivity analyses will measure the robustness of alternative emission estimates.    

$25.00 View

[SOLVED] COMPSCI 5096 TEXT AS DATA 2022

DEGREES OF MSc, MSci, MEng, BEng, BSc, MA and MA (Social Sciences) TEXT AS DATA M COMPSCI 5096 Tuesday 3 May 2022 1.          Question on Distributional Semantics and Word Embedding. (Total marks: 18) Consider the problem of finding an outlier word from among a list of other similar words, e.g., out of the following set of words - linux,   windows,    solaris,    android,     java, the word ‘java’ is an outlier (because the other words are names of operating systems). Given a list of such words your task is to automatically infer the outlier word. With respect to this task, answer the following questions. (a)  An approach to solve the word intrusion problem is to represent words as vectors and then make use of the relative distances/similarities between the vectors for finding the outlier.   Assuming you know (by the output of some process) the vector w for a word w, describe the pseudo-code of finding the outlier word. Task: Describe the pseudo-code of this algorithm.  Clearly state your assumptions and introduce your notations in the algorithm.           [5] One solution to the word outlier detection problem that does not require learning any  parameters (via gradient descent) is the distributional semantics vector approach, where   each word is represented via a bag-of-words vector of contexts. Now, answer the following: (b)  The window size, k, used to define the contexts for each word is an important parameter of this approach. What happens if k is too large or too small?                                               [2] (c)  Describe the pseudo-code of this approach that requires only a single pass through a collection (clearly describe the data structures for an efficient solution).                          [5] (d)  Discuss (with an example) why the vectors of function words (frequent words, such as ‘of’, ‘the’ etc.) obtained with this approach are not reliable.                                                      [2] Now consider word2vec, which is a noise contrastive estimation based method that learns the vectors for each word. With respect to word2vec, answer the following questions. (e)  What is the role of negative samples in the objective function of word2vec?                   [2] (f)  Comment about word2vec’s output for a word with multiple meanings, such as jaguar, bank or python.   What would you expect to find as the nearest neighbors of such polysemous words?  What is the problem if you use such vectors from such words for another task such as text classification?         [2] 2.          Question on word frequencies and language model (Total marks: 15) An alien probe crashes to Earth containing a short passage of alien text. The alien text uses a fiv e letter alphabet: [a, b, c, d, e] with no punctuation or spaces. Below is a short section of the text: abcaedabccbaedabceda (a)  Using character n-grams, write out all of the trigrams that appear more than once with their frequency for the sample text above.                                                                                   [3] Example Answer: trigram    frequency abc              3 eda              3 aed              2 dab              2 (b)  Provide the theoretical maximum number of character n-grams for the alien probe full text for n = 1, 2, 3, 4 and 5. The full text found in the probe is 593 characters long.               [3] Example Answer: n    max n-grams 1              5 2             25 3             125 4            590 5            589 (c)  A linguist makes a breakthrough in understanding the tokens used in the alien text.  She provides two possible ways to tokenize the sample text. (i)  In plain English, explain a single rule that could reproduce this first tokenization a    bca    eda    bccba    eda    bceda [1] Example Answer:  Start a new token whenever the previous character is ’a’ . (ii)  In plain English, explain a single rule that could reproduce this second tokenization ab    caedab    ccbaedab    ceda [1] 2.          Question on word frequencies and language model (Total marks: 15) An alien probe crashes to Earth containing a short passage of alien text. The alien text uses a five letter alphabet: [a, b, c, d, e] with no punctuation or spaces. Below is a short section of the text: abcaedabccbaedabceda (a)  Using character n-grams, write out all of the trigrams that appear more than once with their frequency for the sample text above.                  [3] (b)  Provide the theoretical maximum number of character n-grams for the alien probe full text for n = 1, 2, 3, 4 and 5. The full text found in the probe is 593 characters long.               [3] (c)  A linguist makes a breakthrough in understanding the tokens used in the alien text.  She provides two possible ways to tokenize the sample text. (i)  In plain English, explain a single rule that could reproduce this first tokenization a bca eda bccba eda bceda [1] (ii)  In plain English, explain a single rule that could reproduce this second tokenization ab caedab ccbaedab ceda [1] (d)  More alien probes crash land in different parts of the world.  Scientists want to measure the similarity between the text found in each probe. Here are two tokenized probe texts fragments. Probe Text A: a eda bceda eda bcda bce Probe Text B: ca eda bcba eda bceda eda bce (i)  Calculate the Sørensen–Dice Coefficient and Jaccard Similarity between the two probe texts. Show your work. [4] (ii)  Calculate the similarity between the third probe text (Probe Text C below) and the two prior probe texts using the Sørensen–Dice Coefficient. Using these results, show that the Sørensen–Dice Coefficient is a semi-metric as it breaks the triangle inequality. Probe Text C: beda bceda bceca ebeda bceda b [3] 3.          This question is about Natural Language Processing (Total marks: 18) You just landed an awesome job at the Intellectual Property Office. As your first project, you have been tasked with automatically classifying submitted patent applications into one of the eight broad International Patent Classification sections, as shown here: (a)  You start by applying a typical pre-processing pipeline that consists of case normalisation and a stemmer. Within the context of patent classification application, clearly justify these two pre-processing stages and provide an example that shows why it could lead to improved classification performance.      [4] (b)  You recall from Text as Data that NLP features, such as parts of speech, are often helpful for classification tasks. Within the context of patent classification, provide and justify a specific example where considering a word along with its part-of-speech may help distinguish between two of the above sections.             [3] (c)  Armed with the above intuition, you select an off-the-shelf part-of-speech tagger (based on a Hidden Markov Model) that reports 97% accuracy and apply it to some sample patents to ensure that it produces reasonable part-of-speech tags. To your dismay, you find that it frequently makes mistakes. On closer inspection, you observe that the errors are usually on specialised, domain-specific language in the patents. Explain why this problem arises and what you could do to fix it.       [4] (d)  You want to identify whether two systems (called System A and System B) are better than a baseline method at the classification task. The following table shows intrinsic evaluation metrics obtained over the classification on the train and test sets:   Train Set Test Set Precision Recall Precision Recall Baseline 0.61 0.42 0.56 0.50 System A 0.62* 0.43* 0.58* 0.51 System B 0.67* 0.48* 0.51 0.42 * statistical significance w.r.t. baseline (t-test with p-value < 0.05) Discuss the effectiveness (e.g., generalizability, overfit/underfit, performance on training/test sets etc.) of the models A and B in comparison to the ‘Baseline’ method.                       [3] (e)  Meanwhile, another team has been busy building a BERT-based text classifier, and they have found that it also works well on the task. You decide to join forces with them. Without using an ensemble approach, how might you go about including explicit parts-of-speech into their BERT-based model? How is the technique different than the approach you took in your linear bag-of-features model?           [4]    

$25.00 View

[SOLVED] COMPSCI 5089 Introduction to Data Science and Systems 2022

DEGREES of MSci, MEng, BEng, BSc, MA and MA (Social Sciences) Introduction to Data Science and Systems COMPSCI 5089 1.    (a) You are designing an application for clothing shops to predict clothes size based on customer height and weight.   Suppose we have a clothing dataset with height, weight and the corresponding T-shirt size of several customers. You can represent this dataset based on their vector representations by regarding height and weight as two dimensions. Now there is a new client Abel (U0) whose height is 173cm and weight is 62kg. You are asked to predict the T-shirt size for Abel. (i)  Calculate the Euclidean distance (L2 Norm) between the new point and the existing points.         [3] (ii)  Predict the size of Abel, based on the kNN algorithm, with k = 3 and the above calculated distances. Justify your prediction.        [2] (b) For all answers, include in your answer document both code and the output of that code. (i)  Calculate the covariance matrix for the clothing dataset using numpy.          [1] (ii)  Calculate the eigenvector and eigenvalues the covariance matrix using numpy.        [2] (iii)  Dimensionality reduction. Map the clothing dataset into principal component with the largest eigenvalue of its covariance matrix.         [2] (c) (i) Find SVD for A = , you should include full working in your solution.        [3] (ii)  State the relations between determinant, matrix inversion and non-singular.         [2] 2. Consider a tennis player, Ed Balls, who wants to prepare for a competition match against an opponent—let’s call him Frank Racket. In order to prepare for the match, Ed has acquired records of the 100 previous matches of his opponent and wants to study statistics of Frank’s play to choose where to focus his training. (Here is a quick summary of the rules of tennis: https://protennistips.net/ tennis-rules/) Ed is interested in studying Frank’s serve as this can be an important strategic advantage. •        For a serve to be valid, it must pass the net and bounce in the diagonally opposite service box. •        If the first serve is a fault (eg, hits the net or bounces outside the service box), the player can attempt a second serve. • If the player makes a second fault, he loses the point. Ed wants to study where Frank’s serve bounce in the service box to plan his positioning on the court.  We have NF = 1; 000 examples of first serve from Frank, and NS = 1; 000 examples of second serve. We want to estimate the distributions of the bounce location x for Frank’s first p(xj first) and second serves p(xjsecond). For simplicity, •        we denote the corner closer to the net and towards the centre of the court as position (0,0), and the corner towards the outside of the court and away from the net as (1,1). • We will ignore serves that hit the net This means that values outside [0; 1] × [0; 1] indicate that the serve is a fault. (a) How would you use the empirical distribution to get an estimate of p(xj first) ? Explain the steps, the parameters that need to be set and the associated trade-offs.        [4] (b) Ed now wants to model Frank’s serves using a normal distribution: (i)  Explain the parameters, their effect on the distribution and the best way to estimate them in this scenario.     [4] (ii)  What could be the problem with this choice of model? Give an example of a situation where it would be inappropriate (you can use a diagram to illustrate your example).       [2] (c) Ed has found that his normal model is not accurate enough for him. In order to get a more accurate modelling of the data, he decides to use a mixture of Gaussians to model his data. Explain how the model would be parameterised, and how you would fit the model to the available data (provide the relevant equations).         [5] 3. Pretend that you are the new head of a local radio, IDSS Radio being tasked with renewing the radio’s image and programme. The radio’s programming and popularity has varied over the years and you want to use a data science approach to find the right type of programming for the local audience. To this end you start by categorising the programming of the radio between types of content: C = {music; news; business; fiction; comedy; advertisement} You have historical records of the proportion of each content type in the radio programme for every month over the last ten years, as well as a rating r by a sample of the audience on scale between 1 and 10, where 1 means “hate it” and 10 means “love it”. Considering a programme p = [pm;pn;pb;pf;pc;pa] ∈ R5 that gives the number of hours  for each content type, we are interested in studying the function r(p) that gives the listeners’ rating for this programme. (a) As a first attempt, you decide to assume that the function r(p) is linear, and therefore to solve it using linear-least-squares, of the canonical form (from the lecture notes): (i)  Explain what each variable in this equation means in this scenario, specifying their dimension, and what would be the result.      [4] (ii)  Could you name a reason why this may not be a good model? How could you measure this using your data?   [3] (b) We want to try and fit another model, this time assuming that viewers’ preferences peak for certain quantities of each program, and then decreases again if the quantity increases even more. We could model this quantity preference as a bell shaped distribution function over the quantity pz for each type of content z: Bz (pz ) = αz exp(-β Ⅱpz - μz Ⅱ2 ) and the overall predicetd preference for a program p as: (i)  How many parameters do you need to estimate in this case? Explain the role of each parameter.   [3] (ii)  What would be the most appropriate approach to fit this model to your data (Note: all of the functions above are differentiable, but Bc is clearly not linear)? Explain how you would parametrise this problem (you are not asked to solve it!)     [3] (c) Using this model ˆ(r), how would you use optimisation to find the best program, knowing that you want to run the radio from 6am to midnight daily, and need at least 1 hour of advertisement per day to cover the radio running costs.  How would you resolve this optimisation? [2] 4.    (a) Consider a relation Weather(Id, Time, Longitude, Latitude, Temperature, Humidity), where the primary key (Id) is a 116-byte string hash code, Time is 8-byte Datetime, the other fields are stored by 32-bit float. Assume that the relation has 30000 tuples, stored in a file on disk organised in 4096-byte blocks. Note that the database system adopts fixed-length records – i.e., each file record corresponds to one tuple of the relation and vice versa. (i)  Compute the blocking factor and the number of blocks required to store this relation.   [2] (ii)  You are told that you will need to frequently add new records and you will not often read and fetch a record. Describe in detail the file organisation that you would expect to exhibit the best performance characteristics. Explain your answer by comparing the cost of reasonable alternatives.   [3] (b) Consider the following three relations: • Student(Id, FirstName, LastName, DateOfBirth) where – the primary key (Id) is a 32-bit integer, – FirstName and LastName are both 96-byte strings, and – DateOfBirth is a 32-bit Integer. • Course(Id, Description, Credits), or C, where: – Id, the primary key of this relation, is a 32-bit integer, – Description is a 195-byte string, and – Credits is an 8-bit integer. • Transcript(StudentId, CourseId, Mark), or T , where: – StudentId is a foreign key to the primary key (Id) in the Student relation, – CourseId is a foreign key to the primary key (Id) in the Course relation above – Mark is a 8-byte double precision floating number, and – the primary key consists of the combination of StudentID and CourseID. Assume these relations are also organised in 4096-byte blocks, and that: •  Relation Course (C) has rC = 32 records and nC = 2 blocks, organised in a heap file,  •  Relation Transcript (T ) has rT = 51200 records and nT = 200 blocks, organised in a sequential file, ordered by StudentID. •  Relation Student (S) has rS = 2000 records and nS =100 blocks, stored in a heap file and has a 4-level secondary index on StudentId. Further assume that the memory of the database system can accommodate nB = 23 blocks for processing and that the blocking factor for the join-results block is b f rRS = 10 records per block. Last, assume we execute the following equi-join query: SELECT  *  FROM  Transcript.  AS  T,  Student  AS  S,  Course  AS  C WHERE  T .StudentId  =  S . Id  AND  T .CourseId  =  Course . Id As this is a 3-way join, assume that you need to join T with C first, with each block of intermediate results stored only in RAM (in one of the nB blocks), then joined with S. (i)  Describe the join strategy that would be the most efficient in this case and estimate its total expected cost (in number of block accesses). Show your work.       [8] (ii)  Compare the Naive Nested Loop Join and the Index-based Nested-Loop Join. Which one is faster? Explain why.      [2]

$25.00 View

[SOLVED] COMPSCI 5011 INFORMATION RETRIEVAL 2022

DEGREES OF MSc, MSci, MEng, BEng, BSc, MA and MA (Social Sciences) INFORMATION RETRIEVAL M COMPSCI 5011 Friday 29 April 2022 1 (a) The following documents have been processed by an IR system where stemming is not applied: DocID Text Doc1 france is world champion 1998 france won Doc2 croatia and france played each other in the semifinal Doc3 croatia was in the semifinal 1998 Doc4 croatia won the other semifinal in russia 2018 (i)        Assume  that  the  following  terms  are  stopwords:  and,  in,  is,  the,  was. Construct an inverted file for these documents, showing clearly the dictionary and posting list  components. Your inverted file needs to  store  sufficient information  for  computing  a  simple  tf*idf  term  weight,  where  wij    = tfij *log2(N/dfi)      [5] (ii)       Compute the term weights ofthe two terms “champion” and “ 1998” in Doc1. Show your working.         [2] (iii)      Assuming the use of a best match ranking algorithm, rank all documents using their relevance scores for the following query: 1998 croatia Show your working. Note that log2(0.75)= -0.4150 and log2(1.3333)= 0.4150.           [3] (b) (i)        In Web  search, explain why the use of raw term frequency (TF) counts in scoring documents can hurt the effectiveness of the search engine.         [2] (ii)         Suggest a solution to alleviate the problem, and show through examples how it might work. Explain through examples how modern term weighting models in IR control the raw term frequency counts.           [3] (c)    Assume that you have decided to modify the approach you use to rank the documents of your collection. You have developed a new Web ranking approach that makes use of recent advances in neural networks. All other components of the system remain the same.  Explain in detail the steps you need to undertake to determine whether your new Web ranking approach produces a better retrieval performance than the original ranking approach.          [5] 2. (a)     Consider  a  corpus  of  documents  C  written  in  English,  where  the  frequency distribution of words approximately follows Zipf’s law r * p(wr |C) = 0.1, where r = 1,2, …, n is the rank ofa word by decreasing order of frequency. wr is the word at rank r, and p(wr |C) is the probability of occurrence of word wr in the corpus C. Compute the probability of occurrence of the most frequent word in C. Compute the probability of occurrence of the 2nd  most  frequent word in C. Justify your answers.           [4] (b)   Consider the query “michael jackson music” and the following term frequencies for the three documents D1, D2 and D3, where the search engine is using raw term frequency (TF) but no IDF:   indiana jackson life michael music pop really D1 0 4 1 3 0 6 1 D2 4 0 3 4 1 0 2 D3 0 4 0 5 4 4 0 Assume that the system has returned the following ranking: D2, D3, D1. The user judges D3 to be relevant and both D1 and D2 to be non-relevant. (i)   Show the original query vector, clearly stating the dimensions of the vector.         [2] (ii)  Use  Rocchio’s relevance  feedback algorithm (with α=β=γ=1) to provide a revised query vector for “michael jackson music”. Terms in the revised query that have negative weights can be dropped, i.e. their weights can be changed back to 0. Show all your calculations.          [4] (c) Suppose we have a corpus of documents with a dictionary of 6 words w1, ..., w6. Consider the table below, which provides the estimated language model p(w|C) using the entire corpus of documents C (second column) as well as the word counts for doc1 (third column) and doc2 (fourth column), where ct(w, doci) is the count of word w (i.e. its term frequency) in document doci. Let the query q be the following: q = w1 w2 Word p(w|C) ct(w, doc1) ct(w, doc2) w1 0.8 2 7 w 2 0.1 3 1 w3 0.025 2 1 w4 0.025 2 1 w 5 0.025 1 0 w6 0.025 0 0 SUM 1.0 10 10 Word p(w|C) ct(w, doc1) ct(w, doc2) (i)        Assume that we do not apply any smoothing technique to the language model for doc1  and doc2. Calculate the query likelihood for both doc1  and doc2, i.e. p(q|doc1) and p(q|doc2) (Do not compute the log-likelihood; i.e. do not apply any log scaling). Show your calculations. Provide the resulting ranking of documents and state the document that would be ranked the highest.         [3] (ii)       Suppose we now smooth the language model for doc1 and doc2 using Jelinek- Mercer Smoothing with λ = 0.1. Recalculate the likelihood of the query for both doc1  and doc2, i.e., p(q|doc1) and p(q|doc2) (Do not compute the log- likelihood; i.e. do not apply any log scaling). Show your calculations. Provide the resulting ranking of documents and state the document that would be ranked the highest.           [4] (iii)      Explain which document you think should be reasonably ranked higher (doc1 or doc2) and why?         [3] 3. (a) How would the IDF score of a word w change (i.e., increase, decrease or stay the same) in each of the following cases: (1) adding the word w to a document; (2) making each document twice as long as its original length by concatenating the document with itself; (3) Adding some documents to the collection. You must suitably justify your answers.           [4] (b) Explain  in  detail why positive feedback is likely to be more useful than negative feedback to an information retrieval system. Illustrate your answer using an example from a suitable search scenario.         [4] (c)  Neural retrieval models often use a re-ranking strategy over BM25 to reduce computational overhead. Explain the key limitation of this strategy. Describe in sufficient details an approach that you might use to overcome this problem.    [5] (d)  Consider a query q, which returns all webpages shown in the hyperlink structure below. (i)        Write the adjacency matrix A for the above graph.       [1] (ii)       Using the iterative HITS algorithm, provide the hub and authority scores for all the webpages of the above graph after a complete single  iteration of the algorithm. Show your workings.   [3] (iii)      Describe in sufficient details an alternative approach to compute the hub and authority scores for the above graph. You need to show all required steps to generate the scores, but you do not need to actually compute the final scores.      [3]

$25.00 View

[SOLVED] COMPSCI5089 Intro to Data Sci Systems 2024

Intro to Data Sci & Systems M COMPSCI5089 Friday 20 December 2024 1. This question is concerned with the Linear Algebra part of the course. Note: When answering this question, you are recommended to use either Numpy pseudo- code or Latex syntax (at your preference) for typing mathematical answers into Moodle. Incorrect syntax will not be penalised as long as it is clear and unambiguous. For example, the identity matrix could be written as: [[1,0],[0,1]] or |   1  0    | |   0  1    | and the matrix inverse as Aˆ-1 or inv(A). Consider that you are working for a shipment company and studying the movements of parcels between 5 sites: A, B, C, D and E. The transitions between those sites every day are expressed in the following graph: (a) (i)  What is the adjacency matrix for this graph? Provide the corresponding matrix (Note: ensure that the edge weights are correctly encoded).        [3] (ii)  Assume that at t = 0 you have the following distribution A = 100,B = 10, C = 20,D = 0, E = 0, what would be the distribution at t = 1? [2] (iii)  How would you calculate the package distribution two days ago (xt =-2)? Detail the approach you would use, but you do not have to calculate the actual values. [2] (iv)  How would you transform. this adjacency matrix to make the graph undirected (ie, ensure that paths between any two nodes go both ways)?  [2] (b) What is a steady state of A? Explain two ways to calculate the steady state for this process. [3] (c) Consider the 2 × 3 matrix A with the following SVD decomposition A = UΣVT where (i)  What are the singular values of A?      [3] (ii)  How can you calculate the pseudo-inverse of A, A+ from this decomposition. Explain all steps. (Hint: We have (AB)+ = B+A+ and if A is invertible, then A+ = A-1.)         [5] 2. This question is concerned with the optimisation part of the course. Hint: For the following questions, you can use ((1  0)   (0  1))  ˆ-1 to represent ify your typings. You are given the following linear least squares optimisation problem: Minimise the cost function: f(x) = ∥Ax—b∥2 where: • x ∈ R2 is the vector of unknowns, • A ∈ R2×2 is a matrix of known values, • b ∈ R2 is a vector of known values. Given: (a) Solve the least squares problem using the normal equations method to find the optimal solution x∗ . Hint: The normal equations are derived from the gradient of the least squares function, set to zero: A ⊤Ax = A ⊤b. And [5] (b) Solve the least squares problem using gradient descent, starting from an initial guess xo = and using a step size α = 0.5. Perform. two iterations. Hint: The gradient of the least squares cost function is given by ▽f(x) = 2A⊤ (Ax—b). You can use delta to represent ▽ .  [5] (c) Discuss the merits of stochastic gradient descent (SGD) for solving least squares problems, especially in the context of large datasets.   [4] (d) Now, consider the same least squares problem, but with the additional constraint that x1 +x2 = 1. Solve this constrained optimisation problem using the Lagrange multiplier method. Hint: You don’t need to substitute the values of x, but describe overall the steps with formulas.   [6] 3. This question is concerned with the probabilities part of the course. Note: When answering this question, you are recommended to use either Numpy pseudo- code or Latex syntax (at your preference) for typing mathematical answers into Moodle. Incorrect syntax will not be penalised as long as it is clear and unambiguous. For example, the identity matrix could be written as: [[1,0],[0,1]] or |   1  0    | |   0  1    | and the matrix inverse as Aˆ-1 or inv(A). Let us assume that for a user study you have recorded the gaze of users when navigating a webpage (ie, you have recored what part of the webpage they were looking at). As a result, you have obtained a database D of 100 gaze locations for 100 users. Each record provides you with the x and y coordinates of the user’s gaze location in the page. We will assume that the coordinates are normalized between 0 and 1, such that (0, 0) indicate the top left corner of the page and (1, 1) the bottom right corner. (a) As a first attempt at the problem, you decide to assume that the distribution of users’ gazes is normally distributed. Explain: (i)  how this distribution would be parametrised, stating the dimensionality of each param-eter;     [2] (ii)  and how you would estimate those parameters from D?       [3] (b) After initial experiments, your model appears to perform very poorly: (i)  In what case would this assumption of a normal distribution be obviously wrong? How would you identify that from the data in D?      [2] (ii)  Propose an alternative model you could use and explain and how it would be parametrised (stating all dimensions).      [3] (iii)  How would you estimate those parameters?         [3] (c) Let us assume that out of the 100 users you recorded, 25 did actually buy something on the site, and that in addition to the users’ gaze, you have also recorded which users decided to buy something and which did not. Using this data, how would you estimate how likely is a user to buy something given that they have gazed at a location g0? Explain all steps of your approach in details.         [7] 4. This question is concerned with the databases part of the course. (a) You are given the following two relations: • Course (C): – Schema: Course(Id,  Description,  Credits) – Attributes: *  Id: A 4-byte integer (primary key) *  Description: A 256-byte string *  Credits: A 1-byte integer – Total Records (rC): 32 • Transcript (T): – Schema: Transcript(StudentId,  CourseId,  Mark) – Attributes: *  StudentId: A 4-byte integer (foreign key) *  CourseId: A 4-byte integer (foreign key to Course(Id)) *  Mark: A 8-byte double precision floating number * Primary Key: Combination of StudentId and CourseId – Total Records (rT ): 51,200 Assume: The size of a disk block is 4096 bytes. CourseId in T references Id in C. (i)  Calculate number of blocks for storing each relation (C and T). Hint: To simplify your typing, you can use cel(x) and floor(x) to represent the floor and ceiling functions used to round numbers to the nearest integer.        [6] (ii)  Estimate the selection cardinality of joining the relations C and T on the attribute CourseId. Assume that the courses in C are uniformly enrolled by all the students and appears in the Transcript.   (T) records.         [2] (iii)  Explain how the selectivity helps in the query process.           [2] (b) You are a data engineer at a company that develops personalised music streaming services. The platform needs to recommend songs to users based on their listening history and  preferences. Each song is represented by a high-dimensional feature vector that includes  acoustic attributes, artist information, and embedded representations from machine learning models. (i)  Which types of database would you choose to store and query the song data? Justify your choice.        [6] (ii)  Describe how you would store and index the song data to allow efficient retrieval and recommendation.       [4]

$25.00 View

[SOLVED] Building a Project Management Application

Task(s) Part 1: Building a Project Management Application (70 Marks) In this project, you must apply knowledge of C++ object-oriented programming to design and implement a console-based Project Management application that allows users to manage projects, tasks, team members, vendors and clients. The program should meet the following requirements: 1. Project Class: a. Create a Project class with attributes such as project name, description, start date, end date, and status (e.g., "Not Started", "In Progress", "Completed"). b. Implement methods to add, update, and delete projects. 2. Task Class: a. Create a Task class with attributes such as task name, description, start date, end date, and status (e.g., "Not Started", "In Progress", "Completed"). b. Implement methods to add, update, and delete tasks. c. Each task should be associated with a project. d. A task may have one level of sub-tasks. Each main task may have multiple sub-tasks. 3. Team Member Class: a. Create a TeamMember class with attributes such as team member name, role, and contact information. b. Implement methods to add, update, and delete team members. c. Each team member should be associated with a project. 4. Vendor Class: a. Create a Vendor class with attributes such as company name, company type (what resource does this vendor provide e.g. IT services, cabling, software coding etc.), and contact person(s). b. A project may have multiple vendors. c. Vendors can also be assigned to a task. d. Implement methods to add, update, and delete vendors. 5. Client Class: a. Create a Client class with attributes such as company name, company type (which industry are they in like finance, banking, IT, construction, plantation etc.), and contact person(s). b. A project may have multiple clients. c. Implement methods to add, update, and delete clients. 6. Project Management System Class: a. Create a ProjectManagementSystem class that manages projects, tasks, and team members. b. Implement methods to display project details, task details, and team member details. c. Implement methods to assign tasks to team members and update task status. 7. Menu-Driven Interface: a. Create a menu-driven interface that allows users to interact with the Project Management Application. The menu should include options to: i. Add, update, and delete projects ii. Add, update, and delete tasks iii. Add, update, and delete team members iv. Add, update, and delete vendors v. Add, update, and delete clients vi. Assign tasks to team members vii. Update task status viii. Display project details, task details, and team member details 8. Additional functions – include 2 relevant additional functions (not described in this assignment) that will further demonstrate your ability to use object-oriented programming (OOP) concepts. 9. Code Standard a. You must demonstrate your ability to use as many OOP concepts as possible for your project. Use classes and objects, composition, inheritance, and polymorphism where necessary. At least one static function must also be applied in your program. b. Your projectshould have one main.cpp source file, multiple header files, and text file(s). c. All data should be saved into text files preferably in CSV format. d. Use exception handling to handle possible input errors. e. Use suitable STLs. Part II: Project Documentation (30 Marks) Create a project documentation that covers the following details: 1. A set of UML diagrams that shows the functionalities, relationship between classes, and the workflow of your project management application. Use appropriate UML tools and diagramming techniques. 2. Explain why the STLs you used to develop the project management application is suitable based on this project context. 3. Explain the rationale of using inheritance and polymorphism you implemented in your project. For example, you may clarify why a class should inherit from another class(es) and how you enabled polymorphism in your program codes. 4. Explain the rationale for using the static function in your program. 5. Explain the 2 additional functions you implemented in your program and the OO concepts that you implemented in the 2 additional functions.

$25.00 View

[SOLVED] MST 1013 Research in Education Assignment 1

Research in Education MST 1013 May-Sept 2025 Assignment 1 Developing Your Own Research Focus Contribution: 30% Submission Date: Week 5 Scope: Introduction to Educational Research Objective To help students understand the differences between a discipline, a topic, and a research title in educational research, and to apply this understanding by creating their own research focus based on personal interests or professional goals. Instructions 1. Choose one discipline within the field of education that interests you. Examples: Science Education, Early Childhood Education (ECE), Music Education, Educational Technology, Inclusive Education, etc. 2. Based on your chosen discipline: o Identify a specific research topic.  o Create one research title based on that topic (ensure clarity and focus). o Formulate a research problem (approx. 150 words) that you aim to address. Submission Date: Week 3

$25.00 View

[SOLVED] PHAS0008 Practical Skills 1P Experiment T8

PHAS0008 “Practical Skills 1P” Experiment T8: Specific Latent Heat of Liquid Nitrogen (4 Sessions) Experimental Objectives To determine the specific latent heat of vaporisation of liquid nitrogen. This quantity is also known as the specific enthalpy change on vaporisation. To determine the specific latent heat of melting of water and/or heavy water ice. Relevant Lecture Course •   Thermal Physics and the Properties of Matter (PHAS0006) Potential Hazard: Latent Heat of Liquid Nitrogen •    Nitrogen is non-flammable and weighs approximately the same as air. Inhalation of a Nitrogen enriched atmosphere (ie: loss of oxygen) may cause dizziness, drowsiness, nausea, vomiting, excess salivation, diminished mental alertness, loss of consciousness, and ultimately: death. •    Freeze burns from spilled liquid nitrogen that leaves the dewar or the equipment, for example when retrieving samples. Existing Control Measures •    Prevent unauthorized people having access to areas used for delivering, storing, dispensing and using liquid nitrogen. •    Avoid direct skin contact with items which recently been in proximity of liquid nitrogen, by using insulated gloves or tongs. •   Oxygen depletion monitors are situated around the laboratories, and will detect the amount of Liquid Nitrogen build up in the laboratories. •    Latent Heat of Liquid Nitrogen. Technician or qualified demonstrator for the use of liquid nitrogen, distributes the Liquid Nitrogen. •    Users are required to wear Safety Glasses at all time when using the nitrogen. •   Only persons fully trained in the use of cryogenic liquids may use the LN2 •    The container of the Liquid Nitrogen is covered once the nitrogen has been applied. •    The wiring of the circuit for experiments involving the use of Liquid Nitrogen is  checked by the Technician and Demonstrator before being allowed to continue. •   Supervision from Technician and Demonstrators regarding Health and Safety. •   Safety guidelines are adhered to at all times. •    No Lone working permitted at any time. Risk Level with existing controls Low/Tolerable Safety Note: Use of Liquid Nitrogen Please read this Safety Services policy: • https://www.ucl.ac.uk/safety-services/policies/2022/dec/liquid-nitrogen This experiment uses a small amount of liquid nitrogen. There is no reason for you come into contact with the liquid nitrogen, however if you do there is a possibility that the extreme low temperature of the liquid may cold burn you. To avoid accidents you should take the following precautions. •    Use the safety spectacles provided while performing the experiment. •    Remove rings. If liquid nitrogen falls on your hands it could be trapped behind a ring and this could result in burning. •   On finishing the experiment, ask the technician to return the unused liquid to the storage vessel. Should you come into contact with the liquid, please note that small splashes of liquid nitrogen on your skin will not harm you. However, exposure to the liquid for more than 3 or 4 seconds may cause cold burns. If this should happen for any reason, call the lab    technician for help and take steps to get the liquid away from your skin. If possible run  cold water over the affected region. Please note that on evaporation, one litre of liquid nitrogen will produce around 700 litres of gas. 1. Introduction The term “latent heat” was first used by Joseph Black in a posthumous work, Lectures on the Elements of Chemistry, published in 1803, but describing experiments done 40 years  earlier [1, 2]. The term was first applied to the heat required to vaporise a liquid, but a similar effect is encountered when going from solid to liquid. The modern definition of the specific latent heat of vaporisation, as given, for example, in Chambers’ Dictionary of Science and Technology [3], is “The heat required to change the state of unit mass of a substance from solid to liquid, or from liquid to gas, without change of temperature. Most substances have a latent heat of fusion and latent heat of vaporization. The specific latent heat is the difference in enthalpies of the substance in its two states. Unit J kg-1”.  In Black’s day, the latent heat was quantified by comparing the time taken to boil a vessel of water dry, with the time taken to bring it to boiling point, assuming a constant rate of heat flow. Nowadays we have more accurate ways of measuring the heat input. Latent heat is a key quantity in many natural and industrial processes, for example in temperature regulation and engine performance. It also plays a central role in atmospheric, oceanic and climate stability and modelling. [4-6] 2. Background and Theory Nitrogen and other inert gases, such as helium and propane (C3 H8), can be liquefied by    compression/expansion cycles at around 30 bar and exploitation of the Joule-Thompson effect [see PHAS0006 and 7]. The boiling points of He, H2, N2  and C3 H8 at atmospheric pressure are 4.21, 20.27, 77.35 and 231.1 K respectively. Liquefied gases used in experiments are kept in dewars (named after their inventor Sir James Dewar, the first person to liquefy hydrogen). These are flasks with a double wall of glass, separated by a vacuum, which are used to thermally insulate materials so as to keep them either hot or cold. Dewars insulate the liquid from nearly all sources of ambient heat in the laboratory, but are not 100% efficient. A quantity of liquid nitrogen in a dewar, assumed to be at 77K (the boiling point of N2) [8], will slowly boil away due  to background heat. The rate at which the liquid loses mass is proportional to the rate of influx of background heat: (1) where L is the specific latent heat of vaporisation and Q  = mL [9]. If we supply additional heat, the rate of mass loss will increase: (2) Hence, even if we don’t know the background rate of heat fIow, as Iong as we do know the additional rate we can calculate L from the difference between the two rates of mass loss - in other words, by subtracting equation 1 from equation 2: (3) and hence; (4) Make sure you understand what is meant by the latent heat of vaporisation and latent heat of fusion of a substance, and how they differ from heat capacity. 3.  The Experiment In this experiment the additional heat will be supplied by a resistor in which a current is flowing. According to electrical theory, a resistor, across which there is a potential difference V, and in which a current I is flowing, dissipates power (energy per unit time) according to the formula; P = VI.                                                                      (5) So, if all this power is absorbed by the liquid nitrogen as latent heat, equation 4 is then; (6) We therefore need to measure V, I, and the rates of mass loss with and without the current flowing. Q3.1: Why might V and I fluctuate? Can this be controlled? 3.1 Equipment The experimental set up (see Figure 1) is very simple: a dewar with a loose fitting polystyrene lid through which two wires lead to the resistor is placed on a weighing scale. The mass of the dewar and contents will decrease with time as the liquid nitrogen evaporates. Q3.2: Why is the polystyrene lid loose? Q3.3: What methods of heat transfer are relevant? The diagram in figure 1 is useful, but whenever an electrical circuit is built as part of an experiment a circuit diagram should also be included. The electrical circuit should supply about 10W of electrical power to the resistor. Q3.4: What is the value of resistance of the resistor? Q3.5: What I-V combination(s) will you use? Is there a reason for your choice? 3.2 Safety Note Ask a member of technical staff to fill the dewar nearly to the brim with liquid nitrogen. It should weigh around 200 ± 20 g. DO NOT TURN ON THE ELECTRIC POWER SUPPLY UNTIL THE RESISTOR IS IMMERSED IN NITROGEN – the resistor becomes very hot and needs to be in the liquid nitrogen before the power is turned on to prevent it from burning. It should remain in the same position  at all times, completely surrounded by liquid nitrogen (in both “background” and “power-on” runs) and not in contact with the dewar. Q3.6: How will you ensure that the resistor stays where you want it to be? 3.3 Experimental Procedure In this experiment you will evaluate the rate of mass loss of the liquid nitrogen under two sets of circumstances: [1] with the power on (you can use more than one power setting: is there an advantage in doing this?), and [2] with the power off (the background rate). During these two data runs, you should control all other factors that might influence the mass loss rate. Think carefully about the following questions: Q3.7: As the liquid boils off due to background heat alone, is the rate of mass loss likely to be constant? Q3.8: Is there anything in the design of the equipment that might cause this rate to vary? Q3.9: If you are not sure, can you find out experimentally? Q3.10: If you think there will be variation, how can you limit the effect of such variation on the result of your experiment? It is recommended that your first data run is done under background conditions only, and lasts long enough for you to observe any changes that occur as the liquid boils away. N.B. Under normal background conditions, the liquid boils away quite slowly; it takes more than an hour for a full dewar to lose half its contents. Devise an initial plan for your method, write it in your lab book and discuss with a demonstrator before proceeding. Remember that when taking data you must also estimate the associated uncertainties at the same time - not as an afterthought. After your initial background run, you should assess your data and decide whether your plan needs modification. The best way to do this is by drawing a graph of mass against   time immediately after finishing the run. What do you deduce from the graph? Remember that the formula we are using to calculate L, namely (6), was based on combining together equations (1) and (2), which correspond to the “background” and “power-on” runs respectively. Q3.11: There is an assumption underlying this; what is it? In order to use (6), therefore, you must ensure that your values for the rate of mass loss under “background” and “power-on” conditions are consistent with this assumption. When you draw up your experimental plan, you should also consider whether there is an advantage in measuring for more than one power input – please see equation 6. 4. Data Analysis Plot a graph of the mass of liquid nitrogen versus time; determine dm/dt, together with its uncertainty, with power(s) on and off. When fitting the data to obtain the gradient (and intercept), you should also determine and comment on (reduced) X 2. Using Equation 6, draw up an estimate for L and compare your answer with that found in the literature. [8] At the end of your first set of measurements you should have: •   Graphs of mass versus time; •    Estimates of dm/dt with power on and off; •   Values of V and I; •   An estimate of L with an uncertainty estimate. Q4.1:   How does your estimate for L compare with the published value in terms of its uncertainty? If you conduct multiple runs at identical power, you may wish to consider whether it is appropriate to find an average value. Remember that we can only justify taking an average of two or more values if they were obtained under the same conditions. If you suspect that one of your values is more reliable than the others, you may choose this one as your final result as long as you can justify the choice. You cannot justify picking out a result simply because it is the nearest one to the accepted value! You may wish to consider the average of the data points, include all data points from all  runs on a single graph or simply take the average of your values of L. It will be important to consider your uncertainties and whether the uncertainties affect how much weight should be given to any datum point. As noted in section 3.3, you should also consider whether repeat runs at different power might help reduce the uncertainty in your measured value of L. 5. Discussion & Conclusions If the uncertainties are large or your initial estimate for L is inconsistent with the published value, you may wish to consider some of the following: •    Is your value of VI an accurate estimate of the power dissipated in the resistor? •   What have you assumed about where this power goes? Is your assumption justified? •   Are your mass readings accurate estimates of the quantity of liquid in the dewar? •    Is the procedure you used to convert these mass readings to a rate of loss of mass reliable? •   What factors govern the background rate of mass loss? Given what you can and cannot control and measure, you may wish to repeat the experiment with the same procedure or modify the procedure to reduce your uncertainties. Discuss any major modifications with a demonstrator before proceeding since there are limits on what may be possible. In your conclusions you should discuss whether any modifications you made resulted in an improved result; if you have had any further ideas for modification but do not have either the time or the resources to implement them, describe them in your write-up. 6. Extension Experiment: Specific latent heat of melting for water ice (H2O) and/or heavy water ice (D2O) After completing the main experiments, you should design and conduct an experiment using your apparatus to measure the specific latent heat of melting of water ice (H2O)   and/or heavy water ice (D2O): would you expect the melting point and specific latent heat to be the same for the two different isotopic compositions? For this extension experiment, you can use your balance and dewar, and a member of technical staff can provide you with a digital thermometer. On request, Derek Thomas will be able to give you water ice cubes and/or a single heavy water ice cube. You may also use an IR Thermal Imaging camera which can be borrowed from Derek Thomas. Liquid water can be used as a medium of known heat capacity. Please note that phase change materials, which release their latent heat on freezing, are currently extremely topical for renewable energy storage. [10, 11] You will, of course, need to draw up a Risk Assessment for your experimental procedure: this MUST be approved by a member of staff before you conduct any measurements, and should include a description of how you will dispose of any samples once they have been used. The Material Safety Data Sheet (MSDS) for D2O is available on Moodle, and please see: https://www.ucl.ac.uk/safety-services/working-safely-chemicals Safety Note: You must NOT place the resistor heater in water. 7. References Notes: • Please only quote these references if you have actually read and referred to them, and include relevant page numbers. • The Digital Object Identifier (DOI), is a string of numbers, letters and symbols used to permanently identify an article or document and link to it on the web. • The International Standard Book Number (ISBN) is a numeric commercial book identifier that is intended to be unique. [1] “Joseph Black, carbon dioxide, latent heat, and the beginnings of the discovery of the respiratory gases”, JB West, Am. J. Physiol. Lung Cell Mol. Physiol., 306: L1057–L1063 (2014). DOI: 10.1152/ajplung.00020.2014 [2] “April 23, 1762:  Joseph Black and Latent Heat. Disappearing heat and the dog that did not bark”, R Williams, APS News 21 (2012). https://www.aps.org/publications/apsnews/201204/physicshistory.cfm(Accessed 12/12/2022) [3] “Chambers Dictionary of Science and Technology” 2nd editon. JM Lackie General  Editor. Edinburgh: Chambers (2007). ISBN : 9780550104571 (e-book). UCL username and password required for access. [4] “Latent heat must be visible in climate communications”, T Matthews et al,  WIREs Climate Change, 13, e779 (2022).https://doi.org/10.1002/wcc.779 [5] “Factors of boreal summer latent heat flux variations over the tropical western North Pacific” . Y Wang and R Wu. Clim Dyn 57, 2753–2765 (2021). https://doi.org/10.1007/s00382-021-05835-4 [6] “Sensible heat has significantly affected the global hydrological cycle over the historical period” . G Myhre, et al. Nat Commun 9, 1922 (2018). https://doi.org/10.1038/s41467-018-04307-4 [7] “Liquefaction of gases”. WH Isalski, Thermopedia (2011). DOI: 10.1615/AtoZ.l.liquefaction  of  gases [8] “Tables of Physical and Chemical Constants” 16th edition, originally compiled by G.W.C. Kaye and T.H. Laby; Longman, New York (1995). ISBN-13: 9780582226296.    Available at Kaye and Laby online:http://www.kayelaby.npl.co.uk/toc/ [9] “Physics for Scientists and Engineers” 9th edition, RA Serway and JW Jewett. Australia: Brooks/Cole Cengage Learning (2014).  ISBN : 9781473711143 (e-book). UCL username and password required for access. [10] “Phase change materials for thermal energy storage” . K Pielichowska & K Pielichowski, Progress in Materials Science 65, 67-123 (2014). https://doi.org/10.1016/j.pmatsci.2014.03.005 [11] “Trimodal thermal energy storage material for renewable energy applications” . S Saher, S Johnston, R Esther-Kelvin, et al.  Nature 636, 622–626 (2024). https://doi.org/10.1038/s41586-024-08214-1

$25.00 View

[SOLVED] STAT3600 Linear Statistical Analysis 24 Statistics

STAT3600 Linear Statistical Analysis 1.   [49] Consider the data of five observations. i xi yi 1 26 3.2 2 23 1.8 3 62 4.0 4 20 2.3 5 17 4.8 a.   [5]  Write  down the simple linear regression model of yi  on xi .  What  are the four model assumptions? State them clearly. b.   [5] Letβ(^)1  be the least squares estimator for the unknown population slope in the simple linear regression model. Prove that  c.   [5] Find the least squares estimates of the population intercept and slope. Interpret the estimate for the population slope. d.   [15] Construct the following ANOVA table by filling in the blanks led by letters from A to I. At 5% significance level, test whether there is a linear relationship between the independent and dependent variables using the information on the ANOVA table. State clearly the null and alternative hypotheses, test statistic, null distribution, decision rule and conclusion. Source SS df MS SSR A D G SSE B E H SST C F   e.   [6] Using the Bonferroni's method, construct simultaneous confidence intervals for the population intercept and slope with a family confidence level of at least 95%. f.    [2] Find the coefficient of determination and interpret the result. g.   [1] Find a point estimate for the population mean of Y when x is 25. h.   [4] Construct a 90% confidence interval for the population mean of Y when x is 25. i.    [6] Let Y(1)  and Y(2)  be future responses with the values of x being 30 and 35, respectively. Construct a 95% prediction interval for Y(1)  − Y(2) . 2.   [51] You are given the following matrices computed from a multiple linear regression of yi  = β0 + β1xi1 + β2xi2 + εi: The matrices are properly ordered according to the regression equation given above. a.   [4] Find the sample size and the sample mean of r. b.   [5] Show that the least squares estimator for β is given by β(^) = (XTX)-1XTY. c.   [5] Find the least squares estimates for β0, β1 and β2. Interpret the estimates for β1 and β2. d.   [15] Construct the ANOVA table and hence, test whether the coefficients for the independent variables are jointly equal to zero at the 5% level of significance. Clearly define the null and alternative hypotheses and decision rule. State your conclusion. e.   [7] At the 5% level of significance, conduct a t-test for H0 : β1  = β2  vs. H1 : β1  ≠ β2. f.    [6] Construct a 95% confidence interval for β1  + 2β2. g.    [9] Define    At  the  5%  level  of  significance,  test  the following hypotheses. H0 : Cβ = d vs. H1 : Cβ ≠ d.    

$25.00 View

[SOLVED] MAT223H5S - Linear Algebra I - Winter 2025 Make-Up Term Test Statistics

MAT223H5S - Linear Algebra I - Winter 2025 Make-Up Term Test 1 1.1 (2 points) Suppose that A is a matrix with A −1 = , and b = . Determine the unique solution to the equation Ax = b. 1.2 (3 points) Suppose that T : R2 → R2 first rotates the plane counterclockwise by π/2, then reflects across the line y = x. Determine the matrix AT. 1.3 (6 points) For which values (if any) of c ∈ R is the following set of vectors in R3 linearly independent? 2 Reminder: show your work and justify your steps using only techniques taught in this course. 2.1 (1 point) The equation x − y − 6z = 6 represents a plane in R3 . Determine whether the point P = (3, −3, 0) is on the plane or not. 2.2 (1.5 points) Determine the angle between vectors u = and v = 2.3 (2.5 points) Provide an example, with justification, of a 2 × 2 matrix which is skew-symmetric and invertible, or explain why no such matrix exists. 2.4 (5 points) Find the shortest distance between the lines L1 and L2 with the following equations: You should include a rough drawing illustrating the situation (e.g. it doesn’t have to plot the lines accurately); the drawing should show any vectors and points that you compute as part of your solution. 3 3.1 (5 points) Show that U is a subspace. 3.2 (5 points) Determine a basis for U. Show your steps. 4 4.1 (5 points) Let T : R3 → R3 be the transformation given by T Show that T is linear by verifying the two properties below. Do not simply say that T is a matrix transformation, or use any other technique, or you will receive 0 points. (1) For all u, v ∈ R3 , we have T(u + v) = T(u) + T(v). (2) For all u ∈ R3 and r ∈ R, we have T(ru) = rT(u). 4.2 (5 points) In the pictures below, the fundamental parallelograms of two linear transformations, S, T : R2 → R2 are shown. Determine all eigenvalues (if any) for each of S and T and for each eigenvalue determine a set of basic eigenvectors for that eigenvalue. Make sure to justify your answers using geometric arguments only (i.e. ones which reference the pictures of the transformations, not any algebra involving the matrices of the transformations.) Assume the grid lines are spaced 1 unit apart. 5 Determine if the statements below are true or false. Make sure to justify your answers! You will receive no credit for simply selecting “true" or “false", or providing little explanation. 5.1 True or False: If x, y, z ∈ R3 and {x, y, z} is linearly independent, then {x, x + y + z, z} is also linearly independent. 5.2 True or False: If A is a 3 × 6 matrix, then dim(null(A)) > 2. 5.3 True or False: If A is a 4 × 4 matrix and the systems (A − I)x = 0 and (A + I)x = 0 each have two basic solutions, then A is diagonalizable. 5.4 True or False: If two planes in R3 pass through the origin, and intersect in a line with direction vector d, then d is orthogonal to the normal vectors both of those planes.

$25.00 View