Science Communication and Public Engagement (HPSC0008) Course Syllabus 2024-25 Course Information This interdisciplinary course introduces the public dimensions of science and technology. Drawing on sociology, history, cultural, media and communication studies it explores the relationship between the professional world of science and the social, cultural and personal spaces in which science contributes to the shaping of society. It also develops students’ critical analysis skills with respect to the communication of science in different public contexts including the news media, museums, fiction and online. Ultimately it aims to develop students’ skills in academically interrogating science communication and engagement. Assessment 1 Word limit: 1,000 words Contribution to final mark: 35% EITHER Writing the Life Scientific Choose one episode from the BBC podcasts The Life Scientific. Write it up as a magazine profile. 100 of your words should state what magazine you are writing for and how your content is angled towards its editorial policy, as well as its targeted audience. You can focus on explaining the science itself, or on the life story of the scientist, or a hybrid of the two. If you are writing about the science itself, make sure you explain how the science was done, not just the ‘facts’. If you are writing a life story, avoid writing a straight, chronologically linear biography. There is no need for formal footnotes though you can hyperlink sources as relevant to the audience. Make it interesting and lively. There is no need for formal footnotes though you can hyperlink sources as relevant to the audience. Make it interesting and lively. • Pick one of the below target magazines to write the assignment for: i. Cocoa - www.cocoagirl.com ii. ii. OYLA Magazine - https://oyla.uk iii. iii. New Scientist - https://www.newscientist.com OR Defining Moments of Science Write a blogpost about a defining moment of science (perhaps from global media, or national relevance, or personal). What message(s) about science was conveyed? How did ‘the public’ respond to it at the time? What is your reflection on it now? What is its significance for your reader? How do you want to use it to challenge what the reader thinks about science? Make sure your blogpost touches on the themes of power and audience as we have discussed them in the course. There is no need for formal footnotes though you can hyperlink sources as relevant to the audience. Make it interesting and lively. There is no need for formal footnotes though you can hyperlink sources as relevant to the audience. Make it interesting and lively. Pick one of the below target blogsites to write the assignment for: i. The Guardian - A Newspaper Column - https://www.theguardian.com/science ii. ii. Wired - Popular science and technology blog - https://www.wired.com/category/science iii. Science Museum - Blog for a museum - https://blog.sciencemuseum.org.uk Whichever version of the writing exercise you do, the aim is to provide evidence that you have understood the issues of power that have informed the first half of the module. You should never be simply providing a descriptive account of the content of the science communication studied. The content of a piece of science communication is only relevant insofar as it allows you to answer more interesting questions about it (not just what did it say, anyone can read/watch something to answer that!). Bear in mind also the difference between the research literature and your own experience. This is particularly important when dealing with popular culture or media, subjects which we are all familiar with and have experiences of in our everyday lives. You may experience the mass media and popular culture in one way, and thus form. your own opinions about them but this does not mean that your experiences and opinions are representative of everyone else’s. Sociology is about society not individuals. So be very wary of making statements like, “the public will think this…”, “this won’t make sense to the public…” or “this will make everyone think x”. You may feel that way, but unless you have concrete evidence backing up such claims, these are simply unsubstantiated assertions based upon one person’s experience. You are at university to study these things in an academic and critical manner, so you should always ground your arguments and observations within the academic literature you have read. You should therefore justify your arguments through such mechanisms as sourcing, citing data, referencing, providing logical justification, etc. There is nothing wrong with having personal opinions concerning an issue, but we want to see that you have engaged with the context and issues rather than simply writing a polemic, one-sided and unsubstantiated editorial on the topic! If you want to bring your own opinions or values to bear on your research, you need to make sure that you reflect on how these articulate with other viewpoints or values from within the literature. Assessement 2: Exam Example of Exam Questions: 1. Analyse example/s of either flat-earth, climate sceptic or anti-vaxx activity and characterize participants’ engagement with science. Do they present themselves as denying science or as practising science? Would you characterise this as evidence of the ‘deficit model’? Why or not? 2. Which of the three elements of scientific literacy (subject knowledge, knowledge creation, and disciplinary policing) are treated in the media, and how? 3. In what way(s) have the public been (re)presented by science communicators? What is the ideal way to conceive of the public for purpose of public engagement? 4. Is citizen science real participation in science, or is it just free labour/just pretend? Module aims & objectives Aims The course aims to impart knowledge and understanding, at an introductory level, of: • Concepts in public understanding of, and engagement with, science • Public spaces for science, including the mass media, science museums and everyday life • Cultural, social and political issues around science communication Objectives By the end of this module students should have: • Knowledge and understanding of the basic concepts and scope of science communication • A broad understanding of the cultural, social and political issues around science in public • Skills in written and spoken communication • Skills in relating personal experience to the ideas, tools and values of academic research • Skills in the recognition, collection and analysis of research materials • Skills in argumentation, listening and constructive dialogue • Confidence in contributing in class
HPSC0007 Investigating the Sociology and Politics of Science Course Syllabus 2024-25 Course Information In this module we will read and discuss some of the foundational work in the classical and post-classical discipline of sociology, with particular attention to the ways in which this has informed research within Science and Technology Studies. At the same time, we will situate and critique classical sociology within modernist and colonialist projects, and look at emerging post-sociological, frameworks for understanding science and technology in society: decolonial, posthuman, queer and care-full. Aims & Objectives Aims: The aim of this module is to introduce students to sociologically foundational literatures for science and technology studies (STS). The first half of the module, broadly, covers classical theorists, while the second part of the module deconstructs their frameworks from decolonial, posthuman, queer and care-full perspectives. We consider how recent and contemporary STS topics are informed by both classical and decolonialized theory. Objectives: By the end of this module students should be able to: • Identify and explain key concepts in classical and post-classical sociology; • Understand how STS has developed theoretically through engagement with classical and post-classical sociology; • Understand the critical challenge brought to key classical sociological concepts by decolonial, queer, posthuman and care-full approaches; • Understand at least one example of how STS has integrated the challenge of critique • Create relevant and critical bibliographies in the sociology of science; • Present their work effectively in written formats; • Apply the knowledge gained to interrogate the imprint of power and domination in our daily lives.
PADM-GP 4505: R Coding for Public Policy R is among the most popular coding and development packages from the new generation of powerful and versatile softwares used in public policy and other research settings. Contemporary data engineering and analysis skills used in quantitatively rigorous public policy research depend on a number of interlocking tools available in R. The goal of this course is to lead students into the R world, foster mastery of the basic tools, approaches, and critical thinking therein, and establish a firm platform. for future independence in these tools. This course offers students basic programming, data engineering, and data analysis skills in R particularly focused on processing and manipulation techniques, statistical insights and visualization, and scientific reproducibility. Material is framed in the context of public policy-making and policy evaluation, with a particular emphasis on balancing theory with implementation. Course and Learning Objectives Students who successfully complete this course will install R and RStudio and become familiar with the IDE, understand and utilize core R concepts such as objects and commands from a number of key libraries including the tidyverse, and utilize best-practices for project reproducibility and management. The course will emphasize use cases for R, focusing on cleaning, exploring, and analyzing data. Takeaways Upon completion of the course, you will be able to: 1. Install and set up R and RStudio 2. Find, install, and use many R packages 3. Understand basic programming concepts and how they apply to the R language 4. Read, manipulate, and clean data 5. Plot simple, clear graphics for analysis and communication 6. Conduct regression using R packages 7. Apply data management best practices using R 8. Demonstrate additional insight into software ecosystem elements like LaTeX and Git Hub Instructor Emil Hafeez, MS, MSPH Assistant Adjunct Professor of Public Service at NYU Wagner Associate Research Scientist at NYU Langone Health [email protected] Class Sessions Global Center for Academic & Spiritual Life (GCASL) Room 261 105 E 17St, Room 120 New York, NY 10013 1/21/2025 - 03/04/2025 6:45 PM - 8:25 PM Office Hours By Appointment Via Brightspace Zoom Invitation Learning Resources Hardware ● Access to a computer and an internet connection is necessary for completing the course goals, and as such, bringing a laptop to each class is important enough to make it mandatory. If you are not able to bring a laptop to class, contact Emil at [email protected]. Please note that there are available resources from NYU, including here . Software ● R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. ● RStudio Team (2020). RStudio: Integrated Development for R. RStudio, Inc., Boston, MA URL http://www.rstudio.com/. Recommended References ● R for Data Science 2e by H Wickham, M Çetinkaya-Rundel, and G Grolemund: ○ H. Wickham, M. Çetinkaya-Rundel, G. Grolemund (2023). R for Data Science. O'Reilly Media. R for Data Science. ● ggplot2 by Hadley Wickham: ○ H. Wickham (2023). ggplot2: Elegant Graphics for Data Analysis. Springer. ggplot2. ● Exploratory Data Analysis with R by R Peng: ○ Peng, R. D. (2020). Exploratory Data Analysis with R. Springer. Exploratory Data Analysis with R. ● R Programming for Data Science by R Peng ○ Peng, R. D. (2022). R Programming for Data Science. Leanpub. R Programming for Data Science. ● Advanced R by H Wickham: ○ Wickham, H. (2023). Advanced R. Chapman and Hall/CRC. Advanced R. ● R Graphics Cookbook by W Chang: ○ Chang, W. (2023). R Graphics Cookbook. O'Reilly Media. R Graphics Cookbook. ● R Cookbook by JD Long and P Teetor: ○ Long, J. D., & Teetor, P. (2019). R Cookbook. O'Reilly Media. R Cookbook. ● Pro Git by S. Chacon and B Straub: ○ Chacon, S., & Straub, B. (2023). Pro Git. Apress. Pro Git. ● R Packages by H Wickham ○ Wickham, H. (2020). R Packages. O'Reilly Media. R Packages. ● The Internet (StackOverflow, Google, blog posts) ● R’s `?` operator Brightspace All announcements, resources, and assignments will be delivered through the Brightspace site. I may modify assignments, due dates, and other aspects of the course as we go through the term, with advance notice provided as soon as possible through the course website. You are expected to check the Brightspace and your student email account regularly. Assignments and Evaluation The Course Grade is based on the following: Individual Assessment ● 5 Assignments: 90% ● Participation: 10% The instructions for each assignment will be released after the lecture and the assignment is due before class the following week, or in some occasions when course/scheduling circumstances require, when specified directly. Participation refers to your presence and engagement in class, utilization of available resources to perform. well in class and on assignments, and responding to any class surveys and/or Brightspace queries posted by the teaching team. Late Work Policy Points will be deducted from late submissions, 10% of full credit for each late day, indexed starting from one day right after the assignment deadline on Brightspace. Extensions will be granted only on a case by case basis and in case of emergency, always requiring written confirmation by email prior to the due date. Reach out to Emil at [email protected] . Collaboration Policy The course policy on collaboration applies to all assignments and is as follows. The course goal is to scaffold students' future independence and mastery of R by supplying challenging but fundamental tasks – and accordingly, it's recognized that learning rarely occurs in isolation, and many data-oriented professional workflows occur in teams. Students are encouraged to collaborate in a way that improves their own individual understandings of the material without derogating from anyone else's. This may include reviewing lecture material and recommended references, reviewing examples, and even coarse-grained outlines to assignment problems. However, note that fair assessment of each individual is critical to any course, and as such any work submitted in completion of an assignment must reflect the individual's own skill, effort, and development. You are not allowed to copy code, literally or otherwise. This includes from classmates, code found online or in references, or generated by LLMs like ChatGPT, Gemini, Claude, Meta. AI, or similar; see the Policy below. If you’re unsure about what this means or if you do not know if something is permitted, you’re welcome to email me, Professor Hafeez, [email protected]. My guess is that if you have to ask, it’s probably not allowed, but I’m happy to discuss. If you’re having trouble keeping up with the course and want to stay consistent with the collaboration policy, it’s recommended you attend an office hour or reach out to the teaching team for advice and support. Policy for Generative AI, LLM, and Related Softwares In this course, we adopt Conditional AI tool usage in alignment with NYU's Academic Code. Generative AI tools are permitted for specific uses within this course. They may be employed for tasks such as background research, ideation and reflection on computer science or statistics theory, and text editing or proofreading. However, the use of AI tools for generating novel drafts of text and for editing the functionality of code is strictly forbidden. Any usage of an AI tool must be clearly cited within your work. Grading Scale and Rubric Students will receive grades according to the following scale: • A = 4.0 points • A- = 3.7 points • B+ = 3.3 points • B = 3.0 points • B- = 2.7 points • C+ = 2.3 points • C = 2.0 points • C- = 1.7 points • There are no D+/D/D- • F (fail) = 0.0 points Student grades will be assigned according to the following criteria: (A) Excellent: Exceptional work for a graduate student. Work at this level is unusually thorough, well-reasoned, creative, methodologically sophisticated, and well written. Work is of exceptional, professional quality. (A-) Very good: Very strong work for a graduate student. Work at this level shows signs of creativity, is thorough and well-reasoned, indicates strong understanding of appropriate methodological or analytical approaches, and meets professional standards. (B+) Good: Sound work for a graduate student; well-reasoned and thorough, methodologically sound. This is the graduate student grade that indicates the student has fully accomplished the basic objectives of the course. (B) Adequate: Competent work for a graduate student even though some weaknesses are evident. Demonstrates competency in the key course objectives but shows some indication that understanding of some important issues is less than complete. Methodological or analytical approaches used are adequate but student has not been thorough or has shown other weaknesses or limitations. (B-) Borderline: Weak work for a graduate student; meets the minimal expectations for a graduate student in the course. Understanding of salient issues is somewhat incomplete. Methodological or analytical work performed in the course is minimally adequate. Overall performance, if consistent in graduate courses, would not suffice to sustain graduate status in “good standing.” (C/-/+) Deficient: Inadequate work for a graduate student; does not meet the minimal expectations for a graduate student in the course. Work is inadequately developed or flawed by numerous errors and misunderstanding of important issues. Methodological or analytical work performed is weak and fails to demonstrate knowledge or technical competence expected of graduate students. (F) Fail: Work fails to meet even minimal expectations for course credit for a graduate student. Performance has been consistently weak in methodology and understanding, with serious limits in many areas. Weaknesses or limits are pervasive. Style Note The use of proper style is important in this course and when writing code; for that reason, in addition to the points allocated for questions within particular homework assignments, points can be deducted for coding and style issues including but not limited to: ● Incomplete sentences and particularly sparse explanation ● Poor text formatting and legibility ● Inappropriate spacing and indentation in code ● Unclear variable naming conventions ● Disorganized code and file structure Whitespace and organization is covered by most R style guides including the Advanced R textbook. Occasional typos won’t be penalized but consistent or severe legibility errors may be. Overview of the Course Expect to come to class with a laptop, and participate in some combination of lecture, demonstration, practical implementation, troubleshooting, and collaborative work. Each week with a deliverable listed, you can expect to turn in this assignment before class. • Week 1 o Topic: Introduction to R and Rstudio o Deliverable: Install R and RStudio • Week 2 o Topic: Data objects, functions, and the tidyverse o Deliverable: Assignment 1 • Week 3 o Topic: Application I: Data quality & cleaning data o Deliverable: Assignment 2 • Week 4 o Topic: Graphics in R o Deliverable: Assignment 3 • Week 5 o Topic: Application II: Exploratory data analysis o Deliverable: Assignment 4 • Week 6 o Topic: R for analyses o Begin Assignment 5 • Week 7 o Topic: Other real-world tools & applications o Deliverable: Assignment 5
Smart Industry Operations 2024-2025 Individual (Repair) Assignment Bias in healthcare operations Context of the Assignment Diagnosing severe schizophrenia is important for individuals, families, and healthcare systems. It allows to address the condition effectively and improve outcomes. There is evidence that clinicians may overemphasise psychotic symptoms or underemphasise depressive systems in black African Americans in the US. This can be attributed to diagnostic bias. Misdiagnosis may have harmful effects, leading to healthcare provision disparities. The context of this assignment is to study the extent to which a correct categorisation of diagnostic cases between schizophrenia and depression cases can be achieved based on a range of potential diagnostic attributes. Additionally, of further interest is to identify weather there might be diagnostic disparities between sensitive characteristics, such as gender (sex) and race. Such diagnostic disparities may have implications for disparities in healthcare provisions, resulting in healthcare services inequalities. Diagnostic disparities may mask how sensitive attributes intersect with other important factors, such as family expenditure on healthcare for example. If bias exists already within electronic patient records, this may eventually be amplified when machine learning models are operationalised despite being agnostic of the presence of such disparities. In the considered case, a machine learning model may be considered by a healthcare provider eager to increase efficiency to be sufficiently accurate for operationalisation if overall accuracy exceeds 90%. Your task is to analyse the case data, build machine learning models, and make concrete recommendations based on evidence obtained from your experience with respect to operationalising the use of your machine learning models. Case Data We consider gender and race as two sensitive features in a dataset that contains electronic healthcare data of people who have been diagnoses with a condition, but it is uncertain whether the diagnosis should be schizophrenia or depression. The dataset contains records with the following attributes: Diagnosis : affective disorder (0) or schizophrenia (1). This is the output attribute. The rest are input attributes and include: Sensitive features : Sex: takes value Male or Female Race: takes values Asian, Hispanic, Black, White Psychosocial features: Delay: denoting delay in seeking care (takes values Yes/No) Housing: takes values Stable or Unstable, denoting the housing status of the individual In the attributes below, a clinician may be from different disciplines or may refer to rating by different types of clinicians; for simplicity here a single type of clinician is mentioned. Anhedonia: clinical assessment indicating inability to experience enjoyment in activities typically perceived as enjoyable or fulfilling; rated by a clinician. Dep_Mood: clinical assessment indicating persistent state of sadness, low energy, or emotional heaviness; rated by a clinician. Sleep: average hours of sleep per day; rated by the patient. Tired: whether the patient feels tired or not; rated by the patient. Appetite: the extent to which the patient has good appetite; rated by the patient. Rumination: the extent to which the patient is trapped in same thoughts; rated by a clinician. Concentration: the ability of a patient to concentrate; rated by a clinician. Psychomotor: the extent to which abnormalities in how a patient’s movement follow a thought/mental process; assessed via standardised tests via a clinician. Delusion: the extent to which false beliefs feature in a patient’s thinking; rated by a clinician. Suspicious: the extent to which a patient is unreasonably over-suspicious and distrusts others; rated by a clinician. Withdrawal: the extent to which a patient is in a state of disengagement with social interaction and activities; rated by a clinician. Passive: the extent to which a patient feels a lack of control over own thoughts; rated by a clinician. Tension: the extent to which a patient is in a state of unease, strain or agitation; feels a lack of control over own thoughts; rated by a clinician. Unusual_Thought: the extent to which a patient have thoughts, beliefs or perceptions that significantly deviate from what is typically expected in a given social or cultural context; rated by a clinician. The provided datasets are: A. diagnosis_train.csv A populated with all above information. You will use this for your analysis and for training classification models. B. diagnosis_predict.csv This is a similar dataset to A, but on this one you have no access to the outcome. You will use this to classify unknown cases to schizophrenia or depression. Assignment Questions A.1. Exploratory Data Analysis (15% of Repair Assignment mark) In this part you are expected to: A1.1. Explore the variables, their types, and their basic statistics. A1.2. Analyse further the data regarding data distributions, range of values, existence of outliers and correlations between attributes, as well as between input attributes and Diagnosis. Which are your observations? Additionally, to what extent is the dataset balanced regarding the different categories of the sex and race sensitive attributes? To what extent is the dataset balanced regarding the diagnosis per different categories of the sex and race sensitive attributes? A.2. Classification (40% of Repair Assignment mark) In this part you are expected to develop classifier models. You will have to consider how best to use your training data (diagnosis_train.csv) and you are asked to apply the developed models to the “diagnosis_predict.csv” data at the end and produce diagnosis predictions for them. A2.1. Apply a decision tree classifier, choosing different hyperparameters (as a minimum, different tree depths) on the diagnosis_train.csv. Motivate your solution analysis in relation to overfit and generalization. Report and analyse performance using different performance metrics. Analyse your findings. Do you observe any difference on performance for different sensitive attributes categories (e.g. sex, race)? Finally, choose a developed model and apply it to the diagnosis_predict.csv data to produce your diagnosis predictions. A2.2. Apply a random forest classifier, choosing different hyperparameters (as a minimum, different number of estimators and tree depths). Motivate your solution and analysis in relation to overfit and generalization. Report and analyse performance using different performance metrics. Analyse your findings. Do you observe any difference on performance for different sensitive attributes categories (e.g. sex, race)? Finally, choose a developed model and apply it to the diagnosis_predict.csv data to produce your predictions. A2.3. Make a comparative analysis across all classifier experiments. Make a reasoned choice of a classifier to select and motivate the choice referring to the evidence obtained from performance metrics. A.3. Bias Analysis and Management (35% of Repair Assignment mark) In this part you are expected to further analyse the data and the results you obtained regarding potential bias. Specifically, answer the following questions: A3.1. Consider your results above. For which combination of sensitive attributes (sex, race) did you observe the largest diagnostic disparity (meaning largest difference between precision and recall)? And for which combination did you observe the smallest diagnostic disparity? A3.2. Now choose the combination with the largest diagnostic disparity. Out of your training data, retain only the data corresponding to this combination of sensitive attributes. Build the same type of model as in A2.1 and A2.2 using only these data and perform similar analysis as in A2.1 and A2.2Which are your observations and how you interpret your results? A3.3. Apply resampling of the data records for the selected combination of sensitive attributes to balance precision and recall. Perform the same machine learning as in A3.2. Report and analyse results. Which are the observed differences with respect to diagnostic disparity? A3.4. Without applying resampling, can you think and apply an alternative method to improve the diagnostic disparity? A.4. Overall comparisons and analysis (10% of Repair Assignment mark) In this part you are expected to: Discuss comparatively the obtained results highlight only what you see as most interesting regarding the obtained performance and/or aspects of data unbalance, and fairness, motivating your analysis on the basis of the obtained evidence. What would be your concluding recommendations?. Further Instructions In this assignment you will address the questions provided. The assignment is delivered as a Jupyter Notebook. Jupyter Notebooks do not have pre-specified length. However a good Notebook should be at the same time sufficiently explanatory and relatively compact. It should include insightful motivation, analysis and interpretations, grounded on evidence from the data, the processing of the data you performed, and results you have obtained. You should submit this assignment via email to: [email protected] You should receive receipt confirmation within 24 hours - if not this might indicate that your assignment was not received, in which case please submit again or enquire about it. Submission deadline: 29 January 23:59
BIO 2101 Comprehensive Biology Laboratory Exercise #3a Culture of Animal and Plant Cells In vitro cell culture systems enable the study of: Cell growth and division Cell differentiation Genetic manipulations for gene structure / function studies Biotechnology Cells used may be either: -Primary cultures or -Immortalized cell lines Isolation of cells Cells can be released from soft tissues by enzymatic digestion with proteases and collagenases OR Pieces of tissue can be placed in growth media, and the cells that grow out are available for culture (Explant Culture) Primary cells vs. Immortalized cells Cells that are cultured directly from a subject are known as primary cells. With the exception of some derived from tumors, most primary cell cultures have limited lifespan. An established or immortalized cell line has acquired the ability to proliferate indefinitely either through random mutation or deliberate modification. When a primary culture is sub-cultured, it is known as secondary culture or cell line or sub-clone. Sub-culturing of primary cells to different divisions leads to the generation of cell lines. 1. Cell lines derived from primary cultures of normal cells are finite cell lines. 2. When a finite cell line undergoes transformation and acquires the ability to divide indefinitely, it becomes an established or immortalized cell line. Preparation of cell cultures from tissues Most researchers “obtain” cell lines: ATCC or from other researchers Transformation of cultured cells implies a spontaneous or induced permanent phenotypic change resulting from a heritable change in DNA and gene expression. Transformation often involves the deletion or mutation of the p53 gene, which would normally arrest cell cycle progression if DNA were to become mutated, and overexpression of the telomerase gene. Cell lines HL-60 (Human leukemia cells) NG108-15 (Mouse neuronal cells) SWSIS 3T3 (Mouse embryonic fibroblast) HEK293T (Human embryonic kidney fibroblast) MG63 (Human bone fibroblast) more … Fibroblasts are the most common cells of connective tissue in animals. It synthesizes the extracellular matrix and collagen, the structural framework (stroma) for animal tissues. General culture requirements Appropriate temperature and gas mixture (typically, 37oC, 5% CO2 ) in a cell incubator pH: 7.2-7.5 (buffering by sodium bicarbonate) Humidity is required Glucose, growth factors, and the presence of other nutrient components Culture conditions vary widely for each cell type, and variation of conditions for a particular cell type can result in different phenotypes being expressed. Manipulate in a biological safety cabinet (“Tissue culture hoods”): prevent contamination • Tissue culture experiments are typically carried out in special work stations called “Tissue culture hoods” • Such hoods provide personal protection from harmful agents within the cabinet • Reduce the risk of microbial contamination • Environmental Protection from contaminants contained within the cabinet. Class 2 Biological Safety Cabinets The Class 2 Biological Safety cabinet must meet the requirements for personnel, environmental and product protection. Used when working with low to moderate risk biological agents (Biosafety level 1, 2 and 3 agents). Examples include Salmonellae, Hepatitis B virus and Measles virus. When you are working with very high risk biological agents, Class 3 cabinets should be used.
HPSC0006 Science, Policy and Politics Course Syllabus 2025 This course introduces ways of thinking about the role of science and technology in policy and the relationship between science and government. Science plays a vital role in shaping policy and society. At the same time, social, cultural and political forces shape the production of scientific knowledge. We will focus on developments in science policy, using case studies and current theory in science policy research and STS, asking questions such as: What is the role of the state in regulating, promoting and funding science? What makes an expert? Should scientists be the only ones to make decisions about the direction of scientific research? Aims & objectives This course aims to introduce students to social and political thinking about science. Students will explore a range of case studies against a backdrop of theory in order to understand science as a social and political process; how science is funded; what science policy is and how it affects our lives; how decisions about science and technology are made; as well as thinking about questions such as: what makes an expert? Should scientists be involved in the policy-making process on science and technology; and to what extent should scientists be held to account in terms of their research? By the end of this course students will: • Be able to identify the main themes of science policy studies • Be able to criticise popular but simplistic notions of the relationship between science, technology and society • Have detailed knowledge of several case studies in science policy (and, in particular, the social and political dimensions of the cases) • Have developed research skills through the seminar work and course assessment
Module name: Advanced Digital Design Module code: ENGD3001 Title of the Assignment: Assignment 1 This coursework item is: Formative This coursework will be marked anonymously: Yes The module learning outcomes that are assessed by this coursework are: 1. “Knowledge and specialist analytic development techniques in the areas of VLSI design, ASM design and implementation, and VHDL design.” 2. “Development of generic and transferable skills in advanced digital system design methodologies using industry standard design tools.” This coursework is: Individual This coursework constitutes 40% to the overall module mark. Date Set: 18 Nov 2024 (Week 8) Date & Time Due: by 12:00 noon on Monday, 17 Feb 2025 (Week 21) When completed you are required to submit the following: Submit an electronic copy of your assignment via Learning Zone by the advertised deadline. Please note that you can only make one submission and once a submission is made it is final. No resubmissions or later additions are allowed under any circumstances, and by any means, so please double check that your report is correct and complete before submitting it. Assignment 1 Design a 4-bit universal decimal counter in VHDL using behavioural modelling, as presented below: LD – Synchronous Parallel Load D3,…,D0 – Parallel Data Inputs Q3,…,Q0 – Data Outputs RST – Asynchronous Reset Input UD – Count direction (up/down) The operation of the universal counter is described by the following function table: RST LD UD Action 0 x x Asynchronous Reset 1 0 0 Count Down 1 0 1 Count Up 1 1 x Synchronous Parallel Load Simulate this design with the aid of a ‘graphical testbench’ (also known as a “University Program VWF” file), using the Intel Quartus Prime Lite v23.1 software. It is a mandatory requirement of this assignment to use the correct software and the correct type of testbench. Failure to adhere to any of the mandatory assignment requirements will result in a mark of zero. What you should submit You should submit a formal report explaining your design and your results. Specifically, your report should contain at least the following information: a) An introduction including the design brief, b) A background section on counters, their types and their operation, c) A section explaining how you’ve solved the design task given to you and if applicable why you’ve selected a particular solution out of several possible, d) The complete listing of the code you’ve written, bearing in mind good programming and design practice, e) A legible screenshot of your graphical testbench (or testbenches if using more than one), f) The results of the simulations carried out (i.e. suitable, legible and detailed simulation waveforms) accompanied by detailed comments and explanations, and g) Conclusions (and possible further improvements if applicable).
Principle of microeconomics 5. (Sugar Tax) We will consider two goods - Sugar Drink (X-axis) and Other "Healthier" Foods (Y-axis), assuming the prices of both goods are initially $1 each.Suppose you have $20 to spend.The City imposes taxation on Sl per Sugar Drink.The following table shows relevant information of consumption behavior and utility levels. Px X Py Y Income Utility Level Before Tax $1 10 $1 10 100 After Tax $2 5 $1 $20 50 Hypothetical $2 8 $1 12.5 100 a) (4 points) Depict the scenarios before and after taxation,and the hypothetical point on the budget line and indifference curve. b) (2 points) Do we have enough information to infer whether Sugar Drink is a normal or an inferior good? c) (6 points) In addition to taxing Sugar Drink,the City gives you additional income of$10 as a goodwill, and hopes that you would spend on "Healthier"choice.Would you totally spend this $10 only on the "Healthier" choice?Would you be better off or worse off than before the taxation? Why? (Hint: With the $10 additional income,what is the budget constraint? Draw this budget constraint to a copy of the graphs you drew in part a) paying special attention to scale and number.)
Assignment #1: Personal Listening History Listening to Music (MUZA99H3S) Winter 2025 Assignment Details Key Details Submission Quercus, PDF file Words 500 (i.e., two pages double-spaced, 12-point font) Weighting 5 points (i.e., 5% of your total grade for the course) General Description This assignment is designed to engage your personal listening history through reflection on your musical experiences and preferences. The pedagogical goal of this assignment is to prime your mind for the learning and listening we will do throughout the course. Specific Tasks Answer each of the following 10 questions, in order. Write your answers using complete sentences (i.e., not point-form). Please make it clear that you are answering each question in your writing (i.e., re-state each question, then write your answer underneath the question), so that the TA grading your work doesn’t deduct points because they think you haven’t answered all the questions. Prompts have been provided under each question in order to help you generate ideas. You may choose to incorporate these prompts, or take your answers in another direction. Question #1: What is your earliest sound memory? Prompts: Music? Someone’s voice? Another sound? Question #2: What genre(s) of music surrounded you in your childhood? Prompts: Where you lived? At school? In popular culture? Question #3: What was your first experience witnessing live music? Prompts: At a family gathering? A concert? A religious event? Question #4: What is your experience with formal music education? Prompts: At school? In a choir? 1-on-1 lessons? Question #5: What is your experience with informal music education? Prompts: Learning from family? From friends? From social media videos? Question #6: In which situations do you currently listen to music? Prompts: In transit? While studying? While eating at restaurants? Question #7: What is the most powerful silence that you have ever experienced? Prompts: At a memorial service? During a speech? In nature? Question #8: What is your current favourite song/ piece of music? Prompt: …Iknow it’s tough to pick just one! Question #9: Who is your current favourite musical artist? Prompts: A band? An individual performer? A producer or songwriter? Question #10: What is your current favourite musical genre? Prompt: Be as unspecific (e.g., metal) or specific (e.g., mid-1980s instrumental London doom metal) as you’d like. *Note that you only have to respond to the 10 questions; you DO NOT have to respond to the prompts below each question. Prompts have only been provided to help you better understand the 10 questions,and to help you generate ideas for your answers.
Department of Accounting and Business Analytics BTM 211 611 Management Information Systems Syllabus – Winter 2025 B. COURSE DESCRIPTION AND OBJECTIVES This course introduces aspects of information systems from a business perspective: ● Introduction to Business Technology Management, what it means to manage technology. ○ Includes a high level exposure to the many aspects of BTM and how organizational strategies, missions and goals can guide and infl uence technology decisions. ● Database Management Systems, including technical foundations. ● System implementations: Business Process Modelling and Information Systems Development, including Systems Analysis and Design. ● Business analytics and artificial intelligence: An exposure to these topics and how they are becoming more important in the daily operation of organizations large and small. ● Emerging topics in Information and Communications Technology Management & Strategy. We will examine how information systems are used in business organizations and some of the managerial, organizational and social implications. This will provide an appreciation of the major challenges that we face today in applying information technology effectively. The material will provide us with an understanding of and a basic foundation for computing and systems literacy. Most importantly, we will develop critical thinking and problem-solving skills that are necessary to solve complex business problems. The course schedule outlines the topics and requirements for the course. Please note that the outline is a plan, and this plan is subject to change (e.g. changing the delivery of classes or content due to the pandemic or other unforeseen circumstances). C. COMPETENCY GOALS 1. At the end of this course, you will have developed the following course specific skills or knowledge: ● Gain a basic understanding and awareness of a variety of BTM topics that are expanded on in other 400-level classes. ● Learn about database structure and construction, and about how business systems rely on them. ● Learn how information systems are developed, and experience the development of a simple system (which includes a database, SQL queries for creating tables and retrieving information) ● Learn how to use Tableau to visualize data and publish the visualization to a web-based dashboard. ● Discuss emerging topics in information technology, especially as they relate to information and communications technology management and strategy including analytics, artificial intelligence, security, privacy and ethics. 2. This course incorporates the Competency Goals of the BCom Program, in particular: Competency Goal Description Outcome Goal 1: Business Concepts and Theories Students will have expertise in utilizing business knowledge in evolving local and global contexts. Demonstrate expertise in utilizing business knowledge in evolving local and global contexts. Goal 2: Critical Thinking Students will have the ability to think critically for informed decision-making. Demonstrate ability to navigate business challenges by evaluating information for effective decision-making. Goal 3: Responsible Business Principles Students will have expertise in responsible business principles to achieve societal impact. Evaluate business decisions using responsible business principles including governance, sustainability, and ethics. Goal 4: Professional Skills Students will excel as business professionals, skilled in communication and collaboration. Deliver effective oral communication and written business documents using appropriate technology. Demonstrate ability to collaborate effectively in diverse teams in order to achieve goals. 3. Final grading in this class is done on the basis of individual student achievement of the course and program outcomes. These outcomes are measured by the following assessments: Assignments & In-Class Work ● Business Concepts and Theories, Entrepreneurial Thinking and Business Communication skills will be applied to some assignments and in-class questions. ● Assesses Business Concepts and Theories and the grasp of quantitative material (DBMS design, business analytic integration, and project planning and implementation) over entire course. ● Assesses Ethical Awareness related to interpretation of different ethical viewpoints. Final Exam ● Covers high level knowledge of all the in class lectures through the semester ● Assesses Business Concepts and Theories, Entrepreneurial Thinking, Business Communication, Ethical Awareness related to interpretation of different ethical viewpoints
CLAS 207 Roman Social History A study of the main features of Roman social history from the time of Augustus to AD 200. Topics include class structure, law, education, the family, slavery, poverty and public entertainment. Offered in alternate years. Course content This course explores, by way of lectures and tutorials, the realities and ideologies of Roman daily life- including family life, livingl conditions, disease and medicne, sex and sexuality, entertainments, slavery, the role and rights of women and children, religion, death and disposal. In 2025, this course will be delivered primarily on campus, with online accessibility. Most students will attend on campus; however, the course can be completed online if needed. If you intend to take the course mostly on campus, please select the offering CRN 2128. Course learning objectives Students who pass this course will be able to: 1 Show that they are aware of the basic structures of Roman society, such as the economic system and family construction, in the first two centuries AD. 2 Show they possess a basic vocabulary of Roman social institutions (for instance, the key terms describing relationships in the Roman household or types of slavery). 3 Have a general understanding of the evidence for Roman society and its limitations. 4 Apply simple concepts derived from modern systems of analysis (e.g. from sociology or demography) such as status or life expectancy to ancient evidence in order to understand these features in their historical context. 5 Recognize the differences between Roman society and modern societies (e.g. 21st century New Zealand).
Assignment Title: Applications of MOFs infographic Module Code: 6CCC0085 Module Title: Advanced topics in Chemistry 2 Semester: Semester 2 24-25 Submission deadline: 29th Jan 2025 (3 pm) Marks: 33.33% of Module Mark Word/Page Count: No word count. Infographic should be A1 sized (594 x 841 mm or 59.4 x 84.1 cm, which is 23.4 x 33.1 inches) Assignment Task: for this coursework you are to create an infographic on applications of MOFs of your choosing. Infographics are a graphic visual representation of information, date or knowledge intended to be presented quickly and clearly. An infographic is not the same as a research poster, marks will be lost if you produce a poster. Although there are some similarities between posters and infographics, the latter tend to contain less text and employ more visual methods to convey the message. You are required to create an A1 size infographic which communicates the applications of MOFs. The aim of your infographic is to promote your chosen MOF application to an end of 2nd year chemistry student – making it attractive, engaging, and at the appropriate level. You are required to complete this task in a group on the understanding that all members contribute equally and will receive the same mark for the assessment. Any chemical structures, schemes or mechanisms in your infographic must be produced yourself using Chemdraw (or similar chemical drawing software). The MOF structures can be drawn using software such as Vesta which is available free in the software center at King’s. You can choose the MOF and their applications for your infographic, if you choose one of the MOFs already covered in the lecture material then you must go beyond the lecture material by showing a particular context/application. The format of your infographic can vary. Below are some suggested formats: • Introducing a famous MOF such as UiO-66, and its applications. Here the sections should include introduction of UiO-66, its unique properties, various applications it is used for, advantages/limitations. • Another way to approach this would be introducing a problem such as carbon capture, then presenting literature on why MOFs are used for this application, giving examples of MOFs being used for carbon capture, which MOF is the best for this purpose, advantages/limitations. Learning Outcomes: • how to read, evaluate and disseminate recent literature on advanced topics in chemistry • how to apply core chemical skills at the interface of teaching and research. Purpose of the task: Sharing scientific research with other scientists and the general public using social media is becoming more common in modern society, and infographics are an effective medium to do this. This assessment tests your ability to effectively analyse and communicate research from a scientific paper(s) whilst also requiring you to demonstrate your understanding of core chemical concepts. It also allows you to develop your skills in presenting information from the literature to different target audiences. Assignment Audience: End of Year 2 Chemistry student Resources: The choice of software to use to create your infographic is up to you. MS PowerPoint or Publisher can be used in a similar way as for the creation of posters. There are also online platforms available, many with free access or a free trial period (e.g. Piktochart). I have also included (see below) a couple of articles for the creation of effective infographics (although by no means exhaustive guides and many more are available online), and some examples of very popular and high-quality scientific infographics (from the BMJ and Compound Interest). Guides: www.jmmnews.com/how-to-turn-journal-article-into-infographic/ www.impact.science/infographics-a-great-way-to-simplify-complex-science/ https://libguides.hull.ac.uk/infographics/home Examples: www.bmj.com/infographics (BMJ) www.compoundchem.com/infographics/ (Compound Interest) List of possible MOFs and applications (this is not exhaustive; you can find other MOFs by searching the literature using search engines such as SciFinder, Reaxys, Google Scholar) UiO-66,UiO-67 MIL-101 MOF-5 HKUST-1 Mg-MOF-74 Carbon capture Gas adsorption Catalysis Drug delivery Gas sensing Water harvesting Evaluation Criteria/marking rubric: The expected structure and content of the assessment is described above. Grades reflect the degree to which these expectations have been met, in particular with regard to the following criteria: - The required structure of the assessment has been followed. - A clear summary of the reaction presented. - The background/context to the reaction is clear. - Any specific research/outcomes/contributions to the science from the reaction are included - The science is displayed in a meaningful way, with a clear emphasis on themost important items. - Relevant figures/graphics are included. - Relevant literature references have been cited where appropriate. - The style. and level of the infographic is adhered to consistently. - The layout and presentation of the infographic is logical and clear. 1st (exceptional) 80 – 100% An excellent infographic throughout in terms of content, style. and presentation, analysis and insight. All of the above criteria have been met fully, with strong evidence of independent extracurricular learning. 1st (very good) 70 – 79% A very good infographic throughout in terms of content, style. and presentation, analysis and insight. All of the above criteria have been met, with some evidence of independent extracurricular learning. 2.1 (good) 60 – 69% A good infographic throughout in terms of content, style. and presentation, analysis and insight, but with weaknesses in ONE or TWO areas. The above criteria have been met mostly. 2.2 50 – 59% The infographic presents an acceptable summary of the reaction but lacks clarity and displays a more superficial understanding than required for a 2.1. It is WEAKER in THREE or MORE of the above areas. 3rd 40 – 49% A poor infographic, lacking clarity and depth, many errors and omissions, poorly presented, points repetitive and/or superficial. The infographic fails to meet many of the criteria described above. Fail 0 – 39% A largely incomprehensible and unstructured infographic. Many errors and omissions in the infographic. Poor use of the literature. Poor presentation of results. The report fails in most of the above criteria.
GGR 203 - INTRODUCTION TO CLIMATOLOGY 1.1 Definitions 1.1.1 Climatology: the study of the global climate system, including the processes responsible for maintaining climate at different scales, and a description of the climates of different regions and environments 1.1.2 System: a set of components that interact with each other. “A” and “B” form part of a system if “A” influences “B” and “B” influences “A” 1.1.3 Climate system: Atmosphere Oceans Cryosphere - sea ice - snow cover - alpine glaciers - ice sheets (GI, AIS today) Biosphere Lithosphere (Earth’s crust) See Table 1 for a matrix of interactions between all possible pairs of these components. To learn this table, I suggest making a list of the kinds of effects seen for each component (rows), then learn where (which column) they apply. The sun is not part of the climate system … . Rather, it is an external forcing. There is a heat flux from the interior of the Earth (due to the hot core) to the surface of about 0.3 W/m2. This flux is also an external forcing (it is external to, or not part of, the climate system) eventhough it is physically surrounded by the system components. This flux, although 1000 times smaller than the heat input from the Sun (as we’ll see later) nevertheless notable influences the climate system at the time scale of glacial-interglacial climate oscillations. 1.1.4 Climate: the mean state of the climate system plus the variability and other statistics. We can examine the mean and variability of -temperature - winds - pressure - rainfall, soil moisture Matrix ofinteractions between components ofthe the influence ofthe component listed in the row on thecomponent listedasth solarradiationreachingplants, diffuse solar radiation ,whichremoves 2from theatm.OceansSourceofmoisture ,NO sink orsourceofCO.S aerosolsaffect solar-radiationpropertiesofclouds ReleaseofCO when carbonate,affectingatmospheric CO ,currents transportsea ice, risingsealevel causesicesheets that reachtheedgeofcontinents (i.e.,Greenland andAntarctic) tocalvebreakoff sediments,eventuallyforming newrocks.CryosphereIce andsnow haveacooling effect byreflecting solarradiation; seaicesuppresses heattransfer fromarelatively warm sub-ice ocean(-2°C) tocold Arctic winterair; icesheetsalterwind flowandprecipitationpatterns.issuppressed by seaice.Sea iceformation locallyincreases surfacewater salinity .Salinity decreasesinregionsofnetsea oftheEarth Traditionally, climate refers to the state of the climate system near the Earth’s surface For agricultural and natural systems, the variability of temperature or rainfall from one year to the next or within the growing season can be just as important as the mean Because climate is the statistical properties of the climate system, it needs to be based on a sample. By convention, climate statistics (means, variabilities) are based on a 30 year sample. Thus, a change in temperature from one year to the next, or even from one decade to the next, is not a change in climate. Rather, it is part of the variability that defines that climate. 1.1.5 Meteorology: the study of the day-to-day variations in the state of the atmosphere (“weather”). To predict the future weather, one starts from specific observed initial conditions and then one computes the evolution of that state to a specific time in the future. Climatology, on the other hand, is concerned with means and variabilities averaged over a period of time (30 years). The climate depends strongly on the boundary conditions (solar energy coming in at the top of the atmosphere, the nature of the land or ocean surface). Predicting a change in the climate, therefore, is quite different from the problem of predicting the weather. The fact that we can’t predict the weather more than one week in advance is completely irrelevant to the problem of predicting changes in climate 100 years from now in response to, say, an increased in atmospheric CO2 concentration. [hockey analogy] 1.2 Overview of past natural climatic change Because we have defined climate in terms of both the mean and variability, then if either the mean or even just the variability of some climate variable (such as temperature) changes, the climate has changed. Figure 1.1 in the figures file for Chapter 1 gives an overview of estimated change in global average temperature over the past 500 million years. Key points – there were 4 episodes of several million years duration with periodic glacial-interglacial oscillations during the past 500 million years, the mostrecent being during the past 2 million years roughly. - at other times temperatures were 10-14°C warmer than today - during the last 700,000 years there were 7 saw-tooth shaped glacial-interglacial cycles of about 100,000 years duration, with a gradual, oscillatory approach into full glacial conditions, followed by abrupt (within 10,000 years) transitions to interglacial conditions This is especially evident in Figure 1.2 for the past 400,000 yrs, from which it can also be seen that: - atmospheric CO2 and CH4 (methane) concentrations varied as well, in such away as to reinforce the temperature changes (lower concentrations when it was getting colder, and viceversa). -the last ice age ended around 10,000 yrs ago, and temperatures reached a peak (maybe 1°C warmer than during the late 1800s) about 6000 years ago Figure 1.3 compares lake level status (low, intermediate, and high), as deduced from geomorphic and other evidence, for two time periods compared to present: the peak of the last ice age (about 18,000 years ago) and the mid Holocene (6000 years ago). US SW much moister than now during the last ice age (with huge lakes where there are now only small remnants), while the Sahara desert and east Africa were much moister just 6000 yrs ago. Returning to temperature, as seen from Figures 1.4 to 1.6, - there had been a downward trend of about 0.2°C over the period AD 1000-1900 - the climate warmed by almost 1.0 C during the past 100 years (this is due without question to human emissions of CO2 and other greenhouse gases to the atmosphere) - the warming trend has been particularly large in polar regions 1.3 Overview of the present climate Take note of the following information from the indicated figures: Fig 1.7 – all layers and boundaries; temperature, heights and pressures of 1st 3 boundaries Fig 1.8 – 1-cell early view vs 3 cells (names, locations of the 3 cells and the direction of flow), names and directions of winds, names and locations of high and low pressure cells; qualitative variation of zonal mean surface P with latitude Fig 1.9 – trade winds location, ITCZ as convergence of trade winds, shifts in location with seasons Fig 1.10 – monsoon regions, locations with winter rain, summer rain, and double rain Fig 1.11 –seasonal reversal of winds over Tibetan plateau and east Asia Fig 1.12 – the westerly jet stream – note shift in position and strength with seasons (poleward and stronger in winter, greater variation in NH than in SH) Fig 1.13 – cross-section shows E-W average of the E-W (zonal) wind, where positive is from the west. Seasonal variation in position and strength are seen (much stronger and further equatorward during winter, especially in the NH) (much less variation in the SH). Notice easterly winds (negative values) in stratosphere in summer in both hemispheres Fig 1.14 – January and July temperature patterns: large changes in polar regions, small changes in tropical regions, so there is a large equator-to-pole temp difference in winter, small difference in summer Fig 1.15 – surface pressure pattern in Jan and July: huge and strong high-pressure cell over Siberia in January, turns into weak low-pressure cell in July. Large lows in January centred over Iceland and the Aleutian Islands, largely gone in July. Strong high-pressure cell over N Atlantic in July (Azores High). High-pressure cells over mid latitude oceans, strongest in summer in both hemisphere. Fig 1.16 – seasonal rainfall: ITCZ and march of the monsoons is evident Fig 1.17 – rainfall extremes. Relative to the average precipitation, the extremes are strongest in dry regions. i.e., in the desert, it either doesn’train, or it pours. Fig 1.18 – an example of the alternating pattern of extreme warm and extreme cold regions due to a distortion in the airflow. When there is strong airflow from the north somewhere (bringing cold weather), there has to be compensating strong airflow at some other longitude (bringing warm weather) 1.4 Physical basis of climate We can subdivide the processes responsible for determining the state of the climate system into: Radiative processes, involving: - solar radiation - Infrared radiation Dynamics: -atmospheric and oceanic motions (winds and currents) - Flow of ice sheets, crustal motions Thermodynamics – deals with heat, internal energy, and work - leads to the study of the vertical stability of the atmosphere - leads to important relationships involving evaporation and absorbed energy at the Earth’s surface Surface processes – evaporation - exchange of heat and momentum with the atmosphere - occurrence of ice and snow Clouds are extremely important, as they strongly affect, and are affected by, all of the above sets of processes. At yearly and longer time scales, biological processes play a very important role in climate At geological time scales, coupled biogeochemical cycles also play a very important role. For example, the coupled carbon-phosphate cycles.
Department of Accounting and Business Analytics BTM 211 Management Information Systems DM Assignment – Winter 2025 Case Study – Alberta Aerospace Museum Background Established in 2018, the Alberta Aerospace Museum (AAM) has become the standard for aerospace education in Canada. While its priority lies in showcasing Canada’s contribution to space exploration, historic pieces from some of the most legendary space exploration missions can be seen at this state-of-the-art institution. Located on the outskirts of Edmonton, AAM is complete with a modern education center and its most impressive addition, a vast hanger that houses larger aircraft and ongoing restoration projects. AAM’s curator, Christina Hadfield, prides herself on ensuring that the museum always offers top-of-the-line interactive exhibits that offer educational fun for everyone! Problem As AAM has attracted more patrons and showcased bigger and better artifacts, Christina has found that the museum’s current computer system is ill-equipped to handle the amount of data they have. She has also begun to worry about the safety of her data, especially with how many priceless artifacts the museum houses. She would also like to update the museum’s website to be more mobile friendly. After evaluating her situation, Christina decided her best course of action would be to hire a team of business analysts. She contacted you for your expert opinion on her business operations, and during the initial meeting with her, she explained the current state of AAM. “Our museum houses some of the best artifacts in the world and protecting their data is one of my greatest priorities. I want to make sure our new system has top of the line security measures to ensure the data remains safe.” You ask her to provide some information about the type of data she wants to track, and she begins to explain the museum’s current process. “I’ll begin with information unique to our museum. We need to keep track of all 200 of our employees with a unique employee ID, including their first and last names, their job type, hire date, salary, and education level. Some of our employees are also supervisors, and we keep track of that person and the employees they manage. “Next, we track our visitors (45,000). For the sake of our visitor’s privacy, we only track information that will help us improve their experience. This includes their age, any feedback they provide us with, and if the feedback is generally positive or negative, which we track with a P for positive, or a N for negative. All of this is contained under their unanimous visitor ID.” “Finally, we track information regarding our 20 exhibits, which can be identified by a unique number. This includes tracking the employees assigned to work the exhibit, the visitors the exhibit attracts, and a description of the exhibit. For planning purposes, we also keep track of the square-footage required for the exhibit, the budget, and whether the exhibit is temporary or not.” When she’s finished, you look around her office and see boxes surrounding her desk. Your eye catches on a shard of moonrock she has encased in glass on her desk, and you ask her about the artifacts the museum has. “Artifacts are what add life to our exhibits, and we track plenty of data specific to each artifact to ensure we are properly presenting it to the public. Broadly speaking, we categorize each artifact by an artifact type (25), which is represented by a four-letter code, and its corresponding description. Given that many types of artifacts are fragile, we also add a special-storage indicator that tells us if it requires special handling.” Christina notices you are looking at the shard of moonrock and she puts it in front of you so you can take a closer look as she continues. “For each artifact (70000) we have in our system there is an associated artifact ID. We also add a description to the artifact, the associated type-code for the artifact, and most importantly we track the acquisition date of the artifact. For our purposes we also track the exhibit the artifact is paired with. On top of everything, we track the mission code (200) the artifact is from, the mission name, the mission description, country, and the start and end dates. Oh, and the museum the artifact came from.” You ask her to clarify what she means by museum, and she begins to explain the process behind acquiring artifacts. She opens a binder on her desk and pulls out a piece of paper for you to examine. (see below) As you study the paper, you notice that it contains information about the loaning process. “Most of the artifacts in this museum are on loan to us from other museums, but we also loan out some artifacts from time to time if we get a request for an item. For each museum, we track their unique ID, a code for the country they are located in, the name of the museum, their street address, municipality, and postal code, along with the last name of the curator.” You ask her to email you a copy of this document, and you conclude your meeting saying that you should have everything you need to get started on a prototype data model for her. “That’s wonderful! I can’t wait to see what you come up with!” Requirements Using Draw.io software and the attached “AAM_DM_Assignment_Starting_Point.drawio” file as your starting point, create an Entity Relationship Diagram (i.e., data model) of the Alberta Aerospace Museum business, as shown in the case above. Starting from the provided starting point diagram, completely and correctly specify all Entities, Relationships, and Attributes as described in the lecture and lab materials on Entity Relationship Diagramming. Include all relevant facts from the case in your model, including primary keys, volumes, data types for all fields, descriptive labels, and foreign key fields. There is no need to create or assume any new entities or attributes other than what is required for the above case. Your final model should have exactly NINE entities. You can use any naming convention you want as long as we can understand what you’re trying to say. (ie. first name instead of given names) When making relationships, ensure that your lines DO NOT overlap as you will be deducted marks. Use a white background when creating your diagram. If you use a dark background, you will be deducted marks. When describing your relationships, you can use the same root words (has, is in, contains) but the overall descriptions should be different for each relationship. Use PascalCase for entities and camelCase for primary keys, foreign keys, and attributes. NOTE: In the starting point diagram, the number of lines with ‘?????’s indicates the total number of expected attributes (including all keys). DO NOT reorganize the diagram. All work is to be done individually. Do not copy, in whole or in part, the work of others, including paper printouts, electronic files, or computer programs. Do not use the work of others as a starting point and then modify it. All work submitted under your name must be yours and yours alone. Marking Scheme DATA MODEL: ● Correct entities (well named, identified, volumes) ● Correct relationships ((Connected to correct entities, cardinalities (1-M? M-1?) correct, well described, NO crossing of lines) ● Correct Attributes (Each in the correct entity, well named, correct data type, unique primary key, and foreign key fields are italicized) ● Submitted file created using draw.io, named correctly, delivered electronically through eClass, on time.
Homework 2 | Basic SQL Queries CSE 414 - Data Management Objectives: To create and import databases and to practice simple SQL queries using SQLite. Assignment tools: SQLite 3, the flights dataset hosted here. What to turn in: create-tables.sql, import-tables.sql, hw2-q1.sql, hw2-q2.sql, etc (see below). You should compose these files in a code editor like Sublime Text (or your favorite IDE). Where to turn in: Gradescope Assignment Details In this homework, you will write several SQL queries on a relational flights database. The data in this database is abridged from the Bureau of Transportation Statistics The database consists of four tables regarding a subset of flights that took place in 2015. The schema you should use is as follows. Be sure to use exactly these column names in this order. FLIGHTS (fid int, month_idint, -- 1-12 day of month int, -- 1-31 day of week_idint, -- 1-7, 1 = Monday, 2 = Tuesday, etc carrier_id varchar(7), flight_num int, origin_city varchar(34), origin_state varchar(47), dest_city varchar(34), dest_state varchar(46), departure_delay int, -- in mins taxi_out int, -- in mins arrival_delay int, -- in mins canceled int, -- 1 means canceled actual_time int, -- in mins distance int, -- in miles capacity int, price int -- in $ ) CARRIERS (cid varchar(7), name varchar(83)) MONTHS (mid int, month varchar(9)) WEEKDAYS (did int,day of week varchar(9)) In addition, make sure you impose the following constraints to the tables above: ● The primary key of the FLIGHTS table is fid. ● The primary keys for the other tables are cid, mid, and did respectively. Other than these,do not assume any other attribute(s) is a key/unique across tuples. ● Flights.carrier_id references Carriers.cid ● Flights.month_id references Months.mid ● Flights.day_of_week_id references Weekdays.did We provide the flights database as a set of plain-textdata files in the linked .zip archive. Each file in this archive contains all the rows for the named table, one row per line. In this homework, you need to do two things: 1. import the flights dataset into SQLite 2. run SQL queries to answer a set of questions about the data. IMPORTING THE FLIGHTS DATABASE (20 points) Currently, SQLite does not enforce foreign keys by default. To enable foreign keys use the following as the first command in your create-tables.sql file. PRAGMA foreign_keys=ON; To import the flights database into SQLite, you will need to run sqlite3 with a new database file. For example sqlite3 hw2.db. Then you can run CREATE TABLE statements to create the tables while specifying all key constraints as described above: CREATE TABLE table_name ( ... ); Then, you can use the SQLite .import command to read data from each text file into itstable after setting the input data to be in CSV (comma separated value) form. .mode csv .import filename tablename See examples of .import statements in the SQLite documentation or sqlite3's help online for details. Depending on where you downloaded and extracted the data files, your import statement might look something like ".import /Users/maas/Downloads/filename.csv tablename". On most operating systems you can find the file path by right-clicking the file and looking at the properties. Put all the code for creating your tables into a file called create-tables.sql and all the code for importing the data into these tables into a separate file called import-tables.sql . If done correctly, you should be able to open up a new db file in sqlite and setup the database using these two commands: .read create-tables.sql .read import-tables.sql WRITING SQL QUERIES (80 points, 10 points each) For each question below, write a single SQL query to answer that question. Put each of your queries in a separate .sql file, i.e., hw2-q1.sql, hw2-q2.sql, etc. Important points before starting: ● Like in HW 1, the code in your .sql files must be valid SQL. If running the file causes errors we will subtract points. ● Your answer should NOT contain any subqueries. In HW 3 we will use subqueries, but for this homework you shouldn’t use them. ● Make sure you name the output columns as indicated. Do not change the output column names/return more or fewer columns. ● If a query uses a GROUP BY clause, make sure that all attributes in your SELECT clause for that query are either group by attributes or contained in an aggregate function. SQLite will let you select other attributes, but that is wrong as we discussed in lecture. Other database systems would reject the query and we will subtract points for this mistake. ● Generally the boolean filters in your queries should correspond to the English descriptions. For example if a question asks you to find flights on a Tuesday, your query should test day_of_week = ‘Tuesday’ . It is not correct to instead test did = 2, as this isn’t the description in the problem statement. The reasoning is that a database user doesn’t know that Tuesday has did = 2, they need to join Weekdays to Flights to filter on particular weekday strings. This rule also applies for filters over carrier names, months, etc. ● A tip for solving these problems is to think about FROM clause first. Which tables do you need to join, and what attributes do you need to compute? If you think of the acronym FWGHOS we learned in class, it might help you compose your query. In the following questions below flights include canceled flights as well, unless otherwise noted. Also, when asked to output times you can report them in minutes. 1. (10 points) List the distinct flight numbers of all flights from Seattle to Boston by Alaska Airlines Inc. on Mondays. Also notice that, in the database, the city names include the state, so Seattle appears as ‘Seattle WA’ . Please use the flight_num column instead of fid. Name the output column flight_num. [Hint: Output relation cardinality: 3 rows] 2. (10 points) Find all itineraries from Seattle to Boston on July 15th. Search only for itineraries that have one stop (i.e., flight 1: Seattle -> [somewhere], flight2: [somewhere] - > Boston). Both flights must depart on the same date and must be with the same carrier. It's fine if the landing date is different from the departing date (in the case of an overnight flight). The total flight time (actual_time) of the entire itinerary should be fewer than 7 hours (but notice that actual_time is in minutes). For each itinerary, the query should return the name of the carrier,the first flight number, the origin and destination of that first flight, the flight time, the second flight number, the origin and destination of the second flight, the second flight time, and finally the total flight time. Only count flight times here; do not include any layover time. Name the output columns name (as in the name of the carrier), f1_flight_num, f1_origin_city, f1_dest_city, f1_actual_time, f2_flight_num, f2_origin_city, f2_dest_city, f2_actual_time, and actual_time as the total flight time. List the output columns in this order. [Output relation cardinality: 1472 rows] 3. (10 points) Find the day of the week with the longest average arrival delay. Return the name of the day and the average delay. Name the output columns day_of_week and delay, in that order. (Hint: consider using LIMIT. Look up what it does!) [Output relation cardinality: 1 row] 4. (10 points) Find the names of all airlines that ever flew more than 1000 flights in one day (i.e., a specific day/month, but not any 24-hour period). Return only the names of the airlines. Do not return any duplicates (i.e., airlines with the exact same name). Name the output column name. [Output relation cardinality: 12 rows] 5. (10 points) Find all airlines that had more than 0.5% (= 0.005) of their flights out of Seattle canceled. Return the name of the airline and the percentage of canceled flights out of Seattle. Percentages should be outputted in percent format (3.5% as 3.5 not 0.035). Order the results by the percentage of canceled flights in ascending order. Name the output columns name and percentage, in that order. [Output relation cardinality: 6 rows] 6. (10 points) Find the maximum price of tickets between Seattle and New York, NY (i.e. Seattle to NY or NY to Seattle). Show the maximum price for each airline separately. Name the output columns carrier and max_price, in that order. [Output relation cardinality: 3 rows] 7. (10 points) Find the total capacity of all direct flights that fly between Seattle and San Francisco, CA on July 10th (i.e. Seattle to SF or SF to Seattle). Name the output column capacity. [Output relation cardinality: 1 row] 8. (10 points) Compute the total departure delay of each airline across all flights. Some departure delays may be negative (indicating an early departure); they should reduce the total, so you don't need to handle them specially. Name the output columns name and delay, in that order. [Output relation cardinality: 22 rows]
English MO1A – English Composition Spring 2025 Course Description Builds critical reading and expository writing skills through the analysis and evaluation of college- level texts and the composition of well-organized, full-length essays containing properly documented evidence. Course Student Learning Outcomes: Upon successful completion of the course, you will be able to: · write a thesis-driven essay that is clearly organized, supported by relevant evidence, uses academic prose, and follows up-to-date MLA citation conventions. · demonstrate critical reading, writing, thinking, and research skills through analysis, synthesis, and evaluation of a variety of material encompassing varying viewpoint Course Objective · compose several expository papers from 2 to 7 pages long, totaling 5,000 words, employing such skills as: logical organization, control of diction, awareness of audience and purpose, and adherence to the conventions of academic prose. · compose timed essay examinations with clear thesis, logical organization, convincing arguments, and specific supporting detail. · organize and compose a 5-7-page research paper incorporating and accurately documenting a variety of appropriate source materials. · analyze a variety of essays and at least one book-length work. · demonstrate critical thinking skills in oral and written discussion of assigned readings. · identify and assess the main idea of essays and write clear, relevant responses in informal journal entries and formal essays with a clear statement of thesis, focus, or controlling idea. · utilize the stages of the writing process: generating ideas, drafting, revising, and editing. · develop paragraphs which incorporate appropriate rhetorical strategies, effective transitions, and convincing support. Course Philosophy This class operates on a growth mindset. This means that there is no such thing as a naturally good writer—someone who was born with writing talent and doesn’t have to work at it. We may all have subjects we prefer, but quality writing (or any other product) is the result of hours of practice and hard work. You can all be good writers, but you must be willing to put in the time and effort.
Large-Scale Data Mining: Models and Algorithms ECE 219 Winter 25 Project 1: End-to-End Pipeline to Classify News Articles Due Jan 24, 2025, by 11:59 PM Overview Statistical classification broadly refers to the task of learning to identify a subset of categories that pertain to a data point (sample of text, an image, a video clip, a time-signal etc..) from a predefined (generally human-guided) larger set of categories. The model attempts to master the task given a training data set (that is kept separate from an evaluation set) in which each data point is pre-labeled with their “correct” category membership/s. In this project, we deal with the classification of text data. The project consists of building an end-to-end pipeline to classify samples of news articles and involves the following ML components: 1. Feature Extraction: Construction of TF-IDF representations of textual data; 2. Dimensionality Reduction: Principal Component Analysis (PCA) and non-Negative Matrix Factorization (NMF) - Generally necessary for classical ML methods 3. Application of Simple Classification Models: Applying common classification meth-ods to the extracted features such as Logistic/Linear Classification and Support Vector Machines; 4. Evaluation the Pipeline: Evaluating and diagnosing classification results using Grid-Search and Cross Validation; 5. Replace corpus-level features with pretrained features: To apply pre-training to a downstream classification task and evaluate comparative performance. Getting familiar with the dataset Please access the dataset at this link. We are using a custom dataset that was designed specifically for this quarter. Note: Do not attempt to open the downloaded file in Excel - the formatting of file when visualized in Excel might suggest the data is corrupted, but this is not true. Consider exploring the dataset using Pandas. You might find the following Pandas functions helpful (in no specific order): read csv, head, hist, shape. QUESTION 1: Provide answers to the following questions: • Overview: How many rows (samples) and columns (features) are present in the dataset? • Histograms: Plot 3 histograms on : (a) The total number of alpha-numeric characters per data point (row) in the feature full text: i.e count on the x-axis and frequency on the y-axis; (b) The column leaf label – class on the x-axis; (c) The column root label – class on the x-axis. • Interpret Plots: Provide qualitative interpretations of the histograms. The two sets of labels leaf label and root label are hierarchically arranged as follows: root_label sports climate leaf_label basketball baseball tennis football soccer forest fire flood earthquake drought heatwave Binary Classification For the first part of the project, we will be using only the full text column as the raw features per sample (row) and the root label column as the label for each sample. The root labels are well-separated. Before continuing on, please set the random seed as follows to ensure consistency: import numpy as np import random np.random.seed(42) random.seed(42) 1 Splitting the entire dataset into training and testing data In order to measure the performance of our binary classification model, we split the dataset into a training and a testing set. The model is trained on the training set and evaluated on the testing set. Note: Do not train on the testing set. We create the sets with a Pandas dataframe. input that contains the entire dataset df. Please make sure that the random seeds are set and the fraction of the test set is 0.2: from sklearn.model_selection import train_test_split train, test = train_test_split(df[["full_text","root_label"]], test_size=0.2) train and test contain the dataframes containing specific rows pertaining to the training data and testing data respectively. QUESTION 2: Report the number of training and testing samples. 2 Feature Extraction The primary step in classifying a corpus of text is choosing a good representation of each data point. Since the full text column contains the raw features to describe each data point, we seek a feature extraction module that encodes raw text features into processed computationally compatible features. A good representation should retain enough class-discriminating information post-processing to competitively perform. classification, yet in the meantime, be concise to avoid computational intractability and over fitting. A first model: The Bag-of-Words (BOW): One feature extraction technique is the “Bag of Words” model, where a document – in this case the full text feature of one data point – is represented as a histogram of word frequencies, or other statistics, within a fixed vocabulary of words. The vocabulary is aggregated from the training set (process described below). Compiling the Vocabulary: (a) Each raw feature (text segment) is split into sentences; (b) Each sentence is split into words; (c) The list of words PER sentence are passed jointly into a position tagger to identify the nouns, verbs, adjectives etc.; (d) These tags are used to lemmatize/stem each word; (e) Words are filtered: Very rare or very frequent words (the number of documents they occur in, or the number of times they occur within a document) are removed, digits and punctuation-dominant words are removed, and stopwords, words contained in a database of common words are removed; (f) Remaining words are added to the vocabulary. Say that the selected set of words that compose the vocabulary form. a set W. For a dataset D, the processed features can be collectively represented as a data matrix X ∈ R |D×|W|. So each row captures the count of each word (the histogram) per data point. An Example: If a test data point’s raw feature contains the text, “On Saturday, the NHL hockey team went to the school to volunteer and educate. Outreach is required for hockey, which is dying in popularity in recent years.” and W = [“hockey′′ , “volunteer′′ , “sport′′], the row in X corresponding to this data point would be [2, 1, 0] because “hockey” appears twice, “volunteer” appears once, and “sport” does not appear at all (though it might appear in another data point. Remember the vocabulary is aggregated across the training set.) During Testing: Each text sample in the testing set is similarly processed to those during training; however words are no longer added to W – this was fixed during training. Instead if a word that exists in the vocabulary occurs in the processed words from a testing sample, its count is incremented in the resulting feature matrix X. To avoid adding to the vocabulary during testing and to avoid training on the test set in general please note as a rule: For most of this project you will be using the NLTK and sci-kit learn (sklearn) libraries for text processing and classifier design. In sklearn, in particular, only use the functions fit transform. and fit on the training set and only use transform. on the testing set. The better model: The Term Frequency-Inverse Document Frequency Model (TF-IDF): “document” and “data point” are used interchangeably While Bag-of-Words model con-tinues to be used in many feature extraction pipelines, a normalized count vector that not only counts the number of times a word occurs in a document (the term frequency) but also scales the resulting count by the number of documents the word appears in (the document frequency) might provide a more valuable feature extraction approach. The focus on the document frequency (and more correctly the inverse of it) encourages the feature vector to discriminate between documents as well as represent a document. TF-IDF does not only ask “What is the frequency of different words in a document?” but rather “What is the frequency of words in a document specific to that document and which differentiates it from other documents? A human reading a particular news article in the sports section will usually ignore the contextually dominant words such as “sport”, “competition”, and “player” despite these words being frequent in every news article in the sports section. Such a context-based conditioning of information is widely observed. The human perception system usually applies a saturating function (such as a logarithm or square-root) to the actual input values into the vision model before passing it on to the neuronal network in the brain. This makes sure that a contextually dominant signal does not overwhelm the decision-making processes in the brain. The TF-IDF functions draw their inspiration from such neuronal sys-tems. Here we define the TF-IDF score to be: TF-IDF(d, t) = TF(t, d) × IDF(t) where TF(d, t) represents the frequency of word (processed, lemmatized, otherwise filtered) t in document d, and inverse document frequency is defined as: IDF(t) = log (DF(t)/1) + 1 where n is the total number of documents, and df(t) is the document frequency, i.e. the number of documents that contain the word t. import re def clean(text): text = re.sub(r'^https?://.*[r ]*', '', text, flags=re.MULTILINE) texter = re.sub(r"", " ", text) texter = re.sub(r""", """,texter) texter = re.sub(''', """, texter) texter = re.sub(' ', " ", texter) texter = re.sub(' u '," you ", texter) texter = re.sub('`',"", texter) texter = re.sub(' +', ' ', texter) texter = re.sub(r"(!)1+", r"!", texter) texter = re.sub(r"(?)1+", r"?", texter) texter = re.sub('&', 'and', texter) texter = re.sub('r', ' ',texter) clean = re.compile('') texter = texter.encode('ascii', 'ignore').decode('ascii') texter = re.sub(clean, '', texter) if texter == "": texter = "" return texter QUESTION 3: Use the following specs to extract features from the textual data: • Before doing anything, please clean each data sample using the code block provided above. This function helps remove many but not all HTML artefacts from the crawler’s output. You can also build your own cleaning module if you find this function to be ineffective. • Use the “english” stopwords of the CountVectorizer • Exclude terms that are numbers (e.g. “123”, “-45”, “6.7” etc.) • Perform. lemmatization with nltk.wordnet.WordNetLemmatizer and pos tag • Use min df=3 Please answer the following questions: • What are the pros and cons of lemmatization versus stemming? How do these processes affect the dictionary size? • min df means minimum document frequency. How does varying min df change the TF-IDF matrix? • Should I remove stopwords before or after lemmatizing? Should I remove punctuations before or after lemmatizing? Should I remove numbers before or after lemmatizing? Hint: Recall that the full sentence is input into the Lemmatizer and the lemmatizer is tagging the position of every word based on the sentence structure. • Report the shape of the TF-IDF-processed train and test matrices. The number of rows should match the results of Question 2. The number of columns should roughly be in the order of k×103 . This dimension will vary depending on your exact method of cleaning and lemmatizing and that is okay. The following functions in sklearn will be useful: CountVectorizer, TfidfTransformer, About Lemmatization and for the daring, Pipeline. Please refer to the discussion section notebooks for more guidance. 3 Dimensionality Reduction After applying the above operations, the dimensionality of the representation vectors (TF-IDF vectors) is large. Classical learning algorithms, like the ones required in this section, however, may perform. poorly with such high-dimensional data. Since the TF-IDF matrix is sparse and low-rank, as a remedy, one can project the points from the larger dimensional space to a lower dimension. In this project, we use two dimensionality reduction methods: Latent Semantic Indexing (LSI) and Non-negative Matrix Factorization (NMF), both of which minimize the Mean Squared residual Error (MSE) between the original TF-IDF data matrix and a reconstruction of the matrix from its low-dimensional approximation. Recall that our data is the term-document TF-IDF matrix, whose rows correspond to TF-IDF representation of the documents, i.e. Latent Semantic Indexing (LSI): The LSI representation is obtained by computing left and right singular vectors corresponding to the top k largest singular values of the term-document TF-IDF matrix X. We perform. Singular Value Decomposition (SVD) on the matrix X, resulting in X = UΣVT where U and V orthogonal. Let the singular values in Σ be sorted in descending order, then the first k columns of U and V are called Uk and Vk respectively. Vk consists of the principle components of matrix X in the feature space. Then we use (XVk) (which is also equal to (UkΣk)) as the dimension-reduced data matrix, where rows still correspond to documents, only now each data point can be represented in a (far) lower dimensional space. In this way, the number of features is reduced. LSI is similar to Principal Component Analysis (PCA), and you can see the lecture notes for their relationships. Having obtained U and V, to reduce the test data, we ONLY multiply the test TF-IDF matrix Xt by Vk, i.e. Xt,reduced = XtVk. By doing so, we actually project the test TF-IDF vectors onto the previously learned principle components from training, and use the projections as the dimensionality-reduced data. Non-negative Matrix Factorization (NMF): NMF tries to approximate the data matrix X ∈ R n×m (n = |D| docs and m = |W| terms) with WH (W ∈ R n×r , H ∈ R r×m). Concretely, it finds the non-negative matrices W and H s.t. ∥X − WH∥ 2 F is minimized (∥A∥F ≡ ). Then we use W as the dim-reduced data matrix, and in the fit step, we calculate both W and H. The intuition behind this is that we are trying to describe the documents (the rows in X) as a (non-negative) linear combination of r topics: Here we see h T 1 , . . . , h T r as r “topics”, each of which consists of m scores, indicating how impor-tant each term is in the topic. Then x T i ≈ wi1h T 1 + wi2h T 2 + · · · + wirh T r , i = 1, . . . , n. Now how do we calculate the dim-reduced test data matrix? Again, we try to describe the document vectors (rows by our convention here) in the test data (call it Xt) with (non-negative) linear combinations of the “topics” we learned in the fit step. The “topics”, again, are the rows of H matrix, {h T i } r i=1. How do we do that? Just solve the optimization problem where H is fixed as the H matrix we learned in the fit step. Then Wt is used as the dim-reduced version of Xt . QUESTION 4: Reduce the dimensionality of the data using the methods above: • Plot the explained variance ratio across multiple different k = [1, 5, 10, 25, 50, 100, 500, 1000] for LSI and for the next few sections choose k = 25. What does the explained variance ratio plot look like? What does the plot’s concavity suggest? • With k = 25 found in the previous sections, calculate the reconstruction residual MSE error when using LSI and NMF – they both should use the same k = 25. Which one is larger, the ∥X − WH∥ 2 F in NMF or the X − UkΣkVT k2F in LSI and why? 4 Classification Algorithms In this part, you are asked to use the dimensionality-reduced training data from LSI with your choice of k to train (different types of) classifiers, and evaluate the trained classifiers with test data. Your task would be to classify the documents into two classes (for now a binary classification task) sports versus climate. Classification Measures: Classification quality can be evaluated using different measures such as precision, recall, F-score, etc. Refer to the discussion material to find their definition. Depending on application, the true positive rate (TPR) and the false positive rate (FPR) have different levels of significance. In order to characterize the trade-off between the two quantities, we plot the receiver operating characteristic (ROC) curve. For binary classification, the curve is created by plotting the true positive rate against the false positive rate at various threshold settings on the probabilities assigned to each class (let us assume probability p for class 0 and 1 − p for class 1). In particular, a threshold t is applied to value of p to select between the two classes. The value of threshold t is swept from 0 to 1, and a pair of TPR and FPR is got for each value of t. The ROC is the curve of TPR plotted against FPR. Support Vector Machines (SVM): Linear Support Vector Machines are seemingly effi-cient when dealing with sparse high dimensional datasets, including textual data. They have been shown to have good generalization and test accuracy, while having low computational complexity. These models learn a vector of feature weights, w, and an intercept, b, given the training dataset. Once the weights are learned, the label of a data point is determined by thresholding wTx + b with 0, i.e. sign(wTx + b). Alternatively, one produce probabilities that the data point belongs to either class, by applying a logistic function instead of hard thresholding, i.e. calculating σ(wTx + b). The learning process of the parameter w and b involves solving the following optimization problem: where xi is the ith data point, and yi ∈ {0, 1} is its class label. Minimizing the sum of the slack variables corresponds to minimizing the loss function on the training data. On the other hand, minimizing the first term, which is basically a regularization term, corresponds to maximizing the margin between the two classes. Note that in the objective function, each slack variable represents the amount of error that the classifier can tolerate for a given data sample. The trade-off parameter γ controls relative importance of the two components of the objective function. For instance, when γ ≫ 1, misclassification of individual points is highly penalized. This is called “Hard Margin SVM”. In contrast, a “Soft Margin SVM”, which is the case when γ ≪ 1, is very lenient towards misclassification of a few individual points as long as most data points are well separated. QUESTION 5: Compare and contrast hard-margin and soft-margin linear SVMs: • Train two linear SVMs: – Train one SVM with γ = 2000 (hard margin), another with γ = 0.0005 (soft margin). – Plot the ROC curve, report the confusion matrix and calculate the accuracy, recall, precision and F-1 score of both SVM classifiers on the testing set. Which one performs better? What about for γ = 100000? – What happens for the soft margin SVM? Why is the case? Analyze in terms of the confusion matrix. ∗ Does the ROC curve reflect the performance of the soft-margin SVM? Why? • Use cross-validation to choose γ (use average validation accuracy to compare): Using a 5-fold cross-validation, find the best value of the parameter γ in the range {10k | − 3 ≤ k ≤ 6, k ∈ Z}. Again, plot the ROC curve and report the confusion matrix and calculate the accuracy, recall precision and F-1 score of this best SVM. Logistic Regression: Logistic regression is a probability model that can be used for binary classification. In logistic regression, a logistic function (σ(ϕ) = 1/(1 + exp (−ϕ))) acting on a linear function of the features (ϕ(x) = wTx + b) is used to calculate the probability that a data point belongs to a particular binary class, and during the training process, parameters w and b that maximize the likelihood of predicting the correct labels on the training data are learned. One can also add a regularization term to the objective function, so that the goal of the training process is not only in maximizing the likelihood, but also in minimizing the regularization term, which is often some norm of the parameter vector w. Adding regularization helps prevent ill-conditioned results and over-fitting, and facilitates generalization. A coefficient is used to control the trade-off between maximizing likelihood and minimizing the regularization term. QUESTION 6: Evaluate a logistic classifier: • Train a logistic classifier without regularization (you may need to come up with some way to approximate this if you use sklearn.linear model.LogisticRegression); plot the ROC curve and report the confusion matrix and calculate the accuracy, recall precision and F-1 score of this classifier on the testing set. • Find the optimal regularization coefficient: – Using 5-fold cross-validation on the dimension-reduced-by-SVD training data, find the op-timal regularization strength in the range {10k |−5 ≤ k ≤ 5, k ∈ Z} for logistic regression with L1 regularization and logistic regression with L2 regularization, respectively. – Compare the performance (accuracy, precision, recall and F-1 score) of 3 logistic classi-fiers: w/o regularization, w/ L1 regularization and w/ L2 regularization (with the best parameters you found from the part above), using test data. – How does the regularization parameter affect the test error? How are the learnt coeffi-cients affected? Why might one be interested in each type of regularization? – Both logistic regression and linear SVM are trying to classify data points using a linear decision boundary. What is the difference between their ways to find this boundary? Why do their performances differ? Is this difference statistically significant? Na¨ıve Bayes Model: Scikit-learn provides a suite of Na¨ıve Bayesian classifiers including MultinomialNB, BernoulliNB, and GaussianNB. Na¨ıve Bayes classifiers use the assumption that features are statistically independent of each other when conditioned by the class the data point belongs to, in order to simplify the calculation for the Maximum a posteriori (MAP) estimation of the labels. That is, P(xi | y, x1, . . . , xi−1, xi+1, . . . , xm) = P(xi | y) i ∈ {1, . . . , m} where xi ’s are features and y is the label of the data point. The MultinomialNB, BernoulliNB, and GaussianNB use different underlying probability models. QUESTION 7: Evaluate and profile a Na¨ıve Bayes classifier: Train a GaussianNB classifier; plot the ROC curve and report the confusion matrix and calculate the accuracy, recall, precision and F-1 score of this classifier on the testing set.
GGR 203F - Introduction to Climatology Problem Set One Assigned: 6 Jan 2025 Due: 11:59 PM, Jan 27th, 2025, using the Assignments tab on the course Quercus site. Taken Up: (via recording that will be released): around 12:00 PM, Jan 30th If you have not done so already, and have no previous experience using Excel for calculations, you should review the Excel guide that has been posted and spend some time exploring the various features and options with your own trial computations, before starting this problem set. Doing so systematically now will save you a lot of time over the course of the 4 problem sets in this course. An Excel guide has been posted along with this problem set (suggestions on how to improve the guide are always welcome). Meteorology and Climate 1. This first problem is a simple mathematical parable illustrating the difference between meteorology and climate. We shall consider a series of numbers generated by the formula, Xn+1 = (Xn - 2)2 (a) In the first column of an Excel worksheet, set up numbers from 0 to 60. In the next column, next to the row with the zero, enter the value 0.7000. This is X0. Next, compute the next 60 terms of the series in the remaining rows of the second column of your Excel worksheet. Next, plot the time series (where the numbers in the first column are the X-values, the numbers in the second column the Y-values). How would you describe the resulting plot? Do you see any obvious cycles? Marks: Calculations, 2; Word answers, 2; Graphs, 4 (b) Starting with X0 = 0.7002 in the third column, compute the next 60 terms of the series, as in (a), and plot on the same graph. Compute the difference between the first two series in the 4th column. In predicting the weather, we start from some observed initial conditions and project into the future using equations based on the deterministic laws of physics. The simple equation we've used here illustrates some of the properties of the real atmosphere. Suppose that X0 = 0.7000 was the exact initial value, but that the value X0 = 0.7002 was the observed initial value, containing an error of +0.0002. What happens to this error as n increases (that is, for successive terms in the series)? 2 marks (c) Let us suppose that we reduced our initial error by a factor of two. Recompute the above series starting with X0 = 0.7001 in the 5th column and plot on the same graph as the first two series, and compute the difference between the 1st and 3rd series in the 6th column. Letting the series starting with X0 = 0.7000 be the "true" series, and letting the series starting with X0 = 0.7001 and X0 = 0.7002 be the "predicted" series, has our ability to predict into the future improved by cutting the initial error in half? If so, by how long? 2 marks (d) Climate involves the average and variability of day to day weather events, but it is also only a sample so, continuing our parable, prepare a single table that gives - the means based on the 1st 21 terms, the 1st 41 terms, and all 61 terms. - number of values less than 1.0 for all 61 terms, and - number of values greater than 3.0 for all 61 terms for each of the above three series, and comment on what this might imply about our ability to predict future climates compared to future weather. 3 marks Hand in a printout of your graph (only one, with 3 curves) and the first page of your Excel worksheet, the summary table from (d) and your comments to (a)-(d). [15 marks total Q1] Planck Function 2. The Planck function in terms of wavelength λ is given by which can be written as where C1 = 2πhc2 = 3.7419 x 108 W m-2 (μm)4 and C2 = ch/k = 1.439 x 104 μm K (values and names of c, h, and k are given at the end of the problem set for the sake of completeness). (a) In the above expression, the term C2/λT is said to be the “argument” of the exponential function. What do you notice about the dimensions (units) of the argument of an exponential function (at least in this case)? 2 marks (b) Using the given values of C1 and C2, compute B(λ,T) in an Excel worksheet for T = 260 K. Put the constants that you will use and the temperature in separate rows at the top of the spreadsheet, along with the name of the constants or variable and the units. Below that, do the following (4 marks): - in the first column, set up the boundaries of the various subintervals that will be used. These boundaries should go from 3.0 μm to 20.0 μm in 0.5-μm intervals, from 20.0 μm to 40.0 μm in 1-μm intervals, from 40.0 μm to 60.0 μm in 2-μm intervals, and from 60.0 μm to 80.0 μm in 5-μm intervals - in the 2nd column, compute the midpoints for each of the intervals (you will have one fewer midpoint value than boundary value) - in the 3rd column, compute the of B(λ,T) at the λs given in the 2nd column. Note that you only need to type the formula once, in the first row, then you can drag it down, but don’t forget to put in $ signs where they are needed (see the handout on using Excel). Do not type the values of the constants into your equation; rather, set the values of the constants in various cells near the top of the worksheet (and type the symbol for each constant, the name of the constant if it has a name, and the units of the constant next to each constant). Then, have the equation that you are typing use the cells at the top of the spreadsheet where the values are assigned, by clicking on that cell as you type the equation, and putting a $ before the letter for the column and before the raw number. - Plot your results. - In the 4th column, compute the widths of the intervals centred at the λs given in the 2nd column - Use the 5th column to estimate the total emission graphically (that is, estimate the area under the curve shown in your plot), with the answer at the bottom of this column, then compute σT4 (the total emission according to the Stefan Boltzman Law) below that, where σ = 5.673 x 10-8 W m-2 K-4. How close are the two values? What could be done (without cheating) to make the two values closer to one another? 2 marks (c) Confirm that the wavelength of peak emission agrees with Wien's law (compute the expected value from Wien’s Law somewhere in your worksheet). 2 marks (d) Copy the worksheet two times, then change the temperature to 280 K for one sheet and 300 K for another sheet. Plot the results for all three cases on the same graph (5 marks). Be sure to label both axis and to include units in the labels, to add a title to the graph, and to adjust the size of all fonts to about 14 or 16 so that everything is easily legible. Comment on how the emission of radiation and its distribution change with temperature. 2 marks You should hand in the following: The graph of your 3 Planck function curves (all on one chart), the first worksheet, and a single sheet with a table giving the three graphical estimates of total radiation, the total radiation according to the Stefan-Boltzman law for each case, the wavelength of peak emission, and your comments or answers to the above questions. [17 marks total Q 2] Emission and absorption of radiation 3. Based on the fundamental units, (i) Show that 1 joule = 107 ergs, and (ii) Show that 1 newton = 105 dynes. (iii) Given the density of water as 1 g cm-3, what is the density as kg m-3? 3 marks 4. What is the wavelength of electromagnetic radiation emitted by electric oscillators of frequencies 1.2 x 1015 s-1, 1.2 x 1014 s-1, and 3.6 x 1013 s-1? In what part of the electromagnetic spectrum do each of these oscillators lie? What kind of oscillator is each likely to involve (i.e.: oscillating electron, vibrating dipole molecule, or rotating dipole molecule)? 6 marks 5. The earth is 1.7% closer to the sun than average on 3 January, and 1.6% further than average on 5 July. Using a solar constant of 1370 W m-2, by how much would the solar flux density (W m-2) on a plane perpendicular to the sun's rays vary between these two dates? Given a global mean albedo of 0.3, what would be the change in absorbed energy from January to July, averaged over the entire globe? 5 marks 6. If the incident solar flux at a surface is 520 W m-2, what would be the change in amount of absorbed solar energy (in W m-2) if the surface changed from snow-free grassland with surface albedo αs = 0.25, to a deep snow cover with αs = 0.80? 3 marks Investigating the atmospheric ("greenhouse") effect 7. Consider the simple one-layer atmosphere + surface model developed in class. (a) Compute the infrared emission of radiation to space for a surface temperature of 300 K and an atmospheric temperature of 250 K for atmospheric emissivities of 0.0, 0.1, 0.5, and 1.0. Explain your results briefly. (b) Compute the infrared emission of radiation to space for a surface temperature of 300 K, an atmospheric emissivity of 0.3, and atmospheric temperatures of 290 K, 300 K, and 310 K. Explain your results for each case in comparison to the surface emission. Which of these cases corresponds to a "greenhouse" effect and which to an "anti-greenhouse" effect. (c) We can mimic the direct effect of a CO2 increase by increasing the atmospheric emissivity slightly, from 0.3 to 0.305. Compute the effect that this increase has on the infrared emission to space for a surface temperature of 300 K and atmospheric temperatures of 290 K and 310 K, and comment on your results. Do all your calculations in Excel, and summarize your results in a single table. 13 marks total for Q 7 62 marks total Constants : Stefan-Boltzman constant: 5.673 x 10-8 W m-2 K-4 Speed of light, c: 2.998 x 108 m s-1 Stylistic Issues Concerning Graphs 1. There should be a non-vague title at the top of the graph 2. Make sure that each axis has a clear label with units where appropriate. 3. Omit any unnecessary zeros after the decimal point on the axis tick labels (i.e., labels should be 5, 10, etc, and not 5.000, 10.000). 4. Use a constant precision in the axis tick labels (i.e., 0.0, 0.5, 1.0, 1.5 … not 0, 0.5, 1, 1.5) 5. Use a good sized font (such as 14 or 16 pt) for all tick and axis labels and for legend entries. 6. For graphs with a y-axis spanning positive and negative values, make sure that the x-axis and axis labels align with the lowest point on the y-axis, not with the zero point on the y-axis. 7. Legends should go in a box within the graph area, not at the bottom or side of the graph. Click on the legend once you’ve moved it and select “Fill”, “White” to block out the lines behind the legend, and put a boundary around it. Marks will be deducted if these rules are not followed. Marks will also be deduced for spelling mistakes and improper grammar. The bottom line is that your work should look professional. Eliminating Excess Digits in Worksheets Sometimes Excel gives the result of calculations with 8-9 or more decimal digits, far more than can be justified or are necessary. These make it harder on the eyes and require wider columns, and so should be eliminated. To do this, highlight the cells to be changed, then right-click and select Format Cells, then Number, and in the window “Decimal Places”, select some reasonable number. After, make columns narrower if necessary in order for a table to fit within one screen width (for example). Format for submission of problem set: Please make sure that you submit your problem set as a single pdf file, with all the answers to your questions in the correct order. The easiest way to do that is to type the answers to the word questions in the excel sheet right below your calculations. You can use the "save as adobe pdf" command under "File", and Excel will create a single pdf file with separate pages for each worksheet - so make sure that your worksheets are in the correct order. Please verify that your resulting pdf file looks good (i.e., without the image taking up a small corner of the page instead of the entire page). You might need to highlight the part of the page that should go into the pdf (under the Page Layout tab and then "Print Area"), so that only that part is shown (that is, without wasted space at the end). Bottom line: would you want to mark 30 assignments that look like yours?