Skip to content
Menu
Timeless College
  • Alex Cavazzoni
  • Art Classes Near Me
  • AWS Training in Virginia
  • Bad Influence On Children
  • BROWZ safety compliance
  • Clarence McClendon
  • Digital Marketing Consultancy Kelowna
  • Freedom of speech on social media
  • Https://timelesscollege.xyz/
  • Https://timelesscollege.xyz/ – Timelesscollege.xyz
  • Https://www.timelesscollege.xyz/
  • Https://www.timelesscollege.xyz/ – Timelesscollege.xyz
  • In Home Tutoring
  • Integrated Atpl
  • Jewish Intimacy
  • Learn to play guitar online
  • Online Baseball Hitting Trainer
  • Prince George School
  • Quickbooks Classes
  • Sample Page
  • Schreibwettbewerb
  • STOCKS CRYPTO FOREX Trading
  • Timeless College
  • Timeless College – Timelesscollege.xyz
  • Timelesscollege.xyz/
  • Timelesscollege.xyz/ – Timelesscollege.xyz
  • Training as a Pilot
  • Website Creation Atlanta
Timeless College
COMP8210/COMP8210-2021Public – My Assignment Tutor

COMP8210/COMP8210-2021Public – My Assignment Tutor

December 29, 2021 by seo_automation_owner

Skip to content Why GitHub? TeamEnterpriseExplore MarketplacePricing  Sign inSign up COMP8210/COMP8210-2021Public NotificationsFork 4 Star 1 CodeIssuesPull requestsActionsProjectsWikiSecurityInsights  main  COMP8210-2021/assignments/A3/A3.md Go to filedmollaaliodA3 typo 2Latest commit 3d61893 on 12 Oct History1 contributor107 lines (61 sloc)  7.95 KBRawBlame Assignment 3 – Data Analysis Assessment weight: 25% of the total unit assessment. Penalty for late submission: 2.5 marks per day late (or part thereof). This represents 10% of the total available as indicated in the unit guide. Version 2 – 12 October 2021: Fixed typo in task 4. Assignment description In this assignment you will provide data analysis within the context of a practical application. The assignment is designed so that it can be completed using Python packages introduced during the lectures and workshops of weeks 7 to 10. All the exercises in this assignment must be completed in a single Python Jupyter notebook. Topic: Analysis of medical publications about COVID-19 The current pandemic has triggered an unprecedented amount of published medical research with the aim to find treatments and vaccinations as soon as possible. This has also led to efforts to gather these research publications and conduct text information retrieval and text mining on them. In this assignment we will use a fragment of the most widely used resource of this kind: CORD-19. This resource has been made available via several outlets, probably the most popular of which is Kaggle: https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge We will use Kaggle’s subset stored in the file metadata.csv. The data available in the above link is continuously changing. To ensure consistency among all submissions, please use the following local copy: metadata.csv.zip (390MB download) Open the CSV file and examine its contents. You will see that the data is organised in columns containing the following information (among others): Source (e.g. PMC, Medline, etc)Publication titleJournal titleDOIAbstractDate of publication This assignment is divided in several tasks. Each task addresses a question that you need to answer by writing Python code, producing one or more charts, and explaining how the chart(s) answer(s) the question. Some of the tasks will tell you exactly what kind of chart you need to produce, but other tasks will only express a need for some information and you need to decide what chart or set of charts would help find the information. Your report must be structured using the task titles as headings. Each section must contain the following subheadings or equivalent: Chart (or Charts when appropriate), where you insert the Python code that generates the charts, and the resulting charts. Make sure that the charts have the appropriate titles and legends.Discussion, where you explain how the chart(s) can be used to answer the question asked in the task. Note that, depending on your computer, the size of the data set may be too large for some of the tasks below. If that is the case, feel free to work using a random sample of the data. In fact, it is advisable that you conduct your first tests using a small sample and then report the results on a larger sample (or the entire data set if your computer allows). Task 1 (5 marks) – What is the distribution of journals per source? Identify the 10 most frequent journals. Then, for each publication source, display the distribution of publications in each journal. The final result should be presented as a clustered column chart like the one below. We have obscured the journal names and we have not added a title or sensible axis names — your solution should have this information. You also need to explain how you produced the chart, and how to interpret it. hint 1: The chart was made using panda’s plotting tools but feel free to use your favourite tool.hint 2: Note that the y axis is using a log transform.hint 3: If a publication has multiple sources, use the first source. Task 2 (5 marks) – What are the main clusters? Using the text from the abstracts, cluster the data set and visualise the clusters. To cluster the data, you can use KMeans or another clustering method of your choice. Choose a reasonable metric to determine the similarity between abstracts, and choose a reasonable number of clusters. To visualise the clusters, produce a 2D plot where the axes represent a 2D representation of each document, each document is represented as a marker, and the colour of the marker indicates the cluster to which the document belongs. The visualisation may look like this. Just note that the image below is using different data, so your clusters will look different, and the number of clusters may be different too (source: https://www.kaggle.com/maksimeren/covid-19-literature-clustering). hint: to obtain the 2D representation of a document, try Principal Component Analysis (PCA) or t-SNE. Task 3 (5 marks) – For each cluster, what are the most representative words? To complete this task, create a word cloud for each cluster so that the size of each word in a word cloud represents the importance of the word in the corresponding cluster. When you build the word clouds, make sure that the words displayed are informative. For example, remove stop words, and remove other words that are common across all clusters. It is your choice to determine how to determine the importance of a word. Possible options may be: Use word frequency and remove stop words plus other words you may think are not informative.Use tf.idf.Research techniques used for extracting keywords. Task 4 (5 marks) – What are the most common topics? To complete this task, you need to conduct topic modelling in order to identify the main topics of the opinions expressed in the abstracts. For example, you can perform Latent Dirichlet Allocation. Present the results in the appropriate chart or charts, and write how the charts answer the question of this task. To perform topic modelling you normally need to specify the number of topics. In your report, write how you determined the number of topics. For example, did you try different numbers of topics and choose the number that reported most satisfactory results? If so, how did you determine the most satisfactory results? Task 5 (5 marks) – What are the most common topics in each cluster? Use the information from the topics of task 4 to characterise the topics in each cluster. How to do this, and how to express this information, is up to you. What kind of information do we want to know? Here are some questions that you may want to answer: Are there any topics that are more (or less) common in the clusters?For each topic, what are the most representative clusters?For each cluster, what are its most representative topics?What additional information does the topic modelling stage tell us about each cluster? What you should not do: please don’t conduct a separate topic modelling in each cluster. We want to know whether, for example, cluster 1 and cluster 2 share many common topics. If you conduct a separate topic modelling in each cluster, we will not know if they have topics in common. Submission There are two submission boxes. Turnitin submission: In this submission box, submit your report as a PDF file. This file can be your Python notebook exported to PDF. Do not submit your Python .ipynb file here.Code submission: In this submission box, submit your Python notebook with all the code that generated the PDF report. Make sure that we understand how the code can produce the charts and other analysis results described in the PDF report. Make sure that the Python notebook contains the code, output, and your analysis. A note about plagiarism By submitting this assignment you acknowledge that it is your own work and that you did not copy material from other people or the web. There are penalties for plagiarism. Assessment Rubric Each task will be assessed according to the rubric attached to the assignment. © 2021 GitHub, Inc. TermsPrivacySecurityStatusDocsContact GitHubPricingAPITrainingBlogAbout

  • Assignment status: Already Solved By Our Experts
  • (USA, AUS, UK & CA PhD. Writers)
  • CLICK HERE TO GET A PROFESSIONAL WRITER TO WORK ON THIS PAPER AND OTHER SIMILAR PAPERS, GET A NON PLAGIARIZED PAPER FROM OUR EXPERTS
QUALITY: 100% ORIGINAL PAPER – NO PLAGIARISM – CUSTOM PAPER

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • MMPH6006 Evidence Based Practice:Acquired Pressure Ulcers PUs
  • Nursing Midwifery And Health Management
  • Which communication method(s) would be most effective for each of the following scenarios?
  • MKT 806 Marketing Research Decision Making- Methods of Exploratory
  • MKT 806 Marketing Research Decision Making- Establish Priorities

Recent Comments

  • A WordPress Commenter on Hello world!

Archives

  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021

Categories

  • Uncategorized

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
©2022 Timeless College | Powered by WordPress and Superb Themes!