SlideShare una empresa de Scribd logo
1 de 4
Descargar para leer sin conexión
Q1(d) (11 marks)
We want to be able to carry out an analysis of words in long documents to find the most
frequently used words. This can be used for example to identify the most important words for
language learning or to try to identify authors in literary works. Later on we will ask you to
analyse two Shakespeare plays, Hamlet and The Merchant of Venice, to find the 20 most
frequent words and the number of times each word occurs. Because the most common words are
mainly stop words (articles, prepositions, etc.) and the play's characters (e.g. Hamlet, Horatio,
Portia etc.) we will also want the ability to exclude certain words from the analysis.
First we want to explore the problem in a more general abstract form.
Given the name (string) of a text file containing words, a positive integer m and a text file
containing excluded words (strings), find the m most frequent words in the file (apart from the
excluded words) and their frequencies, given in descending order of frequency.
We define this more formally, as follows:
Operation: Most common in file
Inputs: filename, a string; excluded-words-filename, a string; m, integer
Preconditions: Files of names filename and excluded-words-filename are text files; m > 0
Outputs: most-common-words, a list of at most m items, where each item is a tuple
Postconditions: Each item of most-common-words, is a tuple containing a word from the file
filename together with its frequency, with the list being in descending order of frequency, and no
tuples for words from excluded-words-filename are in the list.
The frequency component of each tuple in most-common-words is greater than or equal to the
frequency of occurrence for any other words in filename, ignoring any words in excluded-words-
filename.
Q1(d)(i) (3 marks)
The main ADT to use for storing the text should be a bag. You will also need to choose a
suitable ADT for the excluded words, and you can also use other standard simple built-in data
structures of Python such as lists or strings, if necessary.
State what sort of ADTs and data structures you would use for this problem and explain what is
stored in these ADTs. Do not explain your choices at this stage - we will ask about that later.
Add your answer for Q1(d)(i) here:
Q1(d)(ii) (3 marks)
Give a step-by-step explanation, showing how your solution would work.
Write your answer to Q1(d)(ii) here
Q1(d)(iii) (5 marks)
Now explain your chosen approach by outlining the characteristics and the expected performance
of the operations on bags and other ADTs/data structures you have used, in standard Python
implementations. You should reference the performance discussions for bags in Chapter 8 and
relevant performance discussions elsewhere in the module text for other ADTs/data structures.
Write your answer to Q1(d)(iii) here
Q1(e) (12 marks)
Implement your approach from part (d) to solve the abstract problem introduced in part (d) and
extended somewhat here:
Analyse two given literary texts to find the 20 most frequent words and the number of times each
word occurs. Exclude from the analysis the words that are often most common but less important
to the analysis: so-called stop words (articles, prepositions, etc.) and words naming the text's
characters. The original files for the two texts may contain line punctuation and extraneous
characters at the start or end of words such as apostrophes, dashes etc and these should be
removed before further processing. The excluded words relevant to each text are listed in a given
text file - and this has been cleaned so that it just contains the relevant words, without any
punctuation or extraneous characters.
In this case the first text is Shakespeare's Hamlet (in the given text file hamlet.txt) with excluded
words listed in the given text file hamlet_excluded_words.txt. The second text is Shakespeare's
Merchant of Venice (in the given text file merchant.txt) with excluded words listed in the given
text file merchant_excluded_words.txt.
We also want you to find the words that occur in both these texts and the number of occurrences
in common for these words e.g. if dog occurs 10 times in the first text and 25 times in the second
text, then there are 10 occurrences in common.
We have provided code frameworks for your solution below. We have split the problem and the
code framework into two parts so you can do one bit at a time and check each part is working.
Q1(e)(i) (5 marks)
The first part of the problem requires reading a text from a file, eliminating excluded words, and
storing the results in a bag.
As in part(d), we define this more formally, as follows:
Operation: Get bag from file
Inputs: filename, a string; excluded-words-filename
Preconditions: Files of names filename and excluded-words-filename are text files
Outputs: text, a bag of words
Postconditions: The bag text contains all words from the file filename together with their
frequency of occurrence, except that any words from excluded-words-filename are omitted.
Here is the code framework for this first part of the problem. We have included some simple test
code, to let you check if your code seems to be working so far.
Please make the required changes as indicated by comments. When you have finished run your
code to view the output.
# Change this code in the places indicated
# in order to implement and test your solution
%run -i m269_util
# Import the functions read_file and read_and_clean_file
%run -i m269_tma03_filehandling
# You will need to amend this function so that the excluded
# words extracted from the list are added to the data structure
# that is returned. You should also set the type annotation
# for the function return value
def get_excluded_words(word_list : list) :
"""Returns the excluded words occurring in word_list
in a suitable data structure. Here we use a set.
"""
# replace the following with your code to initialise the data structure
words = None
# replace the following with your code to add words from the list if not blank
pass
return words
# You will need to amend this function so that the words from the list are
# added to the data structure that is returned, apart from the excluded words.
# You should also set the type annotations for the excluded_words argument
def bag_of_words(word_list : list, excluded_words)-> Counter:
"""Return the words occurring in word_list as a bag-of-words,
omitting any excluded words
"""
words = Counter()
# replace the following with your code to add words from the list if not
# an excluded word
pass
return words
# You should not need to change this function
def get_bag_from_file(filename : str, excluded_words_file : str) -> Counter:
""" Return list of "m" most frequent words and their percentage frequencies
in the text of file "filename", excluding all the words in file "excluded_words"
"""
excluded_words_list= read_file(excluded_words_file)
excluded_words = get_excluded_words(excluded_words_list)
text_list = read_and_clean_file(filename)
text = bag_of_words(text_list, excluded_words)
return text
# We have provided the following code here to allow you to check if your
# code is working so far. You should not need to change it unless you wish
# to add more tests of your own.
print("Collecting words in text...")
text = get_bag_from_file('hamlet.txt', 'hamlet_excluded_words.txt')
test(size, [['Size of text', text, 20635 ]])
test(text.most_common, [['Most common word in text', 1, [('lord', 223)] ]])

Más contenido relacionado

Similar a Q1(d) (11 marks)We want to be able to carry out an analysis of w.pdf

InstructionYou’ll probably want to import FileReader, PrintWriter,.pdf
InstructionYou’ll probably want to import FileReader, PrintWriter,.pdfInstructionYou’ll probably want to import FileReader, PrintWriter,.pdf
InstructionYou’ll probably want to import FileReader, PrintWriter,.pdfarsmobiles
 
It’s sometimes useful to make a little language for a simple problem.pdf
It’s sometimes useful to make a little language for a simple problem.pdfIt’s sometimes useful to make a little language for a simple problem.pdf
It’s sometimes useful to make a little language for a simple problem.pdfarri2009av
 
Functions in C++
Functions in C++Functions in C++
Functions in C++home
 
A Brief Overview of (Static) Program Query Languages
A Brief Overview of (Static) Program Query LanguagesA Brief Overview of (Static) Program Query Languages
A Brief Overview of (Static) Program Query LanguagesKim Mens
 
SessionTwo_MakingFunctionCalls
SessionTwo_MakingFunctionCallsSessionTwo_MakingFunctionCalls
SessionTwo_MakingFunctionCallsHellen Gakuruh
 
Matlab: Procedures And Functions
Matlab: Procedures And FunctionsMatlab: Procedures And Functions
Matlab: Procedures And Functionsmatlab Content
 
Procedures And Functions in Matlab
Procedures And Functions in MatlabProcedures And Functions in Matlab
Procedures And Functions in MatlabDataminingTools Inc
 
Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)Jeet Das
 
Lab 1 Recursion  Introduction   Tracery (tracery.io.docx
Lab 1 Recursion  Introduction   Tracery (tracery.io.docxLab 1 Recursion  Introduction   Tracery (tracery.io.docx
Lab 1 Recursion  Introduction   Tracery (tracery.io.docxsmile790243
 
Python regular expressions
Python regular expressionsPython regular expressions
Python regular expressionsKrishna Nanda
 
C Language (All Concept)
C Language (All Concept)C Language (All Concept)
C Language (All Concept)sachindane
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparationKushaal Singla
 
Article link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docxArticle link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docxfredharris32
 

Similar a Q1(d) (11 marks)We want to be able to carry out an analysis of w.pdf (20)

InstructionYou’ll probably want to import FileReader, PrintWriter,.pdf
InstructionYou’ll probably want to import FileReader, PrintWriter,.pdfInstructionYou’ll probably want to import FileReader, PrintWriter,.pdf
InstructionYou’ll probably want to import FileReader, PrintWriter,.pdf
 
FinalReport
FinalReportFinalReport
FinalReport
 
It’s sometimes useful to make a little language for a simple problem.pdf
It’s sometimes useful to make a little language for a simple problem.pdfIt’s sometimes useful to make a little language for a simple problem.pdf
It’s sometimes useful to make a little language for a simple problem.pdf
 
Functions in C++
Functions in C++Functions in C++
Functions in C++
 
A Brief Overview of (Static) Program Query Languages
A Brief Overview of (Static) Program Query LanguagesA Brief Overview of (Static) Program Query Languages
A Brief Overview of (Static) Program Query Languages
 
SessionTwo_MakingFunctionCalls
SessionTwo_MakingFunctionCallsSessionTwo_MakingFunctionCalls
SessionTwo_MakingFunctionCalls
 
Text Mining with R
Text Mining with RText Mining with R
Text Mining with R
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Matlab: Procedures And Functions
Matlab: Procedures And FunctionsMatlab: Procedures And Functions
Matlab: Procedures And Functions
 
Procedures And Functions in Matlab
Procedures And Functions in MatlabProcedures And Functions in Matlab
Procedures And Functions in Matlab
 
Pcd question bank
Pcd question bank Pcd question bank
Pcd question bank
 
Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)
 
Lab 1 Recursion  Introduction   Tracery (tracery.io.docx
Lab 1 Recursion  Introduction   Tracery (tracery.io.docxLab 1 Recursion  Introduction   Tracery (tracery.io.docx
Lab 1 Recursion  Introduction   Tracery (tracery.io.docx
 
2013 - Andrei Zmievski: Clínica Regex
2013 - Andrei Zmievski: Clínica Regex2013 - Andrei Zmievski: Clínica Regex
2013 - Andrei Zmievski: Clínica Regex
 
Python regular expressions
Python regular expressionsPython regular expressions
Python regular expressions
 
C Language (All Concept)
C Language (All Concept)C Language (All Concept)
C Language (All Concept)
 
qb unit2 solve eem201.pdf
qb unit2 solve eem201.pdfqb unit2 solve eem201.pdf
qb unit2 solve eem201.pdf
 
Erlang session1
Erlang session1Erlang session1
Erlang session1
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
Article link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docxArticle link httpiveybusinessjournal.compublicationmanaging-.docx
Article link httpiveybusinessjournal.compublicationmanaging-.docx
 

Más de alsofshionchennai

Q15 Amabook has average variable costs of $1 and average total costs.pdf
Q15 Amabook has average variable costs of $1 and average total costs.pdfQ15 Amabook has average variable costs of $1 and average total costs.pdf
Q15 Amabook has average variable costs of $1 and average total costs.pdfalsofshionchennai
 
Provide background and analysis ono The Indian initial farmers p.pdf
Provide background and analysis ono The Indian initial farmers p.pdfProvide background and analysis ono The Indian initial farmers p.pdf
Provide background and analysis ono The Indian initial farmers p.pdfalsofshionchennai
 
Provide a detailed description for each of the following measures of.pdf
Provide a detailed description for each of the following measures of.pdfProvide a detailed description for each of the following measures of.pdf
Provide a detailed description for each of the following measures of.pdfalsofshionchennai
 
provide a brief description paragraph on the fungi, then the taxon.pdf
provide a brief description paragraph on the fungi, then the taxon.pdfprovide a brief description paragraph on the fungi, then the taxon.pdf
provide a brief description paragraph on the fungi, then the taxon.pdfalsofshionchennai
 
Proporcione un ejemplo de c�mo las pr�cticas deficientes de gobierno.pdf
Proporcione un ejemplo de c�mo las pr�cticas deficientes de gobierno.pdfProporcione un ejemplo de c�mo las pr�cticas deficientes de gobierno.pdf
Proporcione un ejemplo de c�mo las pr�cticas deficientes de gobierno.pdfalsofshionchennai
 
Prompt Your task is to create a connected list implementation and .pdf
Prompt Your task is to create a connected list implementation and .pdfPrompt Your task is to create a connected list implementation and .pdf
Prompt Your task is to create a connected list implementation and .pdfalsofshionchennai
 
Project ScheduleUse Goods Company Inc. HRM Standardization Project.pdf
Project ScheduleUse Goods Company Inc. HRM Standardization Project.pdfProject ScheduleUse Goods Company Inc. HRM Standardization Project.pdf
Project ScheduleUse Goods Company Inc. HRM Standardization Project.pdfalsofshionchennai
 
Project ScenarioPecos Company acquired 100 percent of Suaros outs.pdf
Project ScenarioPecos Company acquired 100 percent of Suaros outs.pdfProject ScenarioPecos Company acquired 100 percent of Suaros outs.pdf
Project ScenarioPecos Company acquired 100 percent of Suaros outs.pdfalsofshionchennai
 
Professor Jones is very particular when it comes to his morning coff.pdf
Professor Jones is very particular when it comes to his morning coff.pdfProfessor Jones is very particular when it comes to his morning coff.pdf
Professor Jones is very particular when it comes to his morning coff.pdfalsofshionchennai
 
Program Specifications ( please show full working code that builds s.pdf
Program Specifications ( please show full working code that builds s.pdfProgram Specifications ( please show full working code that builds s.pdf
Program Specifications ( please show full working code that builds s.pdfalsofshionchennai
 
Productos m�dicos de Penner El lunes 14 de abril, Neil Bennett, Ge.pdf
Productos m�dicos de Penner El lunes 14 de abril, Neil Bennett, Ge.pdfProductos m�dicos de Penner El lunes 14 de abril, Neil Bennett, Ge.pdf
Productos m�dicos de Penner El lunes 14 de abril, Neil Bennett, Ge.pdfalsofshionchennai
 
P�Pa+Ba Hice bit Holndsiteur soors places at Non bed.pdf
P�Pa+Ba Hice bit Holndsiteur soors places at Non bed.pdfP�Pa+Ba Hice bit Holndsiteur soors places at Non bed.pdf
P�Pa+Ba Hice bit Holndsiteur soors places at Non bed.pdfalsofshionchennai
 
Q1.7. What would happen if you could magically turn off decompositio.pdf
Q1.7. What would happen if you could magically turn off decompositio.pdfQ1.7. What would happen if you could magically turn off decompositio.pdf
Q1.7. What would happen if you could magically turn off decompositio.pdfalsofshionchennai
 
Progressive Corporation (a property and casualty insurance company) .pdf
Progressive Corporation (a property and casualty insurance company) .pdfProgressive Corporation (a property and casualty insurance company) .pdf
Progressive Corporation (a property and casualty insurance company) .pdfalsofshionchennai
 
Q1. (a) Briefly introduce how Force-directed algorithms encode netwo.pdf
Q1. (a) Briefly introduce how Force-directed algorithms encode netwo.pdfQ1. (a) Briefly introduce how Force-directed algorithms encode netwo.pdf
Q1. (a) Briefly introduce how Force-directed algorithms encode netwo.pdfalsofshionchennai
 
Q1. part A. can we use if statement and skip else part(ye.pdf
Q1. part A. can we use if statement and skip else part(ye.pdfQ1. part A. can we use if statement and skip else part(ye.pdf
Q1. part A. can we use if statement and skip else part(ye.pdfalsofshionchennai
 
Q1. Fiscal policy is often focused on replacing spending that is no.pdf
Q1.  Fiscal policy is often focused on replacing spending that is no.pdfQ1.  Fiscal policy is often focused on replacing spending that is no.pdf
Q1. Fiscal policy is often focused on replacing spending that is no.pdfalsofshionchennai
 
Q1 Which of the following would be considered a transport epithelium.pdf
Q1 Which of the following would be considered a transport epithelium.pdfQ1 Which of the following would be considered a transport epithelium.pdf
Q1 Which of the following would be considered a transport epithelium.pdfalsofshionchennai
 
Q1 Find two thoracic vertebrae that fit together and identify .pdf
Q1 Find two thoracic vertebrae that fit together and identify .pdfQ1 Find two thoracic vertebrae that fit together and identify .pdf
Q1 Find two thoracic vertebrae that fit together and identify .pdfalsofshionchennai
 

Más de alsofshionchennai (20)

Q15 Amabook has average variable costs of $1 and average total costs.pdf
Q15 Amabook has average variable costs of $1 and average total costs.pdfQ15 Amabook has average variable costs of $1 and average total costs.pdf
Q15 Amabook has average variable costs of $1 and average total costs.pdf
 
Provide background and analysis ono The Indian initial farmers p.pdf
Provide background and analysis ono The Indian initial farmers p.pdfProvide background and analysis ono The Indian initial farmers p.pdf
Provide background and analysis ono The Indian initial farmers p.pdf
 
Provide a detailed description for each of the following measures of.pdf
Provide a detailed description for each of the following measures of.pdfProvide a detailed description for each of the following measures of.pdf
Provide a detailed description for each of the following measures of.pdf
 
provide a brief description paragraph on the fungi, then the taxon.pdf
provide a brief description paragraph on the fungi, then the taxon.pdfprovide a brief description paragraph on the fungi, then the taxon.pdf
provide a brief description paragraph on the fungi, then the taxon.pdf
 
Proporcione un ejemplo de c�mo las pr�cticas deficientes de gobierno.pdf
Proporcione un ejemplo de c�mo las pr�cticas deficientes de gobierno.pdfProporcione un ejemplo de c�mo las pr�cticas deficientes de gobierno.pdf
Proporcione un ejemplo de c�mo las pr�cticas deficientes de gobierno.pdf
 
Prompt Your task is to create a connected list implementation and .pdf
Prompt Your task is to create a connected list implementation and .pdfPrompt Your task is to create a connected list implementation and .pdf
Prompt Your task is to create a connected list implementation and .pdf
 
Project ScheduleUse Goods Company Inc. HRM Standardization Project.pdf
Project ScheduleUse Goods Company Inc. HRM Standardization Project.pdfProject ScheduleUse Goods Company Inc. HRM Standardization Project.pdf
Project ScheduleUse Goods Company Inc. HRM Standardization Project.pdf
 
Procedure.pdf
Procedure.pdfProcedure.pdf
Procedure.pdf
 
Project ScenarioPecos Company acquired 100 percent of Suaros outs.pdf
Project ScenarioPecos Company acquired 100 percent of Suaros outs.pdfProject ScenarioPecos Company acquired 100 percent of Suaros outs.pdf
Project ScenarioPecos Company acquired 100 percent of Suaros outs.pdf
 
Professor Jones is very particular when it comes to his morning coff.pdf
Professor Jones is very particular when it comes to his morning coff.pdfProfessor Jones is very particular when it comes to his morning coff.pdf
Professor Jones is very particular when it comes to his morning coff.pdf
 
Program Specifications ( please show full working code that builds s.pdf
Program Specifications ( please show full working code that builds s.pdfProgram Specifications ( please show full working code that builds s.pdf
Program Specifications ( please show full working code that builds s.pdf
 
Productos m�dicos de Penner El lunes 14 de abril, Neil Bennett, Ge.pdf
Productos m�dicos de Penner El lunes 14 de abril, Neil Bennett, Ge.pdfProductos m�dicos de Penner El lunes 14 de abril, Neil Bennett, Ge.pdf
Productos m�dicos de Penner El lunes 14 de abril, Neil Bennett, Ge.pdf
 
P�Pa+Ba Hice bit Holndsiteur soors places at Non bed.pdf
P�Pa+Ba Hice bit Holndsiteur soors places at Non bed.pdfP�Pa+Ba Hice bit Holndsiteur soors places at Non bed.pdf
P�Pa+Ba Hice bit Holndsiteur soors places at Non bed.pdf
 
Q1.7. What would happen if you could magically turn off decompositio.pdf
Q1.7. What would happen if you could magically turn off decompositio.pdfQ1.7. What would happen if you could magically turn off decompositio.pdf
Q1.7. What would happen if you could magically turn off decompositio.pdf
 
Progressive Corporation (a property and casualty insurance company) .pdf
Progressive Corporation (a property and casualty insurance company) .pdfProgressive Corporation (a property and casualty insurance company) .pdf
Progressive Corporation (a property and casualty insurance company) .pdf
 
Q1. (a) Briefly introduce how Force-directed algorithms encode netwo.pdf
Q1. (a) Briefly introduce how Force-directed algorithms encode netwo.pdfQ1. (a) Briefly introduce how Force-directed algorithms encode netwo.pdf
Q1. (a) Briefly introduce how Force-directed algorithms encode netwo.pdf
 
Q1. part A. can we use if statement and skip else part(ye.pdf
Q1. part A. can we use if statement and skip else part(ye.pdfQ1. part A. can we use if statement and skip else part(ye.pdf
Q1. part A. can we use if statement and skip else part(ye.pdf
 
Q1. Fiscal policy is often focused on replacing spending that is no.pdf
Q1.  Fiscal policy is often focused on replacing spending that is no.pdfQ1.  Fiscal policy is often focused on replacing spending that is no.pdf
Q1. Fiscal policy is often focused on replacing spending that is no.pdf
 
Q1 Which of the following would be considered a transport epithelium.pdf
Q1 Which of the following would be considered a transport epithelium.pdfQ1 Which of the following would be considered a transport epithelium.pdf
Q1 Which of the following would be considered a transport epithelium.pdf
 
Q1 Find two thoracic vertebrae that fit together and identify .pdf
Q1 Find two thoracic vertebrae that fit together and identify .pdfQ1 Find two thoracic vertebrae that fit together and identify .pdf
Q1 Find two thoracic vertebrae that fit together and identify .pdf
 

Último

Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterMateoGardella
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 

Último (20)

Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 

Q1(d) (11 marks)We want to be able to carry out an analysis of w.pdf

  • 1. Q1(d) (11 marks) We want to be able to carry out an analysis of words in long documents to find the most frequently used words. This can be used for example to identify the most important words for language learning or to try to identify authors in literary works. Later on we will ask you to analyse two Shakespeare plays, Hamlet and The Merchant of Venice, to find the 20 most frequent words and the number of times each word occurs. Because the most common words are mainly stop words (articles, prepositions, etc.) and the play's characters (e.g. Hamlet, Horatio, Portia etc.) we will also want the ability to exclude certain words from the analysis. First we want to explore the problem in a more general abstract form. Given the name (string) of a text file containing words, a positive integer m and a text file containing excluded words (strings), find the m most frequent words in the file (apart from the excluded words) and their frequencies, given in descending order of frequency. We define this more formally, as follows: Operation: Most common in file Inputs: filename, a string; excluded-words-filename, a string; m, integer Preconditions: Files of names filename and excluded-words-filename are text files; m > 0 Outputs: most-common-words, a list of at most m items, where each item is a tuple Postconditions: Each item of most-common-words, is a tuple containing a word from the file filename together with its frequency, with the list being in descending order of frequency, and no tuples for words from excluded-words-filename are in the list. The frequency component of each tuple in most-common-words is greater than or equal to the frequency of occurrence for any other words in filename, ignoring any words in excluded-words- filename. Q1(d)(i) (3 marks) The main ADT to use for storing the text should be a bag. You will also need to choose a suitable ADT for the excluded words, and you can also use other standard simple built-in data structures of Python such as lists or strings, if necessary. State what sort of ADTs and data structures you would use for this problem and explain what is stored in these ADTs. Do not explain your choices at this stage - we will ask about that later. Add your answer for Q1(d)(i) here: Q1(d)(ii) (3 marks)
  • 2. Give a step-by-step explanation, showing how your solution would work. Write your answer to Q1(d)(ii) here Q1(d)(iii) (5 marks) Now explain your chosen approach by outlining the characteristics and the expected performance of the operations on bags and other ADTs/data structures you have used, in standard Python implementations. You should reference the performance discussions for bags in Chapter 8 and relevant performance discussions elsewhere in the module text for other ADTs/data structures. Write your answer to Q1(d)(iii) here Q1(e) (12 marks) Implement your approach from part (d) to solve the abstract problem introduced in part (d) and extended somewhat here: Analyse two given literary texts to find the 20 most frequent words and the number of times each word occurs. Exclude from the analysis the words that are often most common but less important to the analysis: so-called stop words (articles, prepositions, etc.) and words naming the text's characters. The original files for the two texts may contain line punctuation and extraneous characters at the start or end of words such as apostrophes, dashes etc and these should be removed before further processing. The excluded words relevant to each text are listed in a given text file - and this has been cleaned so that it just contains the relevant words, without any punctuation or extraneous characters. In this case the first text is Shakespeare's Hamlet (in the given text file hamlet.txt) with excluded words listed in the given text file hamlet_excluded_words.txt. The second text is Shakespeare's Merchant of Venice (in the given text file merchant.txt) with excluded words listed in the given text file merchant_excluded_words.txt. We also want you to find the words that occur in both these texts and the number of occurrences in common for these words e.g. if dog occurs 10 times in the first text and 25 times in the second text, then there are 10 occurrences in common. We have provided code frameworks for your solution below. We have split the problem and the code framework into two parts so you can do one bit at a time and check each part is working. Q1(e)(i) (5 marks) The first part of the problem requires reading a text from a file, eliminating excluded words, and storing the results in a bag. As in part(d), we define this more formally, as follows:
  • 3. Operation: Get bag from file Inputs: filename, a string; excluded-words-filename Preconditions: Files of names filename and excluded-words-filename are text files Outputs: text, a bag of words Postconditions: The bag text contains all words from the file filename together with their frequency of occurrence, except that any words from excluded-words-filename are omitted. Here is the code framework for this first part of the problem. We have included some simple test code, to let you check if your code seems to be working so far. Please make the required changes as indicated by comments. When you have finished run your code to view the output. # Change this code in the places indicated # in order to implement and test your solution %run -i m269_util # Import the functions read_file and read_and_clean_file %run -i m269_tma03_filehandling # You will need to amend this function so that the excluded # words extracted from the list are added to the data structure # that is returned. You should also set the type annotation # for the function return value def get_excluded_words(word_list : list) : """Returns the excluded words occurring in word_list in a suitable data structure. Here we use a set. """ # replace the following with your code to initialise the data structure words = None # replace the following with your code to add words from the list if not blank pass return words # You will need to amend this function so that the words from the list are # added to the data structure that is returned, apart from the excluded words. # You should also set the type annotations for the excluded_words argument def bag_of_words(word_list : list, excluded_words)-> Counter: """Return the words occurring in word_list as a bag-of-words,
  • 4. omitting any excluded words """ words = Counter() # replace the following with your code to add words from the list if not # an excluded word pass return words # You should not need to change this function def get_bag_from_file(filename : str, excluded_words_file : str) -> Counter: """ Return list of "m" most frequent words and their percentage frequencies in the text of file "filename", excluding all the words in file "excluded_words" """ excluded_words_list= read_file(excluded_words_file) excluded_words = get_excluded_words(excluded_words_list) text_list = read_and_clean_file(filename) text = bag_of_words(text_list, excluded_words) return text # We have provided the following code here to allow you to check if your # code is working so far. You should not need to change it unless you wish # to add more tests of your own. print("Collecting words in text...") text = get_bag_from_file('hamlet.txt', 'hamlet_excluded_words.txt') test(size, [['Size of text', text, 20635 ]]) test(text.most_common, [['Most common word in text', 1, [('lord', 223)] ]])