This document describes a problem to analyze text documents by finding the most frequently used words while excluding stop words and character names. It provides code frameworks to implement a solution in two parts: 1) read a text file, remove excluded words, and store the results in a bag; 2) analyze two Shakespeare texts (Hamlet and Merchant of Venice) to find the 20 most common words in each, excluding common words listed in separate files, and also find common words between the texts. The document asks the reader to complete code implementations for processing words from a file into a bag while removing exclusions, and provides test code to check the results.
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Q1(d) (11 marks)We want to be able to carry out an analysis of w.pdf
1. Q1(d) (11 marks)
We want to be able to carry out an analysis of words in long documents to find the most
frequently used words. This can be used for example to identify the most important words for
language learning or to try to identify authors in literary works. Later on we will ask you to
analyse two Shakespeare plays, Hamlet and The Merchant of Venice, to find the 20 most
frequent words and the number of times each word occurs. Because the most common words are
mainly stop words (articles, prepositions, etc.) and the play's characters (e.g. Hamlet, Horatio,
Portia etc.) we will also want the ability to exclude certain words from the analysis.
First we want to explore the problem in a more general abstract form.
Given the name (string) of a text file containing words, a positive integer m and a text file
containing excluded words (strings), find the m most frequent words in the file (apart from the
excluded words) and their frequencies, given in descending order of frequency.
We define this more formally, as follows:
Operation: Most common in file
Inputs: filename, a string; excluded-words-filename, a string; m, integer
Preconditions: Files of names filename and excluded-words-filename are text files; m > 0
Outputs: most-common-words, a list of at most m items, where each item is a tuple
Postconditions: Each item of most-common-words, is a tuple containing a word from the file
filename together with its frequency, with the list being in descending order of frequency, and no
tuples for words from excluded-words-filename are in the list.
The frequency component of each tuple in most-common-words is greater than or equal to the
frequency of occurrence for any other words in filename, ignoring any words in excluded-words-
filename.
Q1(d)(i) (3 marks)
The main ADT to use for storing the text should be a bag. You will also need to choose a
suitable ADT for the excluded words, and you can also use other standard simple built-in data
structures of Python such as lists or strings, if necessary.
State what sort of ADTs and data structures you would use for this problem and explain what is
stored in these ADTs. Do not explain your choices at this stage - we will ask about that later.
Add your answer for Q1(d)(i) here:
Q1(d)(ii) (3 marks)
2. Give a step-by-step explanation, showing how your solution would work.
Write your answer to Q1(d)(ii) here
Q1(d)(iii) (5 marks)
Now explain your chosen approach by outlining the characteristics and the expected performance
of the operations on bags and other ADTs/data structures you have used, in standard Python
implementations. You should reference the performance discussions for bags in Chapter 8 and
relevant performance discussions elsewhere in the module text for other ADTs/data structures.
Write your answer to Q1(d)(iii) here
Q1(e) (12 marks)
Implement your approach from part (d) to solve the abstract problem introduced in part (d) and
extended somewhat here:
Analyse two given literary texts to find the 20 most frequent words and the number of times each
word occurs. Exclude from the analysis the words that are often most common but less important
to the analysis: so-called stop words (articles, prepositions, etc.) and words naming the text's
characters. The original files for the two texts may contain line punctuation and extraneous
characters at the start or end of words such as apostrophes, dashes etc and these should be
removed before further processing. The excluded words relevant to each text are listed in a given
text file - and this has been cleaned so that it just contains the relevant words, without any
punctuation or extraneous characters.
In this case the first text is Shakespeare's Hamlet (in the given text file hamlet.txt) with excluded
words listed in the given text file hamlet_excluded_words.txt. The second text is Shakespeare's
Merchant of Venice (in the given text file merchant.txt) with excluded words listed in the given
text file merchant_excluded_words.txt.
We also want you to find the words that occur in both these texts and the number of occurrences
in common for these words e.g. if dog occurs 10 times in the first text and 25 times in the second
text, then there are 10 occurrences in common.
We have provided code frameworks for your solution below. We have split the problem and the
code framework into two parts so you can do one bit at a time and check each part is working.
Q1(e)(i) (5 marks)
The first part of the problem requires reading a text from a file, eliminating excluded words, and
storing the results in a bag.
As in part(d), we define this more formally, as follows:
3. Operation: Get bag from file
Inputs: filename, a string; excluded-words-filename
Preconditions: Files of names filename and excluded-words-filename are text files
Outputs: text, a bag of words
Postconditions: The bag text contains all words from the file filename together with their
frequency of occurrence, except that any words from excluded-words-filename are omitted.
Here is the code framework for this first part of the problem. We have included some simple test
code, to let you check if your code seems to be working so far.
Please make the required changes as indicated by comments. When you have finished run your
code to view the output.
# Change this code in the places indicated
# in order to implement and test your solution
%run -i m269_util
# Import the functions read_file and read_and_clean_file
%run -i m269_tma03_filehandling
# You will need to amend this function so that the excluded
# words extracted from the list are added to the data structure
# that is returned. You should also set the type annotation
# for the function return value
def get_excluded_words(word_list : list) :
"""Returns the excluded words occurring in word_list
in a suitable data structure. Here we use a set.
"""
# replace the following with your code to initialise the data structure
words = None
# replace the following with your code to add words from the list if not blank
pass
return words
# You will need to amend this function so that the words from the list are
# added to the data structure that is returned, apart from the excluded words.
# You should also set the type annotations for the excluded_words argument
def bag_of_words(word_list : list, excluded_words)-> Counter:
"""Return the words occurring in word_list as a bag-of-words,
4. omitting any excluded words
"""
words = Counter()
# replace the following with your code to add words from the list if not
# an excluded word
pass
return words
# You should not need to change this function
def get_bag_from_file(filename : str, excluded_words_file : str) -> Counter:
""" Return list of "m" most frequent words and their percentage frequencies
in the text of file "filename", excluding all the words in file "excluded_words"
"""
excluded_words_list= read_file(excluded_words_file)
excluded_words = get_excluded_words(excluded_words_list)
text_list = read_and_clean_file(filename)
text = bag_of_words(text_list, excluded_words)
return text
# We have provided the following code here to allow you to check if your
# code is working so far. You should not need to change it unless you wish
# to add more tests of your own.
print("Collecting words in text...")
text = get_bag_from_file('hamlet.txt', 'hamlet_excluded_words.txt')
test(size, [['Size of text', text, 20635 ]])
test(text.most_common, [['Most common word in text', 1, [('lord', 223)] ]])