Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Everything about TF-IDF algorithm in NLP

Applications, Use Cases and Innovations with
TF-IDF Algorithm
Submitted by
Srijit Panja
(Reg No: 31019024)
In partial fulf...
This is to certify that the project report entitled ”Applications,
Use Cases and Innovations with TF-...
I, Srijit Panja, student of Masters of Science in Computer Science with spe-
cialization in Machine Intelligen...
Próximo SlideShare
Cargando en…3

Eche un vistazo a continuación

1 de 64 Anuncio

Más Contenido Relacionado

Similares a Everything about TF-IDF algorithm in NLP (20)

Más reciente (20)


Everything about TF-IDF algorithm in NLP

  1. 1. Applications, Use Cases and Innovations with TF-IDF Algorithm Submitted by Srijit Panja (Reg No: 31019024) In partial fulfillment of the requirements for the award of Master of Science in Computer Science with Specialization in Machine Intelligence of Cochin University of Science and Technology, Kochi Conducted by Indian Institute of Information Technology and Management-Kerala Technopark Campus Thiruvanathapuram-695 581 April 2021
  2. 2. BONAFIDE CERTIFICATE This is to certify that the project report entitled ”Applications, Use Cases and Innovations with TF-IDF Algorithm” submitted by Srijit Panja (Reg. No: 31019024) in partial fulfillment of the requirements for the award of Master of Science in Com- puter Science with Specialization in Machine Intelligence is a bonafide record of the work carried out at Centre for Artificial General Intelligence and Neuromorphic Systems (NeuroAGI), Indian Institute of Information Technology and Management - Kerala under our supervision. Supervisor Course Coordinator Dr. Alex P. James Dr. Asharaf S Professor Professor IIITM-K IIITM-K i DocuSign Envelope ID: 668929D5-EBFA-42E3-AE17-D1B059D79653
  3. 3. DECLARATION I, Srijit Panja, student of Masters of Science in Computer Science with spe- cialization in Machine Intelligence, hereby declare that this report is sub- stantially the result of my own work , except where explicitly indicated in the text, and has been carried out during the period January 2021 to April 2021. Place: Thiruvananthapuram Date: 15/04/2021 ii
  4. 4. ACKNOWLEDGEMENT Foremost , I would like to express my very sincere gratitude to Prof . Dr. Alex P James, for the continuous support of this thesis, for his patience, motivation, enthusiasm, and immense knowledge. His guidance helped me in all the time of research and writing of this project. I could not have imagined having a better advisor and mentor for my study. I would also like to express my heartfelt gratitude to our course coordina- tor Prof. Dr. Asharaf S for his relentless support and motivation throughout the project. His insightful understanding and feedback made the project a productive one. Moreover, I am extremely grateful to Dr Elizabeth Sherly, Director of IIITM-K for providing us with such great facilities and a proper working environment which gave us all the necessary stuff needed to complete this project. I would like to thank Mr. Shameel Abdulla (CEO) and the team of Clootrack Software Labs Pvt Ltd, an associate organization to NeuroAGI, for providing me real use case datasets for the purpose of my study. Last but not the least, I am thankful to my friends and classmates for their creative suggestion and criticism. iii
  5. 5. ABSTRACT TF-IDF algorithm is a widely used workflow in Natural Language Process- ing. This study aims at building a comprehensive work on all aspects of usage of the algorithm in different purposes. Conventionally TF-IDF is used for feature engineering for a variety of text analytic purposes ranging from sentiment analysis, text summarization, keyword extraction, fake text iden- tification, machine translation and other supervised and unsupervised tasks alike. However this study also aims at deploying TF-IDF algorithm beyond its conventional use of only as a word embedding technique. Beyond feature generation, its potential in prediction tasks with no or minimal learning al- gorithms has been the subject of experiment. An important part of this work is also to propose sentiment analysis and fake text detection in new ways em- ploying score generators with no and minimal supervision in context to the datasets provided respectively. A new lexicon based approach with priority on importance on keywords for determining sentiment is proposed for sentiment analysis. A method of ranking sentences for text summarization according to sentence scores given the sum of TF-IDF scores of the constituent words is a part of the study. In analog and digital mediums, it brings potential in a variety of researches and products. Later part of the work is aimed at establishing its value in prevalent as well as critical industry problems based on clustering, classifica- tion for grouping of customer reviews into categories for product ’car’. A lot of applications and new methods including conventional and pro- posed methods of sentiment analysis, text summarization, fake text detection iv
  6. 6. have been tested on open source text dataset for COVID (COVID Open Re- search Dataset - CORD) in the view of the current ongoing situation. v
  7. 7. Table of Contents List of Tables viii List of Figures ix Glossary x 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 TF-IDF Algorithm - Definition . . . . . . . . . . . . . . . . . 2 1.3.1 TF-IDF Score and Vector . . . . . . . . . . . . . . . . 2 1.3.2 General Importance of terms and Vectorization . . . . 3 2 Literature Review 5 3 Applications of TF-IDF 9 3.1 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . 9 3.1.1 Inverted Indexing . . . . . . . . . . . . . . . . . . . . . 9 3.1.2 Ranked Information Retrieval . . . . . . . . . . . . . . 10 3.1.3 Ranked Information Retrieval using TF-IDF weighting scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.4 Similarity Score Calculation . . . . . . . . . . . . . . . 11 3.1.5 Dynamic Ranking of documents . . . . . . . . . . . . . 13 3.2 Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 Text Summarization . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.1 Document summarization by TF-IDF: . . . . . . . . . 15 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4 Use Cases in Industry 17 4.1 Industry problems . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1.1 Text pre-processing . . . . . . . . . . . . . . . . . . . . 17 vi
  8. 8. 4.1.2 Problem 1: Clustering of customer reviews on cars - solved using K-Means over TF-IDF embeddings of complete sentences . . . . . . . . . . . . . . . . . . . . 20 4.1.3 Problem 2: Clustering of customer reviews on cars - solved using K-Means over TF-IDF embeddings of bi- grams derived from original sentences . . . . . . . . . . 23 4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5 Lexicon based Sentiment Analysis and Fake News Detection 27 5.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . 27 5.1.1 Approach 1 - Innovation on Lexicon based Sentiment Analysis using TF-IDF scores . . . . . . . . . . . . . . 27 5.1.2 Conventional Approach 2 . . . . . . . . . . . . . . . . . 28 5.2 Truth Analysis and Belief Index Generation . . . . . . . . . . 29 5.2.1 Conventional Approach 1 . . . . . . . . . . . . . . . . . 29 5.2.2 Approach 2 - Innovation on Belief Index Generation with method of vectorization as TF-IDF Algorithm . . 31 5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 6 Summary and Future works 34 6.1 Main Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 34 6.2 Open problems . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.3 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 7 Publications 37 8 ANNEXURE 38 8.1 Code for Problem 1: Clustering of customer reviews on cars - solved using K-Means over TF-IDF embeddings of complete sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 8.2 Code for Problem 2: Clustering of customer reviews on cars - solved using K-Means over TF-IDF embeddings of bi-grams derived from original sentences . . . . . . . . . . . . . . . . . . 42 REFERENCES 48 vii
  9. 9. List of Tables 4.1 Table of accuracy of clustering of customer reviews on cars . . 22 4.2 Comparison of accuracies of clustering on word embeddings derived using different methods . . . . . . . . . . . . . . . . . 22 4.3 Table showing structure of formation of clusters (Pure: based on a same keyword throughout, Miscellaneous: based on no particular keyword) . . . . . . . . . . . . . . . . . . . . . . . . 25 4.4 Comparison in accuracies of clustering (of different word em- beddings) of bi-grams extracted from customer reviews from dataset of Problem 1 into clusters closest to three predefined clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.1 Sentences with clear sentimental views derived from CORD19 (Covid19 Open Research Dataset) . . . . . . . . . . . . . . . . 28 5.2 Sentences procured from ’coronavirus’ query in the highest ranked article from CoVid-19 Open Research Dataset by Kag- gle, and their measures of sentiment, and similarity scores . . 30 5.3 Calibration of beliefs and dis-beliefs applied on some sentences. 33 viii
  10. 10. List of Figures 1.1 ’TF-IDF score’ (in Y axis) Vs ’Number of documents’ (in X axis) with specific term with Term Frequency and total num- ber of documents constant at 5 and 15 respectively . . . . . . 4 3.1 Inverted Indexing Mechanism . . . . . . . . . . . . . . . . . . 10 4.1 Code implemented for text pre-processing . . . . . . . . . . . 18 4.2 Retention of only nouns for customer reviews . . . . . . . . . . 19 4.3 Loading of dataset . . . . . . . . . . . . . . . . . . . . . . . . 21 4.4 Code for generation of word embeddings using TF-IDF vectorizer 21 4.5 Code of K-Means clustering . . . . . . . . . . . . . . . . . . . 21 4.6 Code for representation of clusters . . . . . . . . . . . . . . . . 22 4.7 Code for extraction of bigrams . . . . . . . . . . . . . . . . . . 24 4.8 Code for generation of word embeddings using TF-IDF vectorizer 24 4.9 Code for first level of clustering . . . . . . . . . . . . . . . . . 24 4.10 Code for second level of clustering . . . . . . . . . . . . . . . . 25 4.11 The clusters formed . . . . . . . . . . . . . . . . . . . . . . . . 25 ix
  11. 11. Glossary NLP Natural Language Processing ML Machine Learning AI Artificial Intelligence LM Language Model LSTM Long Short Term Memory x
  12. 12. Chapter 1 Introduction 1.1 Motivation The primary purpose of word embeddings are essentially encoding the text in numerical formats. The fact that these encoded sets which are arrays of numbers can be used to make a machine learning model learn, fitting the model on a series of data (encodings tagged against label), indicates the fact that these word embeddings hold intrinsic features of the raw text during the conversion. The trend is then learnt by a learning model. This becomes even more significant in case of TF-IDF word embeddings as the constituent elements of the embedding array are indeed the TF-IDF scores of the words corresponding to their presences in the document which indicate their fre- quency based general importances. Clearly, the embeddings should hold intrinsic features of the text given the elements of that embedding do hold general importances of the words. Thus such a word embedding technique becomes potentially capable in a lot of text analytic mechanisms with high or minimal learning requirements. Plainly, it is an embedding created with a sequence of general importances held together and therefore exploring its use cases become worthwhile. Moreover the scores itself independently show potential in deriving insights in terms of evaluating texts with respect to per- spectives such as sentiment, truth etc. Thus a score based analysis of text in different domains using TF-IDF algorithm is also a significant motivation. 1.2 Objective The work is aimed at showing the implementation and efficiency of TF- IDF algorithm in tasks solved by learning models, as well as minimal or no learning. In the pursuit, an objective is to solve two problems of clustering 1
  13. 13. or grouping customer reviews based on keywords, where we employ TF-IDF algorithm for generating word embeddings. A second objective is to quantify the falsity in fake news articles in addition to detecting them. This is also where TF-IDF word embeddings are tagged with truth/falsity labels and used for a supervised task of classification. A cosine similarity score has been used for the quantification mentioned. A part of the work is also to propose TF-IDF score based methods for lexicon-based sentiment analysis. 1.3 TF-IDF Algorithm - Definition TF-IDF is an information retrieval strategy that gauges a term’s frequency (TF) and its inverse document frequency (IDF). Each word or term that is present in the content has its individual TF and IDF score. The result of the product the TF and IDF scores of a term is known as the TF-IDF weight of that term. 1.3.1 TF-IDF Score and Vector TF-IDF[44] (Term Frequency-Inverse Document Frequency) score is devised for each word with respect to each document. This score will be different even for the same word on change of the document respect to which it is measured. Term frequency (TF)[51] calibrates significance of word in an article. The ith token’s tf in a collection of tokens forming a document j is formulated as: tfi,j = ni,j P k nk,j (1.1) [39] where ni,j is the count of presences of the word in article dj and P k nk,j is the sum total of presences of every word existing in article dj. The inverse document frequency (IDF) represents calibration of the sig- nificance of the word token among an assimilation of articles, computed by dividing the count of articles by the total count of all articles holding the word. For a larger collection, the IDF explodes, therefore deriving a loga- rithm suppresses the impact: idfi = log |D| |dj : ti ∈ dj| (1.2) [41] where |D| is net count of articles in the collection and |dj : ti ∈ dj| is the count of articles in which the word token ti is present. Then tfidf score[44] is formulated as: tfidfi,j = tfi,j ∗ idfi. (1.3) 2
  14. 14. [19] And for every document in the collection, each term is assigned a tfidf score[3]. 1.3.2 General Importance of terms and Vectorization Specific mainstream purpose for using TF-IDF as a lexicon based feature engineering tool is served by the following: • Finding important words in a document: As dj → |D| (1.4) log |D| dj → 0 (1.5) That is to say, as the number of documents in which the word is present becomes large, the TF-IDF Score for the word with respect to its partic- ular document becomes low. That is, the word becomes unimportant. Rarity of a word throughout the corpus makes it unique to a particu- lar document and is considered important in that document. This is mostly important while caliberating the inter document importance of a word. The quantifiable equivalent for whether a term is significant corpus-wide is IDF. Whereas the quantifiable equivalent for the priority of a term a partic- ular document-wide or in a particular document (which is a subset of a corpus) is TF (Term Frequency). The commonness of a term makes it actually significant as it apparently is realized so. The frequency of a term that is to say its number of occurrences make determine the term is related to the topic of the document and its the TF score quantifies the amgnitude of that relation. Even if the IDF score of a word is high, its TF score is zero, general importance of that word for the specific document of its containment stands null. As ni,j → 0 (1.6) tfi,j → 0 (1.7) 3
  15. 15. Figure 1.1: ’TF-IDF score’ (in Y axis) Vs ’Number of documents’ (in X axis) with specific term with Term Frequency and total number of documents constant at 5 and 15 respectively • Process of vectorization of a document: For each document, the corresponding document embedding will struc- turally be an ‘n’ blocked array(‘n’ being the total number of distinct words considering all the words in all the documents i.e. the count of vocabulary of the corpus). In the respective positions allotted for each word(from the entire corpus) in the array, putting in the TF-IDF scores of the words w.r.t the document being vectorized produces a resultant array of size n of numerical data type(generally floating point). This array goes on to become the document embedding of the corresponding document. • Process of vectorization of a sentence: The procedure here is accurately similar to that in Vectorization of a document, with the factual perspective considered that a sentence can be seen as a one sentenced document. 4
  16. 16. Chapter 2 Literature Review Word Embedding[8] methods in Natural Language Processing are source to the encryption[23] of text elements for machine learning purposes. These distinguish from each other based on the technique they adopt to transfer a same text to numerical sets based on whose nature or trend, regression, classification[4][9] or clustering take place. Transfer of text to word embed- dings are essentially done as a part of feature engineering where features are drawn out from raw form text. These features are quantifiable and therefore models can fit upon them and learn based on the nature of this engineered data. Post text pre-processing, feature engineering serves as the primal for Natural Language Processing and associated learning. The tech- niques for such feature engineering or simply word embedding methods trace back from syntactic processes which were mostly rule based. Bag-of-Words (BOW) model, Continuous BOW (CBOW), Continuous skip-gram model, Count Vectorizer, TF-IDF Vectorizer are some of such techniques that are designed on specific workflow that takes into account apparent presences of words in a syntax for feature engineering fitting on a text. The rules are sta- tistical and therefore keyword based modelling[53] or word-frequency prior solutions are mostly solved using any of these. Bag-of-Words (BOW) model[55][36], which later gave rise to Count Vec- torizer counts[26] essentially the frequency of words for the vocabulary of an entire corpus of documents. The frequencies of the words are counted with respect to the documents. That is, for all the documents the net document embedding is of a same size, the total size of the vocabulary of words ac- cumulated from all the documents in the corpus. The array is filled with the frequency of words in blocks allotted for the respective words. This mechanism also holds true for more embedding techniques of the same kind including TF-IDF. But in TF-IDF, it’s not the simply the term frequencies that form the constituents of the embedding. Its a score that captures the 5
  17. 17. general importance of the term technically known as TF-IDF score. Continuous Bag-of-Words (CBOW) model[52] is a text processing method that is effectively used for capturing contexts in a text, which in most cases are subsequently used for predicting target words. For text embedding, it embeds portions of text in a sequence of word tokens. Such contextual em- beddings when tagged against target words for prediction form a data trend on which a neural network can fit for prediction tasks. The embedding mech- anism it employs is similar to BOW model, however a sequence of contiguous or alternate tokens are embedded using the method and not the exact raw text. This is indicative of the fact that context representations are majorly deployed using either a contiguous sequence of words to denote the relative positions of words as a part of a long text or n-grams[45], shorter versions being bi-grams, tri-grams or alternate words in progression called skip-grams. The continuous skip-gram[25] model acts in the same way as CBOW model. However it functions the reverse. In it case, it attempts at predicting a context window from a given word, which previously was the target word. This essentially means that the input to the neural network is the ’target’ word, while the outputs at the last layer are the probabilities for the labels - context words. In both the cases however, the initial context designing i.e to say the labelling of context (set of surrounding words) to target or target to context is derived from an initial vocabulary that holds all tokens possible from a corpus. A matrix[31] captures the relationships between the context and the targets and also thus the reverse relationships, whose derivative in the dataset for function onto which a simple neural network fits. However TF-IDF algorithm is the most extended outcome from this ap- proach, efficient to quantify general importance of words in a both in in- tra and inter-document perspectives. The following set of word embedding processes are though mostly neural network based, but still has an under- lying training on words and not semantics or semantic based word relation- ships. Word2Vec (or extension: Doc2Vec) and GloVe (Global Vectors for Word Representation) are examples which are takes mostly syntactic-word- relationships in a matrix to train using neural network models. These matri- ces are known as co-occurrence matrices[35], which reflect the word-to-word relationships[32] in terms of existence in a same portion of the entire text concerned. The set of embeddings arising from such a setup are commonly prediction oriented i.e. they next set of words arising out of a context is the centre of prediction and the embeddings designed in the process have a goal of achieving accurate predictions in this direction. These are known as prediction based word embedding techniques. Word2Vec[13] is model built by a combination of CBOW and Continuous Skip-gram model. GloVe (Global Vectors for Word Representation) is a par- 6
  18. 18. allel technique to create phrase embeddings. It is primarily based on matrix factorization techniques at the phrase-context matrix. A huge matrix of co- occurrence records is built and each “phrase” (the rows), and how often we see this phrase in a few “context” (the columns) in a large corpus is consid- ered. Usually, we experiment our corpus inside the following way: for every time period, we search for context phrases inside a few place defined by using a window-size before the term and a window-length after the term. Also, we provide less weight for greater distant words. The range of “contexts” is, of path, huge, since it is largely combinatorial in size. So then we factorize this matrix to yield a decrease-dimensional matrix, wherein every row now yields a vector representation for every phrase. In trendy, this is completed by way of minimizing a “reconstruction loss”[24]. This loss tries to locate the decrease-dimensional representations which can explain most of the variance inside the high-dimensional data. In exercise, we use both GloVe[37] and Word2Vec to transform our textual content into embeddings and each show- case similar performances. Although in real programs we teach our version over Wikipedia textual content with a window length around five- 10. The variety of words in the corpus is round 13 million, consequently it takes a huge quantity of time and resources to generate these embeddings. With the advent of BERT (Bidirectional Encoder Representation Trans- former) model which is an attention[49] based bidirectional transformer[54] model designed primarily for end-to-end tasks that employ encoder-decoder approach, the extensions: AlBERT, DistilBERT was seen to crop up along with other neural network based encryption processes as Infersent, Univer- sal Sentence Encoder(USE) etc. Dominantly, a large challenge in NLP is dearth of sufficient training information. In total a giant quantity of textual statistics is present, but in case there is a requirement for project-precise datasets, we require to break the pile into many different fields. And while attempting that, the result is just a few thousand or even some hundred thou- sand manually labeled training sets. Unfortunately, so that it will execute well, deep learning oriented NLP procedures require plenty large quantities of records — they see foremost upgrades while learnt on millions, or even billions, of labelled learning instances. To help complete this void in infor- mation, researchers have advanced diverse strategies for dedicated language representation models with the huge chunks of unlabelled textual content at the internet (that is called pre-training). These pre-trained models subse- quently can be satisfactorily tuned on smaller assignment-unique datasets, like when running with issues like sentiment analysis and query answering. This approach consequences in first-rate accuracy upgrades in comparison with training at the smaller assignment-particular datasets from scratch. BERT[48] is one of the latest additions to those strategies for pre-training 7
  19. 19. in NLP; it resulted a disruption within the deep learning domain as it offered modern outcomes in a huge kind of text oriented tasks, like query answer- ing. Quality component about BERT remains in the fact that it can be downloaded and used totally free — we can both utilize BERT as models to procure top first-class language capabilities out of our text records, or we will fine tune those models in particular undertaking, like sentiment evaluation and query answering, using information to supply modern-day predictions. Before BERT, a language model might have interpreted a textual series all through training in the direction of left-to-right or integrated right-to-left and left-to-right. Such uni-directional procedure performs nicely for produc- ing sentences — like to forecast the following word, add it to series, then expect the following to subsequent word till we’ve a entire sentence. With BERT, a language version which is trained in both directions, we are able to now have an in-depth experience of language context and drift in comparison with the uni-directional language fashions. In place of predicting subsequent word within a series, BERT uses a method referred to as Masked Language Model[43] (MLM), by which it in no particular order masks phrases inside the sentence after which it attempts to predict them. Masking manner that the model appears in both guidelines and it uses the entire context of text sequence, both right as well as left environment, as a method to predict the word that is masked. Contrary to the erstwhile language fashions, it considers both the previous and subsequent tokens simultaneously. The existing mixed right-to-left and left-to-right LSTM[15] based models had not kept this “same-time component”. BERT is predicated on a Transformer (the main process that learns relationships that are contextual between phrases within a text). A primary Transformer comprises of encoder to study the text input as well as a decoder to supply a prediction for the assignment. Since BERT’s intention is to generate a language illustration model, it only requires the encoder element. The encoder input in case of BERT is a token sequence, which are primarily transformed to vectors and subsequently processed inside the neural architecture. A simplified one that is afloat for keyword based search processes, information retrieval and lexicon based NLP is TF-IDF algorithm. This has several implementaions, the most common being the averaged TF-IDF vectorization approach also implemented on Scikit Learn. 8
  20. 20. Chapter 3 Applications of TF-IDF 3.1 Information Retrieval The TF-IDF weighting scheme for information retrieval is a gradual deriva- tion from generally quantified ranked information retrieval methods. Prior to that information was retrieved subject to a search query randomly in no particular order of importance with respect to or coincidence with the key- words in the search. Inverted indexing of documents meant the organization of documents in descending orders of their priorities with respect to a search. This was a more general approach to a later specific introduction of TF-IDF weights. This included both general importance of the search keyword with respect to the document as well as importance of the document in general in a corpus of documents alike in context to the keyword for which the search is conducted. 3.1.1 Inverted Indexing For searching and retrieving documents relevant to a search query, there is a requirement of storing all documents containing the search query separately. This is opposed to the conventionally thought way-out of finding which words a document contain and retrieving it in case the search query is a constituent element. Inverted indexing[34][14] is the indexing of documents correspond- ing to a search query. All the documents containing the search query as a constituent element are indexed into an array corresponding to the search query, and during the retrieval process, all the documents corresponding to document indices in that array are delivered as response to an user gener- ated search. In cases when there are more than one word in a search query, the search query is tokenized into independent words, and one array corre- sponding to each token is allocated. Each of these arrays then contain the 9
  21. 21. Figure 3.1: Inverted Indexing Mechanism document indices corresponding to documents containing the word. Sub- sequently the arrays are then united following set union operation. This gives the net list of documents in which any of the constituent word token is present. Beyond this general approach, a more concentrated Information Retrieval System can also be devised applying set intersection over the in- dividual arrays. This would ensure that only the documents containing all constituent elements from the search query would be retrieved. This is used when the requirement of the user is very specific and subject of search is accurately delivered by the search query an user inputs. However during the retrieval, the documents are not presented to the user in the order of their relevance using this mechanism. The most relevant doc- ument might be presented later, while a less relevant one might be presented before it. This on a whole doesn’t completely fulfill the purpose of devising an information retrieval which would quench the curiosity associated with an user generated query. 3.1.2 Ranked Information Retrieval To the solve the above problem, that is to secure an order for the retrieval of documents[27] and their procurement by the end users, the documents need to be ranked with respect to their relevance with the search query. This relevance is quantified on the parameter of how similar the document is to the search query. More the number of common words between the two texts, and greater the importance of those in the document, more is the similarity. This is quantified by a similarity score. So clearly, in a ranked mechanism[42], 10
  22. 22. there are n similarity scores (n = number of documents) each representing similarity for a ’document ⇐⇒ search query’ pair. In this pair, the search query will remain the same and the document will vary at each iteration to cover all the documents in the pool of data. 3.1.3 Ranked Information Retrieval using TF-IDF weight- ing scheme In this mechanism, for every document in the collection, each term is assigned a tfidf score[3].The document in which a weighted average or sum of the TF-IDF[18] scores corresponding to the word tokens of the search query is maximum should quite supposedly be the most relevant document. And thus, we could have easily imagined a ranked list of documents based on simply such scores. However such a mechanism is not admissible as it would not cover the documents in which the search query or word tokens from the search query is not at all present. Thus there are two inconveniences, first one being all such documents would be ranked zero and the same, defeating the purpose of ordering of documents as we approach the tail of the list. The second inconvenience lie at the fact that in cases when none of the documents contain the word tokens from the search query, no document will be retrieved and presented to the user. For this, the general approach is to represent both the search query and the documents in the forms of vectors. Converting to vectors gives the priv- ilege of finding how close are the these vectors. By this method, there will always be a measure of closeness between a document and the search query and irrespective of its containment of the search query terms, there will be a definite similarity score and subsequently a rank. These vector represen- tations for texts are known as Text Embeddings. Based on the structure of the text that is being being vectorized, these are called as Word Embedding, Sentence Embedding, Document Embedding etc. A TF-IDF embedding[22] for a document is essentially a m sized array (m=size of vocabulary of the corpus). Each block in the array is allocated to a word, and are filled with tf-idf scores of those words corresponding to the document being embedded. 3.1.4 Similarity Score Calculation The similarity[1] between two texts (here the search query and the document) needs to be quantified for the subsequent ranking of documents[20] to take 11
  23. 23. place. There are various ways in which this can be done, the most popular two methods being Jaccard Similarity and Cosine Similarity. Jaccard similarity shows the level of match between two given sets. Sup- pose we have two given sets, A and B . Jaccard Similarity computed between them will given by the formula [40] J(A, B) = A ∩ B A ∪ B (3.1) So, to procure the degree of match between two particular texts, the corresponding texts are operated with tokenization and against every single one of those, a set is built using those tokens. Consequently similarity score is represented in percentage and so the resulting value is J(A, B) × 100. In case of search query to document matching, the obvious approach would be to tokenize the both the search query and the document into words, and then find their Jaccard Similarity Score treating both these collections of tokens as sets. This representation in the form of sets would essentially mean the presence of no duplicate terms. Therefore, this method is not much effective as it only indicates the pres- ence of the tokens from search query in a document and does not consider their importance with respect to the specific document concerned and the entire corpus of documents. The better way out from this is quantifying similarity between two texts using Cosine Similarity approach. • Cosine Similarity over TF-IDF embeddings: Measure of coincidence between two given vectors can well be a function of the angle between them. The more it slants, the higher is the coin- cidence. This proportion as well holds true even on adopting cosine[6] of the angle in place of the angle taken directly. This Cosine similarity effectively can be applied on n dimensional vectors, where n ∈ N − 1. And If n = 1 , the similarity scores procured would either be 0 or 1, producing no clear indication or quantification of coincidences between mere points be it along or parallel to an axis. Cosine Similarity[47] technique here is implemented over text embeddings. For two text em- beddings A and B which can also visualized as vectors with an angle of inclination θ between them, the required similarity score is given out using the deduction as follows[47]: A.B = |A||B|cos θ (3.2) cos θ = A.B |A||B| = Pn i Ai.Bi pPn i A2 i . pPn i B2 i (3.3) 12
  24. 24. where A.B = Pn i Ai.Bi denotes dot product of A and B, |X| = pPn i X2 i denotes norm of vector ‘X’, Xi being ith component of vector ‘X’. In our counterpart, the text embeddings are TF-IDF embeddings. So the vectors A and B would basically represent search query and document respectively. 3.1.5 Dynamic Ranking of documents The ranking of the documents on the basis of their similarities with the search query results in potential search results. A definite rule applied upon this ranked retrieval (for example: all documents with similarity scores above a threshold θ are potential search results for the query) will conclude the final procurement of documents by end users. These information can then be fed into dedicated mechanisms with the intention of drawing related insights. • Document popularity based on retrieval to searches: As and when the Information Retrieval System is launched and made to reach people in the form of Search Engines or Query Processing Models, and they start using it, the repetition in documents retrieved by a spread of users needs to be marked and stored. For this, there should be a database to act parallel to the main data storage (which contains the documents). This is essentially a set of ¡key,value¿ pairs, the keys being the document IDs or the documents themselves and value storing the number of occurrences of that document across search feeds of users. In case the documents with similarity score above a threshold are considered the potential search results for a query, for each retrieval of a document, i.e. each time a document attains a similarity score above the threshold, the value corresponding to its document ID in the parallel database is incremented by 1, showing that it has been searched for by the user. Thus at any point, the documents with the highest value in the parallel database are the most popular and the documents with the lowest are the least popular. • Net Popularity = Dynamic Popularity X TF-IDF based rank: In subsequent searches, an additional perspective to calibrating popu- larity and ranking documents corresponding to a search query is the dynamic popularity of documents based on user search. This when multiplied with the TF-IDF based rank in a list of documents pro- cured for a search query amplifies the recommendation on a document with respect to its priority of being seen by the user. This is purely 13
  25. 25. based on the fact that how many previous users procured or ’watched’ it and it only emphasizes over a document and is not an independent absolute measure of ranking a document. 3.2 Keyword Extraction For topic identification of a group of text, having a set of the most important keywords extracted is one of the purposes for which TF-IDF algorithm is used. Using this, TF-IDF scores are allotted to words contained in a doc- ument, on the basis of which the one with the highest is most specific to the document. The top n such keywords can easily be identified keeping all words for a document in rank of their TF-IDF scores. 3.3 Text Summarization A finite shortening of documents to still convey the same sense but only the important from the original texts makes understanding documents much easier. Moreover juxtaposition of more than one document followed by a collective summarization[16][29] is an effective way out to connect indepen- dent research articles and gather insights from the connections. To check whether two or more documents are related in any sense is similar as check- ing whether the summary of the juxtaposed version is meaningful and imparts proper sense. In general, summarization of text documents[21] in Natural Language Processing is considered as a supervised machine learning approach (where upcoming results are forecasted depending upon given data). Usually, the following are the phases for using an extraction-oriented method to summa- rize text.: • Introducing a process to procure the important keyphrases out of the given article. To explain, one can utilize part-of-speech tagging, words sequences, or other language trends to recognize the keyphrases. • Accumulation of text articles with keyphrases that are labelled pos- itive. The keyphrases ought to be compatible to the predefined ex- traction procedure. To enhance accuracy, creating negatively-labeled keyphrases is also something that can be practised. • Making a binary machine learning classifier model to learn to summa- rize text. Few parameters that can be taken for the purpose are: 14
  26. 26. 1) Size of the keyphrase 2) Keyphrase frequency 3) The word of maximum recurrence in the keyphrase 4)Total count of characters in the keyphrase • Eventually, in the testing part, building up of every keyphrase words and corresponding sentences and execute classification with context to them. 3.3.1 Document summarization by TF-IDF: For extractive summarization[11][5] of a document using the TF-IDF ap- proach, the typical way is to find out a finite number of words with higher TF-IDF[12] scores. This suggests that these words are the most important words in the document, and the exact sense of the original text can well be conveyed using a meaningful concatenation of these words. This are essen- tially the keyphrases[2] which are pulled out from the original text and are connected together. Source text: Italy and England saw high peaks in death rates. The peaks were observed nearly within a gap of four months. Summary: Italy England peaks death rates were observed gap four months. In many cases, the derived summary is not grammatically correct. How- ever the sense of the content is almost accurately retained from the original document and conveyed. 3.4 Summary The applications of TF-IDF algorithm with context to text analytics and even search mechanisms therefore strewn widespread starting from information retrieval, where its use is confined as a word embedding technique to produce text vectors that go on to be compared for similarity with the search query. In other cases, as in Text Summarization, TF-IDF scores are important in understanding the priority of sentences. An in extractive text summarization, this priority of sentences becomes the factor based on which the contituents of a summary is determined. Such summaries are closer to and are formed with the individual sentences of the original text unlike in abstractive text summarization. These make the intra-document importance of keywords 15
  27. 27. vital in determining such extractive synopsis. Word Clouds are common in text analytics which is a derivative from keyword extractions. Moreover in even topic modelling identifications, extraction of keywords is a primary step. The general importances of the keywords in varying domains of text is vital for such purposes which are quantified and indicated quite accurately by TF-IDF algorithm. The TF-IDF scores here are more important compared to the net vector - TF-IDF vector which are formed for accumaulation of such keywords. Embeddings or word vectors are more important when it comes to understanding some trend over an uniform dataset based to which predictions are required to be made. The interpretation of texts with respect to tags and variations of the tagging requires a transformation of text to numerics, which is faciliated by word embeddings. In all applications where, this dependecy is more determined by presences of specific keywords, TF-IDF word embeddings hold significant. 16
  28. 28. Chapter 4 Use Cases in Industry Post feature engineering using TF-IDF, machine learning models fitting upon the feature vectors can be used to solve a whole lot of problems pertaining to text classification, clustering and more specifics such as sentiment analysis, emotion detection, fake news detection etc. 4.1 Industry problems 4.1.1 Text pre-processing Prior to Natural Language Processing and text analytic tasks, there is a series of pre-processing[50][46] steps that the data needs to undergo. In our industry problems mainly rebolving around clustering of customer reviews for various products, those were mainly: • Conversion to lower case: The documents, here customer reviews re- quire a lower cased presence such that same words in different cases are not treated different during clustering. • Removal of stopwords: stopwords need to be removed to avoid possi- bilities of weight shifting towards unnecessary words (unnecessary in regard to their possibilities of being the subject of clustering) during numerical encoding (technically called - embedding) of documents using a word-document scoring or weighting scheme. Stopwords are gener- ally prepositions, conjunctions, determinants and some more generally small sized words which act as connectors or specifiers in a language and do not add much value in conveying the subject of the matter being expressed. 17
  29. 29. Figure 4.1: Code implemented for text pre-processing • Removal of non-semantic words: Usage of a standard dictionary pro- vided by libraries like nltk or word corpus like WordNet in checking for existence of words is an effective technique of obliterating meaning- less words. These simply act as noise which elongates the shape of the document embeddings giving weight based importance to unnecessary words. • Stemming and lemmatization: Converting down derived words to their roots helps count the presence of a word accurately. This in turn helps in assignment of correct scores during document embedding using weighted scheme. For example: A stemmer would stem down ’roots’ to ’root’. And the net presence of root/roots in a set of sentences like ’There are many edible roots. Carrot is one such root.’ would be quantified as two and not one. Both stemming and lemmatization are techniques of reaching to such root words. However the difference remains in the fact that during lemmatization, the final word reached after the process also compulsorily has a meaning, while it is not a criteria in stemming. In the both the cases, the most prevalent way is suffix stripping, that is clipping away of suffixes from words to reach to the root word. Example: boxes - box, boys - boy, oxen - ox etc. 18
  30. 30. Figure 4.2: Retention of only nouns for customer reviews • Specific text pre-processing: For our clustering purposes, we identified that the expected categories were majorly noun-oriented. That is to say, based on nouns the group- ing of sentences into different topic heads was the outcome that was predicted. It hardly depended upon words of other parts of speech. Therefore only the nouns in each review were kept so as to prevent weight shifting on unimportant tokens in a customer review. This was the transformation of each initially conventionally pre-processed cus- tomer review text to their corresponding ’just nouns’ versions before TF-IDF feature engineering for possible clustering. In both the indus- try problems dealt with the categories (categories for parameters for ratings on product: car), the clustering is largely based on nouns that define direct or related features of cars.The dataset for the first prob- lem consisted of expected categories like road presence, safety features, build quality that the clustering needed to match while the dataset for the second was completely unlabelled and the aim of clustering was to identify possible groups based on important keywords. It was largely a lexicon-based clustering. Therefore TF-IDF features were efficient to be clustered representing actual clustering aimed on raw customer review sentences. The frequency of the nouns comprised and those of documents containing such nouns was the important factor of cluster- ing. 19
  31. 31. 4.1.2 Problem 1: Clustering of customer reviews on cars - solved using K-Means over TF-IDF em- beddings of complete sentences • Nature and size of dataset: 398 customer reviews with expected category/label given • Shape of TF-IDF embedding: ( 1 X 1198 ) • K-Means clusteing over TF-IDF embeddings: K-Means Algorithm: K-means clusterer is a straightforward unsupervised learning mecha- nism that is put to use to take care of clustering issues. It imbibes a straightforward process of classifying a provided data set into a num- ber of clusters, the number being marked by the letter ”k,” which is predefined beforehand. The clusters are then situated as points and all perceptions or data points are related with the nearest cluster, figured, changed and afterward the cycle begins once again utilizing the new changes until a desired result is reached. The algorithm: 1) K points are set into the article data space representing the under- lying gathering of centroids. 2) Each article or data point is appointed into the nearest k. 3) After all articles are appointed, the places of the k centroids are recalculated. 4) Steps 2 and 3 are iterated unless the places of the centroids at this point become constant. Here, Number of clusters = 3 Expected number of clusters = 3 (Road Presence, Safety Features, Build Quality) The sentences were clustered without prior bias representing expected clusters. • Results: Shown in Table 4.1 and 4.2 20
  32. 32. Figure 4.3: Loading of dataset Figure 4.4: Code for generation of word embeddings using TF-IDF vectorizer Figure 4.5: Code of K-Means clustering 21
  33. 33. Figure 4.6: Code for representation of clusters Table 4.1: Table of accuracy of clustering of customer reviews on cars Category Accuracy of clustering Safety Features 0.865 Build Quality 0.954 Road Presence 1.0 Table 4.2: Comparison of accuracies of clustering on word embeddings derived using different methods Category TF-IDF Word2Vec BERT Safety Features 0.865 0.528 0.620 Build Quality 0.954 0.663 0.753 Road Presence 1.0 0.685 0.809 22
  34. 34. 4.1.3 Problem 2: Clustering of customer reviews on cars - solved using K-Means over TF-IDF em- beddings of bi-grams derived from original sen- tences • Nature and size of dataset: 5506 customer reviews with labels not given • Extraction of bigrams: Bigrams from ’just nouns’ versions of original customer review tex- t/sentences were extracted, such that the keywords are clustered in contenxt of their occurrence and not absolutely independently. Clus- tering of keywords present in sentences regardless of the context of its presence give rise to different sentences portraying dissimilar senses, but having a same word (that becomes the basis of the grouping) to be grouped into a same cluster. This potentially contributes to wrong groupings. Therefore to mitigate that, the idea of capturing context in bigrams (of nouns) from individual sentences was experimented, and then transforming to keyword-prior word embeddings commonly TF- IDF word embeddings here was carried out. • Shape of TF-IDF embedding: ( 1 X 21244 ) • Clustering: K-Means clusterer was used to cluster over the bigrams converted TF- IDF embeddings. The elbow point for the plot with number of clusters Vs SSE (sum of squared errors) was seen at 6. This was for the entire set of bigrams extracted directly from the pre-processed customer reviews. This later became a first round of clustering as the clustering achieved at the end of it suggested potential grouping into more detailed clusters within these main 6 clusters. As a result, in another second round of grouping with clustering done inside the primary clusters, a total of 19 clusters were achieved. • Results: Shown in Table 4.3 and 4.4 23
  35. 35. Figure 4.7: Code for extraction of bigrams Figure 4.8: Code for generation of word embeddings using TF-IDF vectorizer Figure 4.9: Code for first level of clustering 24
  36. 36. Figure 4.10: Code for second level of clustering Figure 4.11: The clusters formed Table 4.3: Table showing structure of formation of clusters (Pure: based on a same keyword throughout, Miscellaneous: based on no particular keyword) Primary cluster Number of secondary clusters Type of cluster Cluster 1 3 Miscellaneous, Pure, Pure Cluster 2 4 Pure (all) Cluster 3 4 Pure (all) Cluster 4 2 Pure (all) Cluster 5 3 Pure (all) Cluster 6 3 Pure (all) 25
  37. 37. Table 4.4: Comparison in accuracies of clustering (of different word embeddings) of bi-grams extracted from customer reviews from dataset of Problem 1 into clus- ters closest to three predefined clusters Category TF-IDF Word2Vec BERT Safety Features 0.29 0.02 0.08 Build Quality 0.38 0.05 0.14 Road Presence 0.36 0.11 0.11 4.2 Summary The use cases shown therefore depict a real industry problem where in both perspectives the goal remains pointing towards grouping of customer reviews. This is a major challenge faced as far as products and companies that take in customer feedback are concerned. The methods deployed here can there- fore be highly useful majorly in e-commerce sites where customer reviews pour in heavily. The analysis address two major ways of dealing the scenario depending on how the given resources are oriented i.e. to say the nature of the customer reviews. Wherein the requirement sticks towards grouping cus- tomer reviews based on a specific set of keywords irrespective of the context, the solution shapes similar to solution demostrated for problem 1. Majorly in such data, the customer reviews are of the nature that the particular keywords that become the centroids of clustering remains important in all contexts addressed throughout the data. Therefore additional pre-processing for identifying context is needless. However in cases where grouping can fur- ther be descend down to the context in which the keywords that are centroids on an apparent clustering are placed, identification of the context is impor- tant. As methods for this scenario, extracting the nearby words by splitting original text in bi-grams, tri-grams, etc is an way out. Apart from that, for different datasets, even generation of contexts by manually created dictionar- ies can also be potential solution. Such clustering tasks, can be supported with grouping expectations with reference to which performance of a cluster- ing over word embeddings can be quantified which is evident in problem 1. However at the core of it, the task essentially remains a clustering task and the labels or expectations provided cannot be used to provide supervision to the model. 26
  38. 38. Chapter 5 Lexicon based Sentiment Analysis and Fake News Detection 5.1 Sentiment Analysis Sentiment Analysis[30][38] can actually be done at several layers from each retrieved information. The overall sentiment of the search topic combining all possible perspectives and usages linked with it can be found passing an aggregation of all relevant documents through a sentiment analyzer. Individ- ual documents can also be inputs to the same mechanism to find sentiments independently for each document. Concentrically just the relevant sentences that is the only sentences containing the search query tokens can give an overall sentiment completely concentrated to the search keyword. For exam- ple, if ’social distancing’ is the search query, by this mechanism, it is very much feasible to find how people think about it - considering all aspects of social distancing and related topics, just the a subset of all aspects as con- fined in one document at a time, and simply the topic itself with much less related perspectives considered. The opinions[7][28] of people are quantified on three polarities - positive, negative and neutral. 5.1.1 Approach 1 - Innovation on Lexicon based Sen- timent Analysis using TF-IDF scores The workflow[17] for the same is in the order of first building up of a vo- cabulary with all possible words tagged against their basic sentiment - 0 for 27
  39. 39. neutral, 1 for positive and -1 for negative. Following that, for each word in each document, TF-IDF scores are found which act as weights upon their respective basic sentiment scores, and therefore should be multiplied with them. This results in weighted sentiment measures[33] accurately capturing the importance of each word in determining the overall sentiment. The scores of words of same sentiment type in a document are added and we get three resultant scores for three sentiments for a same document. However these scores are normalized, so that the greater presence of words of one sentiment type doesn’t dominate over words of other sentiment types, which in reality might in some cases be more important words and the dominating sentiment should practically depend upon them. • Observation: Table 5.1 Sentiment scores using Lexicon based Sentiment Analysis Technique using TF-IDF scores in three polarities (positive, neutral, negative) on sentences derived from CoVid-19 Open Research Dataset has been shown in this table. Table 5.1: Sentences with clear sentimental views derived from CORD19 (Covid19 Open Research Dataset) Sentence Negative Neutral Positive The virus is dangerous. 0.472 0.178 0.0 More than sixty percent pa- tients have recovered. 0.0 0.229 0.229 Test rates are high in this country. 0.0 0.197 0.229 This predicts good chances of survival. 0.0 0.229 0.268 The vaccine has not yet been invented. 0.229 0.210 0.0 5.1.2 Conventional Approach 2 The above approach was a no-training method. This means that without the training bulk being applied over a considerable size of pre-labelled data, the above process can successfully quantify sentiments with random text. The actual sentiment scores might very well differ from that obtained using other approaches, but it had been experimentally found to keep the senti- mental sense intact. That is if a sentence is found majorly positive using 28
  40. 40. approach 1, it would remain majorly positive using other approaches as well. However the score corresponding to it positive polarity might change with change in approach. The second approach is a rather a supervised learning approach circumscribing mainly a classification task. This remains as a more widely used technique as in case of a no-training approach, the inconvenience remains in tagging all possible words from the vocabulary with basic senti- ment scores. In the supervised learning approach, there is a requirement of a pre-labelled data. This means that texts (generally surrounding the domain in which the sentiment analyzer is being used) are labelled with three polarity values. This overall trend is learnt using a classification model such as Naiye Bayes Classifier, Logistic Regression, Support Vector Classifiers etc. This is then used to predict the sentiment polarity scores of any random foreign text. • Observation: Table 5.2 Sentimental scores in three polarities (positive, neutral, negative) and similarity of a ranked article from CORD (CoVid-19 Open Research Dataset) with respect to a search keyword ’coronavirus’ have been demonstrated here. 5.2 Truth Analysis and Belief Index Genera- tion During any pandemic or a global outbreak, a lot of fake news crops up. In most of the cases, these apparently have high impact and are negative. Such false information give rise to unnecessary panic, trauma and needless subsequent consequential measures. To avoid the ordeal, analysis of truth of each information, news article in an essential task. 5.2.1 Conventional Approach 1 The most prevalent approach requires a high volume dataset where a large number of informational pieces from each sub-domain of the domain for which the truth analyzer is being built, are present. These need to be tagged with true/false values and a classification model thereafter can be used to learn trends from the orientation of the dataset to later predict truth and falsity status for any random information. Since the labelling is rather discrete and here even binary, the number of constituent labelled information in the dataset required to help a classification model reach to accurate predictions 29
  41. 41. Table 5.2: Sentences procured from ’coronavirus’ query in the highest ranked article from CoVid-19 Open Research Dataset by Kaggle, and their measures of sentiment, and similarity scores Sentence Negative Neutral Positive Similarity score Researches so far have shown coronavirus-OC43 as the coronavirus of most prevalence in a lot of nations. 0.0 1.0 0.0 0.47252167 Using the recently created coronavirus antigen struc- ture, 6 CoV +ve patients were triumphantly detected. 0.0 0.789 0.211 0.46485475 The research was directed at incrementing the data in clinically contextual CoV spread by regulating anti- gen dilution in COV dis- eased patients using a re- cently created structure for the fast finding out of CoV- OC43 infections. 0.0 0.935 0.065 0.43198237 This is not likely that cross- reaction be it with Middle East respiratory syndrome CoV or intense acute respi- ratory syndrome CoV. 0.126 0.874 0.0 0.36930588 The clinical features of the CoV +ve individuals are structured in the respective tabular indices. 0.0 1.0 0.0 0.33307608 A lot of new CoV specimens have been rising from 2000. 0.0 1.0 0.0 0.33153056 30
  42. 42. is very high. The resultant dimension and variety of information crafts it more difficult to fact check by manually processing every information and tag them with truth and falsity values. 5.2.2 Approach 2 - Innovation on Belief Index Gen- eration with method of vectorization as TF-IDF Algorithm False information identification[10], traditionally is done using a two phase procedure following primary pre-processing. First process is feature engi- neering that comprised creating text embeddings of original text documents by ways for transforming text to text vectors. Some of such techniques are Word2Vec[13], FastText, GloVe and TF-IDF. This in usual is succeeded by classification by models trained on data corresponding to which predefined labels are present. Different types of vectorization processes, classification methods applied for the aim in hand as well as their combinations have given rise to a set of accuracies, provided the labelled dataset remains constant. However, the continuous challenge throughout was the inadequate labelling of a size of data that is hardly finite with ’truth’ and ’falsity’ labels such that the model can precisely as well as least erronously forecast the label against random text induced as input. The mechanism mostly included bi- nary as well as discrete tagging of textual documents. Particular thing that has been overlooked in the entire mechanism is the opportunity with effi- cient utilization of the distance measure depending upon which the requisite classification was carried out. Truth Analysis followed by Belief Index Gener- ation is a three phase implementation where after classification and feature engineering steps, exists a quantification of coincidence among vectors for text provided and the nearest neighbour that has been previously marked in the erstwhile classification phase. From experiment it has been inferred that corresponding to any given text taken at random the closest or nearest neighbour which is the one having the smallest Euclidean distance posses the highest similarity measure. So in context the way out was visualized as taking the highest similarity measure among similarity scores computed for all possible pairs with the provided text and each of the labelled data. As this highest measure essentially meant the maximum possible closeness of a particular data to a pre-tagged one, so the measure of this similarity could securely be taken as quantity of disbelief or belief (based on relevant classification) which could be put on that randomly supplied text. Therefore appending to a classified discrete true or false label, we calibrate the proba- bility of the information in the direction of the tag, which essentially solves 31
  43. 43. the requisite of tagging of each probable information to come at a conclu- sive point. In the purpose of building up of word embeddings, the TF-IDF Vectorizer moduled upon TF-IDF weighting scheme can be utilized and in procurement of belief indices Cosine Similarity has a niche. • Observation: Table 5.3 The quantification of beliefs and disbeliefs that can be associated with news articles are demonstrated here. News articles used here are from Poynter. The classification of the news articles by the model and their actual labels are also shown. 5.3 Summary The ways in which sentiment analysis and fake detection has been dealt here are proposed methods which essentially use TF-IDF algorithm as score gen- erators for deterministic scores to define end outcomes. Lexicon based sen- timent analysis have been a way in which importances of keywords are con- sidered. The sentimental weights of the keywords in total result towards the overall sentiment. The same principle has been demonstrated here, wherein the TF-IDF scores act a weights to be multiplied with basic sentimental weights. This makes imposition of weights upon words dynamic and depen- dent upon the domain (document and corpus of documents) in which the keywords contributes for sentiment analysis. Fake news detection has been powered by an quantification of how much the news is false. The indicator to that is again a score. However, the score generator for this purpose is actually a cosine similarity score, which the use of TF-IDF is as a word em- bedding technique. This has been proposed as a separate paper an published in IEEE RAICS, 2020. 32
  44. 44. Table 5.3: Calibration of beliefs and dis-beliefs applied on some sentences. Sentence Belief Index Prediction Actual To minimize spread of MERS- coronavirus and stop new human infections, it is vital to point out all possible creation spaces of infections and risks associated and route(s) of spread. 0.95 True True There is just a bound amount of epi- demiological data published taking into account potential sources and spred dy- namics of MERS-coronavirus infection in and among the Arabian Peninsula. 0.98 True True A video presenting a policeman on duty in Hajipur Jail suffering from COVID- 19. -0.88 False False IFITM1 is recognized to have a crit- ical say in preventing primary phases of viral cloning and it absolutely stops entry and infections by a significant count of extremely pathogenic viruses, which even includes HIV-1, filovirus, and SARS coronavirus. 0.99 True True GloboNews, which is a Brazilian news platform, read out that a criminal suc- cumbed to COVID-19 while he was in- volved in gunfire between him and the police. -0.91 False False 33
  45. 45. Chapter 6 Summary and Future works 6.1 Main Conclusions The work is a complete analysis on text based operations using TF-IDF ranging from prediction tasks of types - classification, clustering. Its appli- cation as a word embedding technique for machine learning applications has been shown in problems pertaining to clustering of customer reviews and classification of news articles with context to truth or falsity tags which is a supervised task. The way fake news detection has been dealt in the work demonstrates potential of quantification towards each label after classifica- tion using score generators. In fake news detection, cosine similarity as a distance metric is a main source of score generation. However beyond con- ventional learning based outcome approach where TF-IDF is an essential tool for feature engineering, it has been established in this work, with no or minimal machine learning, TF-IDF too enable independent score-based pre- dictions, where TF-IDF score becomes the deterministic score for a problem. This has been demonstrated in a part of the work where lexicon based senti- ment analysis is done in a way such that a linear combination of the TF-IDF scores of the constituent words is able to express net sentiment of a sentence in three polarities making TF-IDF scores or a linear combination of them an independent score generator for sentiment analysis. As opposed to ’Vader’ package of NLTK library, the process doesn’t act over predefined weights on all possible words enclosed in a vocabulary. The weights of the words are determined dynamically based on their importance in the current document. However in both the cases, the lexicons are factors based on which the scores are calculated and therefore the scores derived in similar approaches remain in relative synchronization. 34
  46. 46. 6.2 Open problems In context to deployments involving TF-IDF algorithms, the focus within major domains of text analytics is a keyword-centric approach where there the occurrences of words is significant. The loophole that traces in such cases is an absence of semantics and context help interpret texts better for various purposes. Though it is not a property that TF-IDF independently showcase, however such properties in applications and for data where that require such treatment can be imbibed. The inclusion can well be facilitated by alterna- tive word embedding techniques themselves as BERT, Word2Vec, Infersent etc or even methods that primatily contribute to pre-processing. Modifica- tion of TF-IDF embeddings by operations over the array, or using TF-IDF embedding as an input embedding layer to neural networks for different tasks is not efficient, as it is dedicated for a particular set of requirements which mostly keyword based. Therefore expecting it to be a part of solution carried out using neural networks, or encoder-decoder networks/generative networks is not feasible. 6.3 Future works Its potential use as score generators for solving a variety of tasks point to- wards rule based solution deployment over products. Over mediums and plat- forms including social media, it can be possibly involved in text mining, rule based opinion mining, extractive text summarization in conceptually valid methods. Algorithmic integration of related purpose-oriented mechanisms can well strengthen current deployments of TF-IDF even beyond keyword based outcomes. Technologies that capture context in localized area of usage of TF-IDF, an example being n-grams approach or semantics, can help in executing text analytics better. To make an algorithm semantically efficient, dictionaries holding word tokens of similar meaning or even corpus with text that mean the same can be an added resource to a problem. Syntactic and lexical grammar represented in some form and implemented in the course of solution of a problem aid in achieving better results. Similarly equivalent or same word present in different parts of speech when accumulated in a same level or an array benefit the overall solution building. Such algorithmic inte- gration doesn’t just remain confined to text pre-processing. Such collections in different formats - lists, dictionaries (JSONs) that act as reference during the main working for a problem logically exist before feature engineering and can aid in potentially replacing tokens from original text as well. As a pivot, keyword based understanding and processing of language is a power TF-IDF 35
  47. 47. contributes to. Expectedly, in processing of text content, and generation of rank oriented end solutions, TF-IDF remains efficient. 36
  48. 48. Chapter 7 Publications 1) Panja, S., A. K. Maan, and A. P. James. ”Vilokana-Lightweight COVID19 Document Analysis.” In 2020 IEEE 63rd International Midwest Symposium on Circuits and Systems (MWSCAS), pp. 500-504. IEEE, 2020. 2) Panja, S., and A. P. James. ”Belief Index for Fake COVID19 Text Detec- tion.” In 2020 IEEE Recent Advances in Intelligent Computational Systems (RAICS), pp. 63-67. IEEE, 2020. 37
  49. 49. Chapter 8 ANNEXURE 8.1 Code for Problem 1: Clustering of cus- tomer reviews on cars - solved using K- Means over TF-IDF embeddings of com- plete sentences # -*- coding: utf -8 -*- """ ClooTrackCustomerReviewClusteringUsingTFIDFVectorizer (1).ipynb Automatically generated by Colaboratory. Original file is located at https :// /1 S40m5RAy5WEUuPmdkx8WlkuKJZY # Loading of dataset """ import pandas as pd df = pd.read_excel (r’clootrackbook.xlsx ’) #for an earlier version of Excel , you may need to use the file extension of ’xls ’ print (df) df.columns = [c.replace(’ ’, ’_’) for c in df.columns] x_tr = df.Opinion y_tr = df.Driver_Name 38
  50. 50. """# Finding the number of clusters into which the opinions have been classified according to the dataset given to us""" count = 1 each = "road presence" old = "road presence" for each in y_tr: if each != old: count = count + 1 old = each numberofclusters = count print ( numberofclusters ) """# Getting ’just nouns ’ versions of all opinions - We will name it as ’NounArray ’""" from textblob import TextBlob nounarray = [] for opinion in x_tr: concatenate = "" blob = TextBlob(opinion) txt = [n for n,t in blob.tags if t == ’NN ’] if len(txt) == 0: txt.append(opinion) for token in txt: concatenate = concatenate +" "+ token nounarray.append(concatenate) nounarray """# Getting feature vectors(Word Embeddings) of all contents of NounArray using TFIDF Vectorizer """ from sklearn. feature_extraction .text import TfidfVectorizer from sklearn.cluster import KMeans import numpy as np import pandas as pd vectorizer = TfidfVectorizer (stop_words=’english ’) X = vectorizer.fit_transform(nounarray) 39
  51. 51. """# Passing the Word Embeddings through KMeans Clusterer """ true_k = 3 model = KMeans(n_clusters=true_k , init=’k-means++’, max_iter =100, n_init =1) """# Getting centroids of each cluster in terms of a group of ’m’ number of words (Here m = 10) """ order_centroids = model. cluster_centers_ .argsort ()[:, ::-1] terms = vectorizer. get_feature_names () for i in range(true_k): print (" Cluster ", i," n"), for ind in order_centroids [i, :10]: print(terms[ind]) """# Clustering and Printing the clusters """ prediction = [] for i in nounarray: X = vectorizer.transform ([i]) predicted = model.predict(X) prediction.append(predicted) print (prediction) finalclusters = [] for i in range(true_k): finalclusters.append ([]) for counter in range(len(prediction)): finalclusters[prediction[counter ][0]]. append(x_tr[counter ]) for eachclusterindex in range(len(finalclusters)): print (" Cluster ", eachclusterindex +1 ,":n") print (finalclusters[ eachclusterindex ], "n") """# Calculating accuracies """ # For Road Presence 40
  52. 52. rp = [] for i in range(len(y_tr)): if y_tr[i] == "road presence ": rp.append(x_tr[i]) d = 0 calcrp = finalclusters [0] print (" Supposedly misclassified sentences: n") for i in calcrp: if i in rp: d = d + 1 else: print (i) if d == len(calcrp): print (" None ") print ("n Accuracy: ",d/len(calcrp)) # For Build Quality bd = [] for i in range(len(y_tr)): if y_tr[i] == "build quality ": bd.append(x_tr[i]) d = 0 calcbd = finalclusters [1] print (" Supposedly misclassified sentences: n") for i in calcbd: if i in bd: d = d + 1 else: print (i) if d == len(calcbd): print (" None ") print ("n Accuracy: ",d/len(calcbd)) #For Safety Features sf = [] for i in range(len(y_tr)): if y_tr[i] == "safety features ": sf.append(x_tr[i]) 41
  53. 53. d = 0 calcsf = finalclusters [2] print (" Supposedly misclassified sentences: n") for i in calcsf: if i in sf: d = d + 1 else: print (i) if d == len(calcsf): print (" None ") print ("n Accuracy: ",d/len(calcsf)) 8.2 Code for Problem 2: Clustering of cus- tomer reviews on cars - solved using K- Means over TF-IDF embeddings of bi- grams derived from original sentences # -*- coding: utf -8 -*- """ Untitled29.ipynb Automatically generated by Colaboratory. Original file is located at https :// /1 pqSlWN5Rli1E9TXLQuan96KlguF """ import pandas as pd df = pd.read_excel (r’reviewfile.xlsx ’) #for an earlier version of Excel , you may need to use the file extension of ’xls ’ print (df.Review) for i in range(len(df.Review)): df.Review[i] = str(df.Review[i]) import re 42
  54. 54. # clean text from noise def clean_text(text): # filter to allow only alphabets text = re.sub(r’[^a-zA -Z’]’, ’ ’, text) # remove Unicode characters text = re.sub(r ’[^x00 -x7F]+’, ’’, text) # convert to lowercase to maintain consistency text = text.lower () return text df[’clean_text ’] = df.Review.apply(clean_text) import nltk’words ’)’punkt ’)’averaged_perceptron_tagger ’) words = set(nltk.corpus.words.words ()) def clean_sent(sent): return " ". join(w for w in nltk. wordpunct_tokenize (sent) if w.lower () in words or not w.isalpha ()) df[’clean_text ’] = df[’clean_text ’]. apply(clean_sent) from textblob import TextBlob nounarray = [] for opinion in df.clean_text: concatenate = "" blob = TextBlob(opinion) txt = [n for n,t in blob.tags if t == ’NN ’] if len(txt) == 0: txt.append(opinion) for token in txt: concatenate = concatenate +" "+ token nounarray.append(concatenate) df[’clean_nouns ’] = nounarray sentences = [] 43
  55. 55. from nltk.tokenize import word_tokenize for i in df.clean_text: sentences.append(word_tokenize(str(i))) import nltk all_bigrams = [] for i in df.clean_nouns: nltk_tokens = nltk.word_tokenize(i) current_bigrams = list(nltk.bigrams(nltk_tokens)) for j in current_bigrams : all_bigrams.append(j) all_bigrams_new = [] for i in all_bigrams: if i[0] not in stop_words and i[1] not in stop_words and len(i[0]) > 3 and len(i[1]) > 3: all_bigrams_new .append(i) all_bigrams_new bigrams_in_string = [] for i in all_bigrams_new : bigram_in_string = i[0] + ’ ’ + i[1] bigrams_in_string .append( bigram_in_string ) from sklearn. feature_extraction .text import TfidfVectorizer tfidfconverter = TfidfVectorizer (max_features =1500 , min_df =5, max_df =0.7, stop_words=stopwords.words(’english ’)) X = tfidfconverter.fit_transform( bigrams_in_string ).toarray () from kneed import KneeLocator elbow = KneeLocator(clusters_df.num_clusters.values , clusters_df. cluster_errors.values , S=1.0, curve=’convex ’, direction=’decreasing ’) print(’create a K-means cluster with ’ + str(elbow.knee) + ’ clusters ’) import matplotlib.pyplot as plt 44
  56. 56. plt.plot(clusters_df [0:10]. num_clusters.values , clusters_df [0:10]. cluster_errors.values) from sklearn.cluster import KMeans import numpy as np kmeans = KMeans(n_clusters =6, random_state =0).fit(X) labels = kmeans.labels_ allclusters = [] for k in range (0,6): cluster = [] for i in range(len(labels)): if labels[i] == k: cluster.append( bigrams_in_string [i]) allclusters.append(cluster) c = 0 for i in allclusters: from sklearn. feature_extraction .text import TfidfVectorizer tfidfconverter = TfidfVectorizer (max_features =1500 , min_df =5, max_df =0.7, stop_words=stopwords.words(’english ’)) Xone = tfidfconverter .fit_transform(i).toarray () print (" Main cluster ",c) cluster_range = range (1, 10) #test for cluster sizes 1 to 10 cluster_errors = [] #create array to hold errors for num_clusters in cluster_range: clusters = KMeans( num_clusters ) cluster_errors .append( clusters.inertia_ ) print (num_clusters) clusters_df = pd.DataFrame ({" cluster_errors ": cluster_errors , "num_clusters ": cluster_range }) from kneed import KneeLocator elbow = KneeLocator(clusters_df.num_clusters.values , clusters_df. cluster_errors.values , S=1.0, curve=’convex ’, direction=’decreasing ’) print(’create a K-means cluster with ’ + str(elbow.knee) + ’ clusters ’) 45
  57. 57. print (’nn’) c = c + 1 final_allclusters = [] from sklearn. feature_extraction .text import TfidfVectorizer tfidfconverter = TfidfVectorizer (max_features =1500 , min_df =5, max_df =0.7, stop_words=stopwords.words(’english ’)) Xone = tfidfconverter.fit_transform(allclusters [5]).toarray () from sklearn.cluster import KMeans import numpy as np kmeans = KMeans(n_clusters =3, random_state =0).fit(Xone) labels = kmeans.labels_ for k in range (0,3): cluster = [] for i in range(len(labels)): if labels[i] == k: cluster.append(allclusters [5][i]) final_allclusters .append(cluster) df_new = pd.DataFrame(list_of_tuples , columns = [’Cluster 1_1 ’, ’Cluster 1_2 ’, ’Cluster 1_3 ’, ’Cluster 2_1 ’, ’Cluster 2_2 ’, ’Cluster 2_3 ’, ’Cluster 2_4 ’, ’Cluster 3_1 ’, ’Cluster 3_2 ’, ’Cluster 3_3 ’, ’Cluster 3_4 ’, ’Cluster 4_1 ’, ’Cluster 4_2 ’, ’Cluster 5_1 ’, ’Cluster 5_2 ’, ’Cluster 5_3 ’, ’Cluster 6_1 ’, ’Cluster 6_2 ’, ’Cluster 6_3 ’]) df_new.to_csv(’final_clusters .csv ’, columns =[’ Cluster 1_1 ’, ’Cluster 1_2 ’, ’Cluster 1_3 ’, ’Cluster 2_1 ’, ’Cluster 2_2 ’, ’Cluster 2_3 ’, ’Cluster 2_4 ’, ’Cluster 3_1 ’, ’Cluster 3_2 ’, ’Cluster 3_3 ’, ’Cluster 3_4 ’, ’Cluster 4_1 ’, ’Cluster 4_2 ’, ’Cluster 5_1 ’, ’Cluster 5_2 ’, ’Cluster 5_3 ’, ’Cluster 6_1 ’, ’Cluster 6_2 ’, ’Cluster 6_3 ’], header=False , index=False) 46
  58. 58. import pandas as pd fcdf = pd.read_excel (’/ content/final_clusters_ .xlsx ’) fcdf 47
  59. 59. REFERENCES [1] Aizawa, A. (2003). An information-theoretic perspective of tf–idf mea- sures. Information Processing & Management, 39(1), 45–65. [2] Al-Hashemi, R. (2010). Text summarization extraction system (tses) using extracted keywords. Int. Arab. J. e Technol., 1(4), 164–168. [3] Albitar, S., Fournier, S., & Espinasse, B. (2014). An effective tf/idf- based text-to-text semantic similarity measure for text classification. In International Conference on Web Information Systems Engineering, (pp. 105–114). Springer. [4] Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text min- ing: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919. [5] Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutier- rez, J. B., & Kochut, K. (2017). Text summarization techniques: a brief survey. arXiv preprint arXiv:1707.02268. [6] Alodadi, M., & Janeja, V. P. (2015). Similarity in patient support forums using tf-idf and cosine similarity metrics. In 2015 International Conference on Healthcare Informatics, (pp. 521–522). IEEE. [7] Bakshi, R. K., Kaur, N., Kaur, R., & Kaur, G. (2016). Opinion mining and sentiment analysis. In 2016 3rd International Conference on Com- puting for Sustainable Global Development (INDIACom), (pp. 452–455). IEEE. [8] Bamler, R., & Mandt, S. (2017). Dynamic word embeddings. In Inter- national conference on Machine learning, (pp. 380–389). PMLR. [9] Banks, D., House, L., McMorris, F. R., Arabie, P., & Gaul, W. A. (2011). Classification, Clustering, and Data Mining Applications: Pro- ceedings of the Meeting of the International Federation of Classification 48
  60. 60. Societies (IFCS), Illinois Institute of Technology, Chicago, 15–18 July 2004. Springer Science & Business Media. [10] Bhatt, G., Sharma, A., Sharma, S., Nagpal, A., Raman, B., & Mittal, A. (2018). Combining neural, statistical and external features for fake news stance identification. In Companion Proceedings of the The Web Conference 2018, (pp. 1353–1357). [11] Chandra, M., Gupta, V., & Paul, S. K. (2011). A statistical approach for automatic text summarization by extraction. In 2011 International Conference on Communication Systems and Network Technologies, (pp. 268–271). IEEE. [12] Christian, H., Agus, M. P., & Suhartono, D. (2016). Single document automatic text summarization using term frequency-inverse document frequency (tf-idf). ComTech: Computer, Mathematics and Engineering Applications, 7(4), 285–294. [13] Church, K. W. (2017). Word2vec. Natural Language Engineering, 23(1), 155–162. [14] Culpepper, J. S., & Moffat, A. (2010). Efficient set intersection for inverted indexing. ACM Transactions on Information Systems (TOIS), 29(1), 1–25. [15] Gers, F. A., Schmidhuber, J., & Cummins, F. (1999). Learning to forget: Continual prediction with lstm. [16] Goldstein, J., Mittal, V. O., Carbonell, J. G., & Kantrowitz, M. (2000). Multi-document summarization by sentence extraction. In NAACL- ANLP 2000 Workshop: Automatic Summarization. [17] Gonçalves, P., Araújo, M., Benevenuto, F., & Cha, M. (2013). Compar- ing and combining sentiment analysis methods. In Proceedings of the first ACM conference on Online social networks, (pp. 27–38). [18] Guo, A., & Yang, T. (2016). Research and improvement of feature words weight based on tfidf algorithm. In 2016 IEEE Information Technology, Networking, Electronic and Automation Control Conference, (pp. 415– 419). IEEE. [19] Hiemstra, D. (2000). A probabilistic justification for using tf× idf term weighting in information retrieval. International Journal on Digital Li- braries, 3(2), 131–139. 49
  61. 61. [20] Huang, C.-H., Yin, J., & Hou, F. (2011). A text similarity measurement combining word semantic information with tf-idf method. Jisuanji Xue- bao(Chinese Journal of Computers), 34(5), 856–864. [21] Hynek, J., & Jezek, K. (2003). Practical approach to automatic text summarization. In ELPUB. [22] Jing, L.-P., Huang, H.-K., & Shi, H.-B. (2002). Improved feature selec- tion approach tfidf in text mining. In Proceedings. International Con- ference on Machine Learning and Cybernetics, vol. 2, (pp. 944–946). IEEE. [23] Karim, M. S. A., & Wong, K. (2014). Universal data embedding in encrypted domain. Signal Processing, 94, 174–182. [24] Kovaleva, O., Rumshisky, A., & Romanov, A. (2018). Similarity-based reconstruction loss for meaning representation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, (pp. 4875–4880). [25] Lazaridou, A., Pham, N. T., & Baroni, M. (2015). Combining lan- guage and vision with a multimodal skip-gram model. arXiv preprint arXiv:1501.02598. [26] Lebret, R., & Collobert, R. (2015). Rehabilitation of count-based models for word vector representations. In International Conference on Intel- ligent Text Processing and Computational Linguistics, (pp. 417–429). Springer. [27] Leuski, A., & Allan, J. (2000). Improving interactive retrieval by com- bining ranked list and clustering. In RIAO, (pp. 665–681). [28] Lin, C., & He, Y. (2009). Joint sentiment/topic model for sentiment analysis. In Proceedings of the 18th ACM conference on Information and knowledge management, (pp. 375–384). [29] Lin, C.-Y., & Hovy, E. (2002). From single to multi-document summa- rization. In Proceedings of the 40th annual meeting of the association for computational linguistics, (pp. 457–464). [30] Liu, B., et al. (2010). Sentiment analysis and subjectivity. Handbook of natural language processing, 2(2010), 627–666. 50
  62. 62. [31] Liu, P., Qiu, X., & Huang, X. (2015). Learning context-sensitive word embeddings with neural tensor skip-gram model. In IJCAI , (pp. 1284– 1290). [32] Longstaff, D., Walker, R., Walker, R. F., & Jackway, P. (1995). Improv- ing co-occurrence matrix feature discrimination. In Proc. of DICTA’95, 3 rd International Conference on Digital Image Computing: Techniques and Applications. Citeseer. [33] Martineau, J., & Finin, T. (2009). Delta tfidf: An improved feature space for sentiment analysis. UMBC Student Collection. [34] Moffat, A., & Zobel, J. (1996). Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems (TOIS), 14(4), 349–379. [35] Nanni, L., Brahnam, S., Ghidoni, S., Menegatti, E., & Barrier, T. (2013). Different approaches for extracting information from the co- occurrence matrix. PloS one, 8(12), e83554. [36] NIEMINEN, T. (2017). Text classification using bag of words. [37] Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), (pp. 1532–1543). [38] Prabowo, R., & Thelwall, M. (2009). Sentiment analysis: A combined approach. Journal of Informetrics, 3(2), 143–157. [39] Ramos, J., et al. (2003). Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, vol. 242, (pp. 29–48). Citeseer. [40] Real, R., & Vargas, J. M. (1996). The probabilistic basis of jaccard’s index of similarity. Systematic biology, 45(3), 380–385. [41] Robertson, S. (2004). Understanding inverse document frequency: on theoretical arguments for idf. Journal of documentation. [42] Sakai, T., & Dou, Z. (2013). Summaries, ranked retrieval and sessions: A unified framework for information access evaluation. In Proceedings of the 36th international ACM SIGIR conference on Research and de- velopment in information retrieval, (pp. 473–482). 51
  63. 63. [43] Salazar, J., Liang, D., Nguyen, T. Q., & Kirchhoff, K. (2019). Masked language model scoring. arXiv preprint arXiv:1910.14659. [44] Shi, C., Xu, C., & Yang, X. (2009). Study of tfidf algorithm. Journal of Computer Applications, 29(6), 167–170. [45] Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., & Chanona- Hernández, L. (2014). Syntactic n-grams as machine learning features for natural language processing. Expert Systems with Applications, 41(3), 853–860. [46] Srividhya, V., & Anitha, R. (2010). Evaluating preprocessing techniques in text categorization. International journal of computer science and application, 47(11), 49–51. [47] Tata, S., & Patel, J. M. (2007). Estimating the selectivity of tf-idf based cosine similarity predicates. ACM Sigmod Record, 36(2), 7–12. [48] Tenney, I., Das, D., & Pavlick, E. (2019). Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950. [49] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762. [50] Vijayarani, S., Ilamathi, M. J., & Nithya, M. (2015). Preprocessing tech- niques for text mining-an overview. International Journal of Computer Science & Communication Networks, 5(1), 7–16. [51] Wang, N., Wang, P., & Zhang, B. (2010). An improved tf-idf weights function based on information theory. In 2010 International Conference on Computer and Communication Technologies in Agriculture Engineer- ing, vol. 3, (pp. 439–441). IEEE. [52] Wang, Q., Xu, J., Chen, H., & He, B. (2017). Two improved continuous bag-of-word models. In 2017 International Joint Conference on Neural Networks (IJCNN), (pp. 2851–2856). IEEE. [53] Wang, X., Zhang, L., & Klabjan, D. (2020). Keyword-based topic mod- eling and keyword selection. arXiv preprint arXiv:2001.07866. [54] Wolf, T., Chaumond, J., Debut, L., Sanh, V., Delangue, C., Moi, A., Cistac, P., Funtowicz, M., Davison, J., Shleifer, S., et al. (2020). Trans- formers: State-of-the-art natural language processing. In Proceedings of 52
  64. 64. the 2020 Conference on Empirical Methods in Natural Language Pro- cessing: System Demonstrations, (pp. 38–45). [55] Zhang, Y., Jin, R., & Zhou, Z.-H. (2010). Understanding bag-of-words model: a statistical framework. International Journal of Machine Learn- ing and Cybernetics, 1(1-4), 43–52. 53