Design For Accessibility: Getting it right from the start
Text features
1. FEATURE ENGINEERING FOR
TEXT DATA
Presenter : Shruti Kar
Instructor : Dr. Guozhu Dong
Class : Feature Engineering
(CS 7900-05)
https://cdn-images-1.medium.com/max/2000/1*vXKKe3J-lfi1YQ7HC6onxQ.jpeg
“More data beats clever algorithms, but better data beats more data.” – Peter Norvig.
3. FEATURES FROM SEMI-STRUCTURED DATA
Examples: Book, Newspaper, XML Documents, PDFs etc.
Features:
• Table Of Content / Index
• Glossary
• Titles
• Subheadings
• Text( Bold, Color, & Italics)
• Captions on Photographs / Diagrams
• Tables
• <> tags in XML documents
4. Cleaning:
• Convert accented characters
• Expanding Contractions
• Lowercasing
• Repairing (“C a s a C a f é” ->
“Casa Café”) (Not in example)
Removing:
• Stopwords
• Removing Tags
• Rare words (Not in example)
• Common words (Not in example)
• Removing non-alphanumeric
Roots:
• Spelling correction (Not in example)
• Chop (Not in example)
• Stem (root word)
• Lemmatize (semantic root)
eg: “I am late” -> “I be late”
NATURAL LANGUAGE PROCESSING:
FEATURES FROM UNSTRUCTURED DATA
7. TEXT VECTORIZATION
BAG OF WORDS MODEL:
• Most simple vector space representational model – Bag of Words Model.
• Vector space model – Mathematical model to represent unstructured text as numeric vectors
each dimension of the vector is a specific featureattribute.
• Bag of words model – represents each text
document as a numeric vector, each dimension
is a specific word from the corpus and the value
could be:
its frequency in the document,
occurrence (denoted by 1 or 0) or
weighted values.
• Each document is represented literally as a ‘bag’
of its own words disregarding :
word orders,
sequences and
grammar.
8. TEXT VECTORIZATION
BAG OF WORDS MODEL:
(1) John like to watch movies. Mary likes movies too.
(2) John also likes to watch football games.
“John”, “likes”, “to”, “watch”, “movies”, “Mary”, “likes”, “movies”, “too”
“John”, “also”, “likes”, “to”, “watch”, “football”, “games”
BOW1 = {“John”:1, “likes”:2, “to”:1, “watch”:1, “movies”:2, “Mary”:1, “too”:1};
BOW2 = {“John”:1, “also”:1, “likes”:1, “to”:1, “watch”:1, “football”:1, “games”:1};
(3) John like to watch movies. Mary likes movies too. John also likes to watch football games.
BOW3 = {“John”:2, “likes”:3, “to”:2, “watch”:2, “movies”:2, “Mary”:1, “too”:1, “also”:1, “football”:1, “games”:1};
John Likes To Watch Movies Mary Too Also Football games
1 2 1 1 2 1 1 0 0 0
1 1 1 1 0 0 0 1 1 1
2 3 2 2 2 1 1 1 1 1
Document Vector
Word Vector
11. TEXT VECTORIZATION
TF-IDF MODEL:
• Term frequencies are not necessarily the best representation for the text.
• Having a high raw count does not necessarily mean that the corresponding word is more important.
• TF-IDF “normalizes” the term frequency by weighing a term by the inverse of document frequency.
TF = (Number of times term t appears in a document)/(Number of terms in the document)
IDF = log(N/n),
where, N is the number of documents
n is the number of documents a term t has appeared in.
TF-IDF = TF x IDF
17. TEXT VECTORIZATION
BAG OF N-GRAMS MODEL:
• Bag of Words model doesn’t consider order of words.
Thus different sentences can have exactly the same representation, as long as the same words are used.
• Bag of N-Grams model – Extension of the Bag of Words model, leverage N-gram based features.
20. TEXT VECTORIZATION
CO-OCCURRENCE MATRIX:
PMI = Pointwise Mutual Information.
Larger PMI Higher correlation
Where, w= word, c= context word
ISSUES: Many entries with PMI (w,c) = log 0
SOLUTION:
• Set PMI(w,c) = 0 for all unobserved pairs.
• Drop all entries of PMI< 0 [POSITIVE POINTWISE MUTUAL INFORMATION]
Produces 2 different vectors for each word:
• Describes word when it is the ‘target word’ in the window
• Describes word when it is the ‘context word’ in window
21. PREDICTION BASED EMBEDDING:
Prediction based Embedding
• CBOW
• Skip-Gram
CBOW and Skip-Gram model for neural network differs in the terms of input and output of the neural network.
• CBOW: Input to neural network is set of context words within a certain window surrounding a ‘target’ word. And
output predicts the ‘target’ word i.e., what word should belong to the target position.
• Skip-Gram: Input is similar to a CBOW model. Output predicts each ‘context’ based on the ‘target’ word appearing
at the center of the window.
Both the times we learn wi and context vector ῶ i for each word in the vocabulary.
22. PREDICTION BASED EMBEDDING:
• Goal is to supply training samples and learn the weights.
• Use the weights to predict probabilities for new input word.
23. PARAGRAPH VECTOR:
• Bag of Words – no order / sequence. But no semantics.
• Bag of N-grams – little semantics. But suffers data sparsity and high dimensionality.
• Methods:
• A weighted average of all the words in the document (loses order of words)
• combining the word vectors in an order given by a parse tree of a sentence, using
matrix-vector operations. (works only for sentences)
• PARAGRAPH VECTOR – applicable to variable-length pieces of texts:
sentences,
paragraphs, and
documents
24. PARAGRAPH VECTOR:
A framework for learning word vectors. Context of three words
(“the,” “cat,” and “sat”) is used to predict the fourth word
(“on”). The input words are mapped to columns of the matrix
W to predict the output word.
26. DOCUMENT SIMILARITY:
• Document similarity – Similarity based on features extracted from the documents like
bag of words or tf-idf.
• Pairwise document similarity
• Several similarity and distance metrics
• cosine distance/similarity
• euclidean distance
• manhattan distance
• BM25 similarity
• jaccard distance
• Levenshtein distance
• Hamming distance
28. TOPIC MODELS
• We can also use some summarization techniques to extract topic or concept based features from text documents.
• Extracting key themes or concepts from a corpus of documents which are represented as topics.
• Each topic can be represented as a bag or collection of words/terms from the document corpus.
29. TOPIC MODELS
• Most use Matrix Decomposition.
• Eg: Latent Semantic Indexing uses Singular Valued Decomposition.
LATENT SEMANTIC INDEXING:
37. TOPIC MODELS
• Latent Dirichlet Allocation – Uses generative probabilistic model
• Each document consists of a combination of several topics.
• Each term or word can be assigned to a specific topic.
• Similar to pLSI based model (probabilistic LSI). Each latent topic contains a Dirichlet prior over them in the case of LDA.
LATENT DIRICHLET ALLOCATION:
• extract K topics
• from M documents
39. TOPIC MODELS
LATENT DIRICHLET ALLOCATION:
• LDA is applied on a document-term matrix (TF-IDF or Bag of Words feature matrix), it gets decomposed into
two main components.
• A document-topic matrix, which would be the feature matrix we
are looking for.
• A topic-term matrix, which helps us in looking at potential topics in the corpus.
40. TEXT DOCUMENT
DOCUMENT
PREPROCESSING
FEATURE SELECTION
FEATURE EXTRACTIONTEXT CLASSIFICATION
Tokenization, Stop-words
removal, Lemmatization,
Stemming
NLP techniques, NER, TF-
IDF, Information Gain(IG),
BOW, N-gram BOW
Word Embeddings, Glove,
LSA, LDA
Neural Network, CNN,
RNN, LSTM, SVM, RF
Structured or
Unstructured data
CONCLUSION
Flesch Reading Ease score : 206.835 – (1.015 x ASL) – (84.6 x ASW)Flesch-Kincaid Grade Level score : (.39 x ASL) + (11.8 x ASW) – 15.59where:ASL = average sentence length (the number of words divided by the number of sentences)ASW = average number of syllables per word (the number of syllables divided by the number of words)
The paragraph vectors are also asked to contribute to the prediction task of the next word given many contexts sampled from the paragraph.
The contexts are fixed-length and sampled from a sliding window over the paragraph. The paragraph vector is shared across all contexts generated from the same paragraph but not across paragraphs. The word vector matrix W, however, is shared across paragraphs. I.e., the vector for “powerful” is the same for all paragraphs.
The paragraph token can be thought of as another word. It acts as a memory that remembers what is missing from the current context – or the topic of the paragraph. For this reason, we often call this model the Distributed Memory Model of Paragraph Vectors (PV-DM).
Suppose that there are N paragraphs in the corpus, M words in the vocabulary, and we want to learn paragraph vectors such that each paragraph is mapped to p dimensions and each word is mapped to q dimensions, then the model has the total of N × p + M × q parameters (excluding the softmax parameters)
Document similarity is the process of using a distance or similarity based metric that can be used to identify how similar a text document is with any other document(s) based on features extracted from the documents like bag of words or tf-idf.