SlideShare una empresa de Scribd logo
1 de 41
FEATURE ENGINEERING FOR
TEXT DATA
Presenter : Shruti Kar
Instructor : Dr. Guozhu Dong
Class : Feature Engineering
(CS 7900-05)
https://cdn-images-1.medium.com/max/2000/1*vXKKe3J-lfi1YQ7HC6onxQ.jpeg
“More data beats clever algorithms, but better data beats more data.” – Peter Norvig.
TEXT DATA
STRUCTURED DATA UNSTRUCTURED DATA
UNDERSTANDING THE TYPE OF TEXT DATA
SEMI-STRUCTURED DATA
FEATURES FROM SEMI-STRUCTURED DATA
Examples: Book, Newspaper, XML Documents, PDFs etc.
Features:
• Table Of Content / Index
• Glossary
• Titles
• Subheadings
• Text( Bold, Color, & Italics)
• Captions on Photographs / Diagrams
• Tables
• <> tags in XML documents
Cleaning:
• Convert accented characters
• Expanding Contractions
• Lowercasing
• Repairing (“C a s a C a f &eacute;” ->
“Casa Café”) (Not in example)
Removing:
• Stopwords
• Removing Tags
• Rare words (Not in example)
• Common words (Not in example)
• Removing non-alphanumeric
Roots:
• Spelling correction (Not in example)
• Chop (Not in example)
• Stem (root word)
• Lemmatize (semantic root)
eg: “I am late” -> “I be late”
NATURAL LANGUAGE PROCESSING:
FEATURES FROM UNSTRUCTURED DATA
Tokenizing:
• Tokenize
• N-Grams
• Skip-Grams
• Char-grams
• Affixes (Not in example)
NATURAL LANGUAGE PROCESSING:
FEATURES FROM UNSTRUCTURED DATA
Enrich:
• Entity Insertion/Extraction
"Microsoft Releases Windows" -> "Microsoft(company) releases Windows(application)"
• Parse Trees
"Alice hits Bill" -> Alice/Noun_subject hits/Verb Bill/Noun_object
[('Mark', 'NNP', u'B-PERSON'), ('and', 'CC', u'O'), ('John', 'NNP', u'B-PERSON'), ('are', 'VBP', u'O'), ('working', 'VBG', u'O'),
('at', 'IN', u'O'), ('Google', 'NNP', u'B-ORGANIZATION'), ('.', '.', u'O’)]
• Reading Level
NATURAL LANGUAGE PROCESSING:
FEATURES FROM UNSTRUCTURED DATA
AFTER PREPROCESSING:
TEXT VECTORIZATION
BAG OF WORDS MODEL:
• Most simple vector space representational model – Bag of Words Model.
• Vector space model – Mathematical model to represent unstructured text as numeric vectors
each dimension of the vector is a specific featureattribute.
• Bag of words model – represents each text
document as a numeric vector, each dimension
is a specific word from the corpus and the value
could be:
its frequency in the document,
occurrence (denoted by 1 or 0) or
weighted values.
• Each document is represented literally as a ‘bag’
of its own words disregarding :
word orders,
sequences and
grammar.
TEXT VECTORIZATION
BAG OF WORDS MODEL:
(1) John like to watch movies. Mary likes movies too.
(2) John also likes to watch football games.
“John”, “likes”, “to”, “watch”, “movies”, “Mary”, “likes”, “movies”, “too”
“John”, “also”, “likes”, “to”, “watch”, “football”, “games”
BOW1 = {“John”:1, “likes”:2, “to”:1, “watch”:1, “movies”:2, “Mary”:1, “too”:1};
BOW2 = {“John”:1, “also”:1, “likes”:1, “to”:1, “watch”:1, “football”:1, “games”:1};
(3) John like to watch movies. Mary likes movies too. John also likes to watch football games.
BOW3 = {“John”:2, “likes”:3, “to”:2, “watch”:2, “movies”:2, “Mary”:1, “too”:1, “also”:1, “football”:1, “games”:1};
John Likes To Watch Movies Mary Too Also Football games
1 2 1 1 2 1 1 0 0 0
1 1 1 1 0 0 0 1 1 1
2 3 2 2 2 1 1 1 1 1
Document Vector
Word Vector
TEXT VECTORIZATION
BAG OF WORDS MODEL:
BAG OF WORDS:
Bag of Words is nothing but Term Frequencies
TEXT VECTORIZATION
BAG OF WORDS MODEL:
Bag of Words here, is nothing but Term Frequencies
TEXT VECTORIZATION
TF-IDF MODEL:
• Term frequencies are not necessarily the best representation for the text.
• Having a high raw count does not necessarily mean that the corresponding word is more important.
• TF-IDF “normalizes” the term frequency by weighing a term by the inverse of document frequency.
TF = (Number of times term t appears in a document)/(Number of terms in the document)
IDF = log(N/n),
where, N is the number of documents
n is the number of documents a term t has appeared in.
TF-IDF = TF x IDF
TEXT VECTORIZATION
TF-IDF MODEL:
TF (This, Document1) = 1/8
TF (This, Document2) = 1/5
IDF (This) = log (2 / 2) = 0
IDF (Messi) = log (2 / 1) = 0.301.
TF – IDF (This, Document1) = (1 / 8) * (0) = 0
TF – IDF (This, Document2) = (1 / 5) * (0) = 0
TF – IDF (Messi, Document1) = (4 / 8) * 0.301 = 0.15
TEXT VECTORIZATION
TF-IDF MODEL:
TEXT VECTORIZATION
BAG OF N-GRAMS MODEL:
• Bag of Words model doesn’t consider order of words.
Thus different sentences can have exactly the same representation, as long as the same words are used.
• Bag of N-Grams model – Extension of the Bag of Words model, leverage N-gram based features.
TEXT VECTORIZATION
CO-OCCURRENCE MATRIX:
Corpus =“The quick brown fox jumps over the lazy dog.”
Window Size :- 2
TEXT VECTORIZATION
CO-OCCURRENCE MATRIX:
Corpus = “He is not lazy. He is intelligent. He is smart”.
Window Size :- 2
TEXT VECTORIZATION
CO-OCCURRENCE MATRIX:
PMI = Pointwise Mutual Information.
Larger PMI  Higher correlation
Where, w= word, c= context word
ISSUES: Many entries with PMI (w,c) = log 0
SOLUTION:
• Set PMI(w,c) = 0 for all unobserved pairs.
• Drop all entries of PMI< 0 [POSITIVE POINTWISE MUTUAL INFORMATION]
Produces 2 different vectors for each word:
• Describes word when it is the ‘target word’ in the window
• Describes word when it is the ‘context word’ in window
PREDICTION BASED EMBEDDING:
Prediction based Embedding
• CBOW
• Skip-Gram
CBOW and Skip-Gram model for neural network differs in the terms of input and output of the neural network.
• CBOW: Input to neural network is set of context words within a certain window surrounding a ‘target’ word. And
output predicts the ‘target’ word i.e., what word should belong to the target position.
• Skip-Gram: Input is similar to a CBOW model. Output predicts each ‘context’ based on the ‘target’ word appearing
at the center of the window.
Both the times we learn wi and context vector ῶ i for each word in the vocabulary.
PREDICTION BASED EMBEDDING:
• Goal is to supply training samples and learn the weights.
• Use the weights to predict probabilities for new input word.
PARAGRAPH VECTOR:
• Bag of Words – no order / sequence. But no semantics.
• Bag of N-grams – little semantics. But suffers data sparsity and high dimensionality.
• Methods:
• A weighted average of all the words in the document (loses order of words)
• combining the word vectors in an order given by a parse tree of a sentence, using
matrix-vector operations. (works only for sentences)
• PARAGRAPH VECTOR – applicable to variable-length pieces of texts:
sentences,
paragraphs, and
documents
PARAGRAPH VECTOR:
A framework for learning word vectors. Context of three words
(“the,” “cat,” and “sat”) is used to predict the fourth word
(“on”). The input words are mapped to columns of the matrix
W to predict the output word.
PARAGRAPH VECTOR:
Distributed Memory Model of
Paragraph Vectors (PV-DM)
Distributed Bag of Words version of
Paragraph Vector (PV-DBOW)
DOCUMENT SIMILARITY:
• Document similarity – Similarity based on features extracted from the documents like
bag of words or tf-idf.
• Pairwise document similarity
• Several similarity and distance metrics
• cosine distance/similarity
• euclidean distance
• manhattan distance
• BM25 similarity
• jaccard distance
• Levenshtein distance
• Hamming distance
DOCUMENT SIMILARITY:
TOPIC MODELS
• We can also use some summarization techniques to extract topic or concept based features from text documents.
• Extracting key themes or concepts from a corpus of documents which are represented as topics.
• Each topic can be represented as a bag or collection of words/terms from the document corpus.
TOPIC MODELS
• Most use Matrix Decomposition.
• Eg: Latent Semantic Indexing uses Singular Valued Decomposition.
LATENT SEMANTIC INDEXING:
TOPIC MODELS
LATENT SEMANTIC INDEXING:
TOPIC MODELS
LATENT SEMANTIC INDEXING:
TOPIC MODELS
LATENT SEMANTIC INDEXING:
TOPIC MODELS
LATENT SEMANTIC INDEXING:
TOPIC MODELS
LATENT SEMANTIC INDEXING:
TOPIC MODELS
LATENT SEMANTIC INDEXING:
TOPIC MODELS
LATENT SEMANTIC INDEXING:
TOPIC MODELS
• Latent Dirichlet Allocation – Uses generative probabilistic model
• Each document consists of a combination of several topics.
• Each term or word can be assigned to a specific topic.
• Similar to pLSI based model (probabilistic LSI). Each latent topic contains a Dirichlet prior over them in the case of LDA.
LATENT DIRICHLET ALLOCATION:
• extract K topics
• from M documents
TOPIC MODELS
LATENT DIRICHLET ALLOCATION:
TOPIC MODELS
LATENT DIRICHLET ALLOCATION:
• LDA is applied on a document-term matrix (TF-IDF or Bag of Words feature matrix), it gets decomposed into
two main components.
• A document-topic matrix, which would be the feature matrix we
are looking for.
• A topic-term matrix, which helps us in looking at potential topics in the corpus.
TEXT DOCUMENT
DOCUMENT
PREPROCESSING
FEATURE SELECTION
FEATURE EXTRACTIONTEXT CLASSIFICATION
Tokenization, Stop-words
removal, Lemmatization,
Stemming
NLP techniques, NER, TF-
IDF, Information Gain(IG),
BOW, N-gram BOW
Word Embeddings, Glove,
LSA, LDA
Neural Network, CNN,
RNN, LSTM, SVM, RF
Structured or
Unstructured data
CONCLUSION
Thank You

Más contenido relacionado

La actualidad más candente

Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesankit_ppt
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Edureka!
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processinggulshan kumar
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)Kuppusamy P
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)Sumit Raj
 
Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionAritra Mukherjee
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlpankit_ppt
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processingsaurabhnarhe
 
Natural language processing
Natural language processing Natural language processing
Natural language processing Md.Sumon Sarder
 
5. phases of nlp
5. phases of nlp5. phases of nlp
5. phases of nlpmonircse2
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measuresankit_ppt
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)VenkateshMurugadas
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processingrohitnayak
 
Artificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language ProcessingArtificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language ProcessingFrank Cunha
 
Natural language processing
Natural language processingNatural language processing
Natural language processingprashantdahake
 

La actualidad más candente (20)

Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processing
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - Introduction
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language processing
Natural language processing Natural language processing
Natural language processing
 
5. phases of nlp
5. phases of nlp5. phases of nlp
5. phases of nlp
 
Audio mining
Audio miningAudio mining
Audio mining
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
NLP
NLPNLP
NLP
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Artificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language ProcessingArtificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language Processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 

Similar a Text features

Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdfHabtamu100
 
Mixed Effects Models - Crossed Random Effects
Mixed Effects Models - Crossed Random EffectsMixed Effects Models - Crossed Random Effects
Mixed Effects Models - Crossed Random EffectsScott Fraundorf
 
Textrank algorithm
Textrank algorithmTextrank algorithm
Textrank algorithmAndrew Koo
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...GeeksLab Odessa
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3Nick Grattan
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
 
Word embeddings
Word embeddingsWord embeddings
Word embeddingsShruti kar
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesOpenSource Connections
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectorsSimon Hughes
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Lucidworks
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingSimon Hughes
 
Topic Segmentation in Dialogue
Topic Segmentation in DialogueTopic Segmentation in Dialogue
Topic Segmentation in DialogueJinho Choi
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLPSatyam Saxena
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLPAnuj Gupta
 

Similar a Text features (20)

Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
 
Mixed Effects Models - Crossed Random Effects
Mixed Effects Models - Crossed Random EffectsMixed Effects Models - Crossed Random Effects
Mixed Effects Models - Crossed Random Effects
 
Textrank algorithm
Textrank algorithmTextrank algorithm
Textrank algorithm
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
 
Topic Segmentation in Dialogue
Topic Segmentation in DialogueTopic Segmentation in Dialogue
Topic Segmentation in Dialogue
 
LSA algorithm
LSA algorithmLSA algorithm
LSA algorithm
 
Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Text Mining
Text MiningText Mining
Text Mining
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 

Último

HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiessarkmank1
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...Amil baba
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 

Último (20)

HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 

Text features

  • 1. FEATURE ENGINEERING FOR TEXT DATA Presenter : Shruti Kar Instructor : Dr. Guozhu Dong Class : Feature Engineering (CS 7900-05) https://cdn-images-1.medium.com/max/2000/1*vXKKe3J-lfi1YQ7HC6onxQ.jpeg “More data beats clever algorithms, but better data beats more data.” – Peter Norvig.
  • 2. TEXT DATA STRUCTURED DATA UNSTRUCTURED DATA UNDERSTANDING THE TYPE OF TEXT DATA SEMI-STRUCTURED DATA
  • 3. FEATURES FROM SEMI-STRUCTURED DATA Examples: Book, Newspaper, XML Documents, PDFs etc. Features: • Table Of Content / Index • Glossary • Titles • Subheadings • Text( Bold, Color, & Italics) • Captions on Photographs / Diagrams • Tables • <> tags in XML documents
  • 4. Cleaning: • Convert accented characters • Expanding Contractions • Lowercasing • Repairing (“C a s a C a f &eacute;” -> “Casa Café”) (Not in example) Removing: • Stopwords • Removing Tags • Rare words (Not in example) • Common words (Not in example) • Removing non-alphanumeric Roots: • Spelling correction (Not in example) • Chop (Not in example) • Stem (root word) • Lemmatize (semantic root) eg: “I am late” -> “I be late” NATURAL LANGUAGE PROCESSING: FEATURES FROM UNSTRUCTURED DATA
  • 5. Tokenizing: • Tokenize • N-Grams • Skip-Grams • Char-grams • Affixes (Not in example) NATURAL LANGUAGE PROCESSING: FEATURES FROM UNSTRUCTURED DATA Enrich: • Entity Insertion/Extraction "Microsoft Releases Windows" -> "Microsoft(company) releases Windows(application)" • Parse Trees "Alice hits Bill" -> Alice/Noun_subject hits/Verb Bill/Noun_object [('Mark', 'NNP', u'B-PERSON'), ('and', 'CC', u'O'), ('John', 'NNP', u'B-PERSON'), ('are', 'VBP', u'O'), ('working', 'VBG', u'O'), ('at', 'IN', u'O'), ('Google', 'NNP', u'B-ORGANIZATION'), ('.', '.', u'O’)] • Reading Level
  • 6. NATURAL LANGUAGE PROCESSING: FEATURES FROM UNSTRUCTURED DATA AFTER PREPROCESSING:
  • 7. TEXT VECTORIZATION BAG OF WORDS MODEL: • Most simple vector space representational model – Bag of Words Model. • Vector space model – Mathematical model to represent unstructured text as numeric vectors each dimension of the vector is a specific featureattribute. • Bag of words model – represents each text document as a numeric vector, each dimension is a specific word from the corpus and the value could be: its frequency in the document, occurrence (denoted by 1 or 0) or weighted values. • Each document is represented literally as a ‘bag’ of its own words disregarding : word orders, sequences and grammar.
  • 8. TEXT VECTORIZATION BAG OF WORDS MODEL: (1) John like to watch movies. Mary likes movies too. (2) John also likes to watch football games. “John”, “likes”, “to”, “watch”, “movies”, “Mary”, “likes”, “movies”, “too” “John”, “also”, “likes”, “to”, “watch”, “football”, “games” BOW1 = {“John”:1, “likes”:2, “to”:1, “watch”:1, “movies”:2, “Mary”:1, “too”:1}; BOW2 = {“John”:1, “also”:1, “likes”:1, “to”:1, “watch”:1, “football”:1, “games”:1}; (3) John like to watch movies. Mary likes movies too. John also likes to watch football games. BOW3 = {“John”:2, “likes”:3, “to”:2, “watch”:2, “movies”:2, “Mary”:1, “too”:1, “also”:1, “football”:1, “games”:1}; John Likes To Watch Movies Mary Too Also Football games 1 2 1 1 2 1 1 0 0 0 1 1 1 1 0 0 0 1 1 1 2 3 2 2 2 1 1 1 1 1 Document Vector Word Vector
  • 9. TEXT VECTORIZATION BAG OF WORDS MODEL: BAG OF WORDS: Bag of Words is nothing but Term Frequencies
  • 10. TEXT VECTORIZATION BAG OF WORDS MODEL: Bag of Words here, is nothing but Term Frequencies
  • 11. TEXT VECTORIZATION TF-IDF MODEL: • Term frequencies are not necessarily the best representation for the text. • Having a high raw count does not necessarily mean that the corresponding word is more important. • TF-IDF “normalizes” the term frequency by weighing a term by the inverse of document frequency. TF = (Number of times term t appears in a document)/(Number of terms in the document) IDF = log(N/n), where, N is the number of documents n is the number of documents a term t has appeared in. TF-IDF = TF x IDF
  • 12. TEXT VECTORIZATION TF-IDF MODEL: TF (This, Document1) = 1/8 TF (This, Document2) = 1/5 IDF (This) = log (2 / 2) = 0 IDF (Messi) = log (2 / 1) = 0.301. TF – IDF (This, Document1) = (1 / 8) * (0) = 0 TF – IDF (This, Document2) = (1 / 5) * (0) = 0 TF – IDF (Messi, Document1) = (4 / 8) * 0.301 = 0.15
  • 14.
  • 15.
  • 16.
  • 17. TEXT VECTORIZATION BAG OF N-GRAMS MODEL: • Bag of Words model doesn’t consider order of words. Thus different sentences can have exactly the same representation, as long as the same words are used. • Bag of N-Grams model – Extension of the Bag of Words model, leverage N-gram based features.
  • 18. TEXT VECTORIZATION CO-OCCURRENCE MATRIX: Corpus =“The quick brown fox jumps over the lazy dog.” Window Size :- 2
  • 19. TEXT VECTORIZATION CO-OCCURRENCE MATRIX: Corpus = “He is not lazy. He is intelligent. He is smart”. Window Size :- 2
  • 20. TEXT VECTORIZATION CO-OCCURRENCE MATRIX: PMI = Pointwise Mutual Information. Larger PMI  Higher correlation Where, w= word, c= context word ISSUES: Many entries with PMI (w,c) = log 0 SOLUTION: • Set PMI(w,c) = 0 for all unobserved pairs. • Drop all entries of PMI< 0 [POSITIVE POINTWISE MUTUAL INFORMATION] Produces 2 different vectors for each word: • Describes word when it is the ‘target word’ in the window • Describes word when it is the ‘context word’ in window
  • 21. PREDICTION BASED EMBEDDING: Prediction based Embedding • CBOW • Skip-Gram CBOW and Skip-Gram model for neural network differs in the terms of input and output of the neural network. • CBOW: Input to neural network is set of context words within a certain window surrounding a ‘target’ word. And output predicts the ‘target’ word i.e., what word should belong to the target position. • Skip-Gram: Input is similar to a CBOW model. Output predicts each ‘context’ based on the ‘target’ word appearing at the center of the window. Both the times we learn wi and context vector ῶ i for each word in the vocabulary.
  • 22. PREDICTION BASED EMBEDDING: • Goal is to supply training samples and learn the weights. • Use the weights to predict probabilities for new input word.
  • 23. PARAGRAPH VECTOR: • Bag of Words – no order / sequence. But no semantics. • Bag of N-grams – little semantics. But suffers data sparsity and high dimensionality. • Methods: • A weighted average of all the words in the document (loses order of words) • combining the word vectors in an order given by a parse tree of a sentence, using matrix-vector operations. (works only for sentences) • PARAGRAPH VECTOR – applicable to variable-length pieces of texts: sentences, paragraphs, and documents
  • 24. PARAGRAPH VECTOR: A framework for learning word vectors. Context of three words (“the,” “cat,” and “sat”) is used to predict the fourth word (“on”). The input words are mapped to columns of the matrix W to predict the output word.
  • 25. PARAGRAPH VECTOR: Distributed Memory Model of Paragraph Vectors (PV-DM) Distributed Bag of Words version of Paragraph Vector (PV-DBOW)
  • 26. DOCUMENT SIMILARITY: • Document similarity – Similarity based on features extracted from the documents like bag of words or tf-idf. • Pairwise document similarity • Several similarity and distance metrics • cosine distance/similarity • euclidean distance • manhattan distance • BM25 similarity • jaccard distance • Levenshtein distance • Hamming distance
  • 28. TOPIC MODELS • We can also use some summarization techniques to extract topic or concept based features from text documents. • Extracting key themes or concepts from a corpus of documents which are represented as topics. • Each topic can be represented as a bag or collection of words/terms from the document corpus.
  • 29. TOPIC MODELS • Most use Matrix Decomposition. • Eg: Latent Semantic Indexing uses Singular Valued Decomposition. LATENT SEMANTIC INDEXING:
  • 37. TOPIC MODELS • Latent Dirichlet Allocation – Uses generative probabilistic model • Each document consists of a combination of several topics. • Each term or word can be assigned to a specific topic. • Similar to pLSI based model (probabilistic LSI). Each latent topic contains a Dirichlet prior over them in the case of LDA. LATENT DIRICHLET ALLOCATION: • extract K topics • from M documents
  • 39. TOPIC MODELS LATENT DIRICHLET ALLOCATION: • LDA is applied on a document-term matrix (TF-IDF or Bag of Words feature matrix), it gets decomposed into two main components. • A document-topic matrix, which would be the feature matrix we are looking for. • A topic-term matrix, which helps us in looking at potential topics in the corpus.
  • 40. TEXT DOCUMENT DOCUMENT PREPROCESSING FEATURE SELECTION FEATURE EXTRACTIONTEXT CLASSIFICATION Tokenization, Stop-words removal, Lemmatization, Stemming NLP techniques, NER, TF- IDF, Information Gain(IG), BOW, N-gram BOW Word Embeddings, Glove, LSA, LDA Neural Network, CNN, RNN, LSTM, SVM, RF Structured or Unstructured data CONCLUSION

Notas del editor

  1. Flesch Reading Ease score : 206.835 – (1.015 x ASL) – (84.6 x ASW) Flesch-Kincaid Grade Level score : (.39 x ASL) + (11.8 x ASW) – 15.59 where: ASL = average sentence length (the number of words divided by the number of sentences) ASW = average number of syllables per word (the number of syllables divided by the number of words)
  2. The paragraph vectors are also asked to contribute to the prediction task of the next word given many contexts sampled from the paragraph. The contexts are fixed-length and sampled from a sliding window over the paragraph. The paragraph vector is shared across all contexts generated from the same paragraph but not across paragraphs. The word vector matrix W, however, is shared across paragraphs. I.e., the vector for “powerful” is the same for all paragraphs. The paragraph token can be thought of as another word. It acts as a memory that remembers what is missing from the current context – or the topic of the paragraph. For this reason, we often call this model the Distributed Memory Model of Paragraph Vectors (PV-DM).
  3. Suppose that there are N paragraphs in the corpus, M words in the vocabulary, and we want to learn paragraph vectors such that each paragraph is mapped to p dimensions and each word is mapped to q dimensions, then the model has the total of N × p + M × q parameters (excluding the softmax parameters)
  4. Document similarity is the process of using a distance or similarity based metric that can be used to identify how similar a text document is with any other document(s) based on features extracted from the documents like bag of words or tf-idf.