SlideShare una empresa de Scribd logo
1 de 124
Descargar para leer sin conexión
NLP & Bigdata
Motivation and Action
Sarath P R
sarath.amrita@gmail.com
IIIT-MK
Thiruvananthapuram

November 09, 2013

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
About me

Working as Technical Lead - Bigdata
Like to develop software applications for good reasons
Independent Data Journalist at DScribe.IN
Holds Masters in Computer Science
Like to travel and meet people

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Agenda

Introduction
Full text Search and Index
Document Clustering
Representing Data
Stanford NLP
R and Weka
Social Media and Sentiment Analysis
Introduction to Bigdata
Current Trends
Conclusion

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Introduction

Sorry !!! No Definitions copied here for NLP !
In case you need a definition tell me. Otherwise we will ’see’
now what is NLP !

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Introduction

Sorry !!! No Definitions copied here for NLP !
In case you need a definition tell me. Otherwise we will ’see’
now what is NLP !

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Introduction - 2 minutes Targit Video

Watch Targit Video Here http://youtu.be/32KE0rbGZ9c

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
So What is He (Targit CTO) Saying ?
“Calling your system, and getting delivered an analysis is right
around the corner”
Go to Targit’s website http://targit.com. You will see a
Lion standing in the front page
They say “Targit is a courage Company”
That was all about Motivation. No hidden agenda to promote
Targit !

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
So What is He (Targit CTO) Saying ?
“Calling your system, and getting delivered an analysis is right
around the corner”
Go to Targit’s website http://targit.com. You will see a
Lion standing in the front page
They say “Targit is a courage Company”
That was all about Motivation. No hidden agenda to promote
Targit !

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
So What is He (Targit CTO) Saying ?
“Calling your system, and getting delivered an analysis is right
around the corner”
Go to Targit’s website http://targit.com. You will see a
Lion standing in the front page
They say “Targit is a courage Company”
That was all about Motivation. No hidden agenda to promote
Targit !

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
So What is He (Targit CTO) Saying ?
“Calling your system, and getting delivered an analysis is right
around the corner”
Go to Targit’s website http://targit.com. You will see a
Lion standing in the front page
They say “Targit is a courage Company”
That was all about Motivation. No hidden agenda to promote
Targit !

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
So What is He (Targit CTO) Saying ?
“Calling your system, and getting delivered an analysis is right
around the corner”
Go to Targit’s website http://targit.com. You will see a
Lion standing in the front page
They say “Targit is a courage Company”
That was all about Motivation. No hidden agenda to promote
Targit !

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Introduction - Innovation

What we just saw is one aspect of NLP
What is it ?
It is Speech Recognition and Analytics
And what they did ?
It is Innovation !

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Introduction - Innovation

What we just saw is one aspect of NLP
What is it ?
It is Speech Recognition and Analytics
And what they did ?
It is Innovation !

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Introduction - Innovation

What we just saw is one aspect of NLP
What is it ?
It is Speech Recognition and Analytics
And what they did ?
It is Innovation !

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Introduction - Innovation

What we just saw is one aspect of NLP
What is it ?
It is Speech Recognition and Analytics
And what they did ?
It is Innovation !

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Introduction - Innovation

What we just saw is one aspect of NLP
What is it ?
It is Speech Recognition and Analytics
And what they did ?
It is Innovation !

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Introduction - Search Engines & Information Retrieval

Tell me your opinion. Question follows
IS Google an NLP Company ?
Yes, they are. Biggest one !
So, how google works ? I mean the Search Engine !
From where they bring you the search results ?
Answer is 3 things. Crawler, Index and Algorithms
Now we will start with few NLP, Machine Learning and Analytics
related topics in detail

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Introduction - Search Engines & Information Retrieval

Tell me your opinion. Question follows
IS Google an NLP Company ?
Yes, they are. Biggest one !
So, how google works ? I mean the Search Engine !
From where they bring you the search results ?
Answer is 3 things. Crawler, Index and Algorithms
Now we will start with few NLP, Machine Learning and Analytics
related topics in detail

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Introduction - Search Engines & Information Retrieval

Tell me your opinion. Question follows
IS Google an NLP Company ?
Yes, they are. Biggest one !
So, how google works ? I mean the Search Engine !
From where they bring you the search results ?
Answer is 3 things. Crawler, Index and Algorithms
Now we will start with few NLP, Machine Learning and Analytics
related topics in detail

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Introduction - Search Engines & Information Retrieval

Tell me your opinion. Question follows
IS Google an NLP Company ?
Yes, they are. Biggest one !
So, how google works ? I mean the Search Engine !
From where they bring you the search results ?
Answer is 3 things. Crawler, Index and Algorithms
Now we will start with few NLP, Machine Learning and Analytics
related topics in detail

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Introduction - Search Engines & Information Retrieval

Tell me your opinion. Question follows
IS Google an NLP Company ?
Yes, they are. Biggest one !
So, how google works ? I mean the Search Engine !
From where they bring you the search results ?
Answer is 3 things. Crawler, Index and Algorithms
Now we will start with few NLP, Machine Learning and Analytics
related topics in detail

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Introduction - Search Engines & Information Retrieval

Tell me your opinion. Question follows
IS Google an NLP Company ?
Yes, they are. Biggest one !
So, how google works ? I mean the Search Engine !
From where they bring you the search results ?
Answer is 3 things. Crawler, Index and Algorithms
Now we will start with few NLP, Machine Learning and Analytics
related topics in detail

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Introduction - Search Engines & Information Retrieval

Tell me your opinion. Question follows
IS Google an NLP Company ?
Yes, they are. Biggest one !
So, how google works ? I mean the Search Engine !
From where they bring you the search results ?
Answer is 3 things. Crawler, Index and Algorithms
Now we will start with few NLP, Machine Learning and Analytics
related topics in detail

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Full text Search and Inverted Index
In information retrieval, full-text search refers to techniques for
searching a single computer-stored document or a collection in a
full text database
When the number of documents to search is potentially large, or
the quantity of search queries to perform is substantial, the
problem of full-text search is often divided into two tasks
Indexing and Searching
The indexing stage will scan the text of all the documents and
build a list of search terms, called an index
In the search stage, when performing a specific query, only the
index is referenced, rather than the text of the original documents

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Full text Search and Inverted Index
In information retrieval, full-text search refers to techniques for
searching a single computer-stored document or a collection in a
full text database
When the number of documents to search is potentially large, or
the quantity of search queries to perform is substantial, the
problem of full-text search is often divided into two tasks
Indexing and Searching
The indexing stage will scan the text of all the documents and
build a list of search terms, called an index
In the search stage, when performing a specific query, only the
index is referenced, rather than the text of the original documents

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Full text Search and Inverted Index
In information retrieval, full-text search refers to techniques for
searching a single computer-stored document or a collection in a
full text database
When the number of documents to search is potentially large, or
the quantity of search queries to perform is substantial, the
problem of full-text search is often divided into two tasks
Indexing and Searching
The indexing stage will scan the text of all the documents and
build a list of search terms, called an index
In the search stage, when performing a specific query, only the
index is referenced, rather than the text of the original documents

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Inverted index

It is the most popular data structure used in document
retrieval systems
Similar to the index in the back of a book
Used on a large scale for example in search engines

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Inverted index

1
1

Reference http://nlp.stanford.edu/IR-book/html/htmledition/
a-first-take-at-building-an-inverted-index-1.html
Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Index vs Inverted Index

Index
A forward index (or just index) is the list of documents, and which
words appear in them
Inverted Index
The inverted index is the list of words, and the documents in which
they appear

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Index vs Inverted Index

Index
A forward index (or just index) is the list of documents, and which
words appear in them
Inverted Index
The inverted index is the list of words, and the documents in which
they appear

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Exercise

Have a look at the table below
Document
Doc 1
Doc 2
Doc 3

Words
talk, iiitmk, campus,nlp
algorithm, bigdata, nlp
researchers, talk

What kind of an Index is it ?
Create an inverted index from this forward index

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Exercise

Have a look at the table below
Document
Doc 1
Doc 2
Doc 3

Words
talk, iiitmk, campus,nlp
algorithm, bigdata, nlp
researchers, talk

What kind of an Index is it ?
Create an inverted index from this forward index

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Exercise

Have a look at the table below
Document
Doc 1
Doc 2
Doc 3

Words
talk, iiitmk, campus,nlp
algorithm, bigdata, nlp
researchers, talk

What kind of an Index is it ?
Create an inverted index from this forward index

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Exercise

Have a look at the table below
Document
Doc 1
Doc 2
Doc 3

Words
talk, iiitmk, campus,nlp
algorithm, bigdata, nlp
researchers, talk

What kind of an Index is it ?
Create an inverted index from this forward index

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Answer
Inverted Index
Words
talk
iiitmk
campus
nlp
algorithm
bigdata
researchers

Document
Doc 1, Doc 3
Doc 1
Doc 1
Doc 1, Doc 2
Doc 2
Doc 2
Doc 3

Search
A search query like ’nlp talk’ would deliver what results ?
Result

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Answer
Inverted Index
Words
talk
iiitmk
campus
nlp
algorithm
bigdata
researchers

Document
Doc 1, Doc 3
Doc 1
Doc 1
Doc 1, Doc 2
Doc 2
Doc 2
Doc 3

Search
A search query like ’nlp talk’ would deliver what results ?
Result

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Answer
Inverted Index
Words
talk
iiitmk
campus
nlp
algorithm
bigdata
researchers

Document
Doc 1, Doc 3
Doc 1
Doc 1
Doc 1, Doc 2
Doc 2
Doc 2
Doc 3

Search
A search query like ’nlp talk’ would deliver what results ?
Result

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Apache Lucene Demo
Which Tool to try for indexing ans searching ?
Apache Lucene is a full-featured text search engine library
Written entirely in Java
Open Source
Scalable and High Performance Indexing
Powerful, Accurate and Efficient Search Algorithms
Interesting Features of Lucene Core
Allows Simultaneous update and searching
Powerful query types like phrase queries, wildcard queries,
range queries etc
Fielded searching (e.g. title, author, contents)

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Apache Lucene Demo
Which Tool to try for indexing ans searching ?
Apache Lucene is a full-featured text search engine library
Written entirely in Java
Open Source
Scalable and High Performance Indexing
Powerful, Accurate and Efficient Search Algorithms
Interesting Features of Lucene Core
Allows Simultaneous update and searching
Powerful query types like phrase queries, wildcard queries,
range queries etc
Fielded searching (e.g. title, author, contents)

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Clustering
Definition
The process of grouping a set of physical or abstract objects into
classes of similar objects is called clustering.
A cluster is a collection of data objects that are similar to one
another within the same cluster and are dissimilar to the objects in
other clusters.
Clustering is applicable in many fields, including machine
learning, pattern recognition, image analysis, information
retrieval, and bioinformatics.
Clustering is an example for un supervised learning in Machine
Learning
Cluster Analysis can be achieved by various algorithms

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Clustering
Definition
The process of grouping a set of physical or abstract objects into
classes of similar objects is called clustering.
A cluster is a collection of data objects that are similar to one
another within the same cluster and are dissimilar to the objects in
other clusters.
Clustering is applicable in many fields, including machine
learning, pattern recognition, image analysis, information
retrieval, and bioinformatics.
Clustering is an example for un supervised learning in Machine
Learning
Cluster Analysis can be achieved by various algorithms

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
The Library Example
Reference
I found this example in the book Mahout In Action by Sean Owen,
Robin Anil, Ted Dunning, and Ellen Friedman
Inside the Library
A Library having thousands of books
There is no particular order or anything how books are
arranged in this Library
Brainstorm !
Will you enjoy finding a book you want from there ?
If not give me some solutions

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
The Library Example
Reference
I found this example in the book Mahout In Action by Sean Owen,
Robin Anil, Ted Dunning, and Ellen Friedman
Inside the Library
A Library having thousands of books
There is no particular order or anything how books are
arranged in this Library
Brainstorm !
Will you enjoy finding a book you want from there ?
If not give me some solutions

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
The Library Example
Reference
I found this example in the book Mahout In Action by Sean Owen,
Robin Anil, Ted Dunning, and Ellen Friedman
Inside the Library
A Library having thousands of books
There is no particular order or anything how books are
arranged in this Library
Brainstorm !
Will you enjoy finding a book you want from there ?
If not give me some solutions

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
The Library Example
Reference
I found this example in the book Mahout In Action by Sean Owen,
Robin Anil, Ted Dunning, and Ellen Friedman
Inside the Library
A Library having thousands of books
There is no particular order or anything how books are
arranged in this Library
Brainstorm !
Will you enjoy finding a book you want from there ?
If not give me some solutions

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
The Library Example
Reference
I found this example in the book Mahout In Action by Sean Owen,
Robin Anil, Ted Dunning, and Ellen Friedman
Inside the Library
A Library having thousands of books
There is no particular order or anything how books are
arranged in this Library
Brainstorm !
Will you enjoy finding a book you want from there ?
If not give me some solutions

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Solutions

What about Sorting the books alphabetically by Title ?
Yes, for readers seraching a book by title, that will help.
What if some looking for books on some general subject ? For
example Health
Grouping books by topics will be more useful in this case
But how would you even begin this grouping ?
You will start reading books one by one and group them ! Good
Work :-)

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Solutions

What about Sorting the books alphabetically by Title ?
Yes, for readers seraching a book by title, that will help.
What if some looking for books on some general subject ? For
example Health
Grouping books by topics will be more useful in this case
But how would you even begin this grouping ?
You will start reading books one by one and group them ! Good
Work :-)

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Solutions

What about Sorting the books alphabetically by Title ?
Yes, for readers seraching a book by title, that will help.
What if some looking for books on some general subject ? For
example Health
Grouping books by topics will be more useful in this case
But how would you even begin this grouping ?
You will start reading books one by one and group them ! Good
Work :-)

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Solutions

What about Sorting the books alphabetically by Title ?
Yes, for readers seraching a book by title, that will help.
What if some looking for books on some general subject ? For
example Health
Grouping books by topics will be more useful in this case
But how would you even begin this grouping ?
You will start reading books one by one and group them ! Good
Work :-)

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Solutions

What about Sorting the books alphabetically by Title ?
Yes, for readers seraching a book by title, that will help.
What if some looking for books on some general subject ? For
example Health
Grouping books by topics will be more useful in this case
But how would you even begin this grouping ?
You will start reading books one by one and group them ! Good
Work :-)

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Solutions

What about Sorting the books alphabetically by Title ?
Yes, for readers seraching a book by title, that will help.
What if some looking for books on some general subject ? For
example Health
Grouping books by topics will be more useful in this case
But how would you even begin this grouping ?
You will start reading books one by one and group them ! Good
Work :-)

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Steps in Clustering

Clustering involves the following
An algorithm, the method used to group the books together.
A notion of both similarity and dissimilarity.
In the library example we relied on our assessment of which
books belonged in an existing stack and which should start a
new one.
A stopping condition.
In the library example, this might have been the point beyond
books can’t be stacked anymore, or when the stacks are
already quite dissimilar.

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Steps in Clustering

Clustering involves the following
An algorithm, the method used to group the books together.
A notion of both similarity and dissimilarity.
In the library example we relied on our assessment of which
books belonged in an existing stack and which should start a
new one.
A stopping condition.
In the library example, this might have been the point beyond
books can’t be stacked anymore, or when the stacks are
already quite dissimilar.

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Steps in Clustering

Clustering involves the following
An algorithm, the method used to group the books together.
A notion of both similarity and dissimilarity.
In the library example we relied on our assessment of which
books belonged in an existing stack and which should start a
new one.
A stopping condition.
In the library example, this might have been the point beyond
books can’t be stacked anymore, or when the stacks are
already quite dissimilar.

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Steps in Clustering

Clustering involves the following
An algorithm, the method used to group the books together.
A notion of both similarity and dissimilarity.
In the library example we relied on our assessment of which
books belonged in an existing stack and which should start a
new one.
A stopping condition.
In the library example, this might have been the point beyond
books can’t be stacked anymore, or when the stacks are
already quite dissimilar.

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
K-Means Algorithm

Let’s see an Algorithm first and after that how to automate the
grouping of books in the Library Example.
K-Means
k-Means clustering aims to partition n observations into k
clusters.
Takes the input parameter, k, and partitions a set of n objects
into k clusters so that the resulting intracluster similarity is
high but the intercluster similarity is low.
Cluster similarity is measured in regard to the mean value of
the objects in a cluster, which can be viewed as the cluster’s
centroid

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
K-Means Algorithm

Let’s see an Algorithm first and after that how to automate the
grouping of books in the Library Example.
K-Means
k-Means clustering aims to partition n observations into k
clusters.
Takes the input parameter, k, and partitions a set of n objects
into k clusters so that the resulting intracluster similarity is
high but the intercluster similarity is low.
Cluster similarity is measured in regard to the mean value of
the objects in a cluster, which can be viewed as the cluster’s
centroid

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
K-Means Example
2 Reference

Teknomo, Kardi. K-Means Clustering Tutorials.
http://people.revoledu.com/kardi/tutorial/kMean
Data
Object
Medicine A
Medicine B
medicine C
Medicine D

Attribute 1 (X) weight index
1
2
4
5

Attribute 2 (Y) pH
1
1
3
4

Problem
we have 4 objects each having 2 attributes
we also know before hand that these objects belong to two
groups of medicine (cluster 1 and cluster 2)
The problem now is to determine which medicines belong to
cluster 1 and which medicines belong to the other cluster
2

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
K-Means Example
2 Reference

Teknomo, Kardi. K-Means Clustering Tutorials.
http://people.revoledu.com/kardi/tutorial/kMean
Data
Object
Medicine A
Medicine B
medicine C
Medicine D

Attribute 1 (X) weight index
1
2
4
5

Attribute 2 (Y) pH
1
1
3
4

Problem
we have 4 objects each having 2 attributes
we also know before hand that these objects belong to two
groups of medicine (cluster 1 and cluster 2)
The problem now is to determine which medicines belong to
cluster 1 and which medicines belong to the other cluster
2

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Steps in K-means

Iterate until stable (ie no object move group):
1

Determine the centroid coordinate

2

Determine the distance of each object to the centroids

3

Group the object based on minimum distance (find the closest
centroid)

Each medicine represents one point with two features (X, Y). We
can represent it as coordinate in a feature space

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Steps in K-means

Iterate until stable (ie no object move group):
1

Determine the centroid coordinate

2

Determine the distance of each object to the centroids

3

Group the object based on minimum distance (find the closest
centroid)

Each medicine represents one point with two features (X, Y). We
can represent it as coordinate in a feature space

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Euclidean distance

Each clustering problem is basically based on a distance
between points
Euclidean Distance is most commonly usd distance measure
Mathematically, Euclidean distance between points with
coordinates (x1, y1) and (x2, y2) is

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Iteration 0

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Iteration 0
Initial Value of Centroids
Take medicine A and medicine B as the first centroids.
Let c1 and c 2 denote the coordinate of the centroids, then
c1 = (1,1) and c 2 = (2,1)
Objects-Centroids Distance
Calculate the distance between cluster centroid to each object.
Distance matrix using Euclidean Distance at iteration 0 is

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Iteration 0
Initial Value of Centroids
Take medicine A and medicine B as the first centroids.
Let c1 and c 2 denote the coordinate of the centroids, then
c1 = (1,1) and c 2 = (2,1)
Objects-Centroids Distance
Calculate the distance between cluster centroid to each object.
Distance matrix using Euclidean Distance at iteration 0 is

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Iteration 0

Each column in the distance matrix symbolizes the object
The first row of the distance matrix corresponds to the
distance of each object to the first centroid and the second
row is the distance of each object to the second centroid
For example, distance from medicine C = (4, 3) to the first
centroid c1 = (1,1) is

Similarly distance to the second centroid c 2 = (2,1) is

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Iteration 0
Objects clustering
We assign each object based on the minimum distance
Thus, medicine A is assigned to group 1, medicine B to group
2 and so on
Group Matrix
The element of Group matrix below is 1 if and only if the
object is assigned to that group.

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Iteration 0
Objects clustering
We assign each object based on the minimum distance
Thus, medicine A is assigned to group 1, medicine B to group
2 and so on
Group Matrix
The element of Group matrix below is 1 if and only if the
object is assigned to that group.

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Iteration 1

Determine new centroids
Compute the new centroid of each group based on the new
members
Group 1 only has one member
thus the centroid remains as c1 = (1,1)
Group 2 now has three members, thus the centroid is the
average coordinate among the three members

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Iteration 1

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Iteration 1

Objects-Centroids Distance
Compute the distance of all objects to the new centroids
Distance matrix at iteration 1 is

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Iteration 1
Objects clustering
Again we assign each object based on the minimum distance
Based on the new distance matrix, we move the medicine B
to Group 1 while all the other objects remain.
Group Matrix
Group matrix at Iteration 1

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Iteration 1
Objects clustering
Again we assign each object based on the minimum distance
Based on the new distance matrix, we move the medicine B
to Group 1 while all the other objects remain.
Group Matrix
Group matrix at Iteration 1

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Iteration 2

Determine new centroids
Compute the new centroid of each group based on the new
members
Group1 and group 2 both has two members, thus the thus the
new centroids are

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Iteration 2

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Iteration 2

Objects-Centroids Distance
Distance matrix at iteration 2 is

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Iteration 2

Objects clustering
Again we assign each object based on the minimum distance
Group Matrix
Group matrix at Iteration 2

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Iteration 2

Objects clustering
Again we assign each object based on the minimum distance
Group Matrix
Group matrix at Iteration 2

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Results

We obtain result that G2 = G1.
Comparing the grouping of last iteration and this iteration
reveals that the objects does not move group anymore.
Thus, the computation of the k-mean clustering has reached
its stability and no more iteration is needed.
We get the final grouping as the results.

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations

X-Y Plane Example
In previous example the measure of similarity (or similarity
metric) for the points was the Euclidean distance between two
points
And that was in the X-Y plane
Library Example
The library example had no such clear, mathematical measure.
And we relied entirely on our wisdom to judge book similarity

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations

X-Y Plane Example
In previous example the measure of similarity (or similarity
metric) for the points was the Euclidean distance between two
points
And that was in the X-Y plane
Library Example
The library example had no such clear, mathematical measure.
And we relied entirely on our wisdom to judge book similarity

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations

Brainstorm !
We need a metric that can be implemented on a computer.
One possible metric could be based on the number of words
common to two books’ titles.
So “Harry Potter: The Philosopher’s Stone” and “Harry
Potter: The Prisoner of Azkaban” have three words in
common: “Harry”, “Potter” and “The”.
But, even though the book “The Lord of the Rings: The Two
Towers” is similar to the Harry Potter series, this measure of
similarity doesn’t capture that.

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations

Brainstorm !
We need a metric that can be implemented on a computer.
One possible metric could be based on the number of words
common to two books’ titles.
So “Harry Potter: The Philosopher’s Stone” and “Harry
Potter: The Prisoner of Azkaban” have three words in
common: “Harry”, “Potter” and “The”.
But, even though the book “The Lord of the Rings: The Two
Towers” is similar to the Harry Potter series, this measure of
similarity doesn’t capture that.

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations

Brainstorm !
We need a metric that can be implemented on a computer.
One possible metric could be based on the number of words
common to two books’ titles.
So “Harry Potter: The Philosopher’s Stone” and “Harry
Potter: The Prisoner of Azkaban” have three words in
common: “Harry”, “Potter” and “The”.
But, even though the book “The Lord of the Rings: The Two
Towers” is similar to the Harry Potter series, this measure of
similarity doesn’t capture that.

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations

Brainstorm !
We need a metric that can be implemented on a computer.
One possible metric could be based on the number of words
common to two books’ titles.
So “Harry Potter: The Philosopher’s Stone” and “Harry
Potter: The Prisoner of Azkaban” have three words in
common: “Harry”, “Potter” and “The”.
But, even though the book “The Lord of the Rings: The Two
Towers” is similar to the Harry Potter series, this measure of
similarity doesn’t capture that.

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations

Brainstorm !
We need a metric that can be implemented on a computer.
One possible metric could be based on the number of words
common to two books’ titles.
So “Harry Potter: The Philosopher’s Stone” and “Harry
Potter: The Prisoner of Azkaban” have three words in
common: “Harry”, “Potter” and “The”.
But, even though the book “The Lord of the Rings: The Two
Towers” is similar to the Harry Potter series, this measure of
similarity doesn’t capture that.

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations
Another Solutions
We could assemble word counts for each book, and when the
counts are close for many words, judge the books similar.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
We could use numeric weights in the computation, and apply
low weights to these words to reduce their effect on the
similarity value.
Once we give a weight value to each word in a book, we can
easily find out the similarity of two books.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations
Another Solutions
We could assemble word counts for each book, and when the
counts are close for many words, judge the books similar.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
We could use numeric weights in the computation, and apply
low weights to these words to reduce their effect on the
similarity value.
Once we give a weight value to each word in a book, we can
easily find out the similarity of two books.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations
Another Solutions
We could assemble word counts for each book, and when the
counts are close for many words, judge the books similar.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
We could use numeric weights in the computation, and apply
low weights to these words to reduce their effect on the
similarity value.
Once we give a weight value to each word in a book, we can
easily find out the similarity of two books.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations
Another Solutions
We could assemble word counts for each book, and when the
counts are close for many words, judge the books similar.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
We could use numeric weights in the computation, and apply
low weights to these words to reduce their effect on the
similarity value.
Once we give a weight value to each word in a book, we can
easily find out the similarity of two books.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations
Another Solutions
We could assemble word counts for each book, and when the
counts are close for many words, judge the books similar.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
We could use numeric weights in the computation, and apply
low weights to these words to reduce their effect on the
similarity value.
Once we give a weight value to each word in a book, we can
easily find out the similarity of two books.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations
Another Solutions
We could assemble word counts for each book, and when the
counts are close for many words, judge the books similar.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
We could use numeric weights in the computation, and apply
low weights to these words to reduce their effect on the
similarity value.
Once we give a weight value to each word in a book, we can
easily find out the similarity of two books.
But the words like “a”, “an”, and “the” cannot contribute
much to the similarity, because they occurs frequently in both
books.
Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations

What if one book is 300 pages long and the other 1000 pages
long?
We have to ensure that the weight of words should be relative
to the length of the text.
We will see a method called TF-IDF shortly

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations

What if one book is 300 pages long and the other 1000 pages
long?
We have to ensure that the weight of words should be relative
to the length of the text.
We will see a method called TF-IDF shortly

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations

What if one book is 300 pages long and the other 1000 pages
long?
We have to ensure that the weight of words should be relative
to the length of the text.
We will see a method called TF-IDF shortly

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations

Task !
Explore following distance measures
1

Squared Euclidean distance measure

2

Manhattan distance measure

3

Cosine distance measure

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations

Representing Data as Vectors
In mathematics, a vector is simply a point in space.
We found how books can be clustered together based on their
similarity in words.
In reality, clustering could be applied to any kind of object
provided we can distinguish similar and dissimilar items.
Clustering of anything via algorithms starts with representing
the object in a way that can be read by computers.
It is quite practical to think of objects in terms of their
measurable features or attributes.

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations
Say we want to cluster bunch of Apples

3

3

Figure taken from Mahout in Action
Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations

A small, round, red apple is more similar to a small, round,
green one than a large, ovoid green one.
The process of vectorization starts with assigning features to a
dimension
Let’s say weight is feature (dimension) 0, color is 1, and size is
2
So the vector of a small round red apple looks like [0: 100
gram, 1: red, 2: small]

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations

A small, round, red apple is more similar to a small, round,
green one than a large, ovoid green one.
The process of vectorization starts with assigning features to a
dimension
Let’s say weight is feature (dimension) 0, color is 1, and size is
2
So the vector of a small round red apple looks like [0: 100
gram, 1: red, 2: small]

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations

A small, round, red apple is more similar to a small, round,
green one than a large, ovoid green one.
The process of vectorization starts with assigning features to a
dimension
Let’s say weight is feature (dimension) 0, color is 1, and size is
2
So the vector of a small round red apple looks like [0: 100
gram, 1: red, 2: small]

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations

A small, round, red apple is more similar to a small, round,
green one than a large, ovoid green one.
The process of vectorization starts with assigning features to a
dimension
Let’s say weight is feature (dimension) 0, color is 1, and size is
2
So the vector of a small round red apple looks like [0: 100
gram, 1: red, 2: small]

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations
Set of apples of different weight, sizes and colors converted to
vectors 4

4

Figure taken from Mahout in Action
Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Document Representations
Improving weighting with TF-IDF
Term frequency - Inverse Document Frequency (TF-IDF)
weighting is a widely used improvement on simple term
frequency weighting.
We found how books can be clustered together based on their
similarity in words.
Instead of simply using term frequency as values in the vector,
this value is multiplied by the inverse of the term’s document
frequency
IDF=log(N/n)
N=total number of documents
n = number of documents that contain a term
TF-IDF = TF*IDF

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Stanford NLP

NLP Toolkit
Stanford NLP group provides NLP toolkits for various major
computational linguistics problems.
Written in Java.
Open Source

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Stanford NLP
Stanford Named Entity Recognizer
Named-entity recognition (NER) techniques locate and
classify atomic elements in text into predefined categories
such as the names of persons, organizations, locations etc
Consider the following text
Hello Jona, I am in Indian Institute at Trivandrum
What are the entities in this ?
NER Demo
Stanford NER is also known as CRFClassifier
Conditional Random Field (CRF) sequence models are used for
structured predictions

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Stanford NLP
Stanford Named Entity Recognizer
Named-entity recognition (NER) techniques locate and
classify atomic elements in text into predefined categories
such as the names of persons, organizations, locations etc
Consider the following text
Hello Jona, I am in Indian Institute at Trivandrum
What are the entities in this ?
NER Demo
Stanford NER is also known as CRFClassifier
Conditional Random Field (CRF) sequence models are used for
structured predictions

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Stanford NLP
Stanford Named Entity Recognizer
Named-entity recognition (NER) techniques locate and
classify atomic elements in text into predefined categories
such as the names of persons, organizations, locations etc
Consider the following text
Hello Jona, I am in Indian Institute at Trivandrum
What are the entities in this ?
NER Demo
Stanford NER is also known as CRFClassifier
Conditional Random Field (CRF) sequence models are used for
structured predictions

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Stanford NLP
Stanford Named Entity Recognizer
Named-entity recognition (NER) techniques locate and
classify atomic elements in text into predefined categories
such as the names of persons, organizations, locations etc
Consider the following text
Hello Jona, I am in Indian Institute at Trivandrum
What are the entities in this ?
NER Demo
Stanford NER is also known as CRFClassifier
Conditional Random Field (CRF) sequence models are used for
structured predictions

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Stanford NLP
Stanford Named Entity Recognizer
Named-entity recognition (NER) techniques locate and
classify atomic elements in text into predefined categories
such as the names of persons, organizations, locations etc
Consider the following text
Hello Jona, I am in Indian Institute at Trivandrum
What are the entities in this ?
NER Demo
Stanford NER is also known as CRFClassifier
Conditional Random Field (CRF) sequence models are used for
structured predictions

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Social Media and Sentiment Analysis

Twitter
Twitter Streaming Demo
Sentiment Analysis
Sentiment analysis is one of the hottest research areas in
computer science today.
A basic task in sentiment analysis is to classify the polarity of
a given text at the document, sentence, or aspect level.
Whether the expressed opinion in a document, a sentence or
an entity feature oraspect is positive, negative, or neutral.

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Social Media and Sentiment Analysis

Twitter
Twitter Streaming Demo
Sentiment Analysis
Sentiment analysis is one of the hottest research areas in
computer science today.
A basic task in sentiment analysis is to classify the polarity of
a given text at the document, sentence, or aspect level.
Whether the expressed opinion in a document, a sentence or
an entity feature oraspect is positive, negative, or neutral.

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Social Media and Sentiment Analysis

Movie Review
Let’s see a tweet on a recently released movie
“Wow #Krish3 looks more exciting than Superman n
Spider-Man for sure ! The Roshans have made a truly world
class super hero film, again!”
These snippets of text are a gold mine for companies and
individuals that want to monitor their reputation and get
timely feedback about their products and actions

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Social Media and Sentiment Analysis

Movie Review
Let’s see a tweet on a recently released movie
“Wow #Krish3 looks more exciting than Superman n
Spider-Man for sure ! The Roshans have made a truly world
class super hero film, again!”
These snippets of text are a gold mine for companies and
individuals that want to monitor their reputation and get
timely feedback about their products and actions

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Social Media and Sentiment Analysis

Movie Review
Let’s see a tweet on a recently released movie
“Wow #Krish3 looks more exciting than Superman n
Spider-Man for sure ! The Roshans have made a truly world
class super hero film, again!”
These snippets of text are a gold mine for companies and
individuals that want to monitor their reputation and get
timely feedback about their products and actions

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Social Media and Sentiment Analysis

Document-Level Sentiment Analysis
Main approach for document level sentiment analysis is
supervised learning.
The system learns a classification model from the training data
common classification algorithms such as SVM, Naive Bayes,
Logistic Regression etc can be used
Thus new documents are tagged into their various sentiment
classes

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Bigdata

Introduction to Bigdata
Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
The challenges include capture, curation, storage, search, sharing,
transfer, analysis, and visualization.

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Bigdata

3 Vs of Bigdata
Volume: Ever-growing data of all types
Velocity: For time-sensitive processes such as catching fraud,
intrusion detection etc, the speed at which data arrives is a
characteristic of bigdata
Variety: Any type of data, structured and unstructured data
such as text, sensor data, audio, video, click streams, log files
and more

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Bigdata

Tools and Technologies
Hadoop
NoSQL
Spark
D3

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Bigdata

Few Interesting Areas
Internet of Things
Data Journalism

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Conclusion

Questions ?

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
References

Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman, Mahout in Action,
Manning Publications
Jiawei Han, Micheline Kamber, Data Mining Concepts and Techniques
Teknomo, Kardi K-Means Clustering Tutorials
A first take at building an inverted index,
http://nlp.stanford.edu/IR-book/html/htmledition/
a-first-take-at-building-an-inverted-index-1.html

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action
Thanks

Sarath P R

sarath.amrita@gmail.com

NLP & Bigdata

Motivation and Action

Más contenido relacionado

La actualidad más candente

KWFinder Review
KWFinder ReviewKWFinder Review
KWFinder ReviewNoel Peter
 
2017 Spring SourceCon Key Takeaways
2017 Spring SourceCon Key Takeaways2017 Spring SourceCon Key Takeaways
2017 Spring SourceCon Key TakeawaysSusanna Frazier
 
DeepCrawl Site Search Webinar 14 Nov 2018 - JP Sherman
DeepCrawl Site Search Webinar 14 Nov 2018 - JP ShermanDeepCrawl Site Search Webinar 14 Nov 2018 - JP Sherman
DeepCrawl Site Search Webinar 14 Nov 2018 - JP ShermanJP Sherman
 
TOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEO
TOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEOTOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEO
TOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEOPalash Nawab
 
TOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEO
TOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEOTOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEO
TOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEOPalash Nawab
 
How to Rank for Featured Snippets in 2018
How to Rank for Featured Snippets in 2018How to Rank for Featured Snippets in 2018
How to Rank for Featured Snippets in 2018Ghergich & Co.
 
Introduction to site search analytics by SearchBroker
Introduction to site search analytics by SearchBrokerIntroduction to site search analytics by SearchBroker
Introduction to site search analytics by SearchBrokerEmpathyBroker
 
Keyword research
Keyword researchKeyword research
Keyword researchStudent
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlOpenSource Connections
 
KMG Symposium 2013, Content Strategy by Andy Crestodina from Orbit Media
KMG Symposium 2013, Content Strategy by Andy Crestodina from Orbit MediaKMG Symposium 2013, Content Strategy by Andy Crestodina from Orbit Media
KMG Symposium 2013, Content Strategy by Andy Crestodina from Orbit MediaKellstadtMarketingGroup13
 
Metaphic or the art of looking another way.
Metaphic or the art of looking another way.Metaphic or the art of looking another way.
Metaphic or the art of looking another way.Suresh Manian
 

La actualidad más candente (13)

KWFinder Review
KWFinder ReviewKWFinder Review
KWFinder Review
 
2017 Spring SourceCon Key Takeaways
2017 Spring SourceCon Key Takeaways2017 Spring SourceCon Key Takeaways
2017 Spring SourceCon Key Takeaways
 
DeepCrawl Site Search Webinar 14 Nov 2018 - JP Sherman
DeepCrawl Site Search Webinar 14 Nov 2018 - JP ShermanDeepCrawl Site Search Webinar 14 Nov 2018 - JP Sherman
DeepCrawl Site Search Webinar 14 Nov 2018 - JP Sherman
 
TOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEO
TOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEOTOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEO
TOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEO
 
TOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEO
TOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEOTOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEO
TOP 5 THINGS YOU SHOULD CONSIDER FOR YOUR SEO
 
Arsen Rabinovich - Advanced Search Summit Napa 2019
Arsen Rabinovich - Advanced Search Summit Napa 2019Arsen Rabinovich - Advanced Search Summit Napa 2019
Arsen Rabinovich - Advanced Search Summit Napa 2019
 
How to Rank for Featured Snippets in 2018
How to Rank for Featured Snippets in 2018How to Rank for Featured Snippets in 2018
How to Rank for Featured Snippets in 2018
 
Introduction to site search analytics by SearchBroker
Introduction to site search analytics by SearchBrokerIntroduction to site search analytics by SearchBroker
Introduction to site search analytics by SearchBroker
 
Keyword research
Keyword researchKeyword research
Keyword research
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
 
KMG Symposium 2013, Content Strategy by Andy Crestodina from Orbit Media
KMG Symposium 2013, Content Strategy by Andy Crestodina from Orbit MediaKMG Symposium 2013, Content Strategy by Andy Crestodina from Orbit Media
KMG Symposium 2013, Content Strategy by Andy Crestodina from Orbit Media
 
Metaphic or the art of looking another way.
Metaphic or the art of looking another way.Metaphic or the art of looking another way.
Metaphic or the art of looking another way.
 
Paddy Moogan
Paddy MooganPaddy Moogan
Paddy Moogan
 

Similar a NLP& Bigdata. Motivation and Action

Slideshare pres
Slideshare presSlideshare pres
Slideshare presTaryn Pahl
 
RA Slideshare Presentation- Taryn Pahl
RA Slideshare Presentation- Taryn PahlRA Slideshare Presentation- Taryn Pahl
RA Slideshare Presentation- Taryn PahlTaryn Pahl
 
SphinnCon Jerusalem 2011 - Oren Shatz - Rich Snippets
SphinnCon Jerusalem 2011 - Oren Shatz - Rich SnippetsSphinnCon Jerusalem 2011 - Oren Shatz - Rich Snippets
SphinnCon Jerusalem 2011 - Oren Shatz - Rich SnippetsOren Shatz
 
People Analytics and Data Science
People Analytics and Data SciencePeople Analytics and Data Science
People Analytics and Data ScienceData Con LA
 
How to unlock the secrets of effortless keyword research with ChatGPT.pptx
How to unlock the secrets of effortless keyword research with ChatGPT.pptxHow to unlock the secrets of effortless keyword research with ChatGPT.pptx
How to unlock the secrets of effortless keyword research with ChatGPT.pptxDaniel Smullen
 
Search Engine Optimization - David Goebel at eMarketing Techniques
Search Engine Optimization - David Goebel at eMarketing TechniquesSearch Engine Optimization - David Goebel at eMarketing Techniques
Search Engine Optimization - David Goebel at eMarketing TechniquesWorkSmart Integrated Marketing
 
What is Data Science actually is?
What is Data Science actually is?What is Data Science actually is?
What is Data Science actually is?Rupak Roy
 
SEO for Enterprise: Back to the Basics
SEO for Enterprise: Back to the BasicsSEO for Enterprise: Back to the Basics
SEO for Enterprise: Back to the BasicsAdam Audette
 
ReThinking Search Strategy from netPRpro, Inc.
ReThinking Search Strategy from netPRpro, Inc.ReThinking Search Strategy from netPRpro, Inc.
ReThinking Search Strategy from netPRpro, Inc.net5000
 
How to Boost Your SEO using Structured Data
How to Boost Your SEO using Structured DataHow to Boost Your SEO using Structured Data
How to Boost Your SEO using Structured DataMartin Tang
 
How to Impress, Not Overwhelm your CMO with Analytics
How to Impress, Not Overwhelm your CMO with AnalyticsHow to Impress, Not Overwhelm your CMO with Analytics
How to Impress, Not Overwhelm your CMO with AnalyticsBonnie Mailey
 
How to Impress, Not Overwhelm your CMO with Analytics
How to Impress, Not Overwhelm your CMO with AnalyticsHow to Impress, Not Overwhelm your CMO with Analytics
How to Impress, Not Overwhelm your CMO with AnalyticsHanapin Marketing
 
Cole Napper: Are you ready for generative AI in people analytics?
Cole Napper: Are you ready for generative AI in people analytics?Cole Napper: Are you ready for generative AI in people analytics?
Cole Napper: Are you ready for generative AI in people analytics?Edunomica
 
Content Re-Optimization
Content Re-OptimizationContent Re-Optimization
Content Re-OptimizationHeba Said
 
How Do I Get a Job in Data Science? | People Ask Google
How Do I Get a Job in Data Science? | People Ask GoogleHow Do I Get a Job in Data Science? | People Ask Google
How Do I Get a Job in Data Science? | People Ask Googleprateek kumar
 
Big book-of-digital-marketing
Big book-of-digital-marketingBig book-of-digital-marketing
Big book-of-digital-marketingTika Ayati
 
Unlock entity-driven content strategy for a 360° search experience.pptx
Unlock entity-driven content strategy for a 360° search experience.pptxUnlock entity-driven content strategy for a 360° search experience.pptx
Unlock entity-driven content strategy for a 360° search experience.pptxBegum Kaya
 
Leveraging Data: LinkedIn Recruiter, Jobs, & Talent Pool Analysis
Leveraging Data: LinkedIn Recruiter, Jobs, & Talent Pool AnalysisLeveraging Data: LinkedIn Recruiter, Jobs, & Talent Pool Analysis
Leveraging Data: LinkedIn Recruiter, Jobs, & Talent Pool AnalysisLinkedIn Europe
 
Leveraging Data in EMEA: LinkedIn Recruiter, Jobs, & Talent Pool Analysis | T...
Leveraging Data in EMEA: LinkedIn Recruiter, Jobs, & Talent Pool Analysis | T...Leveraging Data in EMEA: LinkedIn Recruiter, Jobs, & Talent Pool Analysis | T...
Leveraging Data in EMEA: LinkedIn Recruiter, Jobs, & Talent Pool Analysis | T...LinkedIn Talent Solutions
 
Big book-of-digital-marketing
Big book-of-digital-marketingBig book-of-digital-marketing
Big book-of-digital-marketingQualityWebs
 

Similar a NLP& Bigdata. Motivation and Action (20)

Slideshare pres
Slideshare presSlideshare pres
Slideshare pres
 
RA Slideshare Presentation- Taryn Pahl
RA Slideshare Presentation- Taryn PahlRA Slideshare Presentation- Taryn Pahl
RA Slideshare Presentation- Taryn Pahl
 
SphinnCon Jerusalem 2011 - Oren Shatz - Rich Snippets
SphinnCon Jerusalem 2011 - Oren Shatz - Rich SnippetsSphinnCon Jerusalem 2011 - Oren Shatz - Rich Snippets
SphinnCon Jerusalem 2011 - Oren Shatz - Rich Snippets
 
People Analytics and Data Science
People Analytics and Data SciencePeople Analytics and Data Science
People Analytics and Data Science
 
How to unlock the secrets of effortless keyword research with ChatGPT.pptx
How to unlock the secrets of effortless keyword research with ChatGPT.pptxHow to unlock the secrets of effortless keyword research with ChatGPT.pptx
How to unlock the secrets of effortless keyword research with ChatGPT.pptx
 
Search Engine Optimization - David Goebel at eMarketing Techniques
Search Engine Optimization - David Goebel at eMarketing TechniquesSearch Engine Optimization - David Goebel at eMarketing Techniques
Search Engine Optimization - David Goebel at eMarketing Techniques
 
What is Data Science actually is?
What is Data Science actually is?What is Data Science actually is?
What is Data Science actually is?
 
SEO for Enterprise: Back to the Basics
SEO for Enterprise: Back to the BasicsSEO for Enterprise: Back to the Basics
SEO for Enterprise: Back to the Basics
 
ReThinking Search Strategy from netPRpro, Inc.
ReThinking Search Strategy from netPRpro, Inc.ReThinking Search Strategy from netPRpro, Inc.
ReThinking Search Strategy from netPRpro, Inc.
 
How to Boost Your SEO using Structured Data
How to Boost Your SEO using Structured DataHow to Boost Your SEO using Structured Data
How to Boost Your SEO using Structured Data
 
How to Impress, Not Overwhelm your CMO with Analytics
How to Impress, Not Overwhelm your CMO with AnalyticsHow to Impress, Not Overwhelm your CMO with Analytics
How to Impress, Not Overwhelm your CMO with Analytics
 
How to Impress, Not Overwhelm your CMO with Analytics
How to Impress, Not Overwhelm your CMO with AnalyticsHow to Impress, Not Overwhelm your CMO with Analytics
How to Impress, Not Overwhelm your CMO with Analytics
 
Cole Napper: Are you ready for generative AI in people analytics?
Cole Napper: Are you ready for generative AI in people analytics?Cole Napper: Are you ready for generative AI in people analytics?
Cole Napper: Are you ready for generative AI in people analytics?
 
Content Re-Optimization
Content Re-OptimizationContent Re-Optimization
Content Re-Optimization
 
How Do I Get a Job in Data Science? | People Ask Google
How Do I Get a Job in Data Science? | People Ask GoogleHow Do I Get a Job in Data Science? | People Ask Google
How Do I Get a Job in Data Science? | People Ask Google
 
Big book-of-digital-marketing
Big book-of-digital-marketingBig book-of-digital-marketing
Big book-of-digital-marketing
 
Unlock entity-driven content strategy for a 360° search experience.pptx
Unlock entity-driven content strategy for a 360° search experience.pptxUnlock entity-driven content strategy for a 360° search experience.pptx
Unlock entity-driven content strategy for a 360° search experience.pptx
 
Leveraging Data: LinkedIn Recruiter, Jobs, & Talent Pool Analysis
Leveraging Data: LinkedIn Recruiter, Jobs, & Talent Pool AnalysisLeveraging Data: LinkedIn Recruiter, Jobs, & Talent Pool Analysis
Leveraging Data: LinkedIn Recruiter, Jobs, & Talent Pool Analysis
 
Leveraging Data in EMEA: LinkedIn Recruiter, Jobs, & Talent Pool Analysis | T...
Leveraging Data in EMEA: LinkedIn Recruiter, Jobs, & Talent Pool Analysis | T...Leveraging Data in EMEA: LinkedIn Recruiter, Jobs, & Talent Pool Analysis | T...
Leveraging Data in EMEA: LinkedIn Recruiter, Jobs, & Talent Pool Analysis | T...
 
Big book-of-digital-marketing
Big book-of-digital-marketingBig book-of-digital-marketing
Big book-of-digital-marketing
 

Último

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Último (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

NLP& Bigdata. Motivation and Action

  • 1. NLP & Bigdata Motivation and Action Sarath P R sarath.amrita@gmail.com IIIT-MK Thiruvananthapuram November 09, 2013 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 2. About me Working as Technical Lead - Bigdata Like to develop software applications for good reasons Independent Data Journalist at DScribe.IN Holds Masters in Computer Science Like to travel and meet people Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 3. Agenda Introduction Full text Search and Index Document Clustering Representing Data Stanford NLP R and Weka Social Media and Sentiment Analysis Introduction to Bigdata Current Trends Conclusion Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 4. Introduction Sorry !!! No Definitions copied here for NLP ! In case you need a definition tell me. Otherwise we will ’see’ now what is NLP ! Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 5. Introduction Sorry !!! No Definitions copied here for NLP ! In case you need a definition tell me. Otherwise we will ’see’ now what is NLP ! Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 6. Introduction - 2 minutes Targit Video Watch Targit Video Here http://youtu.be/32KE0rbGZ9c Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 7. So What is He (Targit CTO) Saying ? “Calling your system, and getting delivered an analysis is right around the corner” Go to Targit’s website http://targit.com. You will see a Lion standing in the front page They say “Targit is a courage Company” That was all about Motivation. No hidden agenda to promote Targit ! Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 8. So What is He (Targit CTO) Saying ? “Calling your system, and getting delivered an analysis is right around the corner” Go to Targit’s website http://targit.com. You will see a Lion standing in the front page They say “Targit is a courage Company” That was all about Motivation. No hidden agenda to promote Targit ! Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 9. So What is He (Targit CTO) Saying ? “Calling your system, and getting delivered an analysis is right around the corner” Go to Targit’s website http://targit.com. You will see a Lion standing in the front page They say “Targit is a courage Company” That was all about Motivation. No hidden agenda to promote Targit ! Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 10. So What is He (Targit CTO) Saying ? “Calling your system, and getting delivered an analysis is right around the corner” Go to Targit’s website http://targit.com. You will see a Lion standing in the front page They say “Targit is a courage Company” That was all about Motivation. No hidden agenda to promote Targit ! Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 11. So What is He (Targit CTO) Saying ? “Calling your system, and getting delivered an analysis is right around the corner” Go to Targit’s website http://targit.com. You will see a Lion standing in the front page They say “Targit is a courage Company” That was all about Motivation. No hidden agenda to promote Targit ! Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 12. Introduction - Innovation What we just saw is one aspect of NLP What is it ? It is Speech Recognition and Analytics And what they did ? It is Innovation ! Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 13. Introduction - Innovation What we just saw is one aspect of NLP What is it ? It is Speech Recognition and Analytics And what they did ? It is Innovation ! Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 14. Introduction - Innovation What we just saw is one aspect of NLP What is it ? It is Speech Recognition and Analytics And what they did ? It is Innovation ! Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 15. Introduction - Innovation What we just saw is one aspect of NLP What is it ? It is Speech Recognition and Analytics And what they did ? It is Innovation ! Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 16. Introduction - Innovation What we just saw is one aspect of NLP What is it ? It is Speech Recognition and Analytics And what they did ? It is Innovation ! Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 17. Introduction - Search Engines & Information Retrieval Tell me your opinion. Question follows IS Google an NLP Company ? Yes, they are. Biggest one ! So, how google works ? I mean the Search Engine ! From where they bring you the search results ? Answer is 3 things. Crawler, Index and Algorithms Now we will start with few NLP, Machine Learning and Analytics related topics in detail Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 18. Introduction - Search Engines & Information Retrieval Tell me your opinion. Question follows IS Google an NLP Company ? Yes, they are. Biggest one ! So, how google works ? I mean the Search Engine ! From where they bring you the search results ? Answer is 3 things. Crawler, Index and Algorithms Now we will start with few NLP, Machine Learning and Analytics related topics in detail Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 19. Introduction - Search Engines & Information Retrieval Tell me your opinion. Question follows IS Google an NLP Company ? Yes, they are. Biggest one ! So, how google works ? I mean the Search Engine ! From where they bring you the search results ? Answer is 3 things. Crawler, Index and Algorithms Now we will start with few NLP, Machine Learning and Analytics related topics in detail Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 20. Introduction - Search Engines & Information Retrieval Tell me your opinion. Question follows IS Google an NLP Company ? Yes, they are. Biggest one ! So, how google works ? I mean the Search Engine ! From where they bring you the search results ? Answer is 3 things. Crawler, Index and Algorithms Now we will start with few NLP, Machine Learning and Analytics related topics in detail Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 21. Introduction - Search Engines & Information Retrieval Tell me your opinion. Question follows IS Google an NLP Company ? Yes, they are. Biggest one ! So, how google works ? I mean the Search Engine ! From where they bring you the search results ? Answer is 3 things. Crawler, Index and Algorithms Now we will start with few NLP, Machine Learning and Analytics related topics in detail Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 22. Introduction - Search Engines & Information Retrieval Tell me your opinion. Question follows IS Google an NLP Company ? Yes, they are. Biggest one ! So, how google works ? I mean the Search Engine ! From where they bring you the search results ? Answer is 3 things. Crawler, Index and Algorithms Now we will start with few NLP, Machine Learning and Analytics related topics in detail Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 23. Introduction - Search Engines & Information Retrieval Tell me your opinion. Question follows IS Google an NLP Company ? Yes, they are. Biggest one ! So, how google works ? I mean the Search Engine ! From where they bring you the search results ? Answer is 3 things. Crawler, Index and Algorithms Now we will start with few NLP, Machine Learning and Analytics related topics in detail Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 24. Full text Search and Inverted Index In information retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full text database When the number of documents to search is potentially large, or the quantity of search queries to perform is substantial, the problem of full-text search is often divided into two tasks Indexing and Searching The indexing stage will scan the text of all the documents and build a list of search terms, called an index In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 25. Full text Search and Inverted Index In information retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full text database When the number of documents to search is potentially large, or the quantity of search queries to perform is substantial, the problem of full-text search is often divided into two tasks Indexing and Searching The indexing stage will scan the text of all the documents and build a list of search terms, called an index In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 26. Full text Search and Inverted Index In information retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full text database When the number of documents to search is potentially large, or the quantity of search queries to perform is substantial, the problem of full-text search is often divided into two tasks Indexing and Searching The indexing stage will scan the text of all the documents and build a list of search terms, called an index In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 27. Inverted index It is the most popular data structure used in document retrieval systems Similar to the index in the back of a book Used on a large scale for example in search engines Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 29. Index vs Inverted Index Index A forward index (or just index) is the list of documents, and which words appear in them Inverted Index The inverted index is the list of words, and the documents in which they appear Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 30. Index vs Inverted Index Index A forward index (or just index) is the list of documents, and which words appear in them Inverted Index The inverted index is the list of words, and the documents in which they appear Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 31. Exercise Have a look at the table below Document Doc 1 Doc 2 Doc 3 Words talk, iiitmk, campus,nlp algorithm, bigdata, nlp researchers, talk What kind of an Index is it ? Create an inverted index from this forward index Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 32. Exercise Have a look at the table below Document Doc 1 Doc 2 Doc 3 Words talk, iiitmk, campus,nlp algorithm, bigdata, nlp researchers, talk What kind of an Index is it ? Create an inverted index from this forward index Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 33. Exercise Have a look at the table below Document Doc 1 Doc 2 Doc 3 Words talk, iiitmk, campus,nlp algorithm, bigdata, nlp researchers, talk What kind of an Index is it ? Create an inverted index from this forward index Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 34. Exercise Have a look at the table below Document Doc 1 Doc 2 Doc 3 Words talk, iiitmk, campus,nlp algorithm, bigdata, nlp researchers, talk What kind of an Index is it ? Create an inverted index from this forward index Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 35. Answer Inverted Index Words talk iiitmk campus nlp algorithm bigdata researchers Document Doc 1, Doc 3 Doc 1 Doc 1 Doc 1, Doc 2 Doc 2 Doc 2 Doc 3 Search A search query like ’nlp talk’ would deliver what results ? Result Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 36. Answer Inverted Index Words talk iiitmk campus nlp algorithm bigdata researchers Document Doc 1, Doc 3 Doc 1 Doc 1 Doc 1, Doc 2 Doc 2 Doc 2 Doc 3 Search A search query like ’nlp talk’ would deliver what results ? Result Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 37. Answer Inverted Index Words talk iiitmk campus nlp algorithm bigdata researchers Document Doc 1, Doc 3 Doc 1 Doc 1 Doc 1, Doc 2 Doc 2 Doc 2 Doc 3 Search A search query like ’nlp talk’ would deliver what results ? Result Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 38. Apache Lucene Demo Which Tool to try for indexing ans searching ? Apache Lucene is a full-featured text search engine library Written entirely in Java Open Source Scalable and High Performance Indexing Powerful, Accurate and Efficient Search Algorithms Interesting Features of Lucene Core Allows Simultaneous update and searching Powerful query types like phrase queries, wildcard queries, range queries etc Fielded searching (e.g. title, author, contents) Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 39. Apache Lucene Demo Which Tool to try for indexing ans searching ? Apache Lucene is a full-featured text search engine library Written entirely in Java Open Source Scalable and High Performance Indexing Powerful, Accurate and Efficient Search Algorithms Interesting Features of Lucene Core Allows Simultaneous update and searching Powerful query types like phrase queries, wildcard queries, range queries etc Fielded searching (e.g. title, author, contents) Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 40. Document Clustering Definition The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Clustering is applicable in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Clustering is an example for un supervised learning in Machine Learning Cluster Analysis can be achieved by various algorithms Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 41. Document Clustering Definition The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Clustering is applicable in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Clustering is an example for un supervised learning in Machine Learning Cluster Analysis can be achieved by various algorithms Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 42. The Library Example Reference I found this example in the book Mahout In Action by Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman Inside the Library A Library having thousands of books There is no particular order or anything how books are arranged in this Library Brainstorm ! Will you enjoy finding a book you want from there ? If not give me some solutions Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 43. The Library Example Reference I found this example in the book Mahout In Action by Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman Inside the Library A Library having thousands of books There is no particular order or anything how books are arranged in this Library Brainstorm ! Will you enjoy finding a book you want from there ? If not give me some solutions Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 44. The Library Example Reference I found this example in the book Mahout In Action by Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman Inside the Library A Library having thousands of books There is no particular order or anything how books are arranged in this Library Brainstorm ! Will you enjoy finding a book you want from there ? If not give me some solutions Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 45. The Library Example Reference I found this example in the book Mahout In Action by Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman Inside the Library A Library having thousands of books There is no particular order or anything how books are arranged in this Library Brainstorm ! Will you enjoy finding a book you want from there ? If not give me some solutions Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 46. The Library Example Reference I found this example in the book Mahout In Action by Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman Inside the Library A Library having thousands of books There is no particular order or anything how books are arranged in this Library Brainstorm ! Will you enjoy finding a book you want from there ? If not give me some solutions Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 47. Solutions What about Sorting the books alphabetically by Title ? Yes, for readers seraching a book by title, that will help. What if some looking for books on some general subject ? For example Health Grouping books by topics will be more useful in this case But how would you even begin this grouping ? You will start reading books one by one and group them ! Good Work :-) Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 48. Solutions What about Sorting the books alphabetically by Title ? Yes, for readers seraching a book by title, that will help. What if some looking for books on some general subject ? For example Health Grouping books by topics will be more useful in this case But how would you even begin this grouping ? You will start reading books one by one and group them ! Good Work :-) Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 49. Solutions What about Sorting the books alphabetically by Title ? Yes, for readers seraching a book by title, that will help. What if some looking for books on some general subject ? For example Health Grouping books by topics will be more useful in this case But how would you even begin this grouping ? You will start reading books one by one and group them ! Good Work :-) Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 50. Solutions What about Sorting the books alphabetically by Title ? Yes, for readers seraching a book by title, that will help. What if some looking for books on some general subject ? For example Health Grouping books by topics will be more useful in this case But how would you even begin this grouping ? You will start reading books one by one and group them ! Good Work :-) Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 51. Solutions What about Sorting the books alphabetically by Title ? Yes, for readers seraching a book by title, that will help. What if some looking for books on some general subject ? For example Health Grouping books by topics will be more useful in this case But how would you even begin this grouping ? You will start reading books one by one and group them ! Good Work :-) Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 52. Solutions What about Sorting the books alphabetically by Title ? Yes, for readers seraching a book by title, that will help. What if some looking for books on some general subject ? For example Health Grouping books by topics will be more useful in this case But how would you even begin this grouping ? You will start reading books one by one and group them ! Good Work :-) Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 53. Steps in Clustering Clustering involves the following An algorithm, the method used to group the books together. A notion of both similarity and dissimilarity. In the library example we relied on our assessment of which books belonged in an existing stack and which should start a new one. A stopping condition. In the library example, this might have been the point beyond books can’t be stacked anymore, or when the stacks are already quite dissimilar. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 54. Steps in Clustering Clustering involves the following An algorithm, the method used to group the books together. A notion of both similarity and dissimilarity. In the library example we relied on our assessment of which books belonged in an existing stack and which should start a new one. A stopping condition. In the library example, this might have been the point beyond books can’t be stacked anymore, or when the stacks are already quite dissimilar. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 55. Steps in Clustering Clustering involves the following An algorithm, the method used to group the books together. A notion of both similarity and dissimilarity. In the library example we relied on our assessment of which books belonged in an existing stack and which should start a new one. A stopping condition. In the library example, this might have been the point beyond books can’t be stacked anymore, or when the stacks are already quite dissimilar. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 56. Steps in Clustering Clustering involves the following An algorithm, the method used to group the books together. A notion of both similarity and dissimilarity. In the library example we relied on our assessment of which books belonged in an existing stack and which should start a new one. A stopping condition. In the library example, this might have been the point beyond books can’t be stacked anymore, or when the stacks are already quite dissimilar. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 57. K-Means Algorithm Let’s see an Algorithm first and after that how to automate the grouping of books in the Library Example. K-Means k-Means clustering aims to partition n observations into k clusters. Takes the input parameter, k, and partitions a set of n objects into k clusters so that the resulting intracluster similarity is high but the intercluster similarity is low. Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as the cluster’s centroid Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 58. K-Means Algorithm Let’s see an Algorithm first and after that how to automate the grouping of books in the Library Example. K-Means k-Means clustering aims to partition n observations into k clusters. Takes the input parameter, k, and partitions a set of n objects into k clusters so that the resulting intracluster similarity is high but the intercluster similarity is low. Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as the cluster’s centroid Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 59. K-Means Example 2 Reference Teknomo, Kardi. K-Means Clustering Tutorials. http://people.revoledu.com/kardi/tutorial/kMean Data Object Medicine A Medicine B medicine C Medicine D Attribute 1 (X) weight index 1 2 4 5 Attribute 2 (Y) pH 1 1 3 4 Problem we have 4 objects each having 2 attributes we also know before hand that these objects belong to two groups of medicine (cluster 1 and cluster 2) The problem now is to determine which medicines belong to cluster 1 and which medicines belong to the other cluster 2 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 60. K-Means Example 2 Reference Teknomo, Kardi. K-Means Clustering Tutorials. http://people.revoledu.com/kardi/tutorial/kMean Data Object Medicine A Medicine B medicine C Medicine D Attribute 1 (X) weight index 1 2 4 5 Attribute 2 (Y) pH 1 1 3 4 Problem we have 4 objects each having 2 attributes we also know before hand that these objects belong to two groups of medicine (cluster 1 and cluster 2) The problem now is to determine which medicines belong to cluster 1 and which medicines belong to the other cluster 2 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 61. Steps in K-means Iterate until stable (ie no object move group): 1 Determine the centroid coordinate 2 Determine the distance of each object to the centroids 3 Group the object based on minimum distance (find the closest centroid) Each medicine represents one point with two features (X, Y). We can represent it as coordinate in a feature space Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 62. Steps in K-means Iterate until stable (ie no object move group): 1 Determine the centroid coordinate 2 Determine the distance of each object to the centroids 3 Group the object based on minimum distance (find the closest centroid) Each medicine represents one point with two features (X, Y). We can represent it as coordinate in a feature space Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 63. Euclidean distance Each clustering problem is basically based on a distance between points Euclidean Distance is most commonly usd distance measure Mathematically, Euclidean distance between points with coordinates (x1, y1) and (x2, y2) is Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 64. Iteration 0 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 65. Iteration 0 Initial Value of Centroids Take medicine A and medicine B as the first centroids. Let c1 and c 2 denote the coordinate of the centroids, then c1 = (1,1) and c 2 = (2,1) Objects-Centroids Distance Calculate the distance between cluster centroid to each object. Distance matrix using Euclidean Distance at iteration 0 is Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 66. Iteration 0 Initial Value of Centroids Take medicine A and medicine B as the first centroids. Let c1 and c 2 denote the coordinate of the centroids, then c1 = (1,1) and c 2 = (2,1) Objects-Centroids Distance Calculate the distance between cluster centroid to each object. Distance matrix using Euclidean Distance at iteration 0 is Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 67. Iteration 0 Each column in the distance matrix symbolizes the object The first row of the distance matrix corresponds to the distance of each object to the first centroid and the second row is the distance of each object to the second centroid For example, distance from medicine C = (4, 3) to the first centroid c1 = (1,1) is Similarly distance to the second centroid c 2 = (2,1) is Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 68. Iteration 0 Objects clustering We assign each object based on the minimum distance Thus, medicine A is assigned to group 1, medicine B to group 2 and so on Group Matrix The element of Group matrix below is 1 if and only if the object is assigned to that group. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 69. Iteration 0 Objects clustering We assign each object based on the minimum distance Thus, medicine A is assigned to group 1, medicine B to group 2 and so on Group Matrix The element of Group matrix below is 1 if and only if the object is assigned to that group. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 70. Iteration 1 Determine new centroids Compute the new centroid of each group based on the new members Group 1 only has one member thus the centroid remains as c1 = (1,1) Group 2 now has three members, thus the centroid is the average coordinate among the three members Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 71. Iteration 1 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 72. Iteration 1 Objects-Centroids Distance Compute the distance of all objects to the new centroids Distance matrix at iteration 1 is Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 73. Iteration 1 Objects clustering Again we assign each object based on the minimum distance Based on the new distance matrix, we move the medicine B to Group 1 while all the other objects remain. Group Matrix Group matrix at Iteration 1 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 74. Iteration 1 Objects clustering Again we assign each object based on the minimum distance Based on the new distance matrix, we move the medicine B to Group 1 while all the other objects remain. Group Matrix Group matrix at Iteration 1 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 75. Iteration 2 Determine new centroids Compute the new centroid of each group based on the new members Group1 and group 2 both has two members, thus the thus the new centroids are Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 76. Iteration 2 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 77. Iteration 2 Objects-Centroids Distance Distance matrix at iteration 2 is Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 78. Iteration 2 Objects clustering Again we assign each object based on the minimum distance Group Matrix Group matrix at Iteration 2 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 79. Iteration 2 Objects clustering Again we assign each object based on the minimum distance Group Matrix Group matrix at Iteration 2 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 80. Results We obtain result that G2 = G1. Comparing the grouping of last iteration and this iteration reveals that the objects does not move group anymore. Thus, the computation of the k-mean clustering has reached its stability and no more iteration is needed. We get the final grouping as the results. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 81. Document Representations X-Y Plane Example In previous example the measure of similarity (or similarity metric) for the points was the Euclidean distance between two points And that was in the X-Y plane Library Example The library example had no such clear, mathematical measure. And we relied entirely on our wisdom to judge book similarity Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 82. Document Representations X-Y Plane Example In previous example the measure of similarity (or similarity metric) for the points was the Euclidean distance between two points And that was in the X-Y plane Library Example The library example had no such clear, mathematical measure. And we relied entirely on our wisdom to judge book similarity Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 83. Document Representations Brainstorm ! We need a metric that can be implemented on a computer. One possible metric could be based on the number of words common to two books’ titles. So “Harry Potter: The Philosopher’s Stone” and “Harry Potter: The Prisoner of Azkaban” have three words in common: “Harry”, “Potter” and “The”. But, even though the book “The Lord of the Rings: The Two Towers” is similar to the Harry Potter series, this measure of similarity doesn’t capture that. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 84. Document Representations Brainstorm ! We need a metric that can be implemented on a computer. One possible metric could be based on the number of words common to two books’ titles. So “Harry Potter: The Philosopher’s Stone” and “Harry Potter: The Prisoner of Azkaban” have three words in common: “Harry”, “Potter” and “The”. But, even though the book “The Lord of the Rings: The Two Towers” is similar to the Harry Potter series, this measure of similarity doesn’t capture that. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 85. Document Representations Brainstorm ! We need a metric that can be implemented on a computer. One possible metric could be based on the number of words common to two books’ titles. So “Harry Potter: The Philosopher’s Stone” and “Harry Potter: The Prisoner of Azkaban” have three words in common: “Harry”, “Potter” and “The”. But, even though the book “The Lord of the Rings: The Two Towers” is similar to the Harry Potter series, this measure of similarity doesn’t capture that. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 86. Document Representations Brainstorm ! We need a metric that can be implemented on a computer. One possible metric could be based on the number of words common to two books’ titles. So “Harry Potter: The Philosopher’s Stone” and “Harry Potter: The Prisoner of Azkaban” have three words in common: “Harry”, “Potter” and “The”. But, even though the book “The Lord of the Rings: The Two Towers” is similar to the Harry Potter series, this measure of similarity doesn’t capture that. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 87. Document Representations Brainstorm ! We need a metric that can be implemented on a computer. One possible metric could be based on the number of words common to two books’ titles. So “Harry Potter: The Philosopher’s Stone” and “Harry Potter: The Prisoner of Azkaban” have three words in common: “Harry”, “Potter” and “The”. But, even though the book “The Lord of the Rings: The Two Towers” is similar to the Harry Potter series, this measure of similarity doesn’t capture that. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 88. Document Representations Another Solutions We could assemble word counts for each book, and when the counts are close for many words, judge the books similar. But the words like “a”, “an”, and “the” cannot contribute much to the similarity, because they occurs frequently in both books. We could use numeric weights in the computation, and apply low weights to these words to reduce their effect on the similarity value. Once we give a weight value to each word in a book, we can easily find out the similarity of two books. But the words like “a”, “an”, and “the” cannot contribute much to the similarity, because they occurs frequently in both books. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 89. Document Representations Another Solutions We could assemble word counts for each book, and when the counts are close for many words, judge the books similar. But the words like “a”, “an”, and “the” cannot contribute much to the similarity, because they occurs frequently in both books. We could use numeric weights in the computation, and apply low weights to these words to reduce their effect on the similarity value. Once we give a weight value to each word in a book, we can easily find out the similarity of two books. But the words like “a”, “an”, and “the” cannot contribute much to the similarity, because they occurs frequently in both books. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 90. Document Representations Another Solutions We could assemble word counts for each book, and when the counts are close for many words, judge the books similar. But the words like “a”, “an”, and “the” cannot contribute much to the similarity, because they occurs frequently in both books. We could use numeric weights in the computation, and apply low weights to these words to reduce their effect on the similarity value. Once we give a weight value to each word in a book, we can easily find out the similarity of two books. But the words like “a”, “an”, and “the” cannot contribute much to the similarity, because they occurs frequently in both books. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 91. Document Representations Another Solutions We could assemble word counts for each book, and when the counts are close for many words, judge the books similar. But the words like “a”, “an”, and “the” cannot contribute much to the similarity, because they occurs frequently in both books. We could use numeric weights in the computation, and apply low weights to these words to reduce their effect on the similarity value. Once we give a weight value to each word in a book, we can easily find out the similarity of two books. But the words like “a”, “an”, and “the” cannot contribute much to the similarity, because they occurs frequently in both books. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 92. Document Representations Another Solutions We could assemble word counts for each book, and when the counts are close for many words, judge the books similar. But the words like “a”, “an”, and “the” cannot contribute much to the similarity, because they occurs frequently in both books. We could use numeric weights in the computation, and apply low weights to these words to reduce their effect on the similarity value. Once we give a weight value to each word in a book, we can easily find out the similarity of two books. But the words like “a”, “an”, and “the” cannot contribute much to the similarity, because they occurs frequently in both books. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 93. Document Representations Another Solutions We could assemble word counts for each book, and when the counts are close for many words, judge the books similar. But the words like “a”, “an”, and “the” cannot contribute much to the similarity, because they occurs frequently in both books. We could use numeric weights in the computation, and apply low weights to these words to reduce their effect on the similarity value. Once we give a weight value to each word in a book, we can easily find out the similarity of two books. But the words like “a”, “an”, and “the” cannot contribute much to the similarity, because they occurs frequently in both books. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 94. Document Representations What if one book is 300 pages long and the other 1000 pages long? We have to ensure that the weight of words should be relative to the length of the text. We will see a method called TF-IDF shortly Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 95. Document Representations What if one book is 300 pages long and the other 1000 pages long? We have to ensure that the weight of words should be relative to the length of the text. We will see a method called TF-IDF shortly Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 96. Document Representations What if one book is 300 pages long and the other 1000 pages long? We have to ensure that the weight of words should be relative to the length of the text. We will see a method called TF-IDF shortly Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 97. Document Representations Task ! Explore following distance measures 1 Squared Euclidean distance measure 2 Manhattan distance measure 3 Cosine distance measure Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 98. Document Representations Representing Data as Vectors In mathematics, a vector is simply a point in space. We found how books can be clustered together based on their similarity in words. In reality, clustering could be applied to any kind of object provided we can distinguish similar and dissimilar items. Clustering of anything via algorithms starts with representing the object in a way that can be read by computers. It is quite practical to think of objects in terms of their measurable features or attributes. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 99. Document Representations Say we want to cluster bunch of Apples 3 3 Figure taken from Mahout in Action Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 100. Document Representations A small, round, red apple is more similar to a small, round, green one than a large, ovoid green one. The process of vectorization starts with assigning features to a dimension Let’s say weight is feature (dimension) 0, color is 1, and size is 2 So the vector of a small round red apple looks like [0: 100 gram, 1: red, 2: small] Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 101. Document Representations A small, round, red apple is more similar to a small, round, green one than a large, ovoid green one. The process of vectorization starts with assigning features to a dimension Let’s say weight is feature (dimension) 0, color is 1, and size is 2 So the vector of a small round red apple looks like [0: 100 gram, 1: red, 2: small] Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 102. Document Representations A small, round, red apple is more similar to a small, round, green one than a large, ovoid green one. The process of vectorization starts with assigning features to a dimension Let’s say weight is feature (dimension) 0, color is 1, and size is 2 So the vector of a small round red apple looks like [0: 100 gram, 1: red, 2: small] Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 103. Document Representations A small, round, red apple is more similar to a small, round, green one than a large, ovoid green one. The process of vectorization starts with assigning features to a dimension Let’s say weight is feature (dimension) 0, color is 1, and size is 2 So the vector of a small round red apple looks like [0: 100 gram, 1: red, 2: small] Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 104. Document Representations Set of apples of different weight, sizes and colors converted to vectors 4 4 Figure taken from Mahout in Action Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 105. Document Representations Improving weighting with TF-IDF Term frequency - Inverse Document Frequency (TF-IDF) weighting is a widely used improvement on simple term frequency weighting. We found how books can be clustered together based on their similarity in words. Instead of simply using term frequency as values in the vector, this value is multiplied by the inverse of the term’s document frequency IDF=log(N/n) N=total number of documents n = number of documents that contain a term TF-IDF = TF*IDF Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 106. Stanford NLP NLP Toolkit Stanford NLP group provides NLP toolkits for various major computational linguistics problems. Written in Java. Open Source Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 107. Stanford NLP Stanford Named Entity Recognizer Named-entity recognition (NER) techniques locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations etc Consider the following text Hello Jona, I am in Indian Institute at Trivandrum What are the entities in this ? NER Demo Stanford NER is also known as CRFClassifier Conditional Random Field (CRF) sequence models are used for structured predictions Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 108. Stanford NLP Stanford Named Entity Recognizer Named-entity recognition (NER) techniques locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations etc Consider the following text Hello Jona, I am in Indian Institute at Trivandrum What are the entities in this ? NER Demo Stanford NER is also known as CRFClassifier Conditional Random Field (CRF) sequence models are used for structured predictions Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 109. Stanford NLP Stanford Named Entity Recognizer Named-entity recognition (NER) techniques locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations etc Consider the following text Hello Jona, I am in Indian Institute at Trivandrum What are the entities in this ? NER Demo Stanford NER is also known as CRFClassifier Conditional Random Field (CRF) sequence models are used for structured predictions Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 110. Stanford NLP Stanford Named Entity Recognizer Named-entity recognition (NER) techniques locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations etc Consider the following text Hello Jona, I am in Indian Institute at Trivandrum What are the entities in this ? NER Demo Stanford NER is also known as CRFClassifier Conditional Random Field (CRF) sequence models are used for structured predictions Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 111. Stanford NLP Stanford Named Entity Recognizer Named-entity recognition (NER) techniques locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations etc Consider the following text Hello Jona, I am in Indian Institute at Trivandrum What are the entities in this ? NER Demo Stanford NER is also known as CRFClassifier Conditional Random Field (CRF) sequence models are used for structured predictions Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 112. Social Media and Sentiment Analysis Twitter Twitter Streaming Demo Sentiment Analysis Sentiment analysis is one of the hottest research areas in computer science today. A basic task in sentiment analysis is to classify the polarity of a given text at the document, sentence, or aspect level. Whether the expressed opinion in a document, a sentence or an entity feature oraspect is positive, negative, or neutral. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 113. Social Media and Sentiment Analysis Twitter Twitter Streaming Demo Sentiment Analysis Sentiment analysis is one of the hottest research areas in computer science today. A basic task in sentiment analysis is to classify the polarity of a given text at the document, sentence, or aspect level. Whether the expressed opinion in a document, a sentence or an entity feature oraspect is positive, negative, or neutral. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 114. Social Media and Sentiment Analysis Movie Review Let’s see a tweet on a recently released movie “Wow #Krish3 looks more exciting than Superman n Spider-Man for sure ! The Roshans have made a truly world class super hero film, again!” These snippets of text are a gold mine for companies and individuals that want to monitor their reputation and get timely feedback about their products and actions Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 115. Social Media and Sentiment Analysis Movie Review Let’s see a tweet on a recently released movie “Wow #Krish3 looks more exciting than Superman n Spider-Man for sure ! The Roshans have made a truly world class super hero film, again!” These snippets of text are a gold mine for companies and individuals that want to monitor their reputation and get timely feedback about their products and actions Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 116. Social Media and Sentiment Analysis Movie Review Let’s see a tweet on a recently released movie “Wow #Krish3 looks more exciting than Superman n Spider-Man for sure ! The Roshans have made a truly world class super hero film, again!” These snippets of text are a gold mine for companies and individuals that want to monitor their reputation and get timely feedback about their products and actions Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 117. Social Media and Sentiment Analysis Document-Level Sentiment Analysis Main approach for document level sentiment analysis is supervised learning. The system learns a classification model from the training data common classification algorithms such as SVM, Naive Bayes, Logistic Regression etc can be used Thus new documents are tagged into their various sentiment classes Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 118. Bigdata Introduction to Bigdata Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 119. Bigdata 3 Vs of Bigdata Volume: Ever-growing data of all types Velocity: For time-sensitive processes such as catching fraud, intrusion detection etc, the speed at which data arrives is a characteristic of bigdata Variety: Any type of data, structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 120. Bigdata Tools and Technologies Hadoop NoSQL Spark D3 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 121. Bigdata Few Interesting Areas Internet of Things Data Journalism Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 122. Conclusion Questions ? Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 123. References Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman, Mahout in Action, Manning Publications Jiawei Han, Micheline Kamber, Data Mining Concepts and Techniques Teknomo, Kardi K-Means Clustering Tutorials A first take at building an inverted index, http://nlp.stanford.edu/IR-book/html/htmledition/ a-first-take-at-building-an-inverted-index-1.html Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action
  • 124. Thanks Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action