Social Media Analytics and NLP

Chapter 4
Social Media and Text Analytics
1 Asst. Prof. Rushikesh Chikane, MIT
ACSC, Alandi

Overview of Social Media Analytics
 Social media analytics is the process of collecting and
analyzing audience data shared on social networks to
improve an organization's strategic business decisions.
 Social media analytics is the ability to gather and find
meaning in data gathered from social channels to support
business decisions — and measure the performance of
actions based on those decisions through social media.
 Social media analytics uses specifically designed
software platforms that work similarly to web search
tools.
 Data about keywords or topics is retrieved through
search queries or web ‘crawlers’ that span channels.
 Fragments of text are returned, loaded into a database,
categorized and analyzed to derive meaningful insights.
ACSC, Alandi

Social Media Analytics Process
ACSC, Alandi

Seven Layers of Social Media Analytics
 Social media at a minimum has seven layers of data.
 Each layer carries potentially valuable information
and insights that can be harvested for business
intelligence purposes.
 Out of the seven layers, some are visible or easily
identifiable (e.g., text and actions) and other are
invisible (e.g., social media and hyperlink networks).
ACSC, Alandi

ACSC, Alandi

 LAYER ONE: TEXT
 Social media text analytics deals with the extraction and
analysis of business insights from textual elements of
social media content, such as comments, tweets, blog
posts, and Facebook status updates. Text analytics is
mostly used to understand social media users’ sentiments
or identify emerging themes and topics.
 LAYER TWO: NETWORKS
 Social media network analytics extract, analyze, and
interpret personal and professional social networks, for
example, Facebook, Friendship Network, and Twitter.
Network analytics seeks to identify influential nodes (e.g.,
people and organizations) and their position in the
network.
ACSC, Alandi

 LAYER THREE: ACTIONS
 Social media actions analytics deals with extracting,
analyzing, and interpreting the actions performed by
social media users, including likes, dislikes, shares,
mentions, and endorsement. Actions analytics are mostly
used to measure popularity, influence, and prediction in
social media.
 LAYER FOUR: MOBILE
 Mobile analytics is the next frontier in the social business
landscape. Mobile analytics deals with measuring and
optimizing user engagement with mobile applications (or
apps for short), analyzing and understanding in-app
purchases, customer engagement, and mobile user
demographics.
ACSC, Alandi

 LAYER FIVE: HYPERLINKS
 Hyperlink analytics is about extracting, analyzing, and
interpreting social media hyperlinks (e.g., in-links and out-
links).
 Hyperlink analysis can reveal, for example, Internet traffic
patterns and sources of incoming or outgoing traffic to
and from a source.
 LAYER SIX: LOCATION
 Location analytics, also known as spatial analysis or
geospatial analytics, is concerned with mining and
mapping the locations of social media users,
contents, and data.
ACSC, Alandi

 LAYER SEVEN: SEARCH ENGINES
 Search engines analytics focuses on analyzing historical
search data for gaining a valuable insight into a range of
areas, including trends analysis, keyword monitoring,
search result and advertisement history, and
advertisement spending statistics.
ACSC, Alandi

Accessing Social Media Data
 Social media data is any type of data that can be
gathered through social media. In general, the term
refers to social media metrics and
demographics collected through analytics tools on
social platforms.
 Social media data can also refer to data collected
from content people post publicly on social media.
This type of social media data for marketing can be
collected through social listening tools.
ACSC, Alandi

Social Network Analysis
 Social network analysis (SNA) is the process of
investigating social structures through the use
of networks and graph theory. It characterizes
networked structures in terms of nodes (individual
actors, people, or things within the network) and
the ties, edges, or links (relationships or interactions)
that connect them.
 SNA is the practice of representing networks of
people as graphs and then exploring these graphs. A
typical social network representation has nodes for
people, and edges connecting two nodes to
represent one or more relationships between them
ACSC, Alandi

Asst. Prof. Rushikesh Chikane, MIT
ACSC, Alandi
12
 The resulting graph can reveal patterns of
connection among people. Small networks can be
represented visually, and these visualizations are
intuitive and may make apparent patterns of
connections, and reveal nodes that are highly
connected or which play a critical role in connecting
groups together
 Social network analysis (SNA) is a process of
quantitative and qualitative analysis of a social
network. SNA measures and maps the flow of
relationships and relationship changes between
knowledge-possessing entities.
 Simple and complex entities include websites,
computers, animals, humans, groups, organizations
and nations.

The benefits of social network:
ACSC, Alandi
13
 Helps you understand your audience better
 Used for customer segmentation
 Used to design Recommendation Systems
 Detect fake news, among other things

ACSC, Alandi
14
 Link Prediction:
 Link prediction is one of the most important research
topics in the field of graphs and networks. The objective
of link prediction is to identify pairs of nodes that will
either form a link or not in the future.

ACSC, Alandi
15
 Link prediction has a ton of use in real-world
applications.
 Predict which customers are likely to buy what products
on online marketplaces like Amazon. It can help in making
better product recommendations
 Suggest interactions or collaborations between
employees in an organization
 Extract vital insights from terrorist networks

Introduction to Natural Language
Processing
 Natural Language Processing is a branch of Computer
Science that deals with the understanding and
processing of natural language, e.g. texts or voice
recordings.
 The goal is for a machine to be able to communicate with
humans in the same way that humans have been
communicating with each other for centuries.
 Learning a new language is not easy for us humans
either and requires a lot of time and perseverance.
 When a machine wants to learn a natural language, it is
no different.
 Therefore, some sub-areas have emerged within Natural
Language Processing that are necessary for language to
be completely understood.
ACSC, Alandi

Text Analytics
 Tokenization
 Bag of words
 Word weighting : TF-IDF
 N-Grams
 Stop word
 Stemming and Lemmatization
 Synonyms and Part of speech tagging
ACSC, Alandi

Tokenization
 The text is cut into pieces called “tokens” or “terms.”
 These tokens are the most basic unit of information you’ll
use for your model.
 The terms are often words but this isn’t a necessity.
Entire sentences can be used for analysis.
 We’ll use unigrams: terms consisting of one word.
 Often, however, it’s useful to include bigrams (two words
per token) or trigrams (three words per token) to capture
extra meaning and increase the performance of your
models.
 This does come at a cost, though, because you’re
building bigger term-vectors by including bigrams and/or
trigrams in the equation.
ACSC, Alandi

Bag of words
 To build our classification model we’ll go with the bag of
words approach.
 Bag of words is the simplest way of structuring textual
data: every document is turned into a word vector.
 If a certain word is present in the vector it’s labeled
“True”; the others are labeled “False”. Figure shows a
simplified example of this, in case there are only two
documents: one about the television show Game of
Thrones and one about data science.
 The two word vectors together form the document-term
matrix.
 The document-term matrix holds a column for every term
and a row for every document
ACSC, Alandi

ACSC, Alandi

Word weighting : TF-IDF
 Term Frequency - Inverse Document Frequency (TF-
IDF) is a widely used statistical method in natural
language processing and information retrieval. It
measures how important a term is within a document
relative to a collection of documents (i.e., relative to
a corpus). Words within a text document are
transformed into importance numbers by a text
vectorization process. There are many different text
vectorization scoring schemes, with TF-IDF being
one of the most common.
ACSC, Alandi

 As its name implies, TF-IDF vectorizes/scores a
word by multiplying the word’s Term Frequency (TF)
with the Inverse Document Frequency (IDF).
 Term Frequency: TF of a term or word is the
number of times the term appears in a document
compared to the total number of words in the
document.

ACSC, Alandi

 Inverse Document Frequency: IDF of a term
reflects the proportion of documents in the corpus
that contain the term. Words unique to a small
percentage of documents (e.g., technical jargon
terms) receive higher importance values than words
common across all documents (e.g., a, the, and).
ACSC, Alandi

 The TF-IDF of a term is calculated by multiplying TF
and IDF scores.
 TF-IDF is useful in many natural language
processing applications. For example, Search
Engines use TF-IDF to rank the relevance of a
document for a query. TF-IDF is also employed in
text classification, text summarization, and topic
modeling.
ACSC, Alandi

Example
 Imagine the term ’t’ appears 20 times in a document
that contains a total of 100 words.
 Term Frequency (TF) of ’t’ can be calculated as
follow:
 Assume a collection of related documents contains
10,000 documents. If 100 documents out of 10,000
documents contain the term ’t’, Inverse Document
Frequency (IDF) of ’t’ can be calculated as follows
ACSC, Alandi

 Using these two quantities, we can calculate TF-IDF
score of the term ’t’ for the document.
ACSC, Alandi

N-Grams
 N-gram can be defined as the contiguous sequence
of n items from a given sample of text or speech.
The items can be letters, words, or base pairs
according to the application. The N-grams typically
are collected from a text or speech corpus (A long
text dataset).
 N-grams of texts are extensively used in text mining
and natural language processing tasks. They are
basically a set of co-occurring words within a given
window and when computing the n-grams you
typically move one word forward (although you can
move X words forward in more advanced scenarios).
ACSC, Alandi

 For example, for the sentence
 “I reside in Bengaluru”.
SL.No Type of n-gram Generated n-grams
1 Unigram [“I”, ”reside”, ”in”, ”Bengaluru”]
2 Bigram [“I reside”, ”reside in”, ”in Bengaluru”]
3 Trigram [“I reside in”, “reside in Bengaluru”]
ACSC, Alandi

 When N=1, this is referred to as unigrams and this is
essentially the individual words in a sentence.
 When N=2, this is called bigrams and
 when N=3 this is called trigrams.
 When N>3 this is usually referred to as four grams or five
grams and so on.
 How many N-grams in a sentence?
 If X=Num of words in a given sentence K, the number of
n-grams for sentence K would be:
ACSC, Alandi

Stop word
 Stop words are a set of commonly used words in a
language. Examples of stop words in English are “a,”
“the,” “is,” “are,” etc.
 Stop words are commonly used in Text Mining and
Natural Language Processing (NLP) to eliminate words
that are so widely used that they carry very little useful
information.
 When to remove stop words?
 If we have a task of text classification or sentiment analysis
then we should remove stop words as they do not provide any
information to our model, i.e keeping out unwanted words
out of our corpus, but if we have the task of language
translation then stopwords are useful, as they have to be
translated along with other words.
ACSC, Alandi

 pros:
 Stop words are often removed from the text before
training deep learning and machine learning models since
stop words occur in abundance, hence providing little to
no unique information that can be used for classification
or clustering.
 On removing stop words, dataset size decreases, and the
time to train the model also decreases without a huge
impact on the accuracy of the model.
 Stop word removal can potentially help in improving
performance, as there are fewer and only significant
tokens left. Thus, the classification accuracy could be
improved
ACSC, Alandi

 cons:
 Improper selection and removal of stop words can change
the meaning of our text. So we have to be careful in
choosing our stop words.
 Ex: “ This movie is not good.”
If we remove (not ) in pre-processing step the sentence
(this movie is good) indicates that it is positive which is
wrongly interpreted.
ACSC, Alandi

Stemming and Lemmatization
 What is Stemming?
 Stemming is a technique used to extract the base form of
the words by removing affixes from them. It is just like
cutting down the branches of a tree to its stems. For
example, the stem of the words eating, eats,
eaten is eat.
 Search engines use stemming for indexing the words.
That’s why rather than storing all forms of a word, a
search engine can store only the stems. In this way,
stemming reduces the size of the index and increases
retrieval accuracy.
ACSC, Alandi

What is Lemmatization?
 Lemmatization is a development of Stemming and
describes the process of grouping together the
different inflected forms of a word so they can be
analyzed as a single item.
 Lemmatization is similar to Stemming but it brings
context to the words. So it links words with similar
meanings to one word.
 Lemmatization algorithms usually also use positional
arguments as inputs, such as whether the word is an
adjective, noun, or verb.
ACSC, Alandi

Synonyms and Part of speech tagging
 Part-of-speech (POS) tagging is a process in natural
language processing (NLP) where each word in a text is
labeled with its corresponding part of speech. This can
include nouns, verbs, adjectives, and other grammatical
categories.
 POS tagging is useful for a variety of NLP tasks, such as
information extraction, named entity recognition, and
machine translation. It can also be used to identify the
grammatical structure of a sentence and to disambiguate
words that have multiple meanings.
 POS tagging is typically performed using machine
learning algorithms, which are trained on a large
annotated corpus of text. The algorithm learns to predict
the correct POS tag for a given word based on the
context in which it appears.
ACSC, Alandi

 Why POS tagging?
 POS tagging is an important part of NLP because it
works as the prerequisite for further NLP analysis as
follows −
 Chunking
 Syntax Parsing
 Information extraction
 Machine Translation
 Sentiment Analysis
 Grammar analysis & word-sense disambiguation
ACSC, Alandi

 Tagging a list of sentences
 Rather than tagging a single sentence, the
NLTK’s TaggerI class also provides us
a tag_sents() method with the help of which we can tag a
list of sentences. Following is the example in which we
tagged two simple sentences
 Un-tagging a sentence
 We can also un-tag a sentence. NLTK provides
nltk.tag.untag() method for this purpose. It will take a
tagged sentence as input and provides a list of words
without tags.
ACSC, Alandi

Use of Parts of Speech Tagging in NLP
 To understand the grammatical structure of a sentence:
 By labeling each word with its POS, we can better understand the syntax
and structure of a sentence. This is useful for tasks such as machine
translation and information extraction, where it is important to know how
words relate to each other in the sentence.
 To disambiguate words with multiple meanings:
 Some words, such as “bank,” can have multiple meanings depending on
the context in which they are used. By labeling each word with its POS,
we can disambiguate these words and better understand their intended
meaning.
 To improve the accuracy of NLP tasks:
 POS tagging can help improve the performance of various NLP tasks,
such as named entity recognition and text classification. By providing
additional context and information about the words in a text, we can build
more accurate and sophisticated algorithms.
 To facilitate research in linguistics:
 POS tagging can also be used to study the patterns and characteristics of
language use and to gain insights into the structure and function of
different parts of speech.
ACSC, Alandi

Application of POS Tagging
 Information extraction:
 POS tagging can be used to identify specific types of information in a text, such as
names, locations, and organizations. This is useful for tasks such as extracting
data from news articles or building knowledge bases for artificial intelligence
systems.
 Named entity recognition:
 POS tagging can be used to identify and classify named entities in a text, such as
people, places, and organizations. This is useful for tasks such as building
customer profiles or identifying key figures in a news story.
 Text classification:
 POS tagging can be used to help classify texts into different categories, such as
spam emails or sentiment analysis. By analyzing the POS tags of the words in a
text, algorithms can better understand the content and tone of the text.
 Machine translation:
 POS tagging can be used to help translate texts from one language to another by
identifying the grammatical structure and relationships between words in the
source language and mapping them to the target language.
 Natural language generation:
 POS tagging can be used to generate natural-sounding text by selecting
appropriate words and constructing grammatically correct sentences. This is useful
for tasks such as chatbots and virtual assistants.
ACSC, Alandi

Sentiment Analysis
 Sentiment analysis is the process of classifying
whether a block of text is positive, negative, or,
neutral.
 Sentiment analysis is a subset of natural language
processing (NLP) that uses machine learning to
analyze and classify the emotional tone of text data.
 The goal which Sentiment analysis tries to gain is to
be analyzed people’s opinions in a way that can help
businesses expand.
 It focuses not only on polarity (positive, negative &
neutral) but also on emotions (happy, sad, angry,
etc.)as well as intentions to buy.
ACSC, Alandi

Why Use Sentiment Analysis?
Sentiment analysis is the contextual meaning of
words that indicates the social sentiment of a
brand and also helps the business to determine
whether the product they are manufacturing is
going to make a demand in the market or not.
Businesses can use insights from sentiment
analysis to improve their products, fine-tune
marketing messages, correct misconceptions,
and identify positive influencers.
It’s very helpful in helping businesses to gain
insights, understand customers, predict and
enhance the customer experience, tailor
marketing campaigns, and aid in decision-
making.
ACSC, Alandi

Types of Sentiment Analysis
Fine-grained sentiment analysis:
• This depends on the polarity base. This category can be designed as very positive,
positive, neutral, negative, or very negative. The rating is done on a scale of 1 to 5. If
the rating is 5 then it is very positive, 2 then negative, and 3 then neutral.
Emotion detection:
• The sentiments happy, sad, angry, upset, jolly, pleasant, and so on come under
emotion detection. It is also known as a lexicon method of sentiment analysis.
Aspect-based sentiment analysis:
• It focuses on a particular aspect for instance if a person wants to check the feature of
the cell phone then it checks the aspect such as the battery, screen, and camera
quality then aspect based is used.
Multilingual sentiment analysis:
• Multilingual consists of different languages where the classification needs to be done
as positive, negative, and neutral. This is highly challenging and comparatively
difficult.
ACSC, Alandi

Applications
• If for instance the comments on social media side as
Instagram, over here all the reviews are analyzed and
categorized as positive, negative, and neutral.
Social
Media:
• In the play store, all the comments in the form of 1 to 5
are done with the help of sentiment analysis
approaches.
Customer
Service:
• In the marketing area where a particular product
needs to be reviewed as good or bad.
Marketing
Sector:
• All the reviewers will have a look at the comments and
will check and give the overall review of the product.
Reviewer
side:
ACSC, Alandi

Document or text summarization
 Text summarization is a very useful and important
part of Natural Language Processing (NLP).
 We can summarize our text in a few lines by
removing unimportant text and converting the same
text into smaller semantic text form.
 In this approach we build algorithms or programs
which will reduce the text size and create a summary
of our text data. This is called automatic text
summarization in machine learning.
 Text summarization is the process of creating shorter
text without removing the semantic structure of text.
ACSC, Alandi

ACSC, Alandi
45
 Text summarization is the practice of breaking down long
publications into manageable paragraphs or sentences.
 The procedure extracts important information while also
ensuring that the paragraph's sense is preserved. This
shortens the time it takes to comprehend long materials
like research articles while without omitting critical
information.
 Text summarising presents a number of issues, including
text identification, interpretation, and summary
generation, as well as analysis of the resulting summary.
 Identifying important phrases in the document and
exploiting them to uncover relevant information to add in
the summary are critical jobs in extraction-based
summarising.

ACSC, Alandi
46

ACSC, Alandi
47
Two
approaches to
text
summarization.
Extraction
based
summarization
Abstractive
Summarization

ACSC, Alandi
48
 Extraction based summarization
 The extractive text summarising approach entails
extracting essential words from a source material
and combining them to create a summary.
 Without making any modifications to the texts, the
extraction is done according to the given measure

ACSC, Alandi
49
 Abstractive Summarization
 Another way of text summarization is abstractive
summarization. We create new sentences from the
original content in this step.
 This is in contrast to our previous extractive technique, in
which we only utilized the phrases that were present. It's
possible that the phrases formed by abstractive
summarization aren't present in the original text.
 When abstraction is used for text summarization in deep
learning issues, it can overcome the extractive method's
grammatical errors.
 Abstraction is more efficient than extraction. The text
summarising algorithms necessary for abstraction, on the
other hand, are more complex to build, which is why
extraction is still widely used.

Trend Analytics
ACSC, Alandi
50
 Trend analysis – also known as technical analysis –
is used to monitor metrics and their development
over time. As such, the technique relies on effective
historical analysis.
 Trend analysis is a methodology used in research to
gather and study data for prediction-making about
future consumer behavior based on the trend
analysis of observed and recorded data from past
and ongoing trends.
 It helps determine the main characteristics of the
stock market and the consumers associated with it.
 Trend analysis is the practice that gives us the ability
to look at data over time for a long-running survey.

ACSC, Alandi
51
Types of
Trend
Analysis
Temporal
Method
Geographic
Method
Intuitive
Method

ACSC, Alandi
52
 Temporal Method
 This type of methodology is used to analyze patterns
and trends of a given group of relevant data or
objects of study in a specific cohort of time, as well
as its change in that period.
 A clear example of this type of study is longitudinal
studies with the clear intention of detecting and
analyzing trends that arise from historical trends.
 It is mainly used in ethnographic research and other
types of event-focused studies. The great
disadvantage of this type of trend analysis is that it is
exposed to many variables that could affect the final
result of the study.

ACSC, Alandi
53
 Geographic Method
 The geographic method of trend analysis is generally
easy and reliable; it can be the means to identify
commonalities and differences between user groups
belonging to the same or different geographies.
 The main purpose of the geographic method is the
analysis of market trends that develop in groups of
users identified by their geographic location.
 The downside of the geographic method is
consequently the geographic limitation for data
analysis, which can be influenced by factors such as
culture and traditions that are specific to the
geographic location user groups.

ACSC, Alandi
54
 Intuitive Method
 The intuitive method is a type of trend analysis
implemented to analyze trends within groups of users
based on logical explanations, behavioral patterns, or
other elements perceived by a futurist.
 This market trend analysis is helpful for prediction-
making without the need for large amounts of statistical
data. However, some issues with the methodology are
the overreliance on knowledge and logic provided by
futurists and researchers, which makes it prone to
become biased to its researcher.
 The intuitive method is the most difficult type of trend
analysis and might not be as precise.

Challenges to Social media analytics
Data cleansing
• cleaning unstructured textual data (e.g., normalizing text),
especially high-frequency streamed real-time data, still
presents numerous problems and research challenges.
Scraping
• although social media data is accessible through APIs, due to
the commercial value of the data, most of the major sources
such as Facebook and Google are making it increasingly
difficult for academics to obtain comprehensive access to their
‘raw’ data; very few social data sources provide affordable data
offerings to academia and researchers. News services such as
Thomson Reuters and Bloomberg typicallycharge a premium
for access to their data.
ACSC, Alandi

Scraping
• In contrast, Twitter has recently announced the Twitter
Data Grants program, where researchers can apply to get
access to Twitter’s public tweets and historical data in
order to get insights from its massive set of data (Twitter
has more than 500 million tweets a day).
Data protection
• once you have created a ‘big data’ resource, the data
needs to be secured, ownership and IP issues resolved
(i.e., storing scraped data is against most of the
publishers’ terms of service), and users provided with
different levels of access; otherwise, users may attempt to
‘suck’ all the valuable data from the database.
ACSC, Alandi

Holistic data sources
• researchers are increasingly bringing together and
combining novel data sources: social media data, real-
time market & customer data and geospatial data for
analysis.
Data visualization
• visual representation of data whereby information that
has been abstracted in some schematic form with the
goal of communicating information clearly and
effectively through graphical means. Given the
magnitude of the data involved, visualization is
becoming increasingly important.
ACSC, Alandi

Analytics dashboards
• many social media platforms require users to write
APIs to access feeds or program analytics models
in a programming language, such as Java.
• While reasonable for computer scientists, these
skills are typically beyond most (social science)
researchers.
• Non-programming interfaces are required for giving
what might be referred to as ‘deep’ access to ‘raw’
data, for example, configuring APIs, merging social
media feeds, combining holistic sources and
developing analytical models.
ACSC, Alandi

References
 https://www.researchgate.net/publication/352972869_Challenges_an
d_Difficulties_in_Social_Media_Analytics
 http://repo.darmajaya.ac.id/5411/1/Seven%20Layers%20of%20Socia
l%20Media%20Analytics_%20Mining%20Business%20Insights%20fr
om%20Social%20Media%20Text%2C%20Actions%2C%20Networks
%2C%20Hyperlinks%2C%20Apps%2C%20Search%20Engine%2C
%20and%20Location%20Data%20%28%20PDFDrive%20%29.pdf
 introducing-data-science-machine-learning-python
 https://towardsdatascience.com/stemming-vs-lemmatization-
2daddabcb221
 https://kavita-ganesan.com/what-are-n-grams/
ACSC, Alandi

ACSC, Alandi
60
THANK YOU

Social Media Analytics and NLP

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Social Media Analytics and NLP

Similar to Social Media Analytics and NLP (20)

More from RushikeshChikane2

More from RushikeshChikane2 (10)

Recently uploaded

Recently uploaded (20)

Social Media Analytics and NLP