SlideShare a Scribd company logo
1 of 27
Download to read offline
Fake News and Their Detection
Data Science and Big Data Analysis
Professor: Antonino Nocera
Team Name: 4V’s
Group members:
Arnold Fonkou
Vignesh Kumar Kembu
Ashina Nurkoo
Seyedkourosh Sajjadi
WELFake
Fake News Detection (WELFake) dataset
of 72,134 news articles with 35,028 real
and 37,106 fake news.
This dataset is a part of an ongoing
research on "Fake News Prediction on
Social Media Website" as a doctoral
degree program of Mr. Pawan Kumar
Verma and is partially supported by the
ARTICONF project funded by the
European Union’s Horizon 2020 research
and innovation program.
Columns:
- Serial number (starting from 0)
- Title (about the text news heading)
Text (about the news content)
- Label (0 = fake and 1 = real)
Data Stream Ingestion Hadoop MapReduce
MongoDB Sandbox
PySpark HDFS
Analysis
Architecture
PySpark
Ingestion
From CSV to JSON
Data Conversion
we have converted the
file into JSON to be
closer to reality.
Reading Data
Using PySpark
We used the
DATAFRAME client of
SPARK to read our big
data.
Saving to Hadoop
Write into Hadoop
We read from the data
frame and then we write
it to Hadoop.
Reading Section
Import findspark
findspark.init()
import pyspark
from pyspark.sql import *
spark = SparkSession.builder 
.master("local[1]") 
.appName("PySpark Read JSON") 
.getOrCreate()
# Reading multiline json file
multiline_dataframe = spark.read.option("multiline","true") 
.json("project_data_sample.json")
multiline_dataframe.head()
Saving Section
multiline_dataframe.write.save('/usr/local/hadoop/user3/dsba1.json',
format='json')
And the data is shown as below:
sqlContext = SQLContext(spark)
df = sqlContext.read.format('json').load('/usr/local/hadoop/user3/dsba1.json')
df.show()
Hadoop
Hadoop
Component
HDFS
Component
MapReduce
Mapper (BoW Creation)
Read Lines
Input Data
The data is given as input
lines to the mapper.
Extract Text
Title and Text Extraction
After reading each line as
a JSON object, we
extract the title and the
text related to that piece
of news from it.
Tokenize
Word Extraction
We perform some data
cleaning and then we
extract every single word
from it.
Text Cleaning
import sys
import re
import json
def clean_text(text):
text =
re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(
?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
text = re.sub(r'[^a-zA-Zs]+', '', text)
return text
Tokenizing
def tokenize(text):
if not isinstance(text, str):
text = str(text)
text = clean_text(text)
text = str.lower(text)
return text.split()
Execution
for line in sys.stdin:
line = line.strip()
Try:
json_obj = json.loads(line)
except:
continue
title = json_obj.get("title", "")
text = json_obj.get("text", "")
title_words = tokenize(title)
text_words = tokenize(text)
for word in title_words + text_words:
print(f"{word}t1")
Reducer
Read Lines
Input Data
The data is given as input
lines each containing 2
elements.
Initialize Counter
Word and Count
Extraction
After reading each line a
JON object, we extract
the title and the text
related to that piece of
news from it.
Create BoW
Dictionary
Create a dictionary and
add each word as the key
and its associated count
value as the value.
Counter Initialization
import sys
from collections import Counter
import json
bag_of_words = Counter()
Execution
for line in sys.stdin:
line = line.strip()
try:
word, count = line.split("t")
except:
continue
count = int(count)
bag_of_words[word] += count
with open('bow_data.json', 'w') as f:
json.dump(bag_of_words, f)
Moving to MongoDb
import json
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017')
db = client['bow']
collection = db['bow_collection']
with open('bow_data.json', 'r') as f:
bow_data = json.load(f)
collection.insert_one(bow_data)
Performing MapReduce Operation
In the Terminal:
cat db.json | python3 bow_mapper.py | sort | python3 bow_reducer.py
HDFS
In the case of dealing with big data, we
could partition our dataset into a number
of batches instead of saving it in a single
file.
Instead of:
multiline_dataframe.write.save('/usr/local/hadoop/user3/dsba1.j
son', format='json')
Use:
partitioned_df = multiline_dataframe.repartition(4, "Unnamed: 0")
partitioned_df.write.save('/usr/local/hadoop/user3/dsba1.json',
format='json')
partition_counts = partitioned_df.rdd.mapPartitions(lambda it:
[sum(1 for _ in it)]).collect()
print(partition_counts)
[482, 480, 519, 519]
Create
Database
Create a database for
containing the data.
Import From
Hadoop
Import the JSON File
from Hadoop via
PySpark.
View Data &
Backup
View the data and if it is
inserted correctly then
create a backup before
starting the modifications.
Clean Data
Remove
non-alphanumeric
characters.
Display
Modified Data
Display the modified
content to view
changes.
MongoDB
Creating Database
Use an existing database or create a new one:
>use dsdb_dev
Viewing Data
>use dsdb_dev
>show collections
>db.fake_real_news.find()
>db.fake_real_news.aggregate([{$group : {_id: "$label", rest_number
: {$sum : 1}}}])
Creating a Copy
In the Terminal:
mongodump --db dsdb_dev --collection fake_real_news --out
/home/ds/Documents/
Importing From Hadoop
In the Terminal:
mongoimport --db dsdb_dev --collection fake_real_news --file
/usr/local/hadoop/user3/dsba1.json/part-00000-d1623440-4fde-4b
72-b87d-5943bec596d3-c000.json
Importing from Hadoop Using PySpark
with open('sampled_data.json', 'w') as file: for line in json_data: file.write(line +
'n')
import json
with open('sampled_data.json') as file:
data = file.readlines()
collection.insert_many([json.loads(line) for line in data])
df =
spark.read.json("/usr/local/hadoop/user3/dsba1.json/part-00000-d16234
40-4fde-4b72-b87d-5943bec596d3-c000.json")
sampled_df = df.sample(fraction=0.8, seed=42)
from pymongo import MongoClient
conn = MongoClient()
db = conn.dsdb_dev
collection = db['sampled_data']
json_data = sampled_df.toJSON().collect()
Data Cleaning
>db.fake_real_news.aggregate([
{‘$project': {‘_id': 1, 'Unnamed: 0': 1, 'label': 1, 'text': 1, 'title': 1} }
]). forEach(function(doc) {
if (doc.title) {
var newTitle = doc.title.replace(/[^a-zA-Z0-9 ]/g, '');
db.fake_real_news.update({ '_id': doc._id }, { '$set': { 'title': newTitle } });
}
});
Modified Content Display
>db.fake_real_news.aggregate([
{‘$project': {‘_id': 1, 'Unnamed: 0': 1, 'label': 1, 'text': 1, 'title': 1} }
]);
The file is now ready for word occurrence counting,
which can be done using Jupyter Notebook and
PyMongo.
Backup Restoration
In case of any need, restore the initial file:
>db.fake_real_news.drop()
mongorestore --db dsdb_dev --collection fake_real_news
/home/ds/Documents/dsdb_dev/fake_real_news.bson
Count the Number of Words
db.fake_real_news.aggregate ([
{
'$match': {
'label': "0" # the condition for the 'label'
field to be 1
}
},
{
'$project': {
'words': {'$split': [{'$toLower': '$title'}, ' ']} #
Split the lowercase version of the title field into
an array of words
}
},
{
'$unwind': '$words' # Separate documents
for each word
},
{
'$group': {
'_id': {
'word': '$words', # Group by word field
and count
},
'count': {'$sum': 1}
}
},
{
'$project': {
# Project to return only word field, count,
and id
'word': '$_id.word',
'count': 1
}
},
{
'$match': {
'word': {'$ne': None}, # Exclude null or
non-existent values
}
},
{
'$match': {
'$expr': {'$ne': ['$word', '']} # Exclude
empty strings
}
},
{
'$sort': {'count':-1}
}
])
Hypotheses
H1
Generation of fake news shall be with
the help of stop words.
Metrics - Average number of stop
words in title shall be higher in fake
news.
H2
Real news shall be short and crisp in
order to generate easy value.
Metrics - Length of the fake news shall
be more than the real ones.
H1
We used NLTK to extract stop words from
the title column and compared the
averages between fake and real titles.
The hypothesis is false, as shown by the
figure: fake news (0) is less frequent than
real news (1).
H2
The hypothesis is true, as shown by the figures:
fake news (0) tends to be longer than real news
(1).
Insights on Data &
Pre-processing
To gain quick insights from the data, we
used word clouds for the titles overall and
for fake/real data.
Real News
Fake News
Null Values
The title column contains some null
values, which may cause issues in data
analysis or processing.
We need to fill the null values in the title
column to ensure accurate data analysis.
Text Normalization
To further prepare the data, we applied text normalization techniques, including converting
the title and text to lowercase and removing punctuation marks.
Classification Model
For the binary classification of the News, we have choose
Random Forest Classifier
Splitting of data in x and y variable and Test and train split of
the data has been performed with 77 & 33 size.
The bag of words has been performed to the text of the news
(X_train & X_test) and by removing the stop words in English.
The Label Y_train & Y_test has the class of the news (Fake = 0
& Real = 1 )
Now the train data is feed to the RandomForestClassifier with
500 trees and the model has been tested with the test data
and the model classification confusion matrix is below.
Thank You For Your Attention!

More Related Content

Similar to Fake News and Their Detection

Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Red Hat Developers
 
Study material ip class 12th
Study material ip class 12thStudy material ip class 12th
Study material ip class 12thanimesh dwivedi
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache PinotSiddharth Teotia
 
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovWorkshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovFwdays
 
Python Pandas
Python PandasPython Pandas
Python PandasSunil OS
 
Educational Objectives After successfully completing this assignmen.pdf
Educational Objectives After successfully completing this assignmen.pdfEducational Objectives After successfully completing this assignmen.pdf
Educational Objectives After successfully completing this assignmen.pdfrajeshjangid1865
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxPremaGanesh1
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
 
Introduction to Machine Learning by MARK
Introduction to Machine Learning by MARKIntroduction to Machine Learning by MARK
Introduction to Machine Learning by MARKMRKUsafzai0607
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)Hansol Kang
 
Introduction to objects and inputoutput
Introduction to objects and inputoutput Introduction to objects and inputoutput
Introduction to objects and inputoutput Ahmad Idrees
 
FDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on pythonFDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on pythonkannikadg
 
Big data analytics project report
Big data analytics project reportBig data analytics project report
Big data analytics project reportManav Deshmukh
 
Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Ganesan Narayanasamy
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018 Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018 Codemotion
 

Similar to Fake News and Their Detection (20)

Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
 
Predicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systemsPredicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systems
 
Study material ip class 12th
Study material ip class 12thStudy material ip class 12th
Study material ip class 12th
 
weatherr.pptx
weatherr.pptxweatherr.pptx
weatherr.pptx
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
 
Numerical data.
Numerical data.Numerical data.
Numerical data.
 
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovWorkshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
 
Python Pandas
Python PandasPython Pandas
Python Pandas
 
Educational Objectives After successfully completing this assignmen.pdf
Educational Objectives After successfully completing this assignmen.pdfEducational Objectives After successfully completing this assignmen.pdf
Educational Objectives After successfully completing this assignmen.pdf
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
 
The steps of R code Master.pptx
The steps of R code Master.pptxThe steps of R code Master.pptx
The steps of R code Master.pptx
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Introduction to Machine Learning by MARK
Introduction to Machine Learning by MARKIntroduction to Machine Learning by MARK
Introduction to Machine Learning by MARK
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
 
Introduction to objects and inputoutput
Introduction to objects and inputoutput Introduction to objects and inputoutput
Introduction to objects and inputoutput
 
FDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on pythonFDP-faculty deveopmemt program on python
FDP-faculty deveopmemt program on python
 
Big data analytics project report
Big data analytics project reportBig data analytics project report
Big data analytics project report
 
Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018 Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
 

Recently uploaded

Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证ju0dztxtn
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...BabaJohn3
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfscitechtalktv
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Jon Hansen
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonPayment Village
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...ssuserf63bd7
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunksgmuir1066
 
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7gragkhusi
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理cyebo
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一fztigerwe
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancingmohamed Elzalabany
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理cyebo
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfgreat91
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Valters Lauzums
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一0uyfyq0q4
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeralNABLAS株式会社
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfEmmanuel Dauda
 

Recently uploaded (20)

Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 

Fake News and Their Detection

  • 1. Fake News and Their Detection Data Science and Big Data Analysis Professor: Antonino Nocera Team Name: 4V’s Group members: Arnold Fonkou Vignesh Kumar Kembu Ashina Nurkoo Seyedkourosh Sajjadi
  • 2. WELFake Fake News Detection (WELFake) dataset of 72,134 news articles with 35,028 real and 37,106 fake news. This dataset is a part of an ongoing research on "Fake News Prediction on Social Media Website" as a doctoral degree program of Mr. Pawan Kumar Verma and is partially supported by the ARTICONF project funded by the European Union’s Horizon 2020 research and innovation program. Columns: - Serial number (starting from 0) - Title (about the text news heading) Text (about the news content) - Label (0 = fake and 1 = real)
  • 3. Data Stream Ingestion Hadoop MapReduce MongoDB Sandbox PySpark HDFS Analysis Architecture PySpark
  • 4. Ingestion From CSV to JSON Data Conversion we have converted the file into JSON to be closer to reality. Reading Data Using PySpark We used the DATAFRAME client of SPARK to read our big data. Saving to Hadoop Write into Hadoop We read from the data frame and then we write it to Hadoop.
  • 5. Reading Section Import findspark findspark.init() import pyspark from pyspark.sql import * spark = SparkSession.builder .master("local[1]") .appName("PySpark Read JSON") .getOrCreate() # Reading multiline json file multiline_dataframe = spark.read.option("multiline","true") .json("project_data_sample.json") multiline_dataframe.head() Saving Section multiline_dataframe.write.save('/usr/local/hadoop/user3/dsba1.json', format='json') And the data is shown as below: sqlContext = SQLContext(spark) df = sqlContext.read.format('json').load('/usr/local/hadoop/user3/dsba1.json') df.show()
  • 7. Mapper (BoW Creation) Read Lines Input Data The data is given as input lines to the mapper. Extract Text Title and Text Extraction After reading each line as a JSON object, we extract the title and the text related to that piece of news from it. Tokenize Word Extraction We perform some data cleaning and then we extract every single word from it.
  • 8. Text Cleaning import sys import re import json def clean_text(text): text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|( ?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text) text = re.sub(r'[^a-zA-Zs]+', '', text) return text Tokenizing def tokenize(text): if not isinstance(text, str): text = str(text) text = clean_text(text) text = str.lower(text) return text.split() Execution for line in sys.stdin: line = line.strip() Try: json_obj = json.loads(line) except: continue title = json_obj.get("title", "") text = json_obj.get("text", "") title_words = tokenize(title) text_words = tokenize(text) for word in title_words + text_words: print(f"{word}t1")
  • 9. Reducer Read Lines Input Data The data is given as input lines each containing 2 elements. Initialize Counter Word and Count Extraction After reading each line a JON object, we extract the title and the text related to that piece of news from it. Create BoW Dictionary Create a dictionary and add each word as the key and its associated count value as the value.
  • 10. Counter Initialization import sys from collections import Counter import json bag_of_words = Counter() Execution for line in sys.stdin: line = line.strip() try: word, count = line.split("t") except: continue count = int(count) bag_of_words[word] += count with open('bow_data.json', 'w') as f: json.dump(bag_of_words, f)
  • 11. Moving to MongoDb import json from pymongo import MongoClient client = MongoClient('mongodb://localhost:27017') db = client['bow'] collection = db['bow_collection'] with open('bow_data.json', 'r') as f: bow_data = json.load(f) collection.insert_one(bow_data) Performing MapReduce Operation In the Terminal: cat db.json | python3 bow_mapper.py | sort | python3 bow_reducer.py
  • 12. HDFS In the case of dealing with big data, we could partition our dataset into a number of batches instead of saving it in a single file. Instead of: multiline_dataframe.write.save('/usr/local/hadoop/user3/dsba1.j son', format='json') Use: partitioned_df = multiline_dataframe.repartition(4, "Unnamed: 0") partitioned_df.write.save('/usr/local/hadoop/user3/dsba1.json', format='json') partition_counts = partitioned_df.rdd.mapPartitions(lambda it: [sum(1 for _ in it)]).collect() print(partition_counts) [482, 480, 519, 519]
  • 13. Create Database Create a database for containing the data. Import From Hadoop Import the JSON File from Hadoop via PySpark. View Data & Backup View the data and if it is inserted correctly then create a backup before starting the modifications. Clean Data Remove non-alphanumeric characters. Display Modified Data Display the modified content to view changes. MongoDB
  • 14. Creating Database Use an existing database or create a new one: >use dsdb_dev Viewing Data >use dsdb_dev >show collections >db.fake_real_news.find() >db.fake_real_news.aggregate([{$group : {_id: "$label", rest_number : {$sum : 1}}}]) Creating a Copy In the Terminal: mongodump --db dsdb_dev --collection fake_real_news --out /home/ds/Documents/ Importing From Hadoop In the Terminal: mongoimport --db dsdb_dev --collection fake_real_news --file /usr/local/hadoop/user3/dsba1.json/part-00000-d1623440-4fde-4b 72-b87d-5943bec596d3-c000.json
  • 15. Importing from Hadoop Using PySpark with open('sampled_data.json', 'w') as file: for line in json_data: file.write(line + 'n') import json with open('sampled_data.json') as file: data = file.readlines() collection.insert_many([json.loads(line) for line in data]) df = spark.read.json("/usr/local/hadoop/user3/dsba1.json/part-00000-d16234 40-4fde-4b72-b87d-5943bec596d3-c000.json") sampled_df = df.sample(fraction=0.8, seed=42) from pymongo import MongoClient conn = MongoClient() db = conn.dsdb_dev collection = db['sampled_data'] json_data = sampled_df.toJSON().collect()
  • 16. Data Cleaning >db.fake_real_news.aggregate([ {‘$project': {‘_id': 1, 'Unnamed: 0': 1, 'label': 1, 'text': 1, 'title': 1} } ]). forEach(function(doc) { if (doc.title) { var newTitle = doc.title.replace(/[^a-zA-Z0-9 ]/g, ''); db.fake_real_news.update({ '_id': doc._id }, { '$set': { 'title': newTitle } }); } }); Modified Content Display >db.fake_real_news.aggregate([ {‘$project': {‘_id': 1, 'Unnamed: 0': 1, 'label': 1, 'text': 1, 'title': 1} } ]); The file is now ready for word occurrence counting, which can be done using Jupyter Notebook and PyMongo. Backup Restoration In case of any need, restore the initial file: >db.fake_real_news.drop() mongorestore --db dsdb_dev --collection fake_real_news /home/ds/Documents/dsdb_dev/fake_real_news.bson
  • 17. Count the Number of Words db.fake_real_news.aggregate ([ { '$match': { 'label': "0" # the condition for the 'label' field to be 1 } }, { '$project': { 'words': {'$split': [{'$toLower': '$title'}, ' ']} # Split the lowercase version of the title field into an array of words } }, { '$unwind': '$words' # Separate documents for each word }, { '$group': { '_id': { 'word': '$words', # Group by word field and count }, 'count': {'$sum': 1} } }, { '$project': { # Project to return only word field, count, and id 'word': '$_id.word', 'count': 1 } }, { '$match': { 'word': {'$ne': None}, # Exclude null or non-existent values } }, { '$match': { '$expr': {'$ne': ['$word', '']} # Exclude empty strings } }, { '$sort': {'count':-1} } ])
  • 18. Hypotheses H1 Generation of fake news shall be with the help of stop words. Metrics - Average number of stop words in title shall be higher in fake news. H2 Real news shall be short and crisp in order to generate easy value. Metrics - Length of the fake news shall be more than the real ones.
  • 19. H1 We used NLTK to extract stop words from the title column and compared the averages between fake and real titles. The hypothesis is false, as shown by the figure: fake news (0) is less frequent than real news (1).
  • 20. H2 The hypothesis is true, as shown by the figures: fake news (0) tends to be longer than real news (1).
  • 21. Insights on Data & Pre-processing To gain quick insights from the data, we used word clouds for the titles overall and for fake/real data.
  • 24. Null Values The title column contains some null values, which may cause issues in data analysis or processing. We need to fill the null values in the title column to ensure accurate data analysis.
  • 25. Text Normalization To further prepare the data, we applied text normalization techniques, including converting the title and text to lowercase and removing punctuation marks.
  • 26. Classification Model For the binary classification of the News, we have choose Random Forest Classifier Splitting of data in x and y variable and Test and train split of the data has been performed with 77 & 33 size. The bag of words has been performed to the text of the news (X_train & X_test) and by removing the stop words in English. The Label Y_train & Y_test has the class of the news (Fake = 0 & Real = 1 ) Now the train data is feed to the RandomForestClassifier with 500 trees and the model has been tested with the test data and the model classification confusion matrix is below.
  • 27. Thank You For Your Attention!