SlideShare a Scribd company logo
1 of 18
Download to read offline
#CASHTAG
BIG DATA PIPELINE FOR USER SENTIMENT ANALYSIS
Shafi Bashar
MOTIVATION
• People have opinions
• Different sources, different mediums -Twitter, Reddit, Facebook etc.
• Platform for aggregating opinions and analyzing on aTopic
• v 1.0: User’s opinion of US stock market
DEMO
•Webpage
http://www.hashtagcashtag.com
•Video
http://youtu.be/7oMrJ7n1Hr4
• Alternate Link
http://54.67.108.50
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer Serving Layer Front End
Real-timeView
BatchView
Data Ingestion
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
DATA INGESTION
• Two sources
1. Twitter Data
2. Stock Data
• Twitter Data from streaming API
{u'contributors': None,
u'coordinates': None,
u'created_at': u'Mon Feb 02 07:41:06 +0000 2015',
u'entities': {u'hashtags': [],
u'symbols': [{u'indices': [0, 3], u'text': u'FB'}],
u'urls': [{u'display_url': u'stocknewswires.com/2015/02/fb-goou2026',
u'expanded_url': u'http://www.stocknewswires.com/2015/02/fb-google-glass-might-
be-googles-most-successful-failure-yet.html',
u'indices': [67, 89],
u'url': u'http://t.co/6iY3WYz82M'}],
u'user_mentions': []},
u'favorite_count': 0,
u'favorited': False,
u'geo': None,
u'id': 562153724219764737,
u'id_str': u'562153724219764737',
u'in_reply_to_screen_name': None,
u'in_reply_to_status_id': None,
u'in_reply_to_status_id_str': None,
u'in_reply_to_user_id': None,
u'in_reply_to_user_id_str': None,
u'lang': u'en',
u'metadata': {u'iso_language_code': u'en', u'result_type': u'recent'},
u'place': None,
u'possibly_sensitive': False,
u'retweet_count': 0,
u'retweeted': False,
u'source': u'<a href="http://fnsappbg.blogspot.com/" rel="nofollow">FNS_APP</a>',
u'text': u"$FB:nnGoogle Glass Might Be Google's Most Successful Failure Yet:nnhttp://t.co/
6iY3WYz82M",
u'truncated': False,
u'user': {u'contributors_enabled': False,
u'created_at': u'Mon Nov 17 20:15:38 +0000 2014',
u'default_profile': True,
DATA INGESTION
• Stock Data from www.netfonds.no
• Incremental CSV file for each individual stocks
• Preprocessing to add ticker and time stamp
• Multi topic, multi consumer Kafka
20150126T153000 113.67 100 Auto trade
20150126T153000 113.65 161 Auto trade
20150126T153000 113.68 270 Auto trade
20150126T153000 113.67 100 Auto trade
20150126T153001 113.66 100 Auto trade
20150126T153001 113.65 100 Auto trade
20150126T153001 113.67 100 Auto trade
1422981307.82,AMGN,2015-02-03 17:17:05,149.76,300,,Auto trade,,,
1422981307.82,AMGN,2015-02-03 17:17:53,149.62,207,,Auto trade,,,
1422981307.83,AMGN,2015-02-03 17:16:08,149.81,100,,Auto trade,,,
1422981307.83,AMGN,2015-02-03 17:16:11,149.77,100,,Auto trade,,,
1422981308.26,LNKD,2015-02-03 17:15:28,228.2,200,,Auto trade,,,
1422981308.27,LNKD,2015-02-03 17:16:45,228.5,100,,Auto trade,,,
1422981308.27,LNKD,2015-02-03 17:19:02,229.27,100,,Auto trade,,,
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
BATCH LAYER
• Spark batch job (written in Scala)
• Twitter
• Number of mentions and sentiment of the mentions / time
granularity
• Top trending stocks
ticker | year | month | day | hour | minute | frequency | sentiment
--------+------+-------+-----+------+--------+-----------+-----------
TSLA | 2015 | 2 | 3 | 16 | 35 | 16 | 0
TSLA | 2015 | 2 | 3 | 16 | 34 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 33 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 32 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 31 | 9 | 3
TSLA | 2015 | 2 | 3 | 16 | 30 | 4 | 0
year | month | day | hour | frequency | sentiment | ticker
------+-------+-----+------+-----------+-----------+--------
2015 | 1 | 31 | 13 | 33 | -3 | AAPL
2015 | 1 | 31 | 13 | 17 | 3 | TWTR
2015 | 1 | 31 | 13 | 16 | 2 | SBUX
2015 | 1 | 31 | 13 | 15 | 1 | KO
2015 | 1 | 31 | 13 | 14 | 3 | MCD
2015 | 1 | 31 | 13 | 13 | 0 | EBAY
2015 | 1 | 31 | 13 | 12 | -1 | MSFT
2015 | 1 | 31 | 13 | 11 | 0 | XOM
SENTIMENT ANALYSIS
"downgraded",
"bears",
"bear",
"bearish",
"volatile",
"short",
"sell",
"selling",
"forget",
"down",
"resistance",
"sold",
…
"upgrade",
"upgraded",
"long",
"buy",
"buying",
"growth",
"good",
"gained",
"well",
"great",
"nice",
"top",
…
Positive
Negative
BATCH LAYER
• Stocks
• high, low, open, close, volume
• Azkaban controls the flow and scheduling
• Batch layer uses Re-computation Algorithm
ticker | year | month | day | hour | minute | close | high | low | open | volume
--------+------+-------+-----+------+--------+--------+--------+--------+--------+--------
TWTR | 2015 | 1 | 30 | 13 | 0 | 37.55 | 37.55 | 37.55 | 37.55 | 6740
TWTR | 2015 | 1 | 30 | 12 | 59 | 37.54 | 37.56 | 37.46 | 37.47 | 96070
TWTR | 2015 | 1 | 30 | 12 | 58 | 37.47 | 37.51 | 37.46 | 37.51 | 47839
TWTR | 2015 | 1 | 30 | 12 | 57 | 37.5 | 37.55 | 37.495 | 37.54 | 65830
TWTR | 2015 | 1 | 30 | 12 | 56 | 37.54 | 37.57 | 37.54 | 37.565 | 41758
TWTR | 2015 | 1 | 30 | 12 | 55 | 37.565 | 37.6 | 37.5 | 37.5 | 54317
TWTR | 2015 | 1 | 30 | 12 | 54 | 37.5 | 37.54 | 37.5 | 37.52 | 36129
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
SPEED LAYER
• Spark Streaming (codes written in Scala)
• Task 1: Incremental Algorithm to supplement batch layer in tab 3
• Task 2: Rolling Count for dash board Operation for tab 1
Batch Operation
Batch Operation
Speed Speed Speed
data over time
SpeedSpeed
id | timestamp | frequency | sentiment | ticker
----+------------+-----------+-----------+--------
0 | 1430375561 | 55 | -5 | AAPL
0 | 1430370589 | 55 | -5 | AAPL
0 | 1430365508 | 54 | -5 | AAPL
0 | 1430360540 | 54 | -5 | AAPL
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
SERVING LAYER
• De-normalized tables in Cassandra
• TwitterTime Series
• partitioned by ticker symbol
• clustering order by (year, month, day, hour, minute)
• TopTrending Stocks
• partitioned by (year, month, day, hour)
• clustering order by number of mentions
ticker | year | month | day | hour | minute | frequency | sentiment
--------+------+-------+-----+------+--------+-----------+-----------
TSLA | 2015 | 2 | 3 | 16 | 35 | 16 | 0
TSLA | 2015 | 2 | 3 | 16 | 34 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 33 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 32 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 31 | 9 | 3
TSLA | 2015 | 2 | 3 | 16 | 30 | 4 | 0
year | month | day | hour | frequency | sentiment | ticker
------+-------+-----+------+-----------+-----------+--------
2015 | 1 | 31 | 13 | 33 | -3 | AAPL
2015 | 1 | 31 | 13 | 17 | 3 | TWTR
2015 | 1 | 31 | 13 | 16 | 2 | SBUX
2015 | 1 | 31 | 13 | 15 | 1 | KO
2015 | 1 | 31 | 13 | 14 | 3 | MCD
2015 | 1 | 31 | 13 | 13 | 0 | EBAY
2015 | 1 | 31 | 13 | 12 | -1 | MSFT
2015 | 1 | 31 | 13 | 11 | 0 | XOM
SPEEDVIEW
• CassandraTTL support can be used for rolling count operation for dashboard
application
• Not available in Cassandra-Spark connector
• Add timestamp and ranking to each ticker generation in each 5 second window
• Partitioned by ranking, clustering order by timestamp
id | timestamp | frequency | sentiment | ticker
----+------------+-----------+-----------+--------
0 | 1430375561 | 55 | -5 | AAPL
0 | 1430370589 | 55 | -5 | AAPL
0 | 1430365508 | 54 | -5 | AAPL
0 | 1430360540 | 54 | -5 | AAPL
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
SHAFI BASHAR
• PhD, ECE, UC Davis
• Present - Intel Corporation
• Worked on 4G LTE,WiFi standardization
• Interest -Algorithm, Machine Learning
• Activities - backpacking, skiing, running,
photography
GPRS WCDMA HSPA WiMAX HSPA+ 4G LTE
1Gbps
300Mbps
168Mbps
128Mbps14Mbps384Kbps40Kbps
4G LTE-Advanced
4G3G2.5G

More Related Content

Similar to Hashtag cashtagfinal_1

Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreMariaDB plc
 
Oracle 122 partitioning_in_action_slide_share
Oracle 122 partitioning_in_action_slide_shareOracle 122 partitioning_in_action_slide_share
Oracle 122 partitioning_in_action_slide_shareThomas Teske
 
Oracle Query Tuning Tips - Get it Right the First Time
Oracle Query Tuning Tips - Get it Right the First TimeOracle Query Tuning Tips - Get it Right the First Time
Oracle Query Tuning Tips - Get it Right the First TimeDean Richards
 
LoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BILoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BILoQutus
 
Oracle statistics by example
Oracle statistics by exampleOracle statistics by example
Oracle statistics by exampleMauro Pagano
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Sid Anand
 
Caso de Sucesso Vodafone e Splunk
Caso de Sucesso Vodafone e SplunkCaso de Sucesso Vodafone e Splunk
Caso de Sucesso Vodafone e SplunkSplunk
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Databricks
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightDataStax Academy
 
SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunk
 
IDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdfIDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdfManimuthu Ayyannan
 
Hitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning toolsHitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning toolsBjoern Rost
 
Andreea Marin - Our journey into Cassandra performance optimisation -
Andreea Marin - Our journey into Cassandra performance optimisation -Andreea Marin - Our journey into Cassandra performance optimisation -
Andreea Marin - Our journey into Cassandra performance optimisation -Codemotion
 
Top 10 tips for Oracle performance
Top 10 tips for Oracle performanceTop 10 tips for Oracle performance
Top 10 tips for Oracle performanceGuy Harrison
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksAnyscale
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
 
Indexing Strategies for Oracle Databases - Beyond the Create Index Statement
Indexing Strategies for Oracle Databases - Beyond the Create Index StatementIndexing Strategies for Oracle Databases - Beyond the Create Index Statement
Indexing Strategies for Oracle Databases - Beyond the Create Index StatementSean Scott
 

Similar to Hashtag cashtagfinal_1 (20)

Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStore
 
Oracle 122 partitioning_in_action_slide_share
Oracle 122 partitioning_in_action_slide_shareOracle 122 partitioning_in_action_slide_share
Oracle 122 partitioning_in_action_slide_share
 
Oracle Query Tuning Tips - Get it Right the First Time
Oracle Query Tuning Tips - Get it Right the First TimeOracle Query Tuning Tips - Get it Right the First Time
Oracle Query Tuning Tips - Get it Right the First Time
 
LoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BILoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BI
 
Oracle statistics by example
Oracle statistics by exampleOracle statistics by example
Oracle statistics by example
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
 
Caso de Sucesso Vodafone e Splunk
Caso de Sucesso Vodafone e SplunkCaso de Sucesso Vodafone e Splunk
Caso de Sucesso Vodafone e Splunk
 
20150423 m3
20150423 m320150423 m3
20150423 m3
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
 
SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced Session
 
IDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdfIDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdf
 
Hitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning toolsHitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning tools
 
Andreea Marin - Our journey into Cassandra performance optimisation -
Andreea Marin - Our journey into Cassandra performance optimisation -Andreea Marin - Our journey into Cassandra performance optimisation -
Andreea Marin - Our journey into Cassandra performance optimisation -
 
Top 10 tips for Oracle performance
Top 10 tips for Oracle performanceTop 10 tips for Oracle performance
Top 10 tips for Oracle performance
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at Databricks
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Indexing Strategies for Oracle Databases - Beyond the Create Index Statement
Indexing Strategies for Oracle Databases - Beyond the Create Index StatementIndexing Strategies for Oracle Databases - Beyond the Create Index Statement
Indexing Strategies for Oracle Databases - Beyond the Create Index Statement
 

Recently uploaded

Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisDr.Costas Sachpazis
 
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...MohammadAliNayeem
 
School management system project report.pdf
School management system project report.pdfSchool management system project report.pdf
School management system project report.pdfKamal Acharya
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Krakówbim.edu.pl
 
Theory for How to calculation capacitor bank
Theory for How to calculation capacitor bankTheory for How to calculation capacitor bank
Theory for How to calculation capacitor banktawat puangthong
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Prakhyath Rai
 
Intelligent Agents, A discovery on How A Rational Agent Acts
Intelligent Agents, A discovery on How A Rational Agent ActsIntelligent Agents, A discovery on How A Rational Agent Acts
Intelligent Agents, A discovery on How A Rational Agent ActsSheetal Jain
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxwendy cai
 
Quiz application system project report..pdf
Quiz application system project report..pdfQuiz application system project report..pdf
Quiz application system project report..pdfKamal Acharya
 
ROAD CONSTRUCTION PRESENTATION.PPTX.pptx
ROAD CONSTRUCTION PRESENTATION.PPTX.pptxROAD CONSTRUCTION PRESENTATION.PPTX.pptx
ROAD CONSTRUCTION PRESENTATION.PPTX.pptxGagandeepKaur617299
 
E-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are presentE-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are presentjatinraor66
 
Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AIIntroduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AISheetal Jain
 
Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2T.D. Shashikala
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGKOUSTAV SARKAR
 
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5T.D. Shashikala
 
Dairy management system project report..pdf
Dairy management system project report..pdfDairy management system project report..pdf
Dairy management system project report..pdfKamal Acharya
 
ChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdfChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdfqasastareekh
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfJNTUA
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdfKamal Acharya
 

Recently uploaded (20)

Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
 
School management system project report.pdf
School management system project report.pdfSchool management system project report.pdf
School management system project report.pdf
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 
Theory for How to calculation capacitor bank
Theory for How to calculation capacitor bankTheory for How to calculation capacitor bank
Theory for How to calculation capacitor bank
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
 
Intelligent Agents, A discovery on How A Rational Agent Acts
Intelligent Agents, A discovery on How A Rational Agent ActsIntelligent Agents, A discovery on How A Rational Agent Acts
Intelligent Agents, A discovery on How A Rational Agent Acts
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptx
 
Quiz application system project report..pdf
Quiz application system project report..pdfQuiz application system project report..pdf
Quiz application system project report..pdf
 
ROAD CONSTRUCTION PRESENTATION.PPTX.pptx
ROAD CONSTRUCTION PRESENTATION.PPTX.pptxROAD CONSTRUCTION PRESENTATION.PPTX.pptx
ROAD CONSTRUCTION PRESENTATION.PPTX.pptx
 
E-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are presentE-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are present
 
Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AIIntroduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AI
 
Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
 
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
 
Dairy management system project report..pdf
Dairy management system project report..pdfDairy management system project report..pdf
Dairy management system project report..pdf
 
ChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdfChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdf
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdf
 

Hashtag cashtagfinal_1

  • 1. #CASHTAG BIG DATA PIPELINE FOR USER SENTIMENT ANALYSIS Shafi Bashar
  • 2. MOTIVATION • People have opinions • Different sources, different mediums -Twitter, Reddit, Facebook etc. • Platform for aggregating opinions and analyzing on aTopic • v 1.0: User’s opinion of US stock market
  • 4. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Serving Layer Front End Real-timeView BatchView Data Ingestion
  • 5. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 6. DATA INGESTION • Two sources 1. Twitter Data 2. Stock Data • Twitter Data from streaming API {u'contributors': None, u'coordinates': None, u'created_at': u'Mon Feb 02 07:41:06 +0000 2015', u'entities': {u'hashtags': [], u'symbols': [{u'indices': [0, 3], u'text': u'FB'}], u'urls': [{u'display_url': u'stocknewswires.com/2015/02/fb-goou2026', u'expanded_url': u'http://www.stocknewswires.com/2015/02/fb-google-glass-might- be-googles-most-successful-failure-yet.html', u'indices': [67, 89], u'url': u'http://t.co/6iY3WYz82M'}], u'user_mentions': []}, u'favorite_count': 0, u'favorited': False, u'geo': None, u'id': 562153724219764737, u'id_str': u'562153724219764737', u'in_reply_to_screen_name': None, u'in_reply_to_status_id': None, u'in_reply_to_status_id_str': None, u'in_reply_to_user_id': None, u'in_reply_to_user_id_str': None, u'lang': u'en', u'metadata': {u'iso_language_code': u'en', u'result_type': u'recent'}, u'place': None, u'possibly_sensitive': False, u'retweet_count': 0, u'retweeted': False, u'source': u'<a href="http://fnsappbg.blogspot.com/" rel="nofollow">FNS_APP</a>', u'text': u"$FB:nnGoogle Glass Might Be Google's Most Successful Failure Yet:nnhttp://t.co/ 6iY3WYz82M", u'truncated': False, u'user': {u'contributors_enabled': False, u'created_at': u'Mon Nov 17 20:15:38 +0000 2014', u'default_profile': True,
  • 7. DATA INGESTION • Stock Data from www.netfonds.no • Incremental CSV file for each individual stocks • Preprocessing to add ticker and time stamp • Multi topic, multi consumer Kafka 20150126T153000 113.67 100 Auto trade 20150126T153000 113.65 161 Auto trade 20150126T153000 113.68 270 Auto trade 20150126T153000 113.67 100 Auto trade 20150126T153001 113.66 100 Auto trade 20150126T153001 113.65 100 Auto trade 20150126T153001 113.67 100 Auto trade 1422981307.82,AMGN,2015-02-03 17:17:05,149.76,300,,Auto trade,,, 1422981307.82,AMGN,2015-02-03 17:17:53,149.62,207,,Auto trade,,, 1422981307.83,AMGN,2015-02-03 17:16:08,149.81,100,,Auto trade,,, 1422981307.83,AMGN,2015-02-03 17:16:11,149.77,100,,Auto trade,,, 1422981308.26,LNKD,2015-02-03 17:15:28,228.2,200,,Auto trade,,, 1422981308.27,LNKD,2015-02-03 17:16:45,228.5,100,,Auto trade,,, 1422981308.27,LNKD,2015-02-03 17:19:02,229.27,100,,Auto trade,,,
  • 8. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 9. BATCH LAYER • Spark batch job (written in Scala) • Twitter • Number of mentions and sentiment of the mentions / time granularity • Top trending stocks ticker | year | month | day | hour | minute | frequency | sentiment --------+------+-------+-----+------+--------+-----------+----------- TSLA | 2015 | 2 | 3 | 16 | 35 | 16 | 0 TSLA | 2015 | 2 | 3 | 16 | 34 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 33 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 32 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 31 | 9 | 3 TSLA | 2015 | 2 | 3 | 16 | 30 | 4 | 0 year | month | day | hour | frequency | sentiment | ticker ------+-------+-----+------+-----------+-----------+-------- 2015 | 1 | 31 | 13 | 33 | -3 | AAPL 2015 | 1 | 31 | 13 | 17 | 3 | TWTR 2015 | 1 | 31 | 13 | 16 | 2 | SBUX 2015 | 1 | 31 | 13 | 15 | 1 | KO 2015 | 1 | 31 | 13 | 14 | 3 | MCD 2015 | 1 | 31 | 13 | 13 | 0 | EBAY 2015 | 1 | 31 | 13 | 12 | -1 | MSFT 2015 | 1 | 31 | 13 | 11 | 0 | XOM
  • 11. BATCH LAYER • Stocks • high, low, open, close, volume • Azkaban controls the flow and scheduling • Batch layer uses Re-computation Algorithm ticker | year | month | day | hour | minute | close | high | low | open | volume --------+------+-------+-----+------+--------+--------+--------+--------+--------+-------- TWTR | 2015 | 1 | 30 | 13 | 0 | 37.55 | 37.55 | 37.55 | 37.55 | 6740 TWTR | 2015 | 1 | 30 | 12 | 59 | 37.54 | 37.56 | 37.46 | 37.47 | 96070 TWTR | 2015 | 1 | 30 | 12 | 58 | 37.47 | 37.51 | 37.46 | 37.51 | 47839 TWTR | 2015 | 1 | 30 | 12 | 57 | 37.5 | 37.55 | 37.495 | 37.54 | 65830 TWTR | 2015 | 1 | 30 | 12 | 56 | 37.54 | 37.57 | 37.54 | 37.565 | 41758 TWTR | 2015 | 1 | 30 | 12 | 55 | 37.565 | 37.6 | 37.5 | 37.5 | 54317 TWTR | 2015 | 1 | 30 | 12 | 54 | 37.5 | 37.54 | 37.5 | 37.52 | 36129
  • 12. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 13. SPEED LAYER • Spark Streaming (codes written in Scala) • Task 1: Incremental Algorithm to supplement batch layer in tab 3 • Task 2: Rolling Count for dash board Operation for tab 1 Batch Operation Batch Operation Speed Speed Speed data over time SpeedSpeed id | timestamp | frequency | sentiment | ticker ----+------------+-----------+-----------+-------- 0 | 1430375561 | 55 | -5 | AAPL 0 | 1430370589 | 55 | -5 | AAPL 0 | 1430365508 | 54 | -5 | AAPL 0 | 1430360540 | 54 | -5 | AAPL
  • 14. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 15. SERVING LAYER • De-normalized tables in Cassandra • TwitterTime Series • partitioned by ticker symbol • clustering order by (year, month, day, hour, minute) • TopTrending Stocks • partitioned by (year, month, day, hour) • clustering order by number of mentions ticker | year | month | day | hour | minute | frequency | sentiment --------+------+-------+-----+------+--------+-----------+----------- TSLA | 2015 | 2 | 3 | 16 | 35 | 16 | 0 TSLA | 2015 | 2 | 3 | 16 | 34 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 33 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 32 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 31 | 9 | 3 TSLA | 2015 | 2 | 3 | 16 | 30 | 4 | 0 year | month | day | hour | frequency | sentiment | ticker ------+-------+-----+------+-----------+-----------+-------- 2015 | 1 | 31 | 13 | 33 | -3 | AAPL 2015 | 1 | 31 | 13 | 17 | 3 | TWTR 2015 | 1 | 31 | 13 | 16 | 2 | SBUX 2015 | 1 | 31 | 13 | 15 | 1 | KO 2015 | 1 | 31 | 13 | 14 | 3 | MCD 2015 | 1 | 31 | 13 | 13 | 0 | EBAY 2015 | 1 | 31 | 13 | 12 | -1 | MSFT 2015 | 1 | 31 | 13 | 11 | 0 | XOM
  • 16. SPEEDVIEW • CassandraTTL support can be used for rolling count operation for dashboard application • Not available in Cassandra-Spark connector • Add timestamp and ranking to each ticker generation in each 5 second window • Partitioned by ranking, clustering order by timestamp id | timestamp | frequency | sentiment | ticker ----+------------+-----------+-----------+-------- 0 | 1430375561 | 55 | -5 | AAPL 0 | 1430370589 | 55 | -5 | AAPL 0 | 1430365508 | 54 | -5 | AAPL 0 | 1430360540 | 54 | -5 | AAPL
  • 17. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 18. SHAFI BASHAR • PhD, ECE, UC Davis • Present - Intel Corporation • Worked on 4G LTE,WiFi standardization • Interest -Algorithm, Machine Learning • Activities - backpacking, skiing, running, photography GPRS WCDMA HSPA WiMAX HSPA+ 4G LTE 1Gbps 300Mbps 168Mbps 128Mbps14Mbps384Kbps40Kbps 4G LTE-Advanced 4G3G2.5G