SlideShare una empresa de Scribd logo
1 de 18
Descargar para leer sin conexión
#CASHTAG
BIG DATA PIPELINE FOR USER SENTIMENT ANALYSIS
Shafi Bashar
MOTIVATION
• People have opinions
• Different sources, different mediums -Twitter, Reddit, Facebook etc.
• Platform for aggregating opinions and analyzing on aTopic
• v 1.0: User’s opinion of US stock market
DEMO
•Webpage
http://www.hashtagcashtag.com
•Video
http://youtu.be/7oMrJ7n1Hr4
• Alternate Link
http://54.67.108.50
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer Serving Layer Front End
Real-timeView
BatchView
Data Ingestion
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
DATA INGESTION
• Two sources
1. Twitter Data
2. Stock Data
• Twitter Data from streaming API
{u'contributors': None,
u'coordinates': None,
u'created_at': u'Mon Feb 02 07:41:06 +0000 2015',
u'entities': {u'hashtags': [],
u'symbols': [{u'indices': [0, 3], u'text': u'FB'}],
u'urls': [{u'display_url': u'stocknewswires.com/2015/02/fb-goou2026',
u'expanded_url': u'http://www.stocknewswires.com/2015/02/fb-google-glass-might-
be-googles-most-successful-failure-yet.html',
u'indices': [67, 89],
u'url': u'http://t.co/6iY3WYz82M'}],
u'user_mentions': []},
u'favorite_count': 0,
u'favorited': False,
u'geo': None,
u'id': 562153724219764737,
u'id_str': u'562153724219764737',
u'in_reply_to_screen_name': None,
u'in_reply_to_status_id': None,
u'in_reply_to_status_id_str': None,
u'in_reply_to_user_id': None,
u'in_reply_to_user_id_str': None,
u'lang': u'en',
u'metadata': {u'iso_language_code': u'en', u'result_type': u'recent'},
u'place': None,
u'possibly_sensitive': False,
u'retweet_count': 0,
u'retweeted': False,
u'source': u'<a href="http://fnsappbg.blogspot.com/" rel="nofollow">FNS_APP</a>',
u'text': u"$FB:nnGoogle Glass Might Be Google's Most Successful Failure Yet:nnhttp://t.co/
6iY3WYz82M",
u'truncated': False,
u'user': {u'contributors_enabled': False,
u'created_at': u'Mon Nov 17 20:15:38 +0000 2014',
u'default_profile': True,
DATA INGESTION
• Stock Data from www.netfonds.no
• Incremental CSV file for each individual stocks
• Preprocessing to add ticker and time stamp
• Multi topic, multi consumer Kafka
20150126T153000 113.67 100 Auto trade
20150126T153000 113.65 161 Auto trade
20150126T153000 113.68 270 Auto trade
20150126T153000 113.67 100 Auto trade
20150126T153001 113.66 100 Auto trade
20150126T153001 113.65 100 Auto trade
20150126T153001 113.67 100 Auto trade
1422981307.82,AMGN,2015-02-03 17:17:05,149.76,300,,Auto trade,,,
1422981307.82,AMGN,2015-02-03 17:17:53,149.62,207,,Auto trade,,,
1422981307.83,AMGN,2015-02-03 17:16:08,149.81,100,,Auto trade,,,
1422981307.83,AMGN,2015-02-03 17:16:11,149.77,100,,Auto trade,,,
1422981308.26,LNKD,2015-02-03 17:15:28,228.2,200,,Auto trade,,,
1422981308.27,LNKD,2015-02-03 17:16:45,228.5,100,,Auto trade,,,
1422981308.27,LNKD,2015-02-03 17:19:02,229.27,100,,Auto trade,,,
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
BATCH LAYER
• Spark batch job (written in Scala)
• Twitter
• Number of mentions and sentiment of the mentions / time
granularity
• Top trending stocks
ticker | year | month | day | hour | minute | frequency | sentiment
--------+------+-------+-----+------+--------+-----------+-----------
TSLA | 2015 | 2 | 3 | 16 | 35 | 16 | 0
TSLA | 2015 | 2 | 3 | 16 | 34 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 33 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 32 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 31 | 9 | 3
TSLA | 2015 | 2 | 3 | 16 | 30 | 4 | 0
year | month | day | hour | frequency | sentiment | ticker
------+-------+-----+------+-----------+-----------+--------
2015 | 1 | 31 | 13 | 33 | -3 | AAPL
2015 | 1 | 31 | 13 | 17 | 3 | TWTR
2015 | 1 | 31 | 13 | 16 | 2 | SBUX
2015 | 1 | 31 | 13 | 15 | 1 | KO
2015 | 1 | 31 | 13 | 14 | 3 | MCD
2015 | 1 | 31 | 13 | 13 | 0 | EBAY
2015 | 1 | 31 | 13 | 12 | -1 | MSFT
2015 | 1 | 31 | 13 | 11 | 0 | XOM
SENTIMENT ANALYSIS
"downgraded",
"bears",
"bear",
"bearish",
"volatile",
"short",
"sell",
"selling",
"forget",
"down",
"resistance",
"sold",
…
"upgrade",
"upgraded",
"long",
"buy",
"buying",
"growth",
"good",
"gained",
"well",
"great",
"nice",
"top",
…
Positive
Negative
BATCH LAYER
• Stocks
• high, low, open, close, volume
• Azkaban controls the flow and scheduling
• Batch layer uses Re-computation Algorithm
ticker | year | month | day | hour | minute | close | high | low | open | volume
--------+------+-------+-----+------+--------+--------+--------+--------+--------+--------
TWTR | 2015 | 1 | 30 | 13 | 0 | 37.55 | 37.55 | 37.55 | 37.55 | 6740
TWTR | 2015 | 1 | 30 | 12 | 59 | 37.54 | 37.56 | 37.46 | 37.47 | 96070
TWTR | 2015 | 1 | 30 | 12 | 58 | 37.47 | 37.51 | 37.46 | 37.51 | 47839
TWTR | 2015 | 1 | 30 | 12 | 57 | 37.5 | 37.55 | 37.495 | 37.54 | 65830
TWTR | 2015 | 1 | 30 | 12 | 56 | 37.54 | 37.57 | 37.54 | 37.565 | 41758
TWTR | 2015 | 1 | 30 | 12 | 55 | 37.565 | 37.6 | 37.5 | 37.5 | 54317
TWTR | 2015 | 1 | 30 | 12 | 54 | 37.5 | 37.54 | 37.5 | 37.52 | 36129
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
SPEED LAYER
• Spark Streaming (codes written in Scala)
• Task 1: Incremental Algorithm to supplement batch layer in tab 3
• Task 2: Rolling Count for dash board Operation for tab 1
Batch Operation
Batch Operation
Speed Speed Speed
data over time
SpeedSpeed
id | timestamp | frequency | sentiment | ticker
----+------------+-----------+-----------+--------
0 | 1430375561 | 55 | -5 | AAPL
0 | 1430370589 | 55 | -5 | AAPL
0 | 1430365508 | 54 | -5 | AAPL
0 | 1430360540 | 54 | -5 | AAPL
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
SERVING LAYER
• De-normalized tables in Cassandra
• TwitterTime Series
• partitioned by ticker symbol
• clustering order by (year, month, day, hour, minute)
• TopTrending Stocks
• partitioned by (year, month, day, hour)
• clustering order by number of mentions
ticker | year | month | day | hour | minute | frequency | sentiment
--------+------+-------+-----+------+--------+-----------+-----------
TSLA | 2015 | 2 | 3 | 16 | 35 | 16 | 0
TSLA | 2015 | 2 | 3 | 16 | 34 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 33 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 32 | 1 | 0
TSLA | 2015 | 2 | 3 | 16 | 31 | 9 | 3
TSLA | 2015 | 2 | 3 | 16 | 30 | 4 | 0
year | month | day | hour | frequency | sentiment | ticker
------+-------+-----+------+-----------+-----------+--------
2015 | 1 | 31 | 13 | 33 | -3 | AAPL
2015 | 1 | 31 | 13 | 17 | 3 | TWTR
2015 | 1 | 31 | 13 | 16 | 2 | SBUX
2015 | 1 | 31 | 13 | 15 | 1 | KO
2015 | 1 | 31 | 13 | 14 | 3 | MCD
2015 | 1 | 31 | 13 | 13 | 0 | EBAY
2015 | 1 | 31 | 13 | 12 | -1 | MSFT
2015 | 1 | 31 | 13 | 11 | 0 | XOM
SPEEDVIEW
• CassandraTTL support can be used for rolling count operation for dashboard
application
• Not available in Cassandra-Spark connector
• Add timestamp and ranking to each ticker generation in each 5 second window
• Partitioned by ranking, clustering order by timestamp
id | timestamp | frequency | sentiment | ticker
----+------------+-----------+-----------+--------
0 | 1430375561 | 55 | -5 | AAPL
0 | 1430370589 | 55 | -5 | AAPL
0 | 1430365508 | 54 | -5 | AAPL
0 | 1430360540 | 54 | -5 | AAPL
PIPELINE λ ARCHITECTURE
hdfs
Speed Layer
Batch Layer
Data Ingestion
Serving Layer Front End
Real-timeView
BatchView
SHAFI BASHAR
• PhD, ECE, UC Davis
• Present - Intel Corporation
• Worked on 4G LTE,WiFi standardization
• Interest -Algorithm, Machine Learning
• Activities - backpacking, skiing, running,
photography
GPRS WCDMA HSPA WiMAX HSPA+ 4G LTE
1Gbps
300Mbps
168Mbps
128Mbps14Mbps384Kbps40Kbps
4G LTE-Advanced
4G3G2.5G

Más contenido relacionado

Similar a Hashtag cashtagfinal_1

Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreMariaDB plc
 
Oracle 122 partitioning_in_action_slide_share
Oracle 122 partitioning_in_action_slide_shareOracle 122 partitioning_in_action_slide_share
Oracle 122 partitioning_in_action_slide_shareThomas Teske
 
Oracle Query Tuning Tips - Get it Right the First Time
Oracle Query Tuning Tips - Get it Right the First TimeOracle Query Tuning Tips - Get it Right the First Time
Oracle Query Tuning Tips - Get it Right the First TimeDean Richards
 
LoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BILoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BILoQutus
 
Oracle statistics by example
Oracle statistics by exampleOracle statistics by example
Oracle statistics by exampleMauro Pagano
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Sid Anand
 
Caso de Sucesso Vodafone e Splunk
Caso de Sucesso Vodafone e SplunkCaso de Sucesso Vodafone e Splunk
Caso de Sucesso Vodafone e SplunkSplunk
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Databricks
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightDataStax Academy
 
SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunk
 
IDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdfIDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdfManimuthu Ayyannan
 
Hitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning toolsHitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning toolsBjoern Rost
 
Andreea Marin - Our journey into Cassandra performance optimisation -
Andreea Marin - Our journey into Cassandra performance optimisation -Andreea Marin - Our journey into Cassandra performance optimisation -
Andreea Marin - Our journey into Cassandra performance optimisation -Codemotion
 
Top 10 tips for Oracle performance
Top 10 tips for Oracle performanceTop 10 tips for Oracle performance
Top 10 tips for Oracle performanceGuy Harrison
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksAnyscale
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
 
Indexing Strategies for Oracle Databases - Beyond the Create Index Statement
Indexing Strategies for Oracle Databases - Beyond the Create Index StatementIndexing Strategies for Oracle Databases - Beyond the Create Index Statement
Indexing Strategies for Oracle Databases - Beyond the Create Index StatementSean Scott
 

Similar a Hashtag cashtagfinal_1 (20)

Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStore
 
Oracle 122 partitioning_in_action_slide_share
Oracle 122 partitioning_in_action_slide_shareOracle 122 partitioning_in_action_slide_share
Oracle 122 partitioning_in_action_slide_share
 
Oracle Query Tuning Tips - Get it Right the First Time
Oracle Query Tuning Tips - Get it Right the First TimeOracle Query Tuning Tips - Get it Right the First Time
Oracle Query Tuning Tips - Get it Right the First Time
 
LoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BILoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BI
 
Oracle statistics by example
Oracle statistics by exampleOracle statistics by example
Oracle statistics by example
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
 
Caso de Sucesso Vodafone e Splunk
Caso de Sucesso Vodafone e SplunkCaso de Sucesso Vodafone e Splunk
Caso de Sucesso Vodafone e Splunk
 
20150423 m3
20150423 m320150423 m3
20150423 m3
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
 
SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced Session
 
IDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdfIDEAS Global A.I. Conference 2022.pdf
IDEAS Global A.I. Conference 2022.pdf
 
Hitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning toolsHitchhiker's Guide to free Oracle tuning tools
Hitchhiker's Guide to free Oracle tuning tools
 
Andreea Marin - Our journey into Cassandra performance optimisation -
Andreea Marin - Our journey into Cassandra performance optimisation -Andreea Marin - Our journey into Cassandra performance optimisation -
Andreea Marin - Our journey into Cassandra performance optimisation -
 
Top 10 tips for Oracle performance
Top 10 tips for Oracle performanceTop 10 tips for Oracle performance
Top 10 tips for Oracle performance
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at Databricks
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Indexing Strategies for Oracle Databases - Beyond the Create Index Statement
Indexing Strategies for Oracle Databases - Beyond the Create Index StatementIndexing Strategies for Oracle Databases - Beyond the Create Index Statement
Indexing Strategies for Oracle Databases - Beyond the Create Index Statement
 

Último

Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projectssmsksolar
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...Health
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...HenryBriggs2
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Servicemeghakumariji156
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksMagic Marks
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfsmsksolar
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 

Último (20)

Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdf
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 

Hashtag cashtagfinal_1

  • 1. #CASHTAG BIG DATA PIPELINE FOR USER SENTIMENT ANALYSIS Shafi Bashar
  • 2. MOTIVATION • People have opinions • Different sources, different mediums -Twitter, Reddit, Facebook etc. • Platform for aggregating opinions and analyzing on aTopic • v 1.0: User’s opinion of US stock market
  • 4. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Serving Layer Front End Real-timeView BatchView Data Ingestion
  • 5. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 6. DATA INGESTION • Two sources 1. Twitter Data 2. Stock Data • Twitter Data from streaming API {u'contributors': None, u'coordinates': None, u'created_at': u'Mon Feb 02 07:41:06 +0000 2015', u'entities': {u'hashtags': [], u'symbols': [{u'indices': [0, 3], u'text': u'FB'}], u'urls': [{u'display_url': u'stocknewswires.com/2015/02/fb-goou2026', u'expanded_url': u'http://www.stocknewswires.com/2015/02/fb-google-glass-might- be-googles-most-successful-failure-yet.html', u'indices': [67, 89], u'url': u'http://t.co/6iY3WYz82M'}], u'user_mentions': []}, u'favorite_count': 0, u'favorited': False, u'geo': None, u'id': 562153724219764737, u'id_str': u'562153724219764737', u'in_reply_to_screen_name': None, u'in_reply_to_status_id': None, u'in_reply_to_status_id_str': None, u'in_reply_to_user_id': None, u'in_reply_to_user_id_str': None, u'lang': u'en', u'metadata': {u'iso_language_code': u'en', u'result_type': u'recent'}, u'place': None, u'possibly_sensitive': False, u'retweet_count': 0, u'retweeted': False, u'source': u'<a href="http://fnsappbg.blogspot.com/" rel="nofollow">FNS_APP</a>', u'text': u"$FB:nnGoogle Glass Might Be Google's Most Successful Failure Yet:nnhttp://t.co/ 6iY3WYz82M", u'truncated': False, u'user': {u'contributors_enabled': False, u'created_at': u'Mon Nov 17 20:15:38 +0000 2014', u'default_profile': True,
  • 7. DATA INGESTION • Stock Data from www.netfonds.no • Incremental CSV file for each individual stocks • Preprocessing to add ticker and time stamp • Multi topic, multi consumer Kafka 20150126T153000 113.67 100 Auto trade 20150126T153000 113.65 161 Auto trade 20150126T153000 113.68 270 Auto trade 20150126T153000 113.67 100 Auto trade 20150126T153001 113.66 100 Auto trade 20150126T153001 113.65 100 Auto trade 20150126T153001 113.67 100 Auto trade 1422981307.82,AMGN,2015-02-03 17:17:05,149.76,300,,Auto trade,,, 1422981307.82,AMGN,2015-02-03 17:17:53,149.62,207,,Auto trade,,, 1422981307.83,AMGN,2015-02-03 17:16:08,149.81,100,,Auto trade,,, 1422981307.83,AMGN,2015-02-03 17:16:11,149.77,100,,Auto trade,,, 1422981308.26,LNKD,2015-02-03 17:15:28,228.2,200,,Auto trade,,, 1422981308.27,LNKD,2015-02-03 17:16:45,228.5,100,,Auto trade,,, 1422981308.27,LNKD,2015-02-03 17:19:02,229.27,100,,Auto trade,,,
  • 8. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 9. BATCH LAYER • Spark batch job (written in Scala) • Twitter • Number of mentions and sentiment of the mentions / time granularity • Top trending stocks ticker | year | month | day | hour | minute | frequency | sentiment --------+------+-------+-----+------+--------+-----------+----------- TSLA | 2015 | 2 | 3 | 16 | 35 | 16 | 0 TSLA | 2015 | 2 | 3 | 16 | 34 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 33 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 32 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 31 | 9 | 3 TSLA | 2015 | 2 | 3 | 16 | 30 | 4 | 0 year | month | day | hour | frequency | sentiment | ticker ------+-------+-----+------+-----------+-----------+-------- 2015 | 1 | 31 | 13 | 33 | -3 | AAPL 2015 | 1 | 31 | 13 | 17 | 3 | TWTR 2015 | 1 | 31 | 13 | 16 | 2 | SBUX 2015 | 1 | 31 | 13 | 15 | 1 | KO 2015 | 1 | 31 | 13 | 14 | 3 | MCD 2015 | 1 | 31 | 13 | 13 | 0 | EBAY 2015 | 1 | 31 | 13 | 12 | -1 | MSFT 2015 | 1 | 31 | 13 | 11 | 0 | XOM
  • 11. BATCH LAYER • Stocks • high, low, open, close, volume • Azkaban controls the flow and scheduling • Batch layer uses Re-computation Algorithm ticker | year | month | day | hour | minute | close | high | low | open | volume --------+------+-------+-----+------+--------+--------+--------+--------+--------+-------- TWTR | 2015 | 1 | 30 | 13 | 0 | 37.55 | 37.55 | 37.55 | 37.55 | 6740 TWTR | 2015 | 1 | 30 | 12 | 59 | 37.54 | 37.56 | 37.46 | 37.47 | 96070 TWTR | 2015 | 1 | 30 | 12 | 58 | 37.47 | 37.51 | 37.46 | 37.51 | 47839 TWTR | 2015 | 1 | 30 | 12 | 57 | 37.5 | 37.55 | 37.495 | 37.54 | 65830 TWTR | 2015 | 1 | 30 | 12 | 56 | 37.54 | 37.57 | 37.54 | 37.565 | 41758 TWTR | 2015 | 1 | 30 | 12 | 55 | 37.565 | 37.6 | 37.5 | 37.5 | 54317 TWTR | 2015 | 1 | 30 | 12 | 54 | 37.5 | 37.54 | 37.5 | 37.52 | 36129
  • 12. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 13. SPEED LAYER • Spark Streaming (codes written in Scala) • Task 1: Incremental Algorithm to supplement batch layer in tab 3 • Task 2: Rolling Count for dash board Operation for tab 1 Batch Operation Batch Operation Speed Speed Speed data over time SpeedSpeed id | timestamp | frequency | sentiment | ticker ----+------------+-----------+-----------+-------- 0 | 1430375561 | 55 | -5 | AAPL 0 | 1430370589 | 55 | -5 | AAPL 0 | 1430365508 | 54 | -5 | AAPL 0 | 1430360540 | 54 | -5 | AAPL
  • 14. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 15. SERVING LAYER • De-normalized tables in Cassandra • TwitterTime Series • partitioned by ticker symbol • clustering order by (year, month, day, hour, minute) • TopTrending Stocks • partitioned by (year, month, day, hour) • clustering order by number of mentions ticker | year | month | day | hour | minute | frequency | sentiment --------+------+-------+-----+------+--------+-----------+----------- TSLA | 2015 | 2 | 3 | 16 | 35 | 16 | 0 TSLA | 2015 | 2 | 3 | 16 | 34 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 33 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 32 | 1 | 0 TSLA | 2015 | 2 | 3 | 16 | 31 | 9 | 3 TSLA | 2015 | 2 | 3 | 16 | 30 | 4 | 0 year | month | day | hour | frequency | sentiment | ticker ------+-------+-----+------+-----------+-----------+-------- 2015 | 1 | 31 | 13 | 33 | -3 | AAPL 2015 | 1 | 31 | 13 | 17 | 3 | TWTR 2015 | 1 | 31 | 13 | 16 | 2 | SBUX 2015 | 1 | 31 | 13 | 15 | 1 | KO 2015 | 1 | 31 | 13 | 14 | 3 | MCD 2015 | 1 | 31 | 13 | 13 | 0 | EBAY 2015 | 1 | 31 | 13 | 12 | -1 | MSFT 2015 | 1 | 31 | 13 | 11 | 0 | XOM
  • 16. SPEEDVIEW • CassandraTTL support can be used for rolling count operation for dashboard application • Not available in Cassandra-Spark connector • Add timestamp and ranking to each ticker generation in each 5 second window • Partitioned by ranking, clustering order by timestamp id | timestamp | frequency | sentiment | ticker ----+------------+-----------+-----------+-------- 0 | 1430375561 | 55 | -5 | AAPL 0 | 1430370589 | 55 | -5 | AAPL 0 | 1430365508 | 54 | -5 | AAPL 0 | 1430360540 | 54 | -5 | AAPL
  • 17. PIPELINE λ ARCHITECTURE hdfs Speed Layer Batch Layer Data Ingestion Serving Layer Front End Real-timeView BatchView
  • 18. SHAFI BASHAR • PhD, ECE, UC Davis • Present - Intel Corporation • Worked on 4G LTE,WiFi standardization • Interest -Algorithm, Machine Learning • Activities - backpacking, skiing, running, photography GPRS WCDMA HSPA WiMAX HSPA+ 4G LTE 1Gbps 300Mbps 168Mbps 128Mbps14Mbps384Kbps40Kbps 4G LTE-Advanced 4G3G2.5G