Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Analyzing the power of Tweets 
in predicting Commodity Futures 
Mar 17, 2014 
@gopivotal @being_bayesian 
Srivatsan Ramanu...
Problem Definition 
Ÿ Can we predict Corn, Soybean and Wheat futures based on Social Chatter on Twitter ? 
Ÿ The Custome...
@gopivotal @being_bayesian 
Data 
© Copyright 2013 Pivotal. All rights reserved. 3
Obtaining Data 
Ÿ Used to fetch 5-years of historical tweets matching any of a list of keywords of interest 
Tweets Table...
GNIP 
@gopivotal @being_bayesian 
Ÿ As plugged-in partners, we’ve worked with 
GNIP before, experience was great! 
Ÿ We ...
Grain Futures Vs. Volume of Tweets 
@gopivotal @being_bayesian 
© Copyright 2013 Pivotal. All rights reserved. 6
The Platform 
@gopivotal @being_bayesian 
© Copyright 2013 Pivotal. All rights reserved. 7
Data Science Toolkit 
Ÿ Appliance 
– Full Rack DCA with Greenplum Database 
Ÿ ETL 
– Python 
Ÿ Modeling 
– SQL 
– MADli...
Pivotal Greenplum MPP DB 
@gopivotal @being_bayesian 
Think of it as multiple 
PostGreSQL servers 
Master 
Segments/Worker...
PL/X : X in {pgsql, R, Python, Java, Perl, C etc.} 
• Allows users to write Greenplum/ 
PostgreSQL functions in the R/Pyth...
Scalable, in-database ML 
• Open Source!https://github.com/madlib/madlib 
• Works on Greenplum DB and PostgreSQL 
• Active...
MADlib In-Database 
Functions 
Predictive Modeling Library 
Generalized Linear Models 
• Linear Regression 
• Logistic Reg...
@gopivotal @being_bayesian 
The Models 
© Copyright 2013 Pivotal. All rights reserved. 13
The Approach 
• In addition to identifying textual cues in tweets that were correlated with 
commodity futures, we also wa...
Sentiment Analysis – Challenges 
Ÿ Language on Twitter doesn’t 
adhere to rules of grammar, syntax 
or spelling 
Ÿ We do...
Sentiment Analysis – Approach 
Ÿ Parallelized ArkTweetNLP to achieve fast parts-of-speech tagging on Tweets 
Ÿ Custom (p...
Text Analytics Pipeline with GNIP stream 
Tweet 
Stream 
Stored on 
HDFS 
(gpfdist) 
Loaded as 
external tables 
into GPDB...
Key Take-Aways 
There is significant signal in Tweets in predicting commodity futures 
Sentiment Analysis of tweets can pr...
What’s in it for me? 
@gopivotal @being_bayesian 
© Copyright 2013 Pivotal. All rights reserved. 19
Pivotal Open Source Contributions 
http://gopivotal.com/pivotal-products/open-source-software 
• MADlib – In-database para...
Próxima SlideShare
Cargando en…5
×

Analyzing Power of Tweets in Predicting Commodity Futures

888 visualizaciones

Publicado el

Extracting signals from tweets to predict commodity futures.

Publicado en: Datos y análisis
  • Sé el primero en comentar

Analyzing Power of Tweets in Predicting Commodity Futures

  1. 1. Analyzing the power of Tweets in predicting Commodity Futures Mar 17, 2014 @gopivotal @being_bayesian Srivatsan Ramanujam Senior Data Scientist Pivotal © Copyright 2013 Pivotal. All rights reserved. 1
  2. 2. Problem Definition Ÿ Can we predict Corn, Soybean and Wheat futures based on Social Chatter on Twitter ? Ÿ The Customer: A major Agricultural Cooperative @gopivotal @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 2
  3. 3. @gopivotal @being_bayesian Data © Copyright 2013 Pivotal. All rights reserved. 3
  4. 4. Obtaining Data Ÿ Used to fetch 5-years of historical tweets matching any of a list of keywords of interest Tweets Table Poster Information @gopivotal @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 4
  5. 5. GNIP @gopivotal @being_bayesian Ÿ As plugged-in partners, we’ve worked with GNIP before, experience was great! Ÿ We needed historical data and GNIP’s Historical PowerTrack came in handy Ÿ Clean API, quick quotes, convenient to download results of historical jobs © Copyright 2013 Pivotal. All rights reserved. 5
  6. 6. Grain Futures Vs. Volume of Tweets @gopivotal @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 6
  7. 7. The Platform @gopivotal @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 7
  8. 8. Data Science Toolkit Ÿ Appliance – Full Rack DCA with Greenplum Database Ÿ ETL – Python Ÿ Modeling – SQL – MADlib – PL/Python, PL/Java – Ark-Tweet-NLP1 with PL/Java Wrappers Ÿ Visualization – Tableau 1CMU ARK Twitter Parts-of-Speech tagger : http://www.ark.cs.cmu.edu/TweetNLP (GPL 2) @gopivotal @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 8
  9. 9. Pivotal Greenplum MPP DB @gopivotal @being_bayesian Think of it as multiple PostGreSQL servers Master Segments/Workers Rows are distributed across segments by a particular field (or randomly) © Copyright 2013 Pivotal. All rights reserved. 9
  10. 10. PL/X : X in {pgsql, R, Python, Java, Perl, C etc.} • Allows users to write Greenplum/ PostgreSQL functions in the R/Python/ Java, Perl, pgsql or C languages Standby Ÿ The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster • Data Parallelism: - PL/X piggybacks on Greenplum’s MPP architecture @gopivotal @being_bayesian Master Segment Host Segment Segment … Master Host SQL Interconnect Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment © Copyright 2013 Pivotal. All rights reserved. 10
  11. 11. Scalable, in-database ML • Open Source!https://github.com/madlib/madlib • Works on Greenplum DB and PostgreSQL • Active development by Pivotal • Downloads and Docs: http://madlib.net/ @gopivotal @being_bayesian - Latest Release : 1.4 (Dec 2014) © Copyright 2013 Pivotal. All rights reserved. 11
  12. 12. MADlib In-Database Functions Predictive Modeling Library Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Cox Proportional Hazards • Regression • Elastic Net Regularization • Sandwich Estimators (Huber white, clustered, marginal effects) Matrix Factorization • Single Value Decomposition (SVD) • Low-Rank @gopivotal @being_bayesian Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Affinity Analysis, Market Basket) • Topic Modeling (Parallel LDA) • Decision Trees • Ensemble Learners (Random Forests) • Support Vector Machines • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation Linear Systems • Sparse and Dense Solvers Descriptive Statistics Sketch-based Estimators • CountMin (Cormode- Muthukrishnan) • FM (Flajolet-Martin) • MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions © Copyright 2013 Pivotal. All rights reserved. 12
  13. 13. @gopivotal @being_bayesian The Models © Copyright 2013 Pivotal. All rights reserved. 13
  14. 14. The Approach • In addition to identifying textual cues in tweets that were correlated with commodity futures, we also wanted to analyze whether tweet sentiment was correlated with commodity futures @gopivotal @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 14
  15. 15. Sentiment Analysis – Challenges Ÿ Language on Twitter doesn’t adhere to rules of grammar, syntax or spelling Ÿ We don’t have labeled data for our problem. The tweets aren’t tagged with sentiment Ÿ Semi-Supervised Sentiment Prediction can be achieved by dictionary look-ups of tokens in a Tweet, but without Context, Sentiment Prediction is futile! @gopivotal @being_bayesian “Cool” © Copyright 2013 Pivotal. All rights reserved. 15
  16. 16. Sentiment Analysis – Approach Ÿ Parallelized ArkTweetNLP to achieve fast parts-of-speech tagging on Tweets Ÿ Custom (patent pending) algorithm to extract contextual cues & score sentiment of tweets Semi-Supervised Sentiment Classification Phrase Extraction Break-up Tweets into tokens and tag their parts-of-speech Part-of-speech tagger1 1: Parts-of-speech Tagger : Gp-Ark-Tweet-NLP (http://vatsan.github.io/gp-ark-tweet-nlp/) @gopivotal @being_bayesian Phrasal Polarity Scoring Use learned phrasal polarities to score sentiment of new tweets Sentiment Scored Tweets © Copyright 2013 Pivotal. All rights reserved. 16
  17. 17. Text Analytics Pipeline with GNIP stream Tweet Stream Stored on HDFS (gpfdist) Loaded as external tables into GPDB Parallel Parsing of JSON and extraction of fields using PL/ Python @gopivotal @being_bayesian Topic Analysis through MADlib pLDA Sentiment Analysis through custom PL/Python functions D3.js © Copyright 2013 Pivotal. All rights reserved. 17
  18. 18. Key Take-Aways There is significant signal in Tweets in predicting commodity futures Sentiment Analysis of tweets can provide an additional signal in predicting commodity futures. Twitter sentiment was negatively correlated with commodity futures, in the sample we analyzed A blended model of Text Regression, Sentiment Analysis and Tweet Actor information gave us encouraging results and we believe that when combined with market fundamentals like weather or yield will give better models @gopivotal @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 18
  19. 19. What’s in it for me? @gopivotal @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 19
  20. 20. Pivotal Open Source Contributions http://gopivotal.com/pivotal-products/open-source-software • MADlib – In-database parallel ML - https://github.com/madlib/madlib • PyMADlib – Python Wrapper for MADlib - https://github.com/gopivotal/pymadlib • PivotalR – R wrapper for MADlib - https://github.com/madlib-internal/PivotalR • Part-of-speech tagger for Twitter via SQL - http://vatsan.github.io/gp-ark-tweet-nlp/ Questions? @being_bayesian @gopivotal @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 20

×