QuantUniversity Summer School 2020 (https://qusummerschool.splashthat.com/)
https://quspeakerseries10.splashthat.com/
Lecture 1: Alexander Denev
In this talk, Alexander will introduce Alternative Data and discuss it's uses from his book, The Book of Alternative Data
- What is alternative data?
- Adoption of alternative data
- Information value chain
- Risks associated with alternative data
- Processes required to develop signals
- Valuation of alternative data
Lecture 2: Saeed Amen
In this talk, Saeed will discuss use cases in Alternative Data
-Deciphering Federal Reserve communications
- Using CLS flow data to trade FX
- Geospatial Insight satellite data to estimate retailers' EPS
- Saving "alpha" with transaction cost analysis
- Using Bloomberg News data to trade FX
Frontiers in Alternative Data : Techniques and Use Cases
1. Qu Speaker Series
Frontiers in Alternative Data : Techniques and Use
Cases
2020 Copyright QuantUniversity LLC.
Hosted By:
Sri Krishnamurthy, CFA, CAP
sri@quantuniversity.com
www.qu.academy
09/22/2020
Online
https://quspeakerseries9.splashthat.com/
2. 2
QuantUniversity
• Boston-based Data Science, Quant
Finance and Machine Learning
training and consulting advisory
• Trained more than 1000 students in
Quantitative methods, Data Science
and Big Data Technologies using
MATLAB, Python and R
• Building a platform for AI
and Machine Learning Exploration
and Experimentation
10. 1
The Book of Alternative
Data
S E P T E M B E R 2 0 2 0
11. 2
The Book of Alternative Data
Maximising Data Value, a Vendor Perspective | Deloitte LLP 2019
• Co-authored by Alexander
Denev and Saeed Amen
• The Book of Alternative Data
(on Wiley)
• Hardback available on Amazon
USA now (elsewhere in Sep)
• Kindle available on Amazon
worldwide
• Presentation is based on the
book
12. 3
• Common properties
• Less commonly used by market participants
• Tends to be more expensive
• Often outside financial markets (is tick data “alternative”?)
• Shorter history
• More challenging to use
• “Exhaust data” a byproduct of other processes
• Digital footprint from individual and corporate activity
• Resulted in a rapid rise in the number of alternative datasets
• Can provide an additional revenue stream for those who collect “exhaust data”
• Not all alternative data is necessarily Big Data (but it can be!)
Saeed Amen / @saeedamenfx
What is alternative data?
13. 4
Alternative data & investments case studies
Several clear case studies have emerged demonstrating the value of analytics in combination with alternative data applied to the investment process
ON-LINE PRICE =
INFLATION
Global FSI Firm employs
technology to track prices of 5
million products on-line to
understand price shocks and
monitor shifts in inflation across 70
countries1
MOBILE FOOT TRAFFIC =
ECONOMY
Hedge Funds using location data
pulled from mobile devices to predict
outlook on economy and REIT
values4
SOCIAL+ SEARCH =
EARNINGS
$90B AUM Global Asset Manager mines
search engine data combined with social-
media data to predict results of corporate
events like quarterly earnings3
SATELLITE + SHIPS =
MISPRICED SECURITY
Hedge fund using satellite
intelligence on ships and tank levels
to identify upcoming impact to oil
producers and commodity prices5
WEB + TWITTER =
MARKET MOVINGEVENT
Data provider using 300M Websites, 150M
Twitter feeds in combination with analyst
presentations and FactSet reports to measure
rise up media food chain (e.g. blogs to
newswire) to highlight potentially market
moving events6
APP + CREDIT CARD =
PERFORMANCE
Hedge Fund looks at combination of
alternative data including credit card
transactions, geo-location, and app
downloads to analyze burger chain
performance2
1.Innovative Asset Managers, Eagle Alpha
2.“Foursquare Wants To Be The Nielsen Of Measuring The Real World,” Research Briefs, CBInsights, June 8, 2016.
3.Simone Foxman and Taylor Hall, “Acadian to Use Microsoft's Big Data Technology to Help Make Bets,” Bloomberg, March 7, 2017.
4.Rob Matheson, “Measuring the Economy With Location Data,” MIT News, March 27, 2018.
5.Fred R. Bleakley, “CargoMetrics Cracks the Code on Shipping Data,” Institutional Investor, February 04, 2016.
6.Accern website
AUM of UK-Based Man Group’s
AI/Analytics driven AHL Dimension
fund up 5x over 3 years
Accelerating AI Adoption
Deployed AI (Artificial Intelligence)
techniques to four additional funds
managing $12.3B
Varied Data Sources
Processes terabytes of data ranging
from weather forecasts to container
ship movements
Increasing Valuation
Man Group’s stock price has increased
by 55% from January to October 2017
AI Driving Profit
Artificial intelligence contributed roughly
50% of 2015 profits for the AHL
Dimension Fund
Source: Adam Satariano, “The Massive Hedge Fund Betting on AI,” Bloomberg,
September 27, 2017.
14. 5
• Volume (increasing) – lots of data
• Variety (increasing) – not just numerical data, can be text, image, video etc.
• Velocity (increasing) – speed that data is being generated
• Variability (increasing) – inconsistencies in the data
• Veracity (decreasing) – difficult to tell if accurate (e.g. social media)
• Value (increasing) – business value of the data
Saeed Amen / @saeedamenfx
The Vs of Big Data
15. 6
Quantitative investment strategies and vendor solutions with alpha generation capabilities are becoming critical component to
the return of the buy and sell side’s ROE to pre crisis levels
Addressing Market Challenges
Systematic/quant Investors, typically building
their own analytics
Who:
• Hedge Funds
• Sophisticated Buy Side Firms
Key Challenges:
• Access to good quality raw data or to
curated alternative data
• Maintaining access to cutting edge
technology and algorithms
Customer Needs:
• Co-location of analytics and data
• Simplified access to data and computation
• Simplified, but bespoke, data access
Sophisticated Quants
Most intuitive solutions needed. Limited
technology and programming capability
Who:
• Smaller Sell Side (DSIBs)
• Small Buy Side + Family offices
Key Challenges:
• Reducing technology costs associated
with efficient research tools
• Building/maintaining an edge against
passive benchmark returns
Customer Needs:
• Simplified access to data and computation
• Curated Signals
• Sophisticated, but low maintenance/build
cost analytics platforms
• Elastic access to analytics and associated
data science talent
Traditional Investors
Interested in derived analytics and more
intuitive solutions
Who:
• Large Sell Side (GSIBs)
• Traditional Buy Side Firms
Key Challenges:
• Reducing technology costs associated
with efficient research tools
• Retention and expansion of innovation
talent
Customer Needs:
• Simplified access to data and computation
• Curated Signals
• Simplified, but bespoke, data access
Traditional Quants
Sophisticated but ultra small scale with a
focus on highly scalable business models
Who:
• Alternative Data Providers
• Signal Factories
Key Challenges:
• Simplified access to data
• Ability and agility to scale
• Support of cutting edge algorithms and
alternative data sets
Customer Needs:
• Simplified access to data and computation
• Simplified, but bespoke, data access
• Sophisticated, but low maintenance/build
cost analytics platforms
• Marketplace creation
Fintechs
Sources: Macro trends database, January 2019; The Shift From Active to Passive Investing: Potential Risks to Financial Stability?, Federal Reserve System 2019; Deloitte Global Cost Survey, 2019; Alternative data for investment decisions:
Today’s innovation could be tomorrow’s requirement. Deloitte Centre for Financial Services, 2018 Maximising Data Value, a Vendor Perspective | Deloitte LLP 2019
16. 7
Maximising Data Value, a Vendor Perspective | Deloitte LLP 2019
The Rising Adoption of Alternative Data
Hedge funds were the innovators in this space, but the technology is reaching a tipping point and may see exponential growth
over the next year
Alternative data adoption curve – investment management constituents by phase
Largely hedge funds aggressively seeking information
advantage
Likely constituents
Aggressive long-only managers and PE
firms
Tech savvy large
complex IM firms
Traditional large
complex IM firms
Firms reluctant to
embrace new approaches
Innovators Early adopters Early majority Late majority Laggards
With large scale adoption of alternative data, early
majority firms may face regulatory and talent risks
Late majority firms and laggards may face strategic
risks as they defer or decline the use of alternative
assets
Innovators and early adopters faced data and
model risks as data sets were sourced from
nontraditional, heterogeneous sources
17. 8
Getting to Grips with New Data Sources and Techniques
Investors are increasingly spending on alternative data, but building data science and engineering teams, and the associated
analytics platforms to fully harness such diverse data, remains a significant barrier for all but the largest firms.
Setting a data science/engineering team capable of harnessing alternative data
signals can be both expensive and time consuming:
• A diverse talent pool, typically not found within existing functions, is required
to find, analyse, model and productionise alternative insights
• The technology infrastructure required to integrate alternative and traditional
datasets further increases costs, serving as a pre-requisite for analysis
• Processes that are not optimally engineered (e.g. by inappropriate staffing)
often lead to technical debt, production failures and associated costs
• Alternative data variability makes proactive quality monitoring and
remediation an issue in which significant resources are invested
Market Trends Barriers to Entry
Source: Alternativedata.org Maximising Data Value, a Vendor Perspective | Deloitte LLP 2019
AUM 2016 2017 2018E 2019E
<$1bn 35 63 107 158
$1bn - $10bn 213 340 506 764
>$10bn 1,288 1,954 3,104 4,041
Buy Side Avg. 841 1,267 2,005 2,640
Role Entry Level Salary Bonus
Data Analyst $80k-$100k ~25%
Data Scientist $80k-$100k ~40%
Data Scout $70k-$90k ~15%
Data Engineer $80k-$110k ~30%
Head of Data $250k-$1,000k ~100%
Buy side spend on alternative data has increased over the previous 3 years and
is expected to continue to grow:
• Poor active investment performance is driving shift to passive products and
fee compression
• Active investing strategies are starting to require more diverse data to
generate strong alpha and beta predictive signals
• Savings from bundling of data streams are not currently possible due to the
segmentation of the data providers market but are becoming highly
desirable
Minimum Data Team
1 Head of Data
1 Data Scientist
1 Data Engineer
1 Data Scout
3 Data Analysts
Anticipated minimum spend
between $1m-$2m p/a,
dependent on technology
maturity, existing talent and
complexity of ambition
Average buy side spend on datasets ($k)
Total Buy Side Spend on Alternative Data ($m)
Annual Salaries of Associated Talent
18. 9
The Buy Side is Increasing its Focus on Alternative Data
The majority of buy side believe alternative data will positively impact their investment performance. Deloitte has surveyed
over 100 investment managers (IMs) and has observed significant technological, talent and risk challenges that integrating
such diverse data presents.
10%
40%
42%
8%
0% 20% 40% 60%
IM firms’ opinion about the impact of alternative data on investment processes:
Minimal Impact
Some impact, firms that utilise alternative data early may see
some temporary advantages
Alternative data leaders will see sustained advantages in some
asset classes
Alternative data represents a secular change in IM and expertise
in this area will separate winners and losers over the next 5 years
What is your organization’s status for utilizing alternative data?
No part of the strategic plan
Considering it, but no action at this point
Currently using alternative data in a test environment
Using alternative data to augment portfolio management
decisions
Source: c. 110 responses from IM firms from the polls conducted during the Alternative Data Dbrief session on April 24, 2018. Data has been cleaned to
exclude blank and ‘Don’t Know/Not Applicable’ responses
13%
9%
49%
29%
0% 20% 40% 60%
8%
11%
51%
15%
15%
0% 20% 40% 60%
Do you think utilization of alternative data (or not) presents new or different risks
to IM firms?
No, it’s business as usual
Our existing risk mgmt. framework can be adapted in the normal
course of business to handle alternative data
A fresh look at the risks associated with this development is
appropriate
The issues presented by alternative data are significant – our
firms needs to refresh the risk mgmt. framework to assess them
Other 46%
15%
9%
14%
16%
0% 20% 40% 60%
Our organization’s adoption journey for alternative data utilization will likely or
already includes:
Proprietary platforms and processes
Use of alternative data aggregators or brokers to
facilitate acquisition
Use of alternative data crowd-sourced insights supplied
by vendors
Use of insights developed by sell-side analysts from
alternative data
More than on of these
Maximising Data Value, a Vendor Perspective | Deloitte LLP 2019
19. 10
Data as a Service Infra/Platform as a Service Analytics as a Service
Minimally refined data supplied directly
to customers.
State of the art provides:
• Connected data, via a single point of
access, and the ability to customize
the data feed to a client’s specific
requirements
• Cleansed data with appropriate
imputation and normalised data
concepts and entities
Flexible cloud infrastructure (and
platforms) provisioned with simplified
access to Data.
State of the art provides:
• Simplified access to data, while
improving usage monitoring
• Co-located cloud infrastructure
capable of supporting ultra low
latency algorithmic decisions (and
reducing comms infra costs)
• Access to cloud based elastic/burst
computing capabilities and a variety
of price point storage solutions
Analytics data platform built upon
IaaS/PaaS with pre-built environments
for large scale.
State of the art provides:
• Simplified access to data processing,
providing off-the-rack data platform
solutions that can be readily
accessed
• App store engagement model that
fosters agile fintech ecosystem
Combining AaaS model with a diverse
data science talent pool.
State of the art provides:
• Access to seasoned users of the
AaaS platform
• Access to rare skill sets such as
graph theory, natural language
processing, image processing etc.
able to generate signals from data
outside customer competencies
• Ultra flexible staffing model
minimising overheads for R&D
efforts
Pre-generated signals that are sold to
clients at a premium.
State of the art provides:
• Pre-built signals targeting market
segments and use cases; where
alternative data is used a series of
robust quality checks
• Support for 3rd party vendors (i.e.
those employing AaaS) to sell
signals
• Utilize spare capacity within the
Managed Analytics service
Managed Analytics Service Signal as a Service
Primary Buyers
Sophisticated quants who build their
own analytics and associated platforms
e.g.:
• Large Sell Side institutions
• Quantitatively advanced Hedge
Funds
As per DaaS, with greater focus on
latency dependent trading strategies.
• Large scale seeking ultra low latency
• Mid-Scale unable/unwilling to
develop complex, data-processing
centric, cloud platforms
• Large scale looking to simplify path
to innovation
• Fintechs seeking lean data science
focused operating model
• Mid-Scale unable/unwilling to
develop complex, data-processing
centric, cloud platforms
• Large scale looking to simplify path
to innovation
• Dependent on nature and pricing
strategy of signal
• Smaller Scale Wealth Managers
1 2 3 4 5
Individual investor firms must assess where their comparative advantage exists and opt for a consumption pattern that
maximizes their return on investment from data
Understanding Comparative Advantage in Data
Example Providers
Data Vendors:
• IHS Markit
• Bloomberg
• Refinitiv
• Euroclear (developing)
• Deutsche Borse (developing)
Data Vendors:
• Refinitiv
Other Examples:
• Google, Amazon, Microsoft (without
co-lo)
Data Vendors:
• No comprehensive/deep offerings
within the major players
Other Examples:
• Generic analytics vendors e.g. SAS,
Cloudera, Pivotal
Data Vendors:
• No major players (however Quandl
model is similar)
Other Examples:
• Prof. services; e.g. Deloitte,
Accenture, BCG Gamma etc.
Data Vendors:
• IHS Markit (Research Signals
service ~60 clients)
• Refinitiv (white papers on signals)
• Quandl
Maximising Data Value, a Vendor Perspective | Deloitte LLP 2019
20. 11
In order to realise maximum value for the data assets a combination of prioritization, enhancement and analysis is required,
together with a sophisticated valuation structure that reflects the value of data assets to the firm
The Information Value Chain
Thorough risk assessment is required throughout the value chain to ensure that the data stored within the
vendor and delivered to customers is regulatory compliant, technologically robust and ethically sound!
Maximising Data Value, a Vendor Perspective | Deloitte LLP 2019
21. 12
Alternative data carry greater risk than traditional data and these datasets may also introduce newer risk types
Alternative Data Adoption Alters Risk Exposure
Maximising Data Value, a Vendor Perspective | Deloitte LLP 2019
The potential of new data sources to impact the investment
models and perhaps decision making, if:
• Applicablity: where data is incorporated in the model
incorrectly
• Variability: where the trading signal generated is irregular or
inconsistent under certain conditions
• Integration: where the output of the model is improperly
linked to the trading process
IM firms may face the following risks due to the rise in demand
for data science and advanced analytical skills to process
alternative data:
• Loss of intellectual capital through talent turnover
• Impact on alternative data utilization ability due to delayed
training for existing employees
Firms may face these types of data risks due to immature risk control
processes at data providers
• Data provenance risk: Violation of the terms and conditions from
the data originator while scraping websites
• Accuracy/validity risk: Data may prove unreliable or produce an
inaccurate trading signal
• Material non-public information (MNPI) risk: Receipt of a dataset
containing MNPI could result in risk events
Regulations governing the use of alternative data are still in the
early stages of maturity. There are open questions about
acceptable practices with respect to the use of alternative data.
Furthermore recent regulation introduces significant penalties for
leaks of personally identifiable information could be included in a
dataset received from a source
Data Risk Model Risk
Regulatory Risk Talent Risk
22. 13
Define Value
Simplify
Entities
• Reconcile duplicative data assets and cleanse where appropriate to
drive data efficiency and minimise the risk of divergent and/or
conflicting data
• Link data from different sources together to realise network valuation
benefits
Access
• Document the accesses available per data source
Map
• Map all assets and
associated dictionaries
• Document existing
distribution and storage
approaches
Quality
• Assess data quality within
assets, focusing on Clarity &
Uniqueness, Validity &
Consistency, Timeliness &
Completeness and the
Accuracy, Credibility &
Confidence of the data
sources
Assess
• Third party risks
• Information compliance risks
e.g. GDPR,
Plan
Internal
• Define and share explicit valuation methodology encompassing
collection, usage, storage coverage and governance
External
• Define appropriate pricing strategy for assets (exhaust data)
Maximising the value of a data estate requires a comprehensive mapping of the estate and embedding an appropriate
governance model prioritized by the estimated value of data
Mapping the Data Estate
Market Map & Gap
• Map the data assets to the
current and potential
consumers
• Match the demand for
analytics services with the
investment in data assets
Network Maximisation
• Close gaps in coverage to
realise network benefits of
connected data
• Enhance depth of assets with
proven value
Maximising Data Value, a Vendor Perspective | Deloitte LLP 2019
23. 14
People
A diverse talent pool is required to both build
and maintain the data engineering and
analytics structure, but also to support an
external signals managed service model:
• Data & Machine Learning Engineers -
expected to both build and maintain the
infrastructure, and productize models
developed within the data science pool
• Data Scientists - including image, NLP and
network specialists, in addition to more
traditional finance quant analysts
• Business Analysts - expected to contain
financial analysts capable of analyzing and
translating business requests into data
science problems
• Data Scouts - to explore new datasets that
appear in the market
Process
Building a frictionless signal factory platform
and the data science talent that supports it
must rely upon robust governance of data,
technology and talent:
• Clear duty segregation to minimize key
person risks and bottlenecks
• Models to support autonomy and agility
• Strong data governance and stewardship to
ensure that data management is scalable
without the need to scale effort
• Fail fast proof-of-concepts
• State of the art cyber security, to both ring-
fence sensitive data and prevent external
attacks
Creating and maintaining a signals factory requires a diverse talent pool as a foundation, well designed processes and a high
end technology stack but reduces costs and allows scaling
Developing a Signals Factory Proposition
Technology
A robust and well maintained technology
platform is critical to a signal factory success
with a partnership with a cloud supplier likely
to be a pre-requisite. Key considerations
include:
• Building in a cloud native fashion, to take
full advantage of elastic storage and
compute capabilities
• Support for a variety of data storage
paradigms (e.g. graph, key value,
columnar, relational etc.)
• Seamless integration of exploration tools,
e.g. Jupyter Notebooks, Tableau etc.
• Model management frameworks, to simplify
the promotion to production of models
(likely to involve containerization)
• Support for diverse hardware including
GPUs, FPGAs etc.
Maximising Data Value, a Vendor Perspective | Deloitte LLP 2019
24. 15
Valuation of Ingested Data Assets
As a non-depletable and non-degradable asset data represents a unique valuation and backtesting challenge, particularly
pertinent in financial markets where the greater usage of an asset crowds out value.
Qualitative
A qualitative approach is likely required to
support a benchmark to measure/complement
other approaches. Considerations include:
• Cost of integration and storage
• Data quality of signal (degree of imputation
etc.)
• Depth and breadth of signal coverage
• Value of other similar signal assets
• Uniqueness of the dataset/signal
License & Latency
Constraining the number of consumers of high value
data feeds is a useful heuristic to prevent over-
exploitation but few data vendors do it. Consumers
should:
• Negotiate licensing or latency based consumption
constraints to ensure they either receive data that
other investors do not have access to or before the
market in general
• Factoring in the absence of these constraints when
valuing vendors signal data
Profit Sharing
While complex profit sharing mechanisms create
feedback within a pricing system that incentivizes
both vendor and consumer to maximize the value
of a given signal asset, significant complexities
exist within:
• Implementation e.g. the negotiation of the
degree of profit share, exposure in the event of
signal failures
• Monitoring the agreed terms of the share
Value Maximisation
Strategies
1 2
3
Maximising Data Value, a Vendor Perspective | Deloitte LLP 2019
Backtesting
Solid backtesting program to understand the
alpha from alternative data is needed, but one
needs:
• to account for the usually short history of
alternative data
• to incorporate the statistical uncertainty of the
backtesting results into the price of data
4
26. The Book of Alternative Data
Use cases
A Guide for Investors, Traders and Risk Managers
Saeed Amen, Cuemacro
Co-authored by Alexander Denev & Saeed Amen
Saeed Amen / @saeedamenfx
27. Case study: Federal Reserve
Communications Cuemacro
Index
Saeed Amen / @saeedamenfx
28. Federal Reserve data
• Federal Reserve regularly communicates with markets
• Through speeches, statements, minutes etc.
• Market reacts to this!
• Can read publicly available communications from the web
• Create a dataset of web communications
• Apply NLP to determine the sentiment of individual texts
• Construct an index to give an overall view of FOMC sentiment
• Positive sentiment is hawkish whilst negative sentiment is dovish
Saeed Amen / @saeedamenfx
29. Fed sentiment vs. UST10Y yield
changes
• Can see a relationship between them, as we would expect
Saeed Amen / @saeedamenfx
30. Case study: CLS FX flow data
to trade FX spot
Saeed Amen / @saeedamenfx
31. CLS data
• FX is a more fragmented market than other asset classes
• Vast majority is OTC
• Many different trading venues
• Bilateral trading
• Difficult too find comprehensive FX volume & flow data
• CLS settle most OTC deliverable FX – coverage over 50% of market
• They collect and distribute
• Hourly FX volume data
• Hourly FX flow data for price takers
• 30 minute lag – historical data since later 2012
Saeed Amen / @saeedamenfx
32. Create fund FX flow index
• Use fund FX flow data – tends to be more directional and positive
correlation with spot
• Create fund FX flow index
• Buy spot when very positive
• Sell spot when very negative
Saeed Amen / @saeedamenfx
35. Geospatial Insights: RetailWatch
• It is well known that satellite photography can be used to help
forecast earning per share for retail stocks
• Has been used extensively in US markets (Orbital Insight), but not
as much for European firms
• Uses car counts as a proxy for retail activity
• RetailWatch covers a number of European retailers (both publicly
traded and private companies)
• Relatively new dataset
Saeed Amen / @saeedamenfx
36. Using car counts to estimate EPS
• Created a car count score based upon the 6 months of activity
related to the earnings period
• Compare against Bloomberg’s consensus and actual EPS
• Present results for Marks & Spencer
Saeed Amen / @saeedamenfx
37. Case Study: Saving “alpha”
with transaction cost
analysis
Saeed Amen / @saeedamenfx
38. TCA to “save” alpha
• Big Data and alternative data isn’t just for generating alpha
• It can also be used to “save” alpha, to reduce our transaction
costs
• How much is each LP charging?
• Is one algo better than another?
• tcapy is a Python based library by Cuemacro which does
transaction cost analysis to identify how much traders are paying
for their liquidity
• Needs high frequency market tick data and also trade data from
the client
• Will do a quick demo if there’s time
Saeed Amen / @saeedamenfx
39. Detailed screen
• Plot a specific currency pair over a period of time, breaking down results by broker, algo etc.
Saeed Amen / @saeedamenfx
40. Plotting of trades/orders
• We can plot the trades/orders in the web app alongside market data
Saeed Amen / @saeedamenfx
45. Unstructured & structured news data
• Unstructured news data
• Read news articles, blogs etc. in their raw text form, then clean and then
directly apply text based analysis to add tagging and other fields
• Very time consuming as we need to handle large amounts of data and
also need to do natural language processing, which is non trivial
• Structured news data
• Vendors processes a large amount of news from numerous sources into a
more manageable dataset for us to explore
• Data more easily accessible with additional fields (eg. tagging topics)
• Traders can concentrate on creating effective trading rules and running
risk, rather than spending that time dealing with cleaning up massive
quantities of unstructured news
@saeedamenfx / Copyright Cuemacro
46. Automating news filtering
• Using news to trade markets is not new idea
• A trader essentially “filters” news into the “signal and the noise”
• But there is simply too much news for humans to read!
• How can we read news in automated fashion?
• Easier to use structured news datasets
• However, what news filters do we use?
• News related to unemployment?
• Buy/sell signals?
0
20
40
60
80
100
120
140
200
250
300
350
400
450
500
550
600
650
700
2002 2006 2010 2014
US Jobless Claims NI UNEMPLOY BBG count (smoothed)
Claims BBG
@saeedamenfx / Copyright Cuemacro
47. General approach to news filtering
• Several approaches
• Pick words or sectors which are relatively generic (and also intuitive) like “job cuts”
• The approach to this “picking” depends on our data source, each one is different
• Fit the best words according to a backtest!
• “Fitting” words which are not obviously related is data mining
• Resulting model will likely be unstable when run live
• Also caution when using hindsight to pick words
• For example, “Greek debt crisis” was obvious
• But only after the event!
• NT<GO> is nice way to visualise news
• Bloomberg has machine readable news
• Use natural language processing
0
100
200
300
400
500
600
700
2008 2010 2012 2014
Greek Debt Crisis keywords
BBG count
@saeedamenfx / Copyright Cuemacro
48. Specific steps for text datasets
• We can formulate a few generic steps that are used when dealing with a text based
dataset for trading purposes
• Raw data collection – web scraping and accessing internal databases
• Cleaning dataset – removing HTML tags and invalid observations
• Structuring dataset – adding tags (eg. sentiment) and compress into single database record
• Filtering dataset – choose most relevant entities/topics to prune search space
• Create an indicator – aggregate records to create indicators
• Apply a trading rule to the indicator – how to convert into buy/sell signals directly or added to other
trading factors (eg. carry)
@saeedamenfx / Copyright Cuemacro
49. Using Bloomberg News dataset
• We shall use a dataset consisting of Bloomberg News articles from 2009-2017
• It is a structured dataset, which saves time (eg. we avoid the time consuming raw
data collection step)
• Bloomberg News is written in a consistent style, so easier to process than general
web content
• Each news article has a number of fields tagged including:
• Timestamp of news article
• Title of news article
• Text body of the news article
• Tagging for tradable tickers related to the news (eg. %EUR for EURUSD)
• Tagging for the topic related to the news (eg. FED for articles related to Federal Reserve)
• Company specific news also has additional news analytics fields such as sentiment,
readership statistics etc.
• Topics we choose will depend on underlying dataset
@saeedamenfx / Copyright Cuemacro
50. Generate news signals for FX
• We want to use news to inform FX trading strategies
• Want to develop longer term strategies (ie. not high frequency headline trading)
• Hence, focus will be on macro specific news to trade FX in particular
• Tickers: %EUR, %GBP, %AUD, %NZD, %USD, %CAD, %NOK, %SEK and %JPY
• Topics: FED and ECB
• Could have chosen many other relevant macro topics
• Helps us prune the search space to most relevant news
• Steps we shall do
• Clean body text slightly (eg. remove start of article)
• Ignore very short articles as difficult to gauge sentiment
• Apply sentiment analysis for each article (shall use open source Python based libraries)
• Aggregate data into daily observations (careful about holidays!)
• Create indices for each currency/topic (Z scores for comparability)
• Also generate a news volume score (Z score for comparability)
@saeedamenfx / Copyright Cuemacro
51. Currency pair sentiment score
• Currency pair score = base score – terms score
• When eg. USD/JPY score is positive buy, otherwise sell
@saeedamenfx / Copyright Cuemacro
52. News trading rule by currency pair
• Present risk adjusted returns and compare to a generic trend following strategy
• Apply vol targeting in each instance
• News based trading role outperforms trend significantly in our sample
@saeedamenfx / Copyright Cuemacro
54. What about news volume?
• News volume on a currency pair is heavily correlated with its implied volatility, which
seems intuitive!
• T statistics show a statistically significant relationship in nearly every currency pair in
our sample
• News volume can be used to help us model FX volatility – is FX volatility in line with
what we could expect based on newsflow?
@saeedamenfx / Copyright Cuemacro
55. Scheduled events
• Before scheduled events, FX vol market makers will mark up vol curve
• Known as event volatility add-on
• LHS show EUR/USD ON vol on Fed days, and RHS for ECB days (ignores all other days)
• Have model for estimating add-on (assumes only one big event per day)
• Typically, realized underperforms on these days.. Sell vol*!
• *within reason…
@saeedamenfx / Copyright Cuemacro
56. News for scheduled events
• Can we use news around scheduled events, eg. FED and ECB topics in our case to
inform where the add-on is
• And also to give us an idea of where realized vol would be subsequently? Gamma
traders are taking a view on where implied – realized will be
• There does seem to be a relationship between EUR/USD vol and news before FOMC
and ECB meetings
@saeedamenfx / Copyright Cuemacro
57. EUR/USD vol and news on FOMC days
• Showing news volume versus add-on, implied and realized ON in EUR/USD on FOMC
days
@saeedamenfx / Copyright Cuemacro
58. EUR/USD vol and news on ECB days
• Showing news volume versus add-on, implied and realized ON in EUR/USD on ECB
days
@saeedamenfx / Copyright Cuemacro
59. Conclusion
• Alternative data primer, introducing the topic
• Talked about where to find data
• Showed examples of how to generate (and save!) alpha using
alternative data examining
• CLS FX flow data to generate FX trading signals
• Text based datasets for Fed communications
• Geospatial Insights satellite imagery to estimate EPS
• tcapy to reduce trading costs for FX
Saeed Amen / @saeedamenfx
60. Any questions?
• Drop me an e-mail at saeed@cuemacro.com, ring me or tweet to @saeedamenfx (or even talk
to me now, the old school way!)
Saeed Amen / @saeedamenfx
67. Thank you!
Sri Krishnamurthy, CFA, CAP
Founder and CEO
QuantUniversity LLC.
srikrishnamurthy
www.QuantUniversity.com
Contact
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
13