The Central Bank of the Republic of Turkey is primarily responsible for steering the monetary and exchange rate policies in Turkey.
One of the major core functions of the Bank is market operations. In this context, analyzing and interpreting real-time tick data related to money market instruments has become not only a requirement but also a challenge.
For this use case, an API provided by one of the financial data vendors has been used to gather real-time tick data and data routing has been orchestrated by Apache NiFi.
Gathered data is being transferred to Kafka topics and then handed off to Druid for real-time indexing tasks.
Indicators such as effective cost, bid-ask spread, price impact measures, return reversal are calculated using Apache Storm and finally visualized by means of Apache Superset in order to provide decision-makers with a new set of tools.
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset and Druid
1. Developing High Frequency Indicators
Using Real-Time Tick Data
on Apache Superset and Druid
CBRT Big Data Team
Emre Tokel, Kerem Başol, M. Yağmur Şahin
Zekeriya Besiroglu / Komtas Bilgi Yonetimi
21 March 2019 Barcelona
2. Agenda
WHO WE ARE
CBRT & Our Team
PROJECT DETAILS
Before, Test Cluster,
Phase 1-2-3, Prod
Migration
HIGH FREQUENCY
INDICATORS
Importance & Goals
CURRENT ARCHITECTURE
Apache Kafka, Spark,
Druid & Superset
WORK IN
PROGRESS
Further analyses
FUTURE PLANS
6
5
4
3
2
1
4. Our Solutions
Data Management
• Data Governance Solutions
• Next Generation Analytics
• 360 Engagement
• Data Security
Analytics
• Data Warehouse Solutions
• Customer Journey Analytics
• Advanced Marketing Analytics Solutions
• Industry-specific analytic use cases
• Online Customer Data Platform
• IoT Analytics
• Analytic Lab Solution
Big Data & AI
• Big Data & AI Advisory Services
• Big Data & AI Accelerators
• Data Lake Foundation
• EDW Optimization / Offloading
• Big Data Ingestion and Governance
• AI Implementation – Chatbot
• AI Implementation – Image Recognition
Security Analytics
• Security Analytic Advisory Services
• Integrated Law Enforcement Solutions
• Cyber Security Solutions
• Fraud Analytics Solutions
• Governance, Risk & Compliance Solutions
5. • +20 IT , +18 DB&DWH
• +7 BIG DATA
• Lead Archtitect &Big Data /Analytics
@KOMTAS
• Instructor&Consultant
• ITU,MEF,Şehir Uni. BigData Instr.
• Certified R programmer
• Certified Hadoop Administrator
6. Our Organization
The Central Bank of the Republic of Turkey is primarily responsible for steering the
monetary and exchange rate policies in Turkey.
o Price stability
o Financial stability
o Exchange rate regime
o The privilege of printing and issuing banknotes
o Payment systems
7. • Big Data Engineer• Big Data Engineer
M. Yağmur Şahin Emre Tokel Kerem Başol
• Big Data Team Leader
9. Importance and Goals
To observe foreign exchange markets in real-time
o Are there any patterns regarding to specific time intervals during the day?
o Is there anything to observe before/after local working hours throughout the whole day?
o What does the difference between bid/ask prices tell us?
To be able to detect risks and take necessary policy measures in a timely manner
o Developing liquidity and risk indicators based real-time tick data
o Visualizing observations for decision makers in real-time
o Finally, discovering possible intraday seasonality
Wouldn’t it be great to be able to correlate with news flow as well?
11. Development of High Frequency Indicators Using Real-Time Tick
Data on Apache Superset and Druid
Phase 1
Prod
migration
Next
phases
Test
Cluster
Phase 2 Phase 3
12. Test Cluster
Our first studies on big data have started on very humble servers
o 5 servers with 32 GB RAM for each
o 3 TB storage
HDP 2.6.0.3 installed
o Not the latest version back then
Technical difficulties
o Performance problems
o Apache Druid indexing
o Apache Superset maturity
13. Development of High Frequency Indicators Using Real-Time Tick
Data on Apache Superset and Druid
Phase 1
Prod
migration
Next
phases
Test
Cluster
Phase 2 Phase 3
15. Thomson Reuters Enterprise Platform (TREP)
Thomson Reuters provides its subscribers with an enterprise platform that they can
collect the market data as it is generated
Each financial instrument on TREP has a unique code called RIC
The event queue implemented by the platform can be consumed with the provided
Java SDK
We developed a Java application for consuming this event queue to collect tick-data
according to required RICs
17. Apache Kafka
The data flow is very fast and quite dense
o We published the messages containing tick data collected by our Java application to a message
queue
o Twofold analysis: Batch and real-time
We decided to use Apache Kafka residing on our test big data cluster
We created a topic for each RIC on Apache Kafka and published data to related topics
19. Apache NiFi
In order to manage the flow, we decided to use Apache NiFi
We used KafkaConsumer processor to consume messages from Kafka queues
The NiFi flow was designed to be persisted on MongoDB
22. MongoDB
We had prepared data in JSON format with our Java application
Since we have MongoDB installed on our enterprise systems, we decided to persist
this data to MongoDB
Although MongoDB is not a part of HDP, it seemed as a good choice for our
researchers to use this data in their analyses
24. Apache Zeppelin
We provided our researchers with access to Apache Zeppelin and connection to
MongoDB via Python
By doing so, we offered an alternative to the tools on local computers and provided a
unified interface for financial analysis
25. Business Intelligence on Client Side
Our users had to download daily tick-data manually from their Thomson Reuters
Terminals and work on Excel
Users were then able to access tick-data using Power BI
o We also provided our users with a news timeline along with the tick-data
26. We needed more!
We had to visualize the data in real-time
o Analysis on persisted data using MongoDB, PowerBI and Apache Zeppelin was not enough
28. Development of High Frequency Indicators Using Real-Time Tick
Data on Apache Superset and Druid
Phase 1
Prod
migration
Next
phases
Test
Cluster
Phase 2 Phase 3
30. Apache Druid
We needed a database which was able to:
o Answer ad-hoc queries (slice/dice) for a limited window efficiently
o Store historic data and seamlessly integrate current and historic data
o Provide native integration with possible real-time visualization frameworks (preferably from
Apache stack)
o Provide native integration with Apache Kafka
Apache Druid addressed all the aforementioned requirements
Indexing task was achieved using Tranquility
32. Apache Superset
Apache Superset was the obvious alternative for real-time visualization since tick-data
was stored on Apache Druid
o Native integration with Apache Druid
o Freely available on Hortonworks service stack
We prepared real-time dashboards including:
o Transaction Count
o Bid / Ask Prices
o Contributor Distribution
o Bid - Ask Spread
33. We needed more, again!
Reliability issues with Druid
Performance issues
Enterprise integration requirements
34. Development of High Frequency Indicators Using Real-Time Tick
Data on Apache Superset and Druid
Phase 1
Prod
migration
Next
phases
Test
Cluster
Phase 2 Phase 3
36. Development of High Frequency Indicators Using Real-Time Tick
Data on Apache Superset and Druid
Phase 1
Prod
migration
Next
phases
Test
Cluster
Phase 2 Phase 3
38. Apache Hive + Druid Integration
After setting up our production environment (using HDP 3.0.1.0) and started to
feed data, we realized that data were scattered and we were missing the option to
co-utilize these different data sources
We then realized that Apache Hive was already providing Kafka & Druid indexing
service in the form of a simple table creation and querying facility for Druid from
Hive
40. Apache Spark
Due to additional calculation requirements of our users, we decided to utilize Apache
Spark
With Apache Spark 2.4, we used Spark Streaming and Spark SQL contexts together in
the same application
In our Spark application
o For every 5 seconds, a 30-second window is created
o On each window, outlier boundaries are calculated
o Outlier data points are detected
51. To-Do List
Matching data subscription
Bringing historical tick data into real-time analysis
Possible use of machine learning for intraday indicators
Founded in September 2017 with experienced software engineers
Members have academic background on finance and big data
PoC work was done to explain the capabilities of a big data platform
Payment system data was analyzed
First task was to setup a big data platform
Emre Tokel - Big Data Team LeaderEmre has 15+ years of experience in software development. He has taken role as developer and project manager in various projects. For 2 years now, he has been involved in big data and data intelligence studies within the Bank. Emre has been leading the big data team since last year and is responsible for the architecture of the Big Data Platform, which is based on Hortonworks technologies. He has an MBA degree and is pursuing his Ph.D in finance. Besides IT, he is a divemaster and teaching SCUBA. Kerem Basol - Big Data EngineerKerem has 10+ years of experience in software development including mobile, back-end and front-end. For the past two years, he focused on big data technologies and currently working as a big data engineer. Kerem is responsible for data ingestion and building custom solution stacks for business needs using the Big Data Platform, which is based on Hortonworks technologies. He holds an MS degree in CIS from UPENN. M. Yağmur Sahin - Big Data EngineerYağmur has been developing software for 10 years. Being experienced in software development, he has completed his masters degree in 2016 on distributed stream processing where he was first introduced with big data technologies. For the last 2 years, he has been designing and implementing big data solutions for the Bank using Hortonworks Data Platform. Yağmur is also pursuing his Ph.D at Medical Informatics department of METU. He loves running and hopefully will complete a marathon in coming years.
Power BI has a MongoDB connector
(All dashboards included min/max/average values)
There were some tasks that cannot be handled declaratively