SlideShare una empresa de Scribd logo
1 de 52
Developing High Frequency Indicators
Using Real-Time Tick Data
on Apache Superset and Druid
CBRT Big Data Team
Emre Tokel, Kerem Başol, M. Yağmur Şahin
Zekeriya Besiroglu / Komtas Bilgi Yonetimi
21 March 2019 Barcelona
Agenda
WHO WE ARE
CBRT & Our Team
PROJECT DETAILS
Before, Test Cluster,
Phase 1-2-3, Prod
Migration
HIGH FREQUENCY
INDICATORS
Importance & Goals
CURRENT ARCHITECTURE
Apache Kafka, Spark,
Druid & Superset
WORK IN
PROGRESS
Further analyses
FUTURE PLANS
6
5
4
3
2
1
Who We Are
1
Our Solutions
Data Management
• Data Governance Solutions
• Next Generation Analytics
• 360 Engagement
• Data Security
Analytics
• Data Warehouse Solutions
• Customer Journey Analytics
• Advanced Marketing Analytics Solutions
• Industry-specific analytic use cases
• Online Customer Data Platform
• IoT Analytics
• Analytic Lab Solution
Big Data & AI
• Big Data & AI Advisory Services
• Big Data & AI Accelerators
• Data Lake Foundation
• EDW Optimization / Offloading
• Big Data Ingestion and Governance
• AI Implementation – Chatbot
• AI Implementation – Image Recognition
Security Analytics
• Security Analytic Advisory Services
• Integrated Law Enforcement Solutions
• Cyber Security Solutions
• Fraud Analytics Solutions
• Governance, Risk & Compliance Solutions
• +20 IT , +18 DB&DWH
• +7 BIG DATA
• Lead Archtitect &Big Data /Analytics
@KOMTAS
• Instructor&Consultant
• ITU,MEF,Şehir Uni. BigData Instr.
• Certified R programmer
• Certified Hadoop Administrator
Our Organization
 The Central Bank of the Republic of Turkey is primarily responsible for steering the
monetary and exchange rate policies in Turkey.
o Price stability
o Financial stability
o Exchange rate regime
o The privilege of printing and issuing banknotes
o Payment systems
• Big Data Engineer• Big Data Engineer
M. Yağmur Şahin Emre Tokel Kerem Başol
• Big Data Team Leader
High Frequency
Indicators
2
1
Importance and Goals
 To observe foreign exchange markets in real-time
o Are there any patterns regarding to specific time intervals during the day?
o Is there anything to observe before/after local working hours throughout the whole day?
o What does the difference between bid/ask prices tell us?
 To be able to detect risks and take necessary policy measures in a timely manner
o Developing liquidity and risk indicators based real-time tick data
o Visualizing observations for decision makers in real-time
o Finally, discovering possible intraday seasonality
 Wouldn’t it be great to be able to correlate with news flow as well?
Project Details 3
2
1
Development of High Frequency Indicators Using Real-Time Tick
Data on Apache Superset and Druid
Phase 1
Prod
migration
Next
phases
Test
Cluster
Phase 2 Phase 3
Test Cluster
 Our first studies on big data have started on very humble servers
o 5 servers with 32 GB RAM for each
o 3 TB storage
 HDP 2.6.0.3 installed
o Not the latest version back then
 Technical difficulties
o Performance problems
o Apache Druid indexing
o Apache Superset maturity
Development of High Frequency Indicators Using Real-Time Tick
Data on Apache Superset and Druid
Phase 1
Prod
migration
Next
phases
Test
Cluster
Phase 2 Phase 3
TREP API
Apache
Kafka
Apache NiFi MongoDB
Apache
Zeppelin &
Power BI
Thomson Reuters Enterprise Platform (TREP)
 Thomson Reuters provides its subscribers with an enterprise platform that they can
collect the market data as it is generated
 Each financial instrument on TREP has a unique code called RIC
 The event queue implemented by the platform can be consumed with the provided
Java SDK
 We developed a Java application for consuming this event queue to collect tick-data
according to required RICs
TREP API
Apache
Kafka
Apache NiFi MongoDB
Apache
Zeppelin &
Power BI
Apache Kafka
 The data flow is very fast and quite dense
o We published the messages containing tick data collected by our Java application to a message
queue
o Twofold analysis: Batch and real-time
 We decided to use Apache Kafka residing on our test big data cluster
 We created a topic for each RIC on Apache Kafka and published data to related topics
TREP API
Apache
Kafka
Apache NiFi MongoDB
Apache
Zeppelin &
Power BI
Apache NiFi
 In order to manage the flow, we decided to use Apache NiFi
 We used KafkaConsumer processor to consume messages from Kafka queues
 The NiFi flow was designed to be persisted on MongoDB
Our NiFi Flow
TREP API
Apache
Kafka
Apache NiFi MongoDB
Apache
Zeppelin &
Power BI
MongoDB
 We had prepared data in JSON format with our Java application
 Since we have MongoDB installed on our enterprise systems, we decided to persist
this data to MongoDB
 Although MongoDB is not a part of HDP, it seemed as a good choice for our
researchers to use this data in their analyses
TREP API
Apache
Kafka
Apache NiFi MongoDB
Apache
Zeppelin &
Power BI
Apache Zeppelin
 We provided our researchers with access to Apache Zeppelin and connection to
MongoDB via Python
 By doing so, we offered an alternative to the tools on local computers and provided a
unified interface for financial analysis
Business Intelligence on Client Side
 Our users had to download daily tick-data manually from their Thomson Reuters
Terminals and work on Excel
 Users were then able to access tick-data using Power BI
o We also provided our users with a news timeline along with the tick-data
We needed more!
 We had to visualize the data in real-time
o Analysis on persisted data using MongoDB, PowerBI and Apache Zeppelin was not enough
TREP API
Apache
Kafka
Apache NiFi MongoDB
Apache
Zeppelin &
Power BI
Development of High Frequency Indicators Using Real-Time Tick
Data on Apache Superset and Druid
Phase 1
Prod
migration
Next
phases
Test
Cluster
Phase 2 Phase 3
TREP
API
Apache
Kafka
Apache
Druid
Apache
Superset
Apache Druid
 We needed a database which was able to:
o Answer ad-hoc queries (slice/dice) for a limited window efficiently
o Store historic data and seamlessly integrate current and historic data
o Provide native integration with possible real-time visualization frameworks (preferably from
Apache stack)
o Provide native integration with Apache Kafka
 Apache Druid addressed all the aforementioned requirements
 Indexing task was achieved using Tranquility
TREP
API
Apache
Kafka
Apache
Druid
Apache
Superset
Apache Superset
 Apache Superset was the obvious alternative for real-time visualization since tick-data
was stored on Apache Druid
o Native integration with Apache Druid
o Freely available on Hortonworks service stack
 We prepared real-time dashboards including:
o Transaction Count
o Bid / Ask Prices
o Contributor Distribution
o Bid - Ask Spread
We needed more, again!
 Reliability issues with Druid
 Performance issues
 Enterprise integration requirements
Development of High Frequency Indicators Using Real-Time Tick
Data on Apache Superset and Druid
Phase 1
Prod
migration
Next
phases
Test
Cluster
Phase 2 Phase 3
Architecture
Internet Data
Enterprise Content
Social Media/Media
Micro Level Data
Commercial Data Vendors
Ingestion
Big Data Platform Data Science
GovernanceData Sources
Development of High Frequency Indicators Using Real-Time Tick
Data on Apache Superset and Druid
Phase 1
Prod
migration
Next
phases
Test
Cluster
Phase 2 Phase 3
TREP API Apache Kafka
Apache
Hive + Druid
Integration
Apache Spark
Apache
Superset
Apache Hive + Druid Integration
 After setting up our production environment (using HDP 3.0.1.0) and started to
feed data, we realized that data were scattered and we were missing the option to
co-utilize these different data sources
 We then realized that Apache Hive was already providing Kafka & Druid indexing
service in the form of a simple table creation and querying facility for Druid from
Hive
TREP API Apache Kafka
Apache
Hive + Druid
Integration
Apache Spark
Apache
Superset
Apache Spark
 Due to additional calculation requirements of our users, we decided to utilize Apache
Spark
 With Apache Spark 2.4, we used Spark Streaming and Spark SQL contexts together in
the same application
 In our Spark application
o For every 5 seconds, a 30-second window is created
o On each window, outlier boundaries are calculated
o Outlier data points are detected
Current Architecture
4
3
2
1
Current Architecture & Progress So Far
Java Application
Kafka Topic (real-time)
Kafka Topic (windowed)
TREP Event Queue
Consume Publish
Spark Application
Consume
Publish
Druid Datasource
(real-time)
Druid Datasource
(windowed)
Superset Dashboard
(tick data)
Superset Dashboard
(outlier)
TREP Data Flow
Windowed Spark Streaming
Tick-Data Dashboard
Outlier Dashboard
Work in Progress
5
4
3
2
1
Implementing…
 Moving average calculation (20-day window)
 Volatility Indicator
 Average True Range Indicator (moving average)
o [ max(t) - min(t) ]
o [ max(t) - close(t-1) ]
o [ max(t) - close(t-1) ]
Future Plans
6
5
4
3
2
1
To-Do List
 Matching data subscription
 Bringing historical tick data into real-time analysis
 Possible use of machine learning for intraday indicators
Thank you!
Q & A

Más contenido relacionado

La actualidad más candente

Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 
Apresentacao microsoft project
Apresentacao microsoft projectApresentacao microsoft project
Apresentacao microsoft project
Franklin G. Mendes
 

La actualidad más candente (20)

ServiceNow Paris Release - Our favorite new features
ServiceNow Paris Release - Our favorite new featuresServiceNow Paris Release - Our favorite new features
ServiceNow Paris Release - Our favorite new features
 
Story Points Estimation And Planning Poker
Story Points Estimation And Planning PokerStory Points Estimation And Planning Poker
Story Points Estimation And Planning Poker
 
Kerzner gerenciamento de projetos uma abordagem sistêmica para o planejamen...
Kerzner gerenciamento de projetos   uma abordagem sistêmica para o planejamen...Kerzner gerenciamento de projetos   uma abordagem sistêmica para o planejamen...
Kerzner gerenciamento de projetos uma abordagem sistêmica para o planejamen...
 
Conquer 6 workforce planning and optimization challenges | Anaplan
Conquer 6 workforce planning and optimization challenges | AnaplanConquer 6 workforce planning and optimization challenges | Anaplan
Conquer 6 workforce planning and optimization challenges | Anaplan
 
Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
 
Accenture and Workday: Look to the Cloud for your Global Payroll Strategy
Accenture and Workday: Look to the Cloud for your Global Payroll Strategy       Accenture and Workday: Look to the Cloud for your Global Payroll Strategy
Accenture and Workday: Look to the Cloud for your Global Payroll Strategy
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
Gestão de Prazos e Custos do Projeto
Gestão de Prazos e Custos do ProjetoGestão de Prazos e Custos do Projeto
Gestão de Prazos e Custos do Projeto
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Odoo Implementation Methodology
Odoo Implementation MethodologyOdoo Implementation Methodology
Odoo Implementation Methodology
 
Managing Product Development Chaos with Jira Software and Confluence
Managing Product Development Chaos with Jira Software and ConfluenceManaging Product Development Chaos with Jira Software and Confluence
Managing Product Development Chaos with Jira Software and Confluence
 
MLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine LearningMLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine Learning
 
Lessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerLessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On Docker
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache Kafka
 
Proje yonetimi
Proje yonetimiProje yonetimi
Proje yonetimi
 
Gestão de Projetos - Exemplo de Documentação de Projeto
Gestão de Projetos - Exemplo de Documentação de ProjetoGestão de Projetos - Exemplo de Documentação de Projeto
Gestão de Projetos - Exemplo de Documentação de Projeto
 
Agile-overview: Agile Manifesto, Agile principles and Agile Methodologies
Agile-overview: Agile Manifesto, Agile principles and Agile MethodologiesAgile-overview: Agile Manifesto, Agile principles and Agile Methodologies
Agile-overview: Agile Manifesto, Agile principles and Agile Methodologies
 
Apresentacao microsoft project
Apresentacao microsoft projectApresentacao microsoft project
Apresentacao microsoft project
 
Software Project Scheduling Diagrams
Software Project Scheduling DiagramsSoftware Project Scheduling Diagrams
Software Project Scheduling Diagrams
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 

Similar a Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset and Druid

Similar a Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset and Druid (20)

Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
Etl is Dead; Long Live Streams
Etl is Dead; Long Live StreamsEtl is Dead; Long Live Streams
Etl is Dead; Long Live Streams
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Data Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEAData Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEA
 
Webinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBWebinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDB
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
PNDA - Platform for Network Data Analytics
PNDA - Platform for Network Data AnalyticsPNDA - Platform for Network Data Analytics
PNDA - Platform for Network Data Analytics
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Streaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafka
 

Más de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 

Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset and Druid

  • 1. Developing High Frequency Indicators Using Real-Time Tick Data on Apache Superset and Druid CBRT Big Data Team Emre Tokel, Kerem Başol, M. Yağmur Şahin Zekeriya Besiroglu / Komtas Bilgi Yonetimi 21 March 2019 Barcelona
  • 2. Agenda WHO WE ARE CBRT & Our Team PROJECT DETAILS Before, Test Cluster, Phase 1-2-3, Prod Migration HIGH FREQUENCY INDICATORS Importance & Goals CURRENT ARCHITECTURE Apache Kafka, Spark, Druid & Superset WORK IN PROGRESS Further analyses FUTURE PLANS 6 5 4 3 2 1
  • 4. Our Solutions Data Management • Data Governance Solutions • Next Generation Analytics • 360 Engagement • Data Security Analytics • Data Warehouse Solutions • Customer Journey Analytics • Advanced Marketing Analytics Solutions • Industry-specific analytic use cases • Online Customer Data Platform • IoT Analytics • Analytic Lab Solution Big Data & AI • Big Data & AI Advisory Services • Big Data & AI Accelerators • Data Lake Foundation • EDW Optimization / Offloading • Big Data Ingestion and Governance • AI Implementation – Chatbot • AI Implementation – Image Recognition Security Analytics • Security Analytic Advisory Services • Integrated Law Enforcement Solutions • Cyber Security Solutions • Fraud Analytics Solutions • Governance, Risk & Compliance Solutions
  • 5. • +20 IT , +18 DB&DWH • +7 BIG DATA • Lead Archtitect &Big Data /Analytics @KOMTAS • Instructor&Consultant • ITU,MEF,Şehir Uni. BigData Instr. • Certified R programmer • Certified Hadoop Administrator
  • 6. Our Organization  The Central Bank of the Republic of Turkey is primarily responsible for steering the monetary and exchange rate policies in Turkey. o Price stability o Financial stability o Exchange rate regime o The privilege of printing and issuing banknotes o Payment systems
  • 7. • Big Data Engineer• Big Data Engineer M. Yağmur Şahin Emre Tokel Kerem Başol • Big Data Team Leader
  • 9. Importance and Goals  To observe foreign exchange markets in real-time o Are there any patterns regarding to specific time intervals during the day? o Is there anything to observe before/after local working hours throughout the whole day? o What does the difference between bid/ask prices tell us?  To be able to detect risks and take necessary policy measures in a timely manner o Developing liquidity and risk indicators based real-time tick data o Visualizing observations for decision makers in real-time o Finally, discovering possible intraday seasonality  Wouldn’t it be great to be able to correlate with news flow as well?
  • 11. Development of High Frequency Indicators Using Real-Time Tick Data on Apache Superset and Druid Phase 1 Prod migration Next phases Test Cluster Phase 2 Phase 3
  • 12. Test Cluster  Our first studies on big data have started on very humble servers o 5 servers with 32 GB RAM for each o 3 TB storage  HDP 2.6.0.3 installed o Not the latest version back then  Technical difficulties o Performance problems o Apache Druid indexing o Apache Superset maturity
  • 13. Development of High Frequency Indicators Using Real-Time Tick Data on Apache Superset and Druid Phase 1 Prod migration Next phases Test Cluster Phase 2 Phase 3
  • 14. TREP API Apache Kafka Apache NiFi MongoDB Apache Zeppelin & Power BI
  • 15. Thomson Reuters Enterprise Platform (TREP)  Thomson Reuters provides its subscribers with an enterprise platform that they can collect the market data as it is generated  Each financial instrument on TREP has a unique code called RIC  The event queue implemented by the platform can be consumed with the provided Java SDK  We developed a Java application for consuming this event queue to collect tick-data according to required RICs
  • 16. TREP API Apache Kafka Apache NiFi MongoDB Apache Zeppelin & Power BI
  • 17. Apache Kafka  The data flow is very fast and quite dense o We published the messages containing tick data collected by our Java application to a message queue o Twofold analysis: Batch and real-time  We decided to use Apache Kafka residing on our test big data cluster  We created a topic for each RIC on Apache Kafka and published data to related topics
  • 18. TREP API Apache Kafka Apache NiFi MongoDB Apache Zeppelin & Power BI
  • 19. Apache NiFi  In order to manage the flow, we decided to use Apache NiFi  We used KafkaConsumer processor to consume messages from Kafka queues  The NiFi flow was designed to be persisted on MongoDB
  • 21. TREP API Apache Kafka Apache NiFi MongoDB Apache Zeppelin & Power BI
  • 22. MongoDB  We had prepared data in JSON format with our Java application  Since we have MongoDB installed on our enterprise systems, we decided to persist this data to MongoDB  Although MongoDB is not a part of HDP, it seemed as a good choice for our researchers to use this data in their analyses
  • 23. TREP API Apache Kafka Apache NiFi MongoDB Apache Zeppelin & Power BI
  • 24. Apache Zeppelin  We provided our researchers with access to Apache Zeppelin and connection to MongoDB via Python  By doing so, we offered an alternative to the tools on local computers and provided a unified interface for financial analysis
  • 25. Business Intelligence on Client Side  Our users had to download daily tick-data manually from their Thomson Reuters Terminals and work on Excel  Users were then able to access tick-data using Power BI o We also provided our users with a news timeline along with the tick-data
  • 26. We needed more!  We had to visualize the data in real-time o Analysis on persisted data using MongoDB, PowerBI and Apache Zeppelin was not enough
  • 27. TREP API Apache Kafka Apache NiFi MongoDB Apache Zeppelin & Power BI
  • 28. Development of High Frequency Indicators Using Real-Time Tick Data on Apache Superset and Druid Phase 1 Prod migration Next phases Test Cluster Phase 2 Phase 3
  • 30. Apache Druid  We needed a database which was able to: o Answer ad-hoc queries (slice/dice) for a limited window efficiently o Store historic data and seamlessly integrate current and historic data o Provide native integration with possible real-time visualization frameworks (preferably from Apache stack) o Provide native integration with Apache Kafka  Apache Druid addressed all the aforementioned requirements  Indexing task was achieved using Tranquility
  • 32. Apache Superset  Apache Superset was the obvious alternative for real-time visualization since tick-data was stored on Apache Druid o Native integration with Apache Druid o Freely available on Hortonworks service stack  We prepared real-time dashboards including: o Transaction Count o Bid / Ask Prices o Contributor Distribution o Bid - Ask Spread
  • 33. We needed more, again!  Reliability issues with Druid  Performance issues  Enterprise integration requirements
  • 34. Development of High Frequency Indicators Using Real-Time Tick Data on Apache Superset and Druid Phase 1 Prod migration Next phases Test Cluster Phase 2 Phase 3
  • 35. Architecture Internet Data Enterprise Content Social Media/Media Micro Level Data Commercial Data Vendors Ingestion Big Data Platform Data Science GovernanceData Sources
  • 36. Development of High Frequency Indicators Using Real-Time Tick Data on Apache Superset and Druid Phase 1 Prod migration Next phases Test Cluster Phase 2 Phase 3
  • 37. TREP API Apache Kafka Apache Hive + Druid Integration Apache Spark Apache Superset
  • 38. Apache Hive + Druid Integration  After setting up our production environment (using HDP 3.0.1.0) and started to feed data, we realized that data were scattered and we were missing the option to co-utilize these different data sources  We then realized that Apache Hive was already providing Kafka & Druid indexing service in the form of a simple table creation and querying facility for Druid from Hive
  • 39. TREP API Apache Kafka Apache Hive + Druid Integration Apache Spark Apache Superset
  • 40. Apache Spark  Due to additional calculation requirements of our users, we decided to utilize Apache Spark  With Apache Spark 2.4, we used Spark Streaming and Spark SQL contexts together in the same application  In our Spark application o For every 5 seconds, a 30-second window is created o On each window, outlier boundaries are calculated o Outlier data points are detected
  • 41.
  • 43. Current Architecture & Progress So Far Java Application Kafka Topic (real-time) Kafka Topic (windowed) TREP Event Queue Consume Publish Spark Application Consume Publish Druid Datasource (real-time) Druid Datasource (windowed) Superset Dashboard (tick data) Superset Dashboard (outlier)
  • 49. Implementing…  Moving average calculation (20-day window)  Volatility Indicator  Average True Range Indicator (moving average) o [ max(t) - min(t) ] o [ max(t) - close(t-1) ] o [ max(t) - close(t-1) ]
  • 51. To-Do List  Matching data subscription  Bringing historical tick data into real-time analysis  Possible use of machine learning for intraday indicators

Notas del editor

  1. Founded in September 2017 with experienced software engineers Members have academic background on finance and big data PoC work was done to explain the capabilities of a big data platform Payment system data was analyzed First task was to setup a big data platform Emre Tokel - Big Data Team Leader Emre has 15+ years of experience in software development. He has taken role as developer and project manager in various projects. For 2 years now, he has been involved in big data and data intelligence studies within the Bank. Emre has been leading the big data team since last year and is responsible for the architecture of the Big Data Platform, which is based on Hortonworks technologies. He has an MBA degree and is pursuing his Ph.D in finance. Besides IT, he is a divemaster and teaching SCUBA.  Kerem Basol - Big Data Engineer Kerem has 10+ years of experience in software development including mobile, back-end and front-end. For the past two years, he focused on big data technologies and currently working as a big data engineer. Kerem is responsible for data ingestion and building custom solution stacks for business needs using the Big Data Platform, which is based on Hortonworks technologies. He holds an MS degree in CIS from UPENN.  M. Yağmur Sahin - Big Data Engineer Yağmur has been developing software for 10 years. Being experienced in software development, he has completed his masters degree in 2016 on distributed stream processing where he was first introduced with big data technologies. For the last 2 years, he has been designing and implementing big data solutions for the Bank using Hortonworks Data Platform. Yağmur is also pursuing his Ph.D at Medical Informatics department of METU. He loves running and hopefully will complete a marathon in coming years.
  2. Power BI has a MongoDB connector
  3. (All dashboards included min/max/average values)
  4. There were some tasks that cannot be handled declaratively