SPARKTA
A real-time analytics platform
based on Apache Spark
London, May 2015
FIRST SPARK PLATFORM.
APR 2014
20+ INTERNATIONAL
PROJECTS
WITH SPARK
PLATFORM
OVERVIEW1
STRATIO
INGESTION
Customer lake
STRATIO
STREAMING
STRATIO
QUANTUM
STRATIO DEEP
STRATIO CROSSDATA
ODBC JBDC API Rest
CRM
ER...
STRATIO
INGESTION
Customer lake
STRATIO
STREAMING
STRATIO
QUANTUM
STRATIO DEEP
STRATIO CROSSDATA
ODBC JBDC API Rest
CRM
ER...
STRATIO
INGESTION
Ingests,
transforms
STRATIO
STREAMING
STRATIO
QUANTUM
STRATIO CROSSDATA
Analyzes & processes
A unified S...
STRATIO
INGESTION
Hdfs S3 Elastic
Search
Mongo DB Cassandra Redis Oracle, DB2
Other
Databases
Ingests,
transforms
STRATIO
...
STRATIO DATAVIS
STRATIO
INGESTION
Ingests,
transforms
STRATIO
STREAMING
STRATIO
QUANTUM
STRATIO CROSSDATA
Analyzes &
proce...
REAL-TIME:
Beyond cool dashboards2
The time is N W
We all know this story already
Social media and networking sites are a part of the fabric of
everyday life...
Look at these sexy infographics!
We all love data
visualization
Insights from this vast amount of data
allows us to learn ...
Delivering real-time business in the Internet
But beyond cool visualizations, there are
some core services delivered in re...
Pushing business’ processes to perform faster
Digital companies, born to develop their services in real-time have changed
...
Listen to your data…
CLIENTTPV
Accounts
Loans
and credits
Insurances
Broker
Mortgages
Cards
Deposits
ATM
Online
gateway
ap...
…and start delivering real-time services
Real-time monitoring could be really nice, but your
company needs to work in the ...
REAL-TIME
Challenges at Stratio2
Real-time fraud monitoring
DATA RECEIVER
REAL-TIME
AGGREGATION
CONSOLIDATION
Dashboardin
g
Reporting
FRAUD
DETECTION
Lever...
Extract, Transform and Aggregate
By combining Apache Flume and Spark Streaming we have deployed complex
topologies to deal...
Custom data sources and storage
Each project requires
specific inputs and data
storages, dealing with
different kinds of
e...
Towards a generic real-time aggregation platform
At Stratio, we have implemented several real-time analytic projects based...
ELSEWHERE3
#1 RainBird from Twitter
Some folks from twitter shared some thoughts
about their real-time needs at Strata (2011).
They w...
#2 Countandra
Countandra is a hierarchical distributed counting
engine exploiting all the excellent write&read
performance...
#3 ThunderRain from Intel
ThunderRain is a Real-Time Analytical Processing
(RTAP) example using Spark and Shark, which
can...
#4 TSAR from Twitter
TSAR (the TimeSeries AggregatoR) is a
flexible, reusable, end-to-end service
architecture on top of S...
Towards a generic real-time aggregation platform
Some initiatives have tried to solve this problem, but until now most of ...
4
THIS IS
SPARKTA
Distributed, high-volume & pluggable analytics framework
Our goals:
Since Aryabhatta invented zero, Mathematicians such as...
Sparkta: A first look
DRIVER - SUPERVISOR
AGGREGATION POLICY
QUERY
SERVICES
Aggregation policy
definition is sent to the
e...
Sparkta: Deploy any number of real-time aggregation policies
DRIVER - SUPERVISOR
You can start
several workflows
at any ti...
Sparkta: Key Technologies
+
Apache Kite SDK
INPUTS PROCESSING
RabbitMQ
ZeroMQ
Twitter
Flume
Kafka
....
OUTPUTS
..
..
CONFE...
Sparkta: Define your real-time needs
AGGREGATION POLICY
Remember: no need to code anything.
Define your workflow in a JSON...
Sparkta: Key Technologies
ROLLUPS
• Pass-through
• Time-based
• Secondly, minutely, hourly, daily,
monthly, yearly...
• Hi...
Sparkta SDK
INPUT
OUTPUT(s)
DIMENSION(s)
OPERATORS
TRANSFORMATION(s)
Sparkta has been conceived as an SDK.
You can extend ...
NEXT STEPS5
Source: mydisguises.com
Next steps in our roadmap (1)
Sparkta is a work in progress, so we still have some nice features to
develop…
QUERY
SERVICE...
Next steps in our roadmap (II)
WEB
APPLICATION
DEPLOYING &
MONITORING
How about a nice web interface to create and manage ...
OPEN SOURCE
& COMMUNITY6
OPEN TO YOUR IDEAS
www.stratio.com
@StratioBD
https://github.com/stratio/sparkta
SPARKTA is fully open source
Apache 2 Lic...
DEMO TIME7
Do you want to try SPARKTA?
Use a full-featured sandbox to start trying SPARKTA
vagrant init “stratio/sparkta”
vagrant up
...
Do you want to try SPARKTA?
Getting some real-time stats from
#StrataHadoop
Our real-time policy defines some
rollups in o...
BIG DATA
CHILD`S PLAY
Próxima SlideShare
Cargando en…5
×

[Strata] Sparkta

6.501 visualizaciones

Publicado el

SPARKTA
A real-time analytics platform
based on Apache Spark

Publicado en: Tecnología
0 comentarios
23 recomendaciones
Estadísticas
Notas
  • Sé el primero en comentar

Sin descargas
Visualizaciones
Visualizaciones totales
6.501
En SlideShare
0
De insertados
0
Número de insertados
355
Acciones
Compartido
0
Descargas
114
Comentarios
0
Recomendaciones
23
Insertados 0
No insertados

No hay notas en la diapositiva.
  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • Buscar reloj para reemplazar la O.

  • [Strata] Sparkta

    1. 1. SPARKTA A real-time analytics platform based on Apache Spark London, May 2015
    2. 2. FIRST SPARK PLATFORM. APR 2014 20+ INTERNATIONAL PROJECTS WITH SPARK
    3. 3. PLATFORM OVERVIEW1
    4. 4. STRATIO INGESTION Customer lake STRATIO STREAMING STRATIO QUANTUM STRATIO DEEP STRATIO CROSSDATA ODBC JBDC API Rest CRM ERP Call Center BI Internal Data External Data BI AD HOC APP Hdfs S3 Elastic Search Mongo DB Cassandra Redis Oracle, DB2 Other Databases STRATIO DATAVIS 4
    5. 5. STRATIO INGESTION Customer lake STRATIO STREAMING STRATIO QUANTUM STRATIO DEEP STRATIO CROSSDATA ODBC JBDC API Rest CRM ERP Call Center BI Internal Data External data BI AD HOC APP Ingests, transforms Analyzes and processes real time streaming A unified SQL interface Machine Learning and algorithms Processes & combines with Spark STRATIO DATAVIS Creates and designs dashboards and reports Hdfs S3 Elastic Search Mongo DB Cassandra Redis Oracle, DB2 Other Databases 5
    6. 6. STRATIO INGESTION Ingests, transforms STRATIO STREAMING STRATIO QUANTUM STRATIO CROSSDATA Analyzes & processes A unified SQL interface Machine Learning and algorithms ODBC JBDC API Rest Streaming Apache Kite Apache Flume CRM ERP Call Center BI MLlib Internal Data External Data BI AD HOC APP Combines with Spark data from any source Customer lake STRATIO DEEP Processes & combines with Spark Hdfs S3 Elastic Search Mongo DB Cassandra Redis Oracle, DB2 Other Databases STRATIO DATAVIS Creates and designs dashboards and reports 6
    7. 7. STRATIO INGESTION Hdfs S3 Elastic Search Mongo DB Cassandra Redis Oracle, DB2 Other Databases Ingests, transforms STRATIO STREAMING STRATIO QUANTUM STRATIO CROSSDATA Analyzes & processes Consult & analyze. SQL interface Machine Learning & algorithms ODBC JBDC API Rest Streaming Apache Kite Apache Flume CRM ERP Call Center BI MLib Internal Data External Data BI AD HOC APP Data combination through time Customer lake STRATIO DEEP Processes & combines with Spark Real-time Ephemer al tables Past Stored tables Future Quantum tables STRATIO DATAVIS Creates and designs dashboards and reports 7
    8. 8. STRATIO DATAVIS STRATIO INGESTION Ingests, transforms STRATIO STREAMING STRATIO QUANTUM STRATIO CROSSDATA Analyzes & processes Consulta y analiza. Interfaz SQL Machine Learning & algorithms ODBC JBDC API Rest Streaming Apache Kite Apache Flume CRM ERP Call Center BI MLlib Internal Data External Data Creates and designs dashboards and reports Customer lake STRATIO DEEP Processes & combines with Spark Hdfs S3 Elastic Search Mongo DB Cassandra Redis Oracle, DB2 Other Databases INFORMATIONAL + OPERATIONAL WITHOUT NEED TO REPLICATE DATA Oracle, DB2 Other Databases Mongo DB TeradataOPERATIONAL 8
    9. 9. REAL-TIME: Beyond cool dashboards2
    10. 10. The time is N W We all know this story already Social media and networking sites are a part of the fabric of everyday life, changing the way the world shares and accesses information. The overwhelming amount of information gathered not only from messages, updates and images but also readings from sensors,GPS signals and many other sources was the origin of a (big) technological revolution. Remember? VOLUME, VARIETY & VELOCITY CONFERENCE10
    11. 11. Look at these sexy infographics! We all love data visualization Insights from this vast amount of data allows us to learn from the users and explore our own world. We can follow in real-time the evolution of a topic, an event or even an incident just by exploring aggregated data. CONFERENCE11
    12. 12. Delivering real-time business in the Internet But beyond cool visualizations, there are some core services delivered in real-time, using aggregated data to answer common questions in the fastest way. These services are the heart of the business behind their nice logos. Site traffic, user engagement monitoring, service health, APIs, internal monitoring platforms, real-time dashboards… Aggregated data feeds directly to end users, publishers, and advertisers, among others. CONFERENCE12
    13. 13. Pushing business’ processes to perform faster Digital companies, born to develop their services in real-time have changed the expectations of many others businesses. Real-time information makes it possible for a company to be much more agile than its competitors, improving business answers, gaining insights on their performance… CONFERENCE13
    14. 14. Listen to your data… CLIENTTPV Accounts Loans and credits Insurances Broker Mortgages Cards Deposits ATM Online gateway application logs Social networks transactions geolocation CRM Where as business intelligence is data gathered for the purpose of analyzing trends over time, operational intelligence provides a picture of what is currently happening within a process. And we can listen to almost everything! Orders, transactions, clicks, calls, bookings, internal services... CONFERENCE14
    15. 15. …and start delivering real-time services Real-time monitoring could be really nice, but your company needs to work in the same way as digital companies: • Rethinking existing processes to deliver them faster, better. • Creating new opportunities for competitive advantages. CONFERENCE15
    16. 16. REAL-TIME Challenges at Stratio2
    17. 17. Real-time fraud monitoring DATA RECEIVER REAL-TIME AGGREGATION CONSOLIDATION Dashboardin g Reporting FRAUD DETECTION Leveraging the power of Spark Streaming, we have developed some fraud detection solutions, aggregating data in real-time to work better with machine learning algorithms. CONFERENCE17
    18. 18. Extract, Transform and Aggregate By combining Apache Flume and Spark Streaming we have deployed complex topologies to deal with data coming from heterogeneous sources. The full solution allow us to transform and aggregate data on-the-fly (data cleaning, normalization and enrichment) REAL-TIME AGGREGATION Dashboardin g Reporting CONFERENCE18
    19. 19. Custom data sources and storage Each project requires specific inputs and data storages, dealing with different kinds of events. From click stream activity to bank transactions... DATA STREAM LOADING TRANSFORM CUSTOM LOGS CONFERENCE19
    20. 20. Towards a generic real-time aggregation platform At Stratio, we have implemented several real-time analytic projects based on Apache Spark, Kafka, Flume, Cassandra, or MongoDB. These technologies were always a perfect fit, but soon we found ourselves writing the same pieces of integration code over and over again. This is how SPARKTA was born. CONFERENCE20
    21. 21. ELSEWHERE3
    22. 22. #1 RainBird from Twitter Some folks from twitter shared some thoughts about their real-time needs at Strata (2011). They worked on a “generic” platform in order to deal with pre-calculated data from a huge number of events. It allows them to deal with: • Data Structures • Hierarchical Aggregation • Temporal Aggregation • Multiple Formulas Still not open sourceCURRENT STATE http://goo.gl/ykvQa CONFERENCE22
    23. 23. #2 Countandra Countandra is a hierarchical distributed counting engine exploiting all the excellent write&read performance of Cassandra. It supports: • Geographically distributed counting. • Easy Http Based interface to insert counts. • Hierarchical counting such as com.mywebsite.music. • Retrieves counts, sums and square in near real- time. • Simple Http queries provides desired output in Json format • Queries can be sliced by period such as lasthour ,lastyear and so on for minutely,hourly,daily,monthly values https://github.com/milindparikh/Countandra Rather deprecatedCURRENT STATE CONFERENCE23
    24. 24. #3 ThunderRain from Intel ThunderRain is a Real-Time Analytical Processing (RTAP) example using Spark and Shark, which can be best characterized by the following four salient properties: • Data continuously streamed in & processed in near real-time • Real-time data queried and presented in an online fashion • Real-time and history data combined and mined interactively • Predominant RAM-based processing https://github.com/thunderain- project/thunderain Rather deprecatedCURRENT STATE CONFERENCE24
    25. 25. #4 TSAR from Twitter TSAR (the TimeSeries AggregatoR) is a flexible, reusable, end-to-end service architecture on top of Summingbird. Twitter really needs a truly robust real- time aggregation service considering their scaling and evolving needs. They realized that many time-series applications call for essentially the same architecture, with only slight variations in the data model. https://blog.twitter.com/2014/tsar-a-timeseries-aggregator Still not open sourceCURRENT STATE CONFERENCE25
    26. 26. Towards a generic real-time aggregation platform Some initiatives have tried to solve this problem, but until now most of them were complex or obsolete while others were not open source. For this reason, Stratio created SPARKTA: an open source and full-featured platform for real-time analytics, based on Apache Spark. This is why SPARKTA was conceived CONFERENCE26
    27. 27. 4 THIS IS SPARKTA
    28. 28. Distributed, high-volume & pluggable analytics framework Our goals: Since Aryabhatta invented zero, Mathematicians such as John von Neuman have been in pursuit of efficient counting and architects have constantly built systems that computes counts quicker. In this age of social media, where 100s of 1000s events take place every second, we designed a aggregation engine to deliver real-time service • Pure Spark! • No need of coding, only declarative aggregation workflows • Data continuously streamed in & processed in near real- time • Ready to use out of the box • Plug & play: flexible workflows (inputs, outputs, parsers, etc…) • High performance • Scalable and fault tolerant CONFERENCE28
    29. 29. Sparkta: A first look DRIVER - SUPERVISOR AGGREGATION POLICY QUERY SERVICES Aggregation policy definition is sent to the engine Allows multiple application to be defined, each of which is bound to a context, executing the aggregation workflow others AGGREGATION WORKFLOW CONFERENCE29
    30. 30. Sparkta: Deploy any number of real-time aggregation policies DRIVER - SUPERVISOR You can start several workflows at any time, and also stop or monitor them CONFERENCE30
    31. 31. Sparkta: Key Technologies + Apache Kite SDK INPUTS PROCESSING RabbitMQ ZeroMQ Twitter Flume Kafka .... OUTPUTS .. .. CONFERENCE31
    32. 32. Sparkta: Define your real-time needs AGGREGATION POLICY Remember: no need to code anything. Define your workflow in a JSON document, including: INPUT Where is the data coming from? OUTPUT(s) Where should aggregate data be stored? DIMENSION(s) Which fields will you need for your real-time needs? ROLLUP(s) How do you want to aggregate the dimensions? TRANSFORMATION(s) Which functions should be applied before aggregation? SAVE RAW DATA Do you want to save raw events? CONFERENCE32
    33. 33. Sparkta: Key Technologies ROLLUPS • Pass-through • Time-based • Secondly, minutely, hourly, daily, monthly, yearly... • Hierarchycal • GeoRange: Areas with different sizes (rectangles) OPERATORS • Max, min, count, sum • Average, median • Stdev, variance, count distinct • Last value • Full-text search KiteSDK CONFERENCE33
    34. 34. Sparkta SDK INPUT OUTPUT(s) DIMENSION(s) OPERATORS TRANSFORMATION(s) Sparkta has been conceived as an SDK. You can extend several points of the platform to fulfill your needs, such as adding new inputs, outputs, operators, dimension types. Add new functions to Apache Kite in order to extend the data cleaning, enrichment and normalization capabilities. CONFERENCE34
    35. 35. NEXT STEPS5 Source: mydisguises.com
    36. 36. Next steps in our roadmap (1) Sparkta is a work in progress, so we still have some nice features to develop… QUERY SERVICES ALARMS Creating a REST services layer in order to query the aggregated data allows us to isolate the final consumer from the specific data storage Features - Time ranges - Agreggation on time ranges - Best rollup selection For example, I want to know if we have earned over $3000 in London in the last hour... Remember operational intelligence! CONFERENCE36
    37. 37. Next steps in our roadmap (II) WEB APPLICATION DEPLOYING & MONITORING How about a nice web interface to create and manage policies? Forget the JSON file and use your mouse to define the workflow :) We have been working with Spark jobServer & Yarn, but it will be nice to support Mesos, for example. Hey, did you miss something? Do you have a great idea? Let us know! MORE AWESOMENESS CONFERENCE37
    38. 38. OPEN SOURCE & COMMUNITY6
    39. 39. OPEN TO YOUR IDEAS www.stratio.com @StratioBD https://github.com/stratio/sparkta SPARKTA is fully open source Apache 2 License. We are open to contributors & ideas CONFERENCE39
    40. 40. DEMO TIME7
    41. 41. Do you want to try SPARKTA? Use a full-featured sandbox to start trying SPARKTA vagrant init “stratio/sparkta” vagrant up Just open a shell and type CONFERENCE41
    42. 42. Do you want to try SPARKTA? Getting some real-time stats from #StrataHadoop Our real-time policy defines some rollups in order to know chatty users, hot hashtags, and heatmaps from StrataConf tweets. We are using the standard Twitter input from Spark Streaming, ElasticSearch output & Kibana to display results CONFERENCE42
    43. 43. BIG DATA CHILD`S PLAY

    ×