SlideShare una empresa de Scribd logo
1 de 39
BI, Reporting and Analytics on
Apache Cassandra
27/10/2015
Victor Coustenoble Solutions Engineer
victor.coustenoble@datastax.com
@vizanalytics
Agenda
• DataStax & Apache Cassandra
• Data Modeling and CQL
• Data Access
• Reporting and Analytics
• DataStax Enterprise Analytics
• Architectures
• Hadoop + Cassandra use cases
©2014 DataStax Confidential. Do not distribute without consent. 2
3
DataStax & Apache Cassandra
© 2014 DataStax Confidential. Do not distribute without consent.
DataStax delivers Apache Cassandra in a database platform
purpose-built for the performance and availability demands
of Web, Mobile, and IOT applications, giving enterprises a
secure always-on database that remains operationally simple
when scaled in a single datacenter or across multiple
datacenters and clouds.
“
“
Elevator Pitch
No Vertical Market Concentration
Functional use cases
Messaging
Collections/
Playlists
Fraud
detection
Recommendation/
Personalization
Internet of things/
Sensor data
Apache Cassandra™
• Massively scalable, Open Source, NoSQL, distributed database built for modern, mission-
critical online applications
• Written in Java and is a hybrid of Amazon Dynamo and Google BigTable
• Masterless with no single point of failure
• Distributed and data center aware
• 100% uptime
• Predictable scaling
• High Performance
• Multi Data Center
• Time Series
• Tunable Consistency
• Simple to Operate
• CQL language
• OpsCenter / DevCenter
Dynamo
BigTable
BigTable: http://research.google.com/archive/bigtable-osdi06.pdf
Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
9
Data Modeling and CQL
Data Modeling
Cassandra is not like well known RDBMS systems:
• No a relational model
• No foreign keys, no joins, no agregations
• Modeling guided by requests to be supported, by data access and by
actions (filters, grouping and order needs)
Denormalisation
• Combine columns from different tables in a unique table (“materialized
view”), no joins!
• Better performances, less data trafic
• Don’t be afraid to duplicate data, to write data
• Avoid joins at client level
©2014 DataStax Confidential. Do not distribute without consent. 10
Cassandra Data Model
©2014 DataStax Confidential. Do not distribute without consent. 11
• Based on Google Bigtable
• Row-oriented column family
• De-normalised
CREATE TABLE sporty_league (
team_name varchar,
player_name varchar,
jersey int,
PRIMARY KEY (team_name, player_name)
);
SELECT * FROM sporty_league;
The primary key uniquely identifies a row.
A composite primary key consists of:
• A partition key
• One or more clustering columns
e.g. PRIMARY KEY (partition key, cluster columns, ...)
• The partition key determines on which node the partition resides
• Data is ordered in cluster column order within the partition
CQL – Cassandra Query Language
©2014 DataStax Confidential. Do not distribute without consent.
• Data type : BLOB, UUID, TIMEUUID, User Defined Type
…
• User Defined Functions, User Defined Aggregates
• Collections : Map, List, Set
• TTL (Time-To-Live) at column level
• Counters
• Lightweight Transactions (LWT) : race condition problem
solving with IF NOT EXISTS
• Batch statements
• Secondary Index
• Very similar to RDBMS SQL syntax
• Core DML and DDL commands supported: INSERT, UPDATE, DELETE, SELECT, CREATE, GRANT …
INSERT INTO sporty_league (team_name, player_name, jersey) VALUES (’PSG',’Zlatan’,10);
SELECT player_name as nom_joueur FROM sporty_league WHERE team_name = ‘PSG’;
DevCenter
13
Data Access
Cassandra Data Access
CQL language via cqlsh (command line) or DevCenter
(development environnement) or drivers
• Drivers on Cassandra native protocol
• Command CQL COPY
• Import/Export tools for massive bulk loader
• Connectors in ETL solutions (Talend, Informatica)
• Via analytics layers Spark and Hadoop
• Via ODBC/JDBC drivers
Cassandra Clients - Native Driver
DataStax drivers available and supported: Java, Python, C#, C++, Ruby, Node.js,
PHP (much more to come like Scala, Go…)
This includes:
• Load Balancing
• Data Centre Aware
• Latency Aware
• Token Aware
• Reconnection policies
• Retry policies
• Downgrading Consistency
• Plus others…
©2014 DataStax Confidential. Do not distribute without consent. 15
Connexions ODBC / JDBC
ODBC drivers
• For SparkSQL (SQL engine on Spark), via JDBC/ODBC SparkSQL thrift server
• For Hive (Hadoop SQL engine)
• For Cassandra directly (ANSI SQL or CQL requests)
JDBC drivers
• For SparkSQL (SQL engine on Spark), via JDBC/ODBC SparkSQL thrift server
• For Cassandra directly (in progress)
• JDBC drivers from the community but not officialy supported
17
Reporting & Analytics
Real-Time / Operational Analytics Use Cases
Recommendation Engine
Internet of Things
Fraud Detection
Risk Analysis
Buyer Behaviour Analytics
Telematics, Logistics
Business Intelligence
Infrastructure Monitoring
…
How to do analytics on Cassandra data ?
Remember …
Cassandra = NO JOIN , NO GROUP BY , Filter on Primary Key only
2 solutions:
• CQL with predictable queries
• Joins and Aggregations on the fly:
Server level => Need a distributed processing framework : Hadoop or Spark
Client level => Possible but risky !
Reporting and Dashboard
Confidential 20
• Static and operational dashboards and reports created for a
specific Cassandra application.
• CQL, Solr queries and DataStax drivers
• KPI and aggregations pre-calculated with scheduled batch or on
the fly during insert.
BI & Data Visualization tools
21
For BI and Data Visualization tools like Tableau Software,
Power BI, Qlikview, Excel ….
• DataStax ODBC driver
SQL joins and aggregations executed at client level !
• Spark ODBC driver (from Databricks or Microsoft)
SQL translated in Spark jobs and executed at server level
Tableau Software
22
Databricks Spark ODBC Driver for SparkSQL
Live SQL queries to Spark or Extract data on local client
Power BI Desktop
23
Support for On-Prem Spark distributions
“The new data source in this month’s release is support for On-Prem Spark distributions. Last
month, we added support for Microsoft Azure HDInsight Spark, and this month we’re expanding
to other Spark distributions.
This new connector can be found under the “Other” category in the “Get Data” dialog.”
http://blogs.msdn.com/b/powerbi/archive/2015/09/23/44-new-features-in-the-power-bi-desktop-
september-update.aspx
Microsoft Spark ODBC Driver
Notebook
24
Run code (Spark or CQL) from a Web browser
Notebooks like Zeppelin, Spark Notebook, Jupyter
For example Zeppelin:
• Examples available for Cassandra
• CQL language interpretor
• https://github.com/doanduyhai/incubator-zeppelin
DataStax Enterprise Analytics
Analytics with DataStax Enterprise
There are 4 ways to do Analytics on Cassandra data:
• Reporting with CQL queries
• Integrated Search (Solr)
• Integrated Batch Analytics (Hadoop integrated) on Cassandra
• Integrated Near Real-Time Analytics (Spark)
• Virtual multi data centers optimised as required – different workloads, hardware, availability etc..
• Cassandra will replicate the data for you – no ETL is necessary
• Cassandra node started with Solr, Hadoop or Spark
Cassandra
Replication
Transactions Analytics
Enterprise Search & Powerfull Secondary Index
• Built-in enterprise search on Cassandra data via a strong Apache Solr and Lucene
integration
• Facets, Filtering, Geospatial search, Text Analysis, Joins, etc.
• Real-time indexing process and search operations
• Search queries from CQL and REST/Solr
• Solr shortcomings:
• No bottleneck. Client can read/write to any Solr node.
• Search index partitioning and replication for scalability and availability.
• Multi-DC support
• Data durability (Solr lacks write-ahead log, data can be lost)
27
Cassandra
Replication
Customer
Facing
Search
Nodes
Batch Analytics - Hadoop
• Integrated Hadoop 1.0.4
• CFS (Cassandra File System) , no HDFS
• No Single Point of failure
• No Hadoop complexity – every node is built the same
• Hive / Pig / Sqoop / Mahout
©2014 DataStax Confidential. Do not distribute without consent. 28
Cassandra
Replication
Customer
Facing
Hadoop
Nodes
Real-Time Analytics - Spark
• Tight integration between Apache Spark and Cassandra
• Distributed Processing : “In-memory Map/Reduce”, multi-thread, best for iterations
• GraphX, MLLib (Machine learning), SparkSQL, Spark Streaming (Real-time processing)
• Thrift JDBC/ODBC Spark server – Spark Job server
• Apache Solr integration
• DataStax / Databricks partnership
• 10x – 100x speed of MapReduce
©2014 DataStax Confidential. Do not distribute without consent. 29
Cassandra
Replication
Customer
Facing
Spark
Nodes
« Big Data » SDK
Real-time or Batch Analytics
©2014 DataStax Confidential. Do not distribute without consent. 30
Data Enrichment
Batch Processing
Machine Learning
Pre-computed
aggregates
Data
NO ETL
Spark Use Cases
31
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize data
Schema migration,
Data conversion
Architectures
Workloads Isolation
©2014 DataStax Confidential. Do not distribute without consent. 33
No ETL
Hot / Cold Data in a DataStax architecture
© 2014 DataStax, All Rights Reserved. Company Confidential
Hot Data
Online Operational Application
Cold Data
Offline Application
DataStax Cassandra Enterprise
34
DataStax Enterprise + Datawarehouse / Hadoop
© 2014 DataStax, All Rights Reserved. Company Confidential
Write Intensive
Internet of Things - Activity logs
for fraud and recommendation –
Messages
35
Read Intensive
Catalogue – Playlist –
Recommendation – Fraud
Alert – Personalization
Operational Search,
Dashboard and Reporting
Offline Applications
Historical Analysis - OLAP -
Complex Analytics – Self
Service BI
Operational Search,
Dashboard and Reporting
Data Warehouse
Hadoop cluster
Computation Engine
Multidimensional Cube
Cassandra + Hadoop Use Cases
Ooyala Use Case : Hadoop + Cassandra
Company Confidential 37
By leveraging data stored in Apache Cassandra, Ooyala is helping their customers take a more strategic
approach when delivering a digital video experience, so they can get ahead in this fast-evolving space.
http://www.datastax.com/resources/casestudies/ooyala
San Francisco-based video services company Ooyala provides a suite of technologies and services that support content
owners in managing, analyzing and monetizing the digital video they publish online, on mobile devices, and through the over-
the-top distribution platform for delivering Internet video to television.
Spotify Use Case : Hadoop + Cassandra
Company Confidential 38
https://labs.spotify.com/2015/01/09/personalization-at-spotify-using-cassandra/
Personalization at Spotify using Cassandra
Thanks
We power the big data apps
that transform business.
©2013 DataStax Confidential. Do not distribute without consent.

Más contenido relacionado

La actualidad más candente

Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
Derek Stainer
 

La actualidad más candente (20)

All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
 
Greenplum Architecture
Greenplum ArchitectureGreenplum Architecture
Greenplum Architecture
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
MySQL Advanced Administrator 2021 - 네오클로바
MySQL Advanced Administrator 2021 - 네오클로바MySQL Advanced Administrator 2021 - 네오클로바
MySQL Advanced Administrator 2021 - 네오클로바
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 

Similar a BI, Reporting and Analytics on Apache Cassandra

Similar a BI, Reporting and Analytics on Apache Cassandra (20)

DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Manuel Hurtado. Couchbase paradigma4oct
Manuel Hurtado. Couchbase paradigma4octManuel Hurtado. Couchbase paradigma4oct
Manuel Hurtado. Couchbase paradigma4oct
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDC
 
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introduction
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Webinar: Buckle Up: The Future of the Distributed Database is Here - DataStax...
Webinar: Buckle Up: The Future of the Distributed Database is Here - DataStax...Webinar: Buckle Up: The Future of the Distributed Database is Here - DataStax...
Webinar: Buckle Up: The Future of the Distributed Database is Here - DataStax...
 
Cassandra 2.0 to 2.1
Cassandra 2.0 to 2.1Cassandra 2.0 to 2.1
Cassandra 2.0 to 2.1
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
5 Factors When Selecting a High Performance, Low Latency Database
5 Factors When Selecting a High Performance, Low Latency Database5 Factors When Selecting a High Performance, Low Latency Database
5 Factors When Selecting a High Performance, Low Latency Database
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
 
IBM - Introduction to Cloudant
IBM - Introduction to CloudantIBM - Introduction to Cloudant
IBM - Introduction to Cloudant
 

Más de Victor Coustenoble

Más de Victor Coustenoble (14)

Préparation de Données pour la Détection de Fraude
Préparation de Données pour la Détection de FraudePréparation de Données pour la Détection de Fraude
Préparation de Données pour la Détection de Fraude
 
Préparation de Données dans le Cloud
Préparation de Données dans le CloudPréparation de Données dans le Cloud
Préparation de Données dans le Cloud
 
Préparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec TrifactaPréparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec Trifacta
 
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - Trifacta
 
DataStax Enterprise BBL
DataStax Enterprise BBLDataStax Enterprise BBL
DataStax Enterprise BBL
 
DataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTDataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoT
 
DataStax et Cassandra dans Azure au Microsoft Techdays
DataStax et Cassandra dans Azure au Microsoft TechdaysDataStax et Cassandra dans Azure au Microsoft Techdays
DataStax et Cassandra dans Azure au Microsoft Techdays
 
Webinar Degetel DataStax
Webinar Degetel DataStaxWebinar Degetel DataStax
Webinar Degetel DataStax
 
Quelles stratégies de Recherche avec Cassandra ?
Quelles stratégies de Recherche avec Cassandra ?Quelles stratégies de Recherche avec Cassandra ?
Quelles stratégies de Recherche avec Cassandra ?
 
Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0
 
DataStax Enterprise - La plateforme de base de données pour le Cloud
DataStax Enterprise - La plateforme de base de données pour le CloudDataStax Enterprise - La plateforme de base de données pour le Cloud
DataStax Enterprise - La plateforme de base de données pour le Cloud
 
Datastax Cassandra + Spark Streaming
Datastax Cassandra + Spark StreamingDatastax Cassandra + Spark Streaming
Datastax Cassandra + Spark Streaming
 
DataStax Enterprise et Cas d'utilisation de Apache Cassandra
DataStax Enterprise et Cas d'utilisation de Apache CassandraDataStax Enterprise et Cas d'utilisation de Apache Cassandra
DataStax Enterprise et Cas d'utilisation de Apache Cassandra
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 

Último

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 

Último (20)

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 

BI, Reporting and Analytics on Apache Cassandra

  • 1. BI, Reporting and Analytics on Apache Cassandra 27/10/2015 Victor Coustenoble Solutions Engineer victor.coustenoble@datastax.com @vizanalytics
  • 2. Agenda • DataStax & Apache Cassandra • Data Modeling and CQL • Data Access • Reporting and Analytics • DataStax Enterprise Analytics • Architectures • Hadoop + Cassandra use cases ©2014 DataStax Confidential. Do not distribute without consent. 2
  • 4. © 2014 DataStax Confidential. Do not distribute without consent. DataStax delivers Apache Cassandra in a database platform purpose-built for the performance and availability demands of Web, Mobile, and IOT applications, giving enterprises a secure always-on database that remains operationally simple when scaled in a single datacenter or across multiple datacenters and clouds. “ “ Elevator Pitch
  • 5.
  • 6. No Vertical Market Concentration
  • 8. Apache Cassandra™ • Massively scalable, Open Source, NoSQL, distributed database built for modern, mission- critical online applications • Written in Java and is a hybrid of Amazon Dynamo and Google BigTable • Masterless with no single point of failure • Distributed and data center aware • 100% uptime • Predictable scaling • High Performance • Multi Data Center • Time Series • Tunable Consistency • Simple to Operate • CQL language • OpsCenter / DevCenter Dynamo BigTable BigTable: http://research.google.com/archive/bigtable-osdi06.pdf Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
  • 10. Data Modeling Cassandra is not like well known RDBMS systems: • No a relational model • No foreign keys, no joins, no agregations • Modeling guided by requests to be supported, by data access and by actions (filters, grouping and order needs) Denormalisation • Combine columns from different tables in a unique table (“materialized view”), no joins! • Better performances, less data trafic • Don’t be afraid to duplicate data, to write data • Avoid joins at client level ©2014 DataStax Confidential. Do not distribute without consent. 10
  • 11. Cassandra Data Model ©2014 DataStax Confidential. Do not distribute without consent. 11 • Based on Google Bigtable • Row-oriented column family • De-normalised CREATE TABLE sporty_league ( team_name varchar, player_name varchar, jersey int, PRIMARY KEY (team_name, player_name) ); SELECT * FROM sporty_league; The primary key uniquely identifies a row. A composite primary key consists of: • A partition key • One or more clustering columns e.g. PRIMARY KEY (partition key, cluster columns, ...) • The partition key determines on which node the partition resides • Data is ordered in cluster column order within the partition
  • 12. CQL – Cassandra Query Language ©2014 DataStax Confidential. Do not distribute without consent. • Data type : BLOB, UUID, TIMEUUID, User Defined Type … • User Defined Functions, User Defined Aggregates • Collections : Map, List, Set • TTL (Time-To-Live) at column level • Counters • Lightweight Transactions (LWT) : race condition problem solving with IF NOT EXISTS • Batch statements • Secondary Index • Very similar to RDBMS SQL syntax • Core DML and DDL commands supported: INSERT, UPDATE, DELETE, SELECT, CREATE, GRANT … INSERT INTO sporty_league (team_name, player_name, jersey) VALUES (’PSG',’Zlatan’,10); SELECT player_name as nom_joueur FROM sporty_league WHERE team_name = ‘PSG’; DevCenter
  • 14. Cassandra Data Access CQL language via cqlsh (command line) or DevCenter (development environnement) or drivers • Drivers on Cassandra native protocol • Command CQL COPY • Import/Export tools for massive bulk loader • Connectors in ETL solutions (Talend, Informatica) • Via analytics layers Spark and Hadoop • Via ODBC/JDBC drivers
  • 15. Cassandra Clients - Native Driver DataStax drivers available and supported: Java, Python, C#, C++, Ruby, Node.js, PHP (much more to come like Scala, Go…) This includes: • Load Balancing • Data Centre Aware • Latency Aware • Token Aware • Reconnection policies • Retry policies • Downgrading Consistency • Plus others… ©2014 DataStax Confidential. Do not distribute without consent. 15
  • 16. Connexions ODBC / JDBC ODBC drivers • For SparkSQL (SQL engine on Spark), via JDBC/ODBC SparkSQL thrift server • For Hive (Hadoop SQL engine) • For Cassandra directly (ANSI SQL or CQL requests) JDBC drivers • For SparkSQL (SQL engine on Spark), via JDBC/ODBC SparkSQL thrift server • For Cassandra directly (in progress) • JDBC drivers from the community but not officialy supported
  • 18. Real-Time / Operational Analytics Use Cases Recommendation Engine Internet of Things Fraud Detection Risk Analysis Buyer Behaviour Analytics Telematics, Logistics Business Intelligence Infrastructure Monitoring …
  • 19. How to do analytics on Cassandra data ? Remember … Cassandra = NO JOIN , NO GROUP BY , Filter on Primary Key only 2 solutions: • CQL with predictable queries • Joins and Aggregations on the fly: Server level => Need a distributed processing framework : Hadoop or Spark Client level => Possible but risky !
  • 20. Reporting and Dashboard Confidential 20 • Static and operational dashboards and reports created for a specific Cassandra application. • CQL, Solr queries and DataStax drivers • KPI and aggregations pre-calculated with scheduled batch or on the fly during insert.
  • 21. BI & Data Visualization tools 21 For BI and Data Visualization tools like Tableau Software, Power BI, Qlikview, Excel …. • DataStax ODBC driver SQL joins and aggregations executed at client level ! • Spark ODBC driver (from Databricks or Microsoft) SQL translated in Spark jobs and executed at server level
  • 22. Tableau Software 22 Databricks Spark ODBC Driver for SparkSQL Live SQL queries to Spark or Extract data on local client
  • 23. Power BI Desktop 23 Support for On-Prem Spark distributions “The new data source in this month’s release is support for On-Prem Spark distributions. Last month, we added support for Microsoft Azure HDInsight Spark, and this month we’re expanding to other Spark distributions. This new connector can be found under the “Other” category in the “Get Data” dialog.” http://blogs.msdn.com/b/powerbi/archive/2015/09/23/44-new-features-in-the-power-bi-desktop- september-update.aspx Microsoft Spark ODBC Driver
  • 24. Notebook 24 Run code (Spark or CQL) from a Web browser Notebooks like Zeppelin, Spark Notebook, Jupyter For example Zeppelin: • Examples available for Cassandra • CQL language interpretor • https://github.com/doanduyhai/incubator-zeppelin
  • 26. Analytics with DataStax Enterprise There are 4 ways to do Analytics on Cassandra data: • Reporting with CQL queries • Integrated Search (Solr) • Integrated Batch Analytics (Hadoop integrated) on Cassandra • Integrated Near Real-Time Analytics (Spark) • Virtual multi data centers optimised as required – different workloads, hardware, availability etc.. • Cassandra will replicate the data for you – no ETL is necessary • Cassandra node started with Solr, Hadoop or Spark Cassandra Replication Transactions Analytics
  • 27. Enterprise Search & Powerfull Secondary Index • Built-in enterprise search on Cassandra data via a strong Apache Solr and Lucene integration • Facets, Filtering, Geospatial search, Text Analysis, Joins, etc. • Real-time indexing process and search operations • Search queries from CQL and REST/Solr • Solr shortcomings: • No bottleneck. Client can read/write to any Solr node. • Search index partitioning and replication for scalability and availability. • Multi-DC support • Data durability (Solr lacks write-ahead log, data can be lost) 27 Cassandra Replication Customer Facing Search Nodes
  • 28. Batch Analytics - Hadoop • Integrated Hadoop 1.0.4 • CFS (Cassandra File System) , no HDFS • No Single Point of failure • No Hadoop complexity – every node is built the same • Hive / Pig / Sqoop / Mahout ©2014 DataStax Confidential. Do not distribute without consent. 28 Cassandra Replication Customer Facing Hadoop Nodes
  • 29. Real-Time Analytics - Spark • Tight integration between Apache Spark and Cassandra • Distributed Processing : “In-memory Map/Reduce”, multi-thread, best for iterations • GraphX, MLLib (Machine learning), SparkSQL, Spark Streaming (Real-time processing) • Thrift JDBC/ODBC Spark server – Spark Job server • Apache Solr integration • DataStax / Databricks partnership • 10x – 100x speed of MapReduce ©2014 DataStax Confidential. Do not distribute without consent. 29 Cassandra Replication Customer Facing Spark Nodes « Big Data » SDK
  • 30. Real-time or Batch Analytics ©2014 DataStax Confidential. Do not distribute without consent. 30 Data Enrichment Batch Processing Machine Learning Pre-computed aggregates Data NO ETL
  • 31. Spark Use Cases 31 Load data from various sources Analytics (join, aggregate, transform, …) Sanitize, validate, normalize data Schema migration, Data conversion
  • 33. Workloads Isolation ©2014 DataStax Confidential. Do not distribute without consent. 33 No ETL
  • 34. Hot / Cold Data in a DataStax architecture © 2014 DataStax, All Rights Reserved. Company Confidential Hot Data Online Operational Application Cold Data Offline Application DataStax Cassandra Enterprise 34
  • 35. DataStax Enterprise + Datawarehouse / Hadoop © 2014 DataStax, All Rights Reserved. Company Confidential Write Intensive Internet of Things - Activity logs for fraud and recommendation – Messages 35 Read Intensive Catalogue – Playlist – Recommendation – Fraud Alert – Personalization Operational Search, Dashboard and Reporting Offline Applications Historical Analysis - OLAP - Complex Analytics – Self Service BI Operational Search, Dashboard and Reporting Data Warehouse Hadoop cluster Computation Engine Multidimensional Cube
  • 36. Cassandra + Hadoop Use Cases
  • 37. Ooyala Use Case : Hadoop + Cassandra Company Confidential 37 By leveraging data stored in Apache Cassandra, Ooyala is helping their customers take a more strategic approach when delivering a digital video experience, so they can get ahead in this fast-evolving space. http://www.datastax.com/resources/casestudies/ooyala San Francisco-based video services company Ooyala provides a suite of technologies and services that support content owners in managing, analyzing and monetizing the digital video they publish online, on mobile devices, and through the over- the-top distribution platform for delivering Internet video to television.
  • 38. Spotify Use Case : Hadoop + Cassandra Company Confidential 38 https://labs.spotify.com/2015/01/09/personalization-at-spotify-using-cassandra/ Personalization at Spotify using Cassandra
  • 39. Thanks We power the big data apps that transform business. ©2013 DataStax Confidential. Do not distribute without consent.

Notas del editor

  1. Cassandra is designed to handle big data workloads across multiple data centers with no single point of failure, providing enterprises with continuous availability without compromising performance. It uses aspects of Dynamos partitioning and replication and a log-structured data model similar to Bigtable’s. It takes its distribution algorithm from Dynamo and its data model from Bigtable. Cassandra is a reinvented database which is lightening fast and always on ideal for todays online applications where relational databases like Oracle can’t keep up. This means that in todays world, cassandra stores and processes real time information at fast, predictive performance and built in fault tolerance
  2. Predictive analytics Does this simple architecture look familiar to you? Lambda Nathan Marz
  3. DUYHAI