An Efficient Data Mining Solution with Cassandra and Spark

•Descargar como PPTX, PDF•

1 recomendación•1,851 vistas

This document discusses efficient data mining solutions using Hadoop, Cassandra, and Spark. It describes Cassandra as a fast, robust, and efficient key-value database but notes it has limitations for certain queries. Spark is presented as an alternative to Hadoop MapReduce that can be 100 times faster for interactive algorithms and data mining. The document demonstrates how Spark can integrate with Cassandra to allow distributed data processing over Cassandra data without needing to clone the data or use other databases. Future extensions are proposed to directly access Cassandra's SSTable files from Spark and extend CQL3 to leverage Spark.

Tecnología

Stratio Deep
An efficient data mining solution
“Two and two are four?
Sometimes… Sometimes they are five.”

G. Orwell

#StratioBD

Goals
•
•
•
•

#StratioBD

Why do you need Cassandra?
What is the problem?
Why do you need Spark?
How do they work together?

Cassandra
•
•
•
•

#StratioBD

Based on DynamoDB…
Replication, Key/Value, P2P
And based on Big Table…
Column oriented

NO
BOTTLENECK

DECENTRALIZED

REPLICATED

Case A

One User – Lot of data
#StratioBD

Case C

Many user – Lot of data
#StratioBD

Crawler app
100M
Indexed
pages

3k
reads

Cassandra, I choose you

#StratioBD

Query time

< 1s

New query
“I need to find all the reference to the domain ACME.
I need the answer by Friday.”

#StratioBD

Problem
Cassandra is not well suited to resolved this type of

queries
You need to design the schema with the query in mind

#StratioBD

What options do we have?

•
•
•

#StratioBD

Run Hive Query on top of C*
Write an ETL script and load data into another DB
Clone the cluster

What options do we have?
Run Hive Query on top of C*
Write ETL scripts and load into another DB
Clone the cluster

#StratioBD

And now… what can we do?

“We can't solve problems by using the same kind
of thinking we used when we created them”

Albert Einstein

#StratioBD

Spark
•
•
•
•
•

Alternative to MapReduce
A low latency cluster computing system
For very large datasets
Create by UC Berkeley AMP Lab in 2010.
May be 100 times faster than MapReduce for:



#StratioBD

Interactive algorithms.
Interactive data mining

Logistic regression in
Spark vs Hadoop

SOURCE | http://spark.incubator.apache.org/

#StratioBD

Spark and Cassandra

Integration points

#StratioBD

Cassandra’s HDFS abstraction layer
Advantantages:
•

Easily integrates with legacy systems.

Drawbacks:
•
•

Very high-level: no access to low level Cassandra’s features.
Questionable performance.

INTEGRATION POINTS: HDFS OVER CASSANDRA

#StratioBD

Cassandra’s Hadoop Interface
•

Thrift protocol

•

CQL3 (our implementation)


Uses the novel Cassandra’s CqlPagingInputFormat

INTEGRATION POINTS: HDFS OVER CASSANDRA

#StratioBD

CQL3 Integration
•
•
•

Supports CQL3 features
Respects data locality
Good compromise between
performance / implementation complexity

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

#StratioBD

CQL3 Integration (II)
Provides a Java friendly API:
•

Developers map Column Families to custom serializable POJOs

•

StratioDeep wraps the complexity of performing Spark calculations

directly over the user provided POJOs.

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

#StratioBD

CQL3 Integration (III)
Drawbacks:
•

Still not preforming as well as we’d like


•

No analyst-friendly interface:


#StratioBD

Uses Cassandra’s Hadoop Interface

No SQL-like query features

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

Future extensions
What are we currently working on?

Bring the integration to another level:
•
•
•

#StratioBD

Dump Cassandra’s Hadoop Interface
Direct access to Cassandra’s SSTable(s) files.
Extend Cassandra’s CQL3 to make use of Spark’s distributed
data processing power

Más contenido relacionado

La actualidad más candente

Big Data Ecosystem - 1000 Simulated DronesEspeo Software

Hadoop at ayasdiMohit Jaggi

Scala: the unpredicted lingua franca for data scienceAndy Petrella

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit

Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Spark Summit

Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteDatabricks

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks

Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Spark Summit

Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks

Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark Summit

Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...Spark Summit

AI on Spark for Malware Analysis and Anomalous Threat DetectionDatabricks

Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...Spark Summit

Apache Spark BriefingThomas W. Dinsmore

Implementing the Lambda Architecture efficiently with Apache SparkDataWorks Summit

Spark Summit EU talk by Christos ErotocritouSpark Summit

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...Spark Summit

Apache Spark At Scale in the CloudDatabricks

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit

La actualidad más candente (20)

Big Data Ecosystem - 1000 Simulated Drones

Hadoop at ayasdi

Scala: the unpredicted lingua franca for data science

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...

Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...

Rental Cars and Industrialized Learning to Rank with Sean Downes

Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin

Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...

AI on Spark for Malware Analysis and Anomalous Threat Detection

Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...

Apache Spark Briefing

Implementing the Lambda Architecture efficiently with Apache Spark

Spark Summit EU talk by Christos Erotocritou

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...

Apache Spark At Scale in the Cloud

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...

Destacado

Learn to use Stratio CrossdataÁlvaro Agea Herradón

Why Spark?Álvaro Agea Herradón

StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit ...Álvaro Agea Herradón

Stratio platform overview v4.1Stratio

Stratio CrossData: an efficient distributed datahub with batch and streaming ...Stratio

Crossdata: an efficient distributed datahub with batch and streaming query ca...Álvaro Agea Herradón

Primeros pasos con Apache Spark - Madrid Meetupdhiguero

La Unión Bancaria Europeakoball

PresentacionComunicacionesPDB

El modelo europeo de reporting y el lenguaje XBRL - Ignacio BoixoAsociación XBRL España

UNION BANCARIA EN LA UNION EUROPEARamiro Ojeda

Recuperación y Unión Bancaria Europea. Emilio OntiverosUniversidad de Deusto - Deustuko Unibertsitatea - University of Deusto

11 Tools for your Open Source devops stack Kris Buytaert

Estándares en Unión Europea: Marco, Desafíos y Oportunidades - Francisco Garc...Asociación XBRL España

Tutorial en Apache Spark - Clasificando tweets en realtimeSocialmetrix

Distributed Logistic Model TreesStratio

[Strata] SparktaStratio

La translación del marco regulatorio Solvencia II al estándar XBRL - Aitor Az...Asociación XBRL España

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit

El impacto del big data en la estrategia de los medios de comunicacion by Osc...ACTUONDA

Destacado (20)

Learn to use Stratio Crossdata

Why Spark?

StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit ...

Stratio platform overview v4.1

Stratio CrossData: an efficient distributed datahub with batch and streaming ...

Crossdata: an efficient distributed datahub with batch and streaming query ca...

Primeros pasos con Apache Spark - Madrid Meetup

La Unión Bancaria Europea

Presentacion

El modelo europeo de reporting y el lenguaje XBRL - Ignacio Boixo

UNION BANCARIA EN LA UNION EUROPEA

Recuperación y Unión Bancaria Europea. Emilio Ontiveros

11 Tools for your Open Source devops stack

Estándares en Unión Europea: Marco, Desafíos y Oportunidades - Francisco Garc...

Tutorial en Apache Spark - Clasificando tweets en realtime

Distributed Logistic Model Trees

[Strata] Sparkta

La translación del marco regulatorio Solvencia II al estándar XBRL - Aitor Az...

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)

El impacto del big data en la estrategia de los medios de comunicacion by Osc...

An Efficient Data Mining Solution with Cassandra and Spark

1. AN EFFICIENT DATA MINING SOLUTION

2. Hadoop?

3. Cassandra?

4. Spark?

5. Stratio Deep An efficient data mining solution “Two and two are four? Sometimes… Sometimes they are five.” G. Orwell #StratioBD

7. Goals • • • • #StratioBD Why do you need Cassandra? What is the problem? Why do you need Spark? How do they work together?

8. Cassandra • • • • #StratioBD Based on DynamoDB… Replication, Key/Value, P2P And based on Big Table… Column oriented

9. ROBUST FAST EFFICENT

10. NO BOTTLENECK DECENTRALIZED REPLICATED

11. Another Database?

12. Why?

13. Case A One User – Lot of data #StratioBD

14. Case B Many User – Few data #StratioBD

15. Case C Many user – Lot of data #StratioBD

16. Crawler app 100M Indexed pages 3k reads Cassandra, I choose you #StratioBD Query time < 1s

17. But…

18. Marketing walks in

19. New query “I need to find all the reference to the domain ACME. I need the answer by Friday.” #StratioBD

20. Problem Cassandra is not well suited to resolved this type of queries You need to design the schema with the query in mind #StratioBD

21. Challenge Accepted

22. What options do we have? • • • #StratioBD Run Hive Query on top of C* Write an ETL script and load data into another DB Clone the cluster

23. What options do we have? Run Hive Query on top of C* Write ETL scripts and load into another DB Clone the cluster #StratioBD

24. And now… what can we do? “We can't solve problems by using the same kind of thinking we used when we created them” Albert Einstein #StratioBD

25. Spark • • • • • Alternative to MapReduce A low latency cluster computing system For very large datasets Create by UC Berkeley AMP Lab in 2010. May be 100 times faster than MapReduce for:   #StratioBD Interactive algorithms. Interactive data mining

26. Logistic regression in Spark vs Hadoop SOURCE | http://spark.incubator.apache.org/ #StratioBD

27. WHO USES SPARK?

28. Spark and Cassandra Integration points #StratioBD

29. Cassandra’s HDFS abstraction layer Advantantages: • Easily integrates with legacy systems. Drawbacks: • • Very high-level: no access to low level Cassandra’s features. Questionable performance. INTEGRATION POINTS: HDFS OVER CASSANDRA #StratioBD

30. Cassandra’s Hadoop Interface • Thrift protocol • CQL3 (our implementation)  Uses the novel Cassandra’s CqlPagingInputFormat INTEGRATION POINTS: HDFS OVER CASSANDRA #StratioBD

31. CQL3 Integration • • • Supports CQL3 features Respects data locality Good compromise between performance / implementation complexity INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3 #StratioBD

32. CQL3 Integration (II) Provides a Java friendly API: • Developers map Column Families to custom serializable POJOs • StratioDeep wraps the complexity of performing Spark calculations directly over the user provided POJOs. INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3 #StratioBD

33. Demo

34. CQL3 Integration (III) Drawbacks: • Still not preforming as well as we’d like  • No analyst-friendly interface:  #StratioBD Uses Cassandra’s Hadoop Interface No SQL-like query features INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

35. Future extensions What are we currently working on? Bring the integration to another level: • • • #StratioBD Dump Cassandra’s Hadoop Interface Direct access to Cassandra’s SSTable(s) files. Extend Cassandra’s CQL3 to make use of Spark’s distributed data processing power

36. Conclusion #StratioBD

37. THANKS

Notas del editor

Good afternoon, in this moment, everybody should know Stratio, the big data company. Now, I need to know if you seem familiar some concepts.

An Efficient Data Mining Solution with Cassandra and Spark

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a An Efficient Data Mining Solution with Cassandra and Spark

Similar a An Efficient Data Mining Solution with Cassandra and Spark (20)

Último

Último (20)

An Efficient Data Mining Solution with Cassandra and Spark

Notas del editor