Cassandra&map reduce

•

3 recomendaciones•1,394 vistas

vlaskinvlad

Some notes on Cassandra&MapReduce integration

Tecnología

Agenda
- Why do we need Cassandra & MapReduce
- 3 notes about Cassandra
- 3 notes Cassandra + MapReduce

RTB
- How often?
10 - 30M auctions per day for mobile devices in Russia
e.g. 10-30Gb of data /day
- What do we need?
Be effective in showing Ads
- They call it "big data & data mining"
We decided to use Cassandra&Hadoop for that

1. Cassandra tokens
∆ = [T0, T4] - token range
ex. ∆ = [-2^63, +2^63]
Every time one writes (K,V) into Cassandra:
- ex. token(K) in [T2, T3]
- (K,V) will be put into node 3 (if replica 1)

2. Cassandra Load Balancing
Partitioner generates tokens for your keys
E.g. it creates token(K)
Cassandra offers the following partitioners:
● Murmur3Partitioner (default): Uniformly distributes data
across the cluster based on MurmurHash hash values.
● RandomPartitioner: Uniformly distributes data across the
cluster based on MD5 hash values.
● ByteOrderedPartitioner: Keeps an ordered distribution of data
lexically by key bytes
The Murmur3Partitioner is the default partitioning strategy for new Cassandra clusters and the
right choice for new clusters in almost all cases.
http://www.datastax.com/docs/1.2/cluster_architecture/partitioners

3. Cassandra indexes and knows
- Cassandra support common data formats
E.g. byte, string, long, double
- Cassandra support secondary indexes
E.g. you can select your data not bulky
- Cassandra knows how much data (records) in every
token range

Cassandra & Map-Reduce
Google says:
1. Cassandra is integrated with Map-Reduce
http://wiki.apache.org/cassandra/HadoopSupport
2. It is outdated
3. It is used for Hadoop 1.0.3 or whatever version
This means: Please install hadoop+mr cluster yourself

Cassandra & Map-Reduce (we want)
1. Cloudera Hadoop Distribution (CDH4)
Cloudera manager installs your cluster in couple of clicks
2. Up to date (Cassandra 1.1.x - 1.2.x)
Solution:
A) Take Cassandra sources from
http://cassandra.apache.org/download/
B) Take package org.apache.cassandra.hadoop and recompile it, having
dependencies from CDH4&Cassadnra[1.x]
And Jar is ready to go for your map-reduce jobs

1. Allocate your cluster
DataStax says:
To configure a Cassandra cluster for Hadoop integration, overlay a Hadoop cluster over your Cassandra nodes.
This involves installing a TaskTracker on each Cassandra node, and setting up a JobTracker and HDFS data node.
Why?

Because this:
works 100 times faster than this:

2. Number of map tasks
Job control parameter: InputSplitSize (default 65536)
Estimates how much data one mapper will receive
Every map task has it's own token range to read data from: [-2^63, +2^63] / number of map tasks

3. How job reads the data
JobControlParameter: RangeBatchSize (default: 4096)
Bulk volume to read including your filters (primary & secondary indexes)
Cassandra does filtering job on server side
( [-2^63, +2^63] / number of map tasks )

Pros:
1. Easy to manage (Cassandra cluster & cloudera manager is
2. Easy to index
3. Supports query language & data types support
Cons:
1. Scalable extremely expensive (every node should run cassandra + hadoop)
2. Reading is very slow
3. Reading big amount is impossible
Note: Netflix reading using cassandra to manage the data.
But their map-reduce jobs are reading sstable-files directly, avoiding Cassandra!
http://www.datastax.com/dev/blog/2012-in-review-performance
Conclusion

Más contenido relacionado

La actualidad más candente

Anomaly Detection with Apache SparkCloudera, Inc.

Gnocchi v3Gordon Chung

Hands on MapR -- Viadeaviadea

Using Spark for Timeseries Graph Analytics vedVed Mulkalwar

Apache NemoNAVER Engineering

Using Spark for Timeseries Graph Analytics vedVed Mulkalwar

Gnocchi v4 - past and presentGordon Chung

Cassandra synergyniallmilton

spark stream - kafka - the right way Dori Waldman

Network simulator 2AAKASH S

Druid meetup @walkmeDori Waldman

Using Spark over CassandraNoam Barkai

Apache Spark Internals - Part 2Jéferson Machado

Cloud burst 소개주영 송

Gnocchi v3 brownbagGordon Chung

Apache SparkJéferson Machado

Gnocchi v4 (preview)Gordon Chung

MapReduce: Distributed Computing for Machine Learningbutest

La actualidad más candente (19)

Anomaly Detection with Apache Spark

Gnocchi v3

Hands on MapR -- Viadea

Using Spark for Timeseries Graph Analytics ved

Apache Nemo

Using Spark for Timeseries Graph Analytics ved

Gnocchi v4 - past and present

Cassandra synergy

spark stream - kafka - the right way

Network simulator 2

Druid meetup @walkme

Using Spark over Cassandra

Apache Spark Internals - Part 2

Cloud burst 소개

Gnocchi v3 brownbag

Apache Spark

Gnocchi v4 (preview)

MapReduce: Distributed Computing for Machine Learning

Destacado

Predictable Big Data Performance in Real-timeAerospike, Inc.

What enterprises can learn from Real Time Bidding (RTB)bigdatagurus_meetup

How We Use MongoDB in Our Advertising SystemMongoDB

Algorithmic marketplacereducedata

Monitoring Cassandra with graphite using Yammer Coda-Hale LibraryNader Ganayem

Get Started with Data Science by Analyzing Traffic Data from California HighwaysAerospike, Inc.

Hugfr SPARK & RIAK -20160114_hug_franceModern Data Stack France

There are 250 Database products, are you running the right one?Aerospike, Inc.

WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...Aerospike, Inc.

Streaming Big Data & Analytics For ScaleHelena Edelson

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson

NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson

2017 DB Trends for Powering Real-Time Systems of EngagementAerospike, Inc.

What the Spark!? Intro and Use CasesAerospike, Inc.

Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Helena Edelson

Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Helena Edelson

All about Programmatic buying(RTB), DSP,SSP, DMP & DCT - A complete digital ...Karunakar Ravirala

M6d cassandra summitEdward Capriolo

Nibiru: Building your own NoSQL storeEdward Capriolo

Destacado (19)

Predictable Big Data Performance in Real-time

What enterprises can learn from Real Time Bidding (RTB)

How We Use MongoDB in Our Advertising System

Algorithmic marketplace

Monitoring Cassandra with graphite using Yammer Coda-Hale Library

Get Started with Data Science by Analyzing Traffic Data from California Highways

Hugfr SPARK & RIAK -20160114_hug_france

There are 250 Database products, are you running the right one?

WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...

Streaming Big Data & Analytics For Scale

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis

2017 DB Trends for Powering Real-Time Systems of Engagement

What the Spark!? Intro and Use Cases

Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...

Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...

All about Programmatic buying(RTB), DSP,SSP, DMP & DCT - A complete digital ...

M6d cassandra summit

Nibiru: Building your own NoSQL store

Similar a Cassandra&map reduce

Spark 计算模型wang xing

GumGum: Multi-Region Cassandra in AWSDataStax Academy

The Apache Cassandra ecosystemAlex Thompson

Apache cassandra and spark. you got the the lighter, let's start the firePatrick McFadin

Apache Cassandra at the Geek2Geek BerlinChristian Johannsen

Unit 2vishal choudhary

ClusterAnalysisAnbarasan S

Introduction to Apache SparkDatio Big Data

Tuning and Debugging in Apache SparkDatabricks

Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

Big data analytics_beyond_hadoop_public_18_july_2013Vijay Srinivas Agneeswaran, Ph.D

Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon

Hadoop ecosystemRan Silberman

Apache cassandra & apache spark for time series dataPatrick McFadin

Tweaking perfomance on high-load projects_Думанский ДмитрийGeeksLab Odessa

Parallel Implementation of K Means Clustering on CUDAprithan

Hadoop - Introduction to HDFSVibrant Technologies & Computers

Scala+dataSamir Bessalah

Hadoopdevakalyan143

Similar a Cassandra&map reduce (20)

Spark 计算模型

GumGum: Multi-Region Cassandra in AWS

The Apache Cassandra ecosystem

Apache cassandra and spark. you got the the lighter, let's start the fire

Apache Cassandra at the Geek2Geek Berlin

Unit 2

ClusterAnalysis

Introduction to Apache Spark

Tuning and Debugging in Apache Spark

Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Unified Big Data Processing with Apache Spark (QCON 2014)

Big data analytics_beyond_hadoop_public_18_july_2013

Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016

Hadoop ecosystem

Apache cassandra & apache spark for time series data

Tweaking perfomance on high-load projects_Думанский Дмитрий

Parallel Implementation of K Means Clustering on CUDA

Hadoop - Introduction to HDFS

Scala+data

Hadoop

Último

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

Time Series Foundation Models - current state and future directionsNathaniel Shimoni

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

unit 4 immunoblotting technique complete.pptxBkGupta21

Training state-of-the-art general text embeddingZilliz

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Sample pptx for embedding into website for demoHarshalMandlekar2

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

Cassandra&map reduce

1. Cassandra & MR and how we used it in

2. Agenda - Why do we need Cassandra & MapReduce - 3 notes about Cassandra - 3 notes Cassandra + MapReduce

3. Real Time Bidding (RTB)

4. RTB - How often? 10 - 30M auctions per day for mobile devices in Russia e.g. 10-30Gb of data /day - What do we need? Be effective in showing Ads - They call it "big data & data mining" We decided to use Cassandra&Hadoop for that

5. 1. Cassandra tokens ∆ = [T0, T4] - token range ex. ∆ = [-2^63, +2^63] Every time one writes (K,V) into Cassandra: - ex. token(K) in [T2, T3] - (K,V) will be put into node 3 (if replica 1)

6. 2. Cassandra Load Balancing Partitioner generates tokens for your keys E.g. it creates token(K) Cassandra offers the following partitioners: ● Murmur3Partitioner (default): Uniformly distributes data across the cluster based on MurmurHash hash values. ● RandomPartitioner: Uniformly distributes data across the cluster based on MD5 hash values. ● ByteOrderedPartitioner: Keeps an ordered distribution of data lexically by key bytes The Murmur3Partitioner is the default partitioning strategy for new Cassandra clusters and the right choice for new clusters in almost all cases. http://www.datastax.com/docs/1.2/cluster_architecture/partitioners

7. 3. Cassandra indexes and knows - Cassandra support common data formats E.g. byte, string, long, double - Cassandra support secondary indexes E.g. you can select your data not bulky - Cassandra knows how much data (records) in every token range

8. Cassandra & Map-Reduce Google says: 1. Cassandra is integrated with Map-Reduce http://wiki.apache.org/cassandra/HadoopSupport 2. It is outdated 3. It is used for Hadoop 1.0.3 or whatever version This means: Please install hadoop+mr cluster yourself

9. Cassandra & Map-Reduce (we want) 1. Cloudera Hadoop Distribution (CDH4) Cloudera manager installs your cluster in couple of clicks 2. Up to date (Cassandra 1.1.x - 1.2.x) Solution: A) Take Cassandra sources from http://cassandra.apache.org/download/ B) Take package org.apache.cassandra.hadoop and recompile it, having dependencies from CDH4&Cassadnra[1.x] And Jar is ready to go for your map-reduce jobs

10. 1. Allocate your cluster DataStax says: To configure a Cassandra cluster for Hadoop integration, overlay a Hadoop cluster over your Cassandra nodes. This involves installing a TaskTracker on each Cassandra node, and setting up a JobTracker and HDFS data node. Why?

11. Because this: works 100 times faster than this:

12. 2. Number of map tasks Job control parameter: InputSplitSize (default 65536) Estimates how much data one mapper will receive Every map task has it's own token range to read data from: [-2^63, +2^63] / number of map tasks

13. 3. How job reads the data JobControlParameter: RangeBatchSize (default: 4096) Bulk volume to read including your filters (primary & secondary indexes) Cassandra does filtering job on server side ( [-2^63, +2^63] / number of map tasks )

14. Pros: 1. Easy to manage (Cassandra cluster & cloudera manager is 2. Easy to index 3. Supports query language & data types support Cons: 1. Scalable extremely expensive (every node should run cassandra + hadoop) 2. Reading is very slow 3. Reading big amount is impossible Note: Netflix reading using cassandra to manage the data. But their map-reduce jobs are reading sstable-files directly, avoiding Cassandra! http://www.datastax.com/dev/blog/2012-in-review-performance Conclusion

Cassandra&map reduce

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (19)

Similar a Cassandra&map reduce

Similar a Cassandra&map reduce (20)

Último

Último (20)

Cassandra&map reduce