SlideShare una empresa de Scribd logo
1 de 14
Descargar para leer sin conexión
Cassandra & MR
and how we used it in
Agenda
- Why do we need Cassandra & MapReduce
- 3 notes about Cassandra
- 3 notes Cassandra + MapReduce
Real Time Bidding (RTB)
RTB
- How often?
10 - 30M auctions per day for mobile devices in Russia
e.g. 10-30Gb of data /day
- What do we need?
Be effective in showing Ads
- They call it "big data & data mining"
We decided to use Cassandra&Hadoop for that
1. Cassandra tokens
∆ = [T0, T4] - token range
ex. ∆ = [-2^63, +2^63]
Every time one writes (K,V) into Cassandra:
- ex. token(K) in [T2, T3]
- (K,V) will be put into node 3 (if replica 1)
2. Cassandra Load Balancing
Partitioner generates tokens for your keys
E.g. it creates token(K)
Cassandra offers the following partitioners:
● Murmur3Partitioner (default): Uniformly distributes data
across the cluster based on MurmurHash hash values.
● RandomPartitioner: Uniformly distributes data across the
cluster based on MD5 hash values.
● ByteOrderedPartitioner: Keeps an ordered distribution of data
lexically by key bytes
The Murmur3Partitioner is the default partitioning strategy for new Cassandra clusters and the
right choice for new clusters in almost all cases.
http://www.datastax.com/docs/1.2/cluster_architecture/partitioners
3. Cassandra indexes and knows
- Cassandra support common data formats
E.g. byte, string, long, double
- Cassandra support secondary indexes
E.g. you can select your data not bulky
- Cassandra knows how much data (records) in every
token range
Cassandra & Map-Reduce
Google says:
1. Cassandra is integrated with Map-Reduce
http://wiki.apache.org/cassandra/HadoopSupport
2. It is outdated
3. It is used for Hadoop 1.0.3 or whatever version
This means: Please install hadoop+mr cluster yourself
Cassandra & Map-Reduce (we want)
1. Cloudera Hadoop Distribution (CDH4)
Cloudera manager installs your cluster in couple of clicks
2. Up to date (Cassandra 1.1.x - 1.2.x)
Solution:
A) Take Cassandra sources from
http://cassandra.apache.org/download/
B) Take package org.apache.cassandra.hadoop and recompile it, having
dependencies from CDH4&Cassadnra[1.x]
And Jar is ready to go for your map-reduce jobs
1. Allocate your cluster
DataStax says:
To configure a Cassandra cluster for Hadoop integration, overlay a Hadoop cluster over your Cassandra nodes.
This involves installing a TaskTracker on each Cassandra node, and setting up a JobTracker and HDFS data node.
Why?
Because this:
works 100 times faster than this:
2. Number of map tasks
Job control parameter: InputSplitSize (default 65536)
Estimates how much data one mapper will receive
Every map task has it's own token range to read data from: [-2^63, +2^63] / number of map tasks
3. How job reads the data
JobControlParameter: RangeBatchSize (default: 4096)
Bulk volume to read including your filters (primary & secondary indexes)
Cassandra does filtering job on server side
( [-2^63, +2^63] / number of map tasks )
Pros:
1. Easy to manage (Cassandra cluster & cloudera manager is
2. Easy to index
3. Supports query language & data types support
Cons:
1. Scalable extremely expensive (every node should run cassandra + hadoop)
2. Reading is very slow
3. Reading big amount is impossible
Note: Netflix reading using cassandra to manage the data.
But their map-reduce jobs are reading sstable-files directly, avoiding Cassandra!
http://www.datastax.com/dev/blog/2012-in-review-performance
Conclusion

Más contenido relacionado

La actualidad más candente

Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadeaviadea
 
Using Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics vedUsing Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics vedVed Mulkalwar
 
Using Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics vedUsing Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics vedVed Mulkalwar
 
Gnocchi v4 - past and present
Gnocchi v4 - past and presentGnocchi v4 - past and present
Gnocchi v4 - past and presentGordon Chung
 
Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergyniallmilton
 
spark stream - kafka - the right way
spark stream - kafka - the right way spark stream - kafka - the right way
spark stream - kafka - the right way Dori Waldman
 
Network simulator 2
Network simulator 2Network simulator 2
Network simulator 2AAKASH S
 
Network simulator 2
Network simulator 2Network simulator 2
Network simulator 2AAKASH S
 
Druid meetup @walkme
Druid meetup @walkmeDruid meetup @walkme
Druid meetup @walkmeDori Waldman
 
Using Spark over Cassandra
Using Spark over CassandraUsing Spark over Cassandra
Using Spark over CassandraNoam Barkai
 
Apache Spark Internals - Part 2
Apache Spark Internals - Part 2Apache Spark Internals - Part 2
Apache Spark Internals - Part 2Jéferson Machado
 
Cloud burst 소개
Cloud burst 소개Cloud burst 소개
Cloud burst 소개주영 송
 
Gnocchi v3 brownbag
Gnocchi v3 brownbagGnocchi v3 brownbag
Gnocchi v3 brownbagGordon Chung
 
Gnocchi v4 (preview)
Gnocchi v4 (preview)Gnocchi v4 (preview)
Gnocchi v4 (preview)Gordon Chung
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learningbutest
 

La actualidad más candente (19)

Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Gnocchi v3
Gnocchi v3Gnocchi v3
Gnocchi v3
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadea
 
Using Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics vedUsing Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics ved
 
Apache Nemo
Apache NemoApache Nemo
Apache Nemo
 
Using Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics vedUsing Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics ved
 
Gnocchi v4 - past and present
Gnocchi v4 - past and presentGnocchi v4 - past and present
Gnocchi v4 - past and present
 
Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergy
 
spark stream - kafka - the right way
spark stream - kafka - the right way spark stream - kafka - the right way
spark stream - kafka - the right way
 
Network simulator 2
Network simulator 2Network simulator 2
Network simulator 2
 
Network simulator 2
Network simulator 2Network simulator 2
Network simulator 2
 
Druid meetup @walkme
Druid meetup @walkmeDruid meetup @walkme
Druid meetup @walkme
 
Using Spark over Cassandra
Using Spark over CassandraUsing Spark over Cassandra
Using Spark over Cassandra
 
Apache Spark Internals - Part 2
Apache Spark Internals - Part 2Apache Spark Internals - Part 2
Apache Spark Internals - Part 2
 
Cloud burst 소개
Cloud burst 소개Cloud burst 소개
Cloud burst 소개
 
Gnocchi v3 brownbag
Gnocchi v3 brownbagGnocchi v3 brownbag
Gnocchi v3 brownbag
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Gnocchi v4 (preview)
Gnocchi v4 (preview)Gnocchi v4 (preview)
Gnocchi v4 (preview)
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learning
 

Destacado

Predictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-timePredictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-timeAerospike, Inc.
 
What enterprises can learn from Real Time Bidding (RTB)
What enterprises can learn from Real Time Bidding (RTB)What enterprises can learn from Real Time Bidding (RTB)
What enterprises can learn from Real Time Bidding (RTB)bigdatagurus_meetup
 
How We Use MongoDB in Our Advertising System
How We Use MongoDB in Our Advertising SystemHow We Use MongoDB in Our Advertising System
How We Use MongoDB in Our Advertising SystemMongoDB
 
Algorithmic marketplace
Algorithmic marketplaceAlgorithmic marketplace
Algorithmic marketplacereducedata
 
Monitoring Cassandra with graphite using Yammer Coda-Hale Library
Monitoring Cassandra with graphite using Yammer Coda-Hale LibraryMonitoring Cassandra with graphite using Yammer Coda-Hale Library
Monitoring Cassandra with graphite using Yammer Coda-Hale LibraryNader Ganayem
 
Get Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California HighwaysGet Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California HighwaysAerospike, Inc.
 
There are 250 Database products, are you running the right one?
There are 250 Database products, are you running the right one?There are 250 Database products, are you running the right one?
There are 250 Database products, are you running the right one?Aerospike, Inc.
 
WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...
WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...
WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...Aerospike, Inc.
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleHelena Edelson
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
 
2017 DB Trends for Powering Real-Time Systems of Engagement
2017 DB Trends for Powering Real-Time Systems of Engagement2017 DB Trends for Powering Real-Time Systems of Engagement
2017 DB Trends for Powering Real-Time Systems of EngagementAerospike, Inc.
 
What the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesWhat the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesAerospike, Inc.
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Helena Edelson
 
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Helena Edelson
 
All about Programmatic buying(RTB), DSP,SSP, DMP & DCT - A complete digital ...
All about Programmatic buying(RTB), DSP,SSP, DMP & DCT -  A complete digital ...All about Programmatic buying(RTB), DSP,SSP, DMP & DCT -  A complete digital ...
All about Programmatic buying(RTB), DSP,SSP, DMP & DCT - A complete digital ...Karunakar Ravirala
 
Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeEdward Capriolo
 

Destacado (19)

Predictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-timePredictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-time
 
What enterprises can learn from Real Time Bidding (RTB)
What enterprises can learn from Real Time Bidding (RTB)What enterprises can learn from Real Time Bidding (RTB)
What enterprises can learn from Real Time Bidding (RTB)
 
How We Use MongoDB in Our Advertising System
How We Use MongoDB in Our Advertising SystemHow We Use MongoDB in Our Advertising System
How We Use MongoDB in Our Advertising System
 
Algorithmic marketplace
Algorithmic marketplaceAlgorithmic marketplace
Algorithmic marketplace
 
Monitoring Cassandra with graphite using Yammer Coda-Hale Library
Monitoring Cassandra with graphite using Yammer Coda-Hale LibraryMonitoring Cassandra with graphite using Yammer Coda-Hale Library
Monitoring Cassandra with graphite using Yammer Coda-Hale Library
 
Get Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California HighwaysGet Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California Highways
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
 
There are 250 Database products, are you running the right one?
There are 250 Database products, are you running the right one?There are 250 Database products, are you running the right one?
There are 250 Database products, are you running the right one?
 
WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...
WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...
WEBINAR: Architectures for Digital Transformation and Next-Generation Systems...
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For Scale
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
2017 DB Trends for Powering Real-Time Systems of Engagement
2017 DB Trends for Powering Real-Time Systems of Engagement2017 DB Trends for Powering Real-Time Systems of Engagement
2017 DB Trends for Powering Real-Time Systems of Engagement
 
What the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesWhat the Spark!? Intro and Use Cases
What the Spark!? Intro and Use Cases
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
 
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
 
All about Programmatic buying(RTB), DSP,SSP, DMP & DCT - A complete digital ...
All about Programmatic buying(RTB), DSP,SSP, DMP & DCT -  A complete digital ...All about Programmatic buying(RTB), DSP,SSP, DMP & DCT -  A complete digital ...
All about Programmatic buying(RTB), DSP,SSP, DMP & DCT - A complete digital ...
 
M6d cassandra summit
M6d cassandra summitM6d cassandra summit
M6d cassandra summit
 
Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL store
 

Similar a Cassandra&map reduce

Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
GumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSGumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSDataStax Academy
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystemAlex Thompson
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the firePatrick McFadin
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinChristian Johannsen
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataPatrick McFadin
 
Tweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский ДмитрийTweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский ДмитрийGeeksLab Odessa
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 

Similar a Cassandra&map reduce (20)

Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
GumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSGumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWS
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
 
Unit 2
Unit 2Unit 2
Unit 2
 
ClusterAnalysis
ClusterAnalysisClusterAnalysis
ClusterAnalysis
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
 
Tweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский ДмитрийTweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский Дмитрий
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Scala+data
Scala+dataScala+data
Scala+data
 
Hadoop
HadoopHadoop
Hadoop
 

Último

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 

Cassandra&map reduce

  • 1. Cassandra & MR and how we used it in
  • 2. Agenda - Why do we need Cassandra & MapReduce - 3 notes about Cassandra - 3 notes Cassandra + MapReduce
  • 4. RTB - How often? 10 - 30M auctions per day for mobile devices in Russia e.g. 10-30Gb of data /day - What do we need? Be effective in showing Ads - They call it "big data & data mining" We decided to use Cassandra&Hadoop for that
  • 5. 1. Cassandra tokens ∆ = [T0, T4] - token range ex. ∆ = [-2^63, +2^63] Every time one writes (K,V) into Cassandra: - ex. token(K) in [T2, T3] - (K,V) will be put into node 3 (if replica 1)
  • 6. 2. Cassandra Load Balancing Partitioner generates tokens for your keys E.g. it creates token(K) Cassandra offers the following partitioners: ● Murmur3Partitioner (default): Uniformly distributes data across the cluster based on MurmurHash hash values. ● RandomPartitioner: Uniformly distributes data across the cluster based on MD5 hash values. ● ByteOrderedPartitioner: Keeps an ordered distribution of data lexically by key bytes The Murmur3Partitioner is the default partitioning strategy for new Cassandra clusters and the right choice for new clusters in almost all cases. http://www.datastax.com/docs/1.2/cluster_architecture/partitioners
  • 7. 3. Cassandra indexes and knows - Cassandra support common data formats E.g. byte, string, long, double - Cassandra support secondary indexes E.g. you can select your data not bulky - Cassandra knows how much data (records) in every token range
  • 8. Cassandra & Map-Reduce Google says: 1. Cassandra is integrated with Map-Reduce http://wiki.apache.org/cassandra/HadoopSupport 2. It is outdated 3. It is used for Hadoop 1.0.3 or whatever version This means: Please install hadoop+mr cluster yourself
  • 9. Cassandra & Map-Reduce (we want) 1. Cloudera Hadoop Distribution (CDH4) Cloudera manager installs your cluster in couple of clicks 2. Up to date (Cassandra 1.1.x - 1.2.x) Solution: A) Take Cassandra sources from http://cassandra.apache.org/download/ B) Take package org.apache.cassandra.hadoop and recompile it, having dependencies from CDH4&Cassadnra[1.x] And Jar is ready to go for your map-reduce jobs
  • 10. 1. Allocate your cluster DataStax says: To configure a Cassandra cluster for Hadoop integration, overlay a Hadoop cluster over your Cassandra nodes. This involves installing a TaskTracker on each Cassandra node, and setting up a JobTracker and HDFS data node. Why?
  • 11. Because this: works 100 times faster than this:
  • 12. 2. Number of map tasks Job control parameter: InputSplitSize (default 65536) Estimates how much data one mapper will receive Every map task has it's own token range to read data from: [-2^63, +2^63] / number of map tasks
  • 13. 3. How job reads the data JobControlParameter: RangeBatchSize (default: 4096) Bulk volume to read including your filters (primary & secondary indexes) Cassandra does filtering job on server side ( [-2^63, +2^63] / number of map tasks )
  • 14. Pros: 1. Easy to manage (Cassandra cluster & cloudera manager is 2. Easy to index 3. Supports query language & data types support Cons: 1. Scalable extremely expensive (every node should run cassandra + hadoop) 2. Reading is very slow 3. Reading big amount is impossible Note: Netflix reading using cassandra to manage the data. But their map-reduce jobs are reading sstable-files directly, avoiding Cassandra! http://www.datastax.com/dev/blog/2012-in-review-performance Conclusion