Cassandra + Hadoop = Brisk

•Descargar como PPTX, PDF•

7 recomendaciones•11,365 vistas

An introduction to DataStax's Brisk (a distribution of Cassandra, Hadoop and Hive). Includes a back story of my own experience with Cassandra plus a demo of Brisk built around a very simple ad-network-type application.

Tecnología

21 22 23 24 GC HELL! 17 18 19 20 13 14 15 16

Please volunteer if you would like to give a talk, Internet fame awaits

Analytics is more difficult than it could be

Welcome Brisk! ,[object Object],[object Object]

Can split cluster for OLAP and OLTP workloads, scaling up either as required,[object Object]

Demonstrating brisk… Building anAd Network!

System to put users in buckets via a pixel

Add user to a bucket (including ability to define expiry time)

Get buckets a user belongs to,[object Object]

Ubuntu 10.10 image with RAID 0 ephemeral disks

Jairam has been bug-fixing some minor issues,[object Object]

Data model CF = users [userUUID] [segmentID] = 1 CF = segments [segmentID] [userUUID] = 1

$Data model create keyspacewhyk ... with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' ... and strategy_options = [{replication_factor:1}]; create column family users ... with comparator = 'AsciiType' ... and rows_cached = 5000; create column family segments ... with comparator = 'AsciiType' ... and rows_cached = 5000;$

Our pixel http://wehaveyourkidneys.com/add.php? segment=<alphaNumericCode> &expire=<numberOfSeconds> ,[object Object],[object Object]

Real-time access http://wehaveyourkidneys.com/show.php $pool = new ConnectionPool('whyk', array('localhost')); $users = new ColumnFamily($pool, 'users'); // @todo this only gets first 100! $segments = $users->get($userUuid); header('Content-Type: application/json'); echo json_encode(array_keys($segments));

Analytics How many users in each segment? Launch HIVE (very easy!) root@brisk-01:~# brisk hive

CREATE EXTERNAL TABLE whyk.users (userUuid string, segmentId string, value string) STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler’ WITH SERDEPROPERTIES ("cassandra.columns.mapping" = ":key,:column,:value" ); select segmentId, count(1) as total from whyk.users group by segmentId order by total desc;

Summary http://www.flickr.com/photos/sovietuk/2956044892/sizes/o/in/photostream/

Más contenido relacionado

La actualidad más candente

Lightning Fast Analytics with Cassandra and SparkTim Vincent

Bulk Loading Data into CassandraDataStax

Cassandra Summit 2015: Intro to DSE SearchCaleb Rackliffe

Using Spark over CassandraNoam Barkai

Intro to cassandra + hadoopJeremy Hanna

Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.Natalino Busa

Apache Tajo - BWC 2014Gruter

Powering a Virtual Power Station with Big DataDataWorks Summit/Hadoop Summit

Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...OpenCredo

Monitoring Cassandra with RiemannPatricia Gorla

Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin

Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble

Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...DataStax

Lightning fast analytics with Spark and Cassandranickmbailey

Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...DataStax

Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016DataStax

Critical Attributes for a High-Performance, Low-Latency DatabaseScyllaDB

Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBCody Ray

Buzzwords 2014 / Overview / part1Andrii Gakhov

ScyllaDB: NoSQL at Ludicrous SpeedJ On The Beach

La actualidad más candente (20)

Lightning Fast Analytics with Cassandra and Spark

Bulk Loading Data into Cassandra

Cassandra Summit 2015: Intro to DSE Search

Using Spark over Cassandra

Intro to cassandra + hadoop

Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.

Apache Tajo - BWC 2014

Powering a Virtual Power Station with Big Data

Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...

Monitoring Cassandra with Riemann

Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library

Spark + Cassandra = Real Time Analytics on Operational Data

Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...

Lightning fast analytics with Spark and Cassandra

Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...

Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016

Critical Attributes for a High-Performance, Low-Latency Database

Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Buzzwords 2014 / Overview / part1

ScyllaDB: NoSQL at Ludicrous Speed

Similar a Cassandra + Hadoop = Brisk

Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...Matt Stubbs

AWS-Certified-Cloud-Practitioner wiz.pdfManiBharathi833999

AWS Cloud Practitioner.PDFssuser82123d

Reusable, composable, battle-tested Terraform modulesYevgeniy Brikman

Qubole - Big data in cloudDmitry Tolpeko

Machine Learning on the Cloud with Apache MXNetdelagoya

ASHviz - Dats visualization research experiments using ASH dataJohn Beresniewicz

WhizCard-CLF-C01-06-09-2022.pdf2BA19CS016BharatiJad

Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Amazon Web Services

(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services

Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...Amazon Web Services

Managing Application Lifecycle using Jira and Bitbucket Cloud and AWS ToolingAtlassian

[NEW LAUNCH!] Scaling Tightly-coupled HPC workloads on HPC with Elastic Fabri...Amazon Web Services

KSQL - Stream Processing simplified!Guido Schmutz

Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...Jamie Kinney

Azure database as a service optionsMarcelo Adade

Single View of Dataconfluent

Going Headless with Craft CMS 3.3JustinHolt20

Semantic technologies in practice - KULeuven 2016Aad Versteden

The Future is Now: Leveraging the Cloud with RubyRobert Dempsey

Similar a Cassandra + Hadoop = Brisk (20)

Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...

AWS-Certified-Cloud-Practitioner wiz.pdf

AWS Cloud Practitioner.PDF

Reusable, composable, battle-tested Terraform modules

Qubole - Big data in cloud

Machine Learning on the Cloud with Apache MXNet

ASHviz - Dats visualization research experiments using ASH data

WhizCard-CLF-C01-06-09-2022.pdf

Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...

(BDT208) A Technical Introduction to Amazon Elastic MapReduce

Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine ...

Managing Application Lifecycle using Jira and Bitbucket Cloud and AWS Tooling

[NEW LAUNCH!] Scaling Tightly-coupled HPC workloads on HPC with Elastic Fabri...

KSQL - Stream Processing simplified!

Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...

Azure database as a service options

Single View of Data

Going Headless with Craft CMS 3.3

Semantic technologies in practice - KULeuven 2016

The Future is Now: Leveraging the Cloud with Ruby

Más de Dave Gardner

Cabs, Cassandra, and Hailo (at Cassandra EU)Dave Gardner

Cabs, Cassandra, and HailoDave Gardner

Planning to Fail #phpne13Dave Gardner

Planning to Fail #phpuk13Dave Gardner

Cassandra concepts, patterns and anti-patternsDave Gardner

Unique ID generation in distributed systemsDave Gardner

Learning CassandraDave Gardner

Cassandra's Sweet Spot - an introduction to Apache CassandraDave Gardner

Intro slides from Cassandra London July 2011Dave Gardner

2011.07.18 cassandrameetupDave Gardner

Introduction to Cassandra at London Web MeetupDave Gardner

Running Cassandra on Amazon EC2Dave Gardner

PHP and CassandraDave Gardner

Más de Dave Gardner (13)

Cabs, Cassandra, and Hailo (at Cassandra EU)

Cabs, Cassandra, and Hailo

Planning to Fail #phpne13

Planning to Fail #phpuk13

Cassandra concepts, patterns and anti-patterns

Unique ID generation in distributed systems

Learning Cassandra

Cassandra's Sweet Spot - an introduction to Apache Cassandra

Intro slides from Cassandra London July 2011

2011.07.18 cassandrameetup

Introduction to Cassandra at London Web Meetup

Running Cassandra on Amazon EC2

PHP and Cassandra

Último

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

"ML in Production",Oleksandr BaganFwdays

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

Advanced Computer Architecture – An IntroductionDilum Bandara

Cassandra + Hadoop = Brisk

1. London Our sponsors: Acunu

2. But first, a short back story…

3. 9 10 11 12 5 6 7 8 1 2 3 4

4. 21 22 23 24 GC HELL! 17 18 19 20 13 14 15 16

5. 33 34 35 36 29 30 31 32 25 26 27 28

6. 33 34 35 36 29 30 31 32 25 26 27 28

7. Please volunteer if you would like to give a talk, Internet fame awaits

9. Analytics is more difficult than it could be

10.

11.

12.

13. Demonstrating brisk… Building anAd Network!

14.

15. System to put users in buckets via a pixel

16. Real-time queries

17.

18. API provides:

19. Add user to a bucket (including ability to define expiry time)

20.

21. Ubuntu 10.10 image with RAID 0 ephemeral disks

22.

23. Data model CF = users [userUUID] [segmentID] = 1 CF = segments [segmentID] [userUUID] = 1

24. Data model create keyspacewhyk ... with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' ... and strategy_options = [{replication_factor:1}]; create column family users ... with comparator = 'AsciiType' ... and rows_cached = 5000; create column family segments ... with comparator = 'AsciiType' ... and rows_cached = 5000;

25. Data model create keyspacewhyk ... with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' ... and strategy_options = [{replication_factor:1}]; create column family users ... with comparator = 'AsciiType' ... and rows_cached = 5000; create column family segments ... with comparator = 'AsciiType' ... and rows_cached = 5000;

26.

27. Real-time access http://wehaveyourkidneys.com/show.php $pool = new ConnectionPool('whyk', array('localhost')); $users = new ColumnFamily($pool, 'users'); // @todo this only gets first 100! $segments = $users->get($userUuid); header('Content-Type: application/json'); echo json_encode(array_keys($segments));

28. Analytics How many users in each segment? Launch HIVE (very easy!) root@brisk-01:~# brisk hive

29. CREATE EXTERNAL TABLE whyk.users (userUuid string, segmentId string, value string) STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler’ WITH SERDEPROPERTIES ("cassandra.columns.mapping" = ":key,:column,:value" ); select segmentId, count(1) as total from whyk.users group by segmentId order by total desc;

30. Summary http://www.flickr.com/photos/sovietuk/2956044892/sizes/o/in/photostream/

31. Real time access+ Batch analytics

32. Easy Easy to setup Easy to deploy mixed-modeclustersEasy to query (Hive)

33. No Single Pointof Failure

34. Further reading… Installing the Brisk AMI http://www.datastax.com/docs/0.8/brisk/install_brisk_ami Key advantages of Brisk – from Jonathan Ellis http://hackerne.ws/item?id=2528271 Why I’m very excited about DataStax’s Brisk – by Nathan Milford http://blog.milford.io/2011/04/why-i-am-very-excited-about-datastaxs-brisk/ The demo code on Github https://github.com/davegardnerisme/we-have-your-kidneys

Notas del editor

Started at Imagini; May 2010New ad-targeting product! Lots of users.MySQL DB for profiles, MySQL based server for events reportingProfile DB cannot update rows so we only insert; this means clients have to merge together all rows for a user on every readMySQL DB has a habbit of dying, requiring a repair and downtime; having 2 DBs managed to put off total death but not for long
Choosing Cassandra after some research; no single point of failure attractive, high write throughput attractive, linear scaling attractiveWelcome to GC hell!Start Cassandra London – like alcoholics anonymous; a support network
Batch analytics; how? No Hive support, no support for streaming jarPig input readerNo output reader; require HDFS
Keep up the meetupsAcunu generous at providing speakers; downside is hearing sales pitch!0.7 comes along; downside is not compatible with 0.6; Thrift interface changes0.8 comes along; CQL, countersBrisk!
A summary
Some points about “distribution” Some points about Cloudera and reaction
Realtime + batch analytics combinedNo single point of failure; we don’t need Hadoop’snamenode anymoreCross DC clusters
No adsNo networkNo publishersCool domain name
User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)

Cassandra + Hadoop = Brisk

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Cassandra + Hadoop = Brisk

Similar a Cassandra + Hadoop = Brisk (20)

Más de Dave Gardner

Más de Dave Gardner (13)

Último

Último (20)

Cassandra + Hadoop = Brisk

Notas del editor