Big data and noSQL in real time

•Descargar como PPTX, PDF•

1 recomendación•1,017 vistas

Explain the challenge of having real time analytics in big data and nosql applications. Showing Facebook and Twitter examples.

Tecnología

Big Data and NoSQL in REAL TIME
Facebook and Twitter Examples
Ron Zavner

Agenda
 Our real time world…
 Flavors of Big Data
 Facebook messaging and real time analytics system
 Twitter analytics system
 Winning architecture?
2
® Copyright 2011 Gigaspaces Ltd. All Rights

What is Real Time?
3
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

We’re Living in a Real Time World…
Homeland Security
Real Time Search
Social
eCommerce
User Tracking &
Engagement
Financial Services
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved4

Big Data Predictions
“Over the next few years we'll see the adoption of scalable
frameworks and platforms for handling
streaming, or near real-time, analysis and processing. In the
same way that Hadoop has been borne out of large-scale web
applications, these platforms will be driven by the needs of large-
scale location-aware mobile, social and sensor use.”
Edd Dumbill, O’REILLY
5
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved6
The Two Vs of Big Data
Velocity Volume

The Flavors of Big Data Analytics
Counting Correlating Research
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved7

Analytics – Counting
 How many
signups, tweets, retweet
s for a topic?
 What’s the average
latency?
 Demographics
 Countries and cities
 Gender
 Age groups
 Device types
 …
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved8

Analytics – Correlating
 What devices fail at the
same time?
 What features get user
hooked?
 What places on the
globe are “happening”?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved9

Analytics – Research
 Sentiment analysis
 “Obama is popular”
 Trends
 “People like to tweet
after watching
American Idol”
 Spam patterns
 How can you tell when
a user spams?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved10

It’s All about Timing
• Event driven / stream processing
• High resolution – every tweet gets counted
• Ad-hoc querying
• Medium resolution
• Long running batch jobs (ETL, map/reduce)
• Low resolution (trends & patterns)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved11
This is what
we’re here to
discuss 

Store 135+ Billion Messages A Month
13
® Copyright 2011 Gigaspaces Ltd. All Rights

The actual analytics..
 Like button analytics
 Comments box analytics
14
® Copyright 2011 Gigaspaces Ltd. All Rights

Goals
 Show why plugins are valuable
 Make the data more actionable
 Make the data more timely
 Remove point of failures
 Handle massive load - 200K events per second
15
® Copyright 2011 Gigaspaces Ltd. All Rights

Technology Evaluation
 MySQL DB Counters
 In-Memory Counters
 MapReduce
 Cassandra
 HBase
16
® Copyright 2011 Gigaspaces Ltd. All Rights

PTail
Scribe
Puma
Hbase
FACEBOOK
Log
FACEBOOK
Log
FACEBOOK
Log
HDFS
Real Time Long Term
Batch
1.5 Sec
The solution..
10,000
write/sec
per server

Keep Things In Memory
Facebook keeps 80% of its
data in Memory
(Stanford research)
RAM is 100-1000x faster
than Disk (Random seek)
• Disk: 5 -10ms
• RAM: ~0.001msec

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved20
Twitter Reach – Here’s One Use Case

Let’s start with some
statistics ….
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved21
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html

It takes a week for users to
send 1 billion Tweets.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved22
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html

On average,
140 million
tweets get sent every day.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved23
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html

The highest
throughput to date is
6,939 tweets/sec.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved24
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html

460,000 new
accounts
are created daily.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved25
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html

5% of the users generate
75% of the content.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved26
Twitter in Numbers
Source: http://www.sysomos.com/insidetwitter/

Challenge – Word Count
Word:Count
Tweets
Count
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved27
• Hottest topics
• URL mentions
• etc.

 (Tens of) thousands of tweets per second to
process
 Assumption: Need to process in near real time
 Aggregate counters for each word
 A few 10s of thousands of words (or hundreds of
thousands if we include URLs)
 System needs to linearly scale
 System needs to be fault tolerant
Word Count - Analyze the Problem
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved28

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved29
Use EDA (Event Driven Architecture)
TokenizerRaw FiltererTokenized CounterFiltered

Sharding (Partitioning)
Tokenizer1 Filterer 1
Tokenizer2 Filterer 2
Tokenizer
3
Filterer 3
Tokenizer
n
Filterer n
Counter
Updater 1
Counter
Updater 2
Counter
Updater 3
Counter
Updater n

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved31
Computing Reach with Event Streams

Twitter Storm
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved32

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved33
Twitter Storm

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved34
Storm Overview

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved35
Storm Cluster

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved36
Streaming word count with Storm

 Storage
 Data Persistency
 Querying
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved37
Storm Limitation
Spouts
Bolt
Topologies

 Event driven / flow
 Reliable
 Storage
 Data Persistency
 Querying
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved38
Winner is… storm & in memory data grids

 Facebook messages
 http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-
messaging-system-hbase-to-store-135.html
 Facebook Real time analytics
 http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-
analytics-system-hbase-to-process-20.html
 Learn and fork the code on github:
https://github.com/Gigaspaces/rt-analytics
 Detailed blog post
http://bit.ly/gs-bigdata-analytics
 Twitter in numbers:
http://blog.twitter.com/2011/03/numbers.html
 Twitter Storm:
http://bit.ly/twitter-storm
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved39
References

RonZ@gigaspaces.com
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved40
Q&A

Más contenido relacionado

Último

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Destacado

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture CodeSkeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools

12 Ways to Increase Your Influence at WorkGetSmarter

Destacado (20)

Product Design Trends in 2024 | Teenage Engineerings

How Race, Age and Gender Shape Attitudes Towards Mental Health

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

Skeleton Culture Code

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

Big data and noSQL in real time

1. Big Data and NoSQL in REAL TIME Facebook and Twitter Examples Ron Zavner

2. Agenda  Our real time world…  Flavors of Big Data  Facebook messaging and real time analytics system  Twitter analytics system  Winning architecture? 2 ® Copyright 2011 Gigaspaces Ltd. All Rights

5. Big Data Predictions “Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large- scale location-aware mobile, social and sensor use.” Edd Dumbill, O’REILLY 5 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

8. Analytics – Counting  How many signups, tweets, retweet s for a topic?  What’s the average latency?  Demographics  Countries and cities  Gender  Age groups  Device types  … ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved8

10. Analytics – Research  Sentiment analysis  “Obama is popular”  Trends  “People like to tweet after watching American Idol”  Spam patterns  How can you tell when a user spams? ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved10

11. It’s All about Timing • Event driven / stream processing • High resolution – every tweet gets counted • Ad-hoc querying • Medium resolution • Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns) ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved11 This is what we’re here to discuss 

12. FACEBOOK REAL-TIME ANALYTICS SYSTEM 12

15. Goals  Show why plugins are valuable  Make the data more actionable  Make the data more timely  Remove point of failures  Handle massive load - 200K events per second 15 ® Copyright 2011 Gigaspaces Ltd. All Rights

17. PTail Scribe Puma Hbase FACEBOOK Log FACEBOOK Log FACEBOOK Log HDFS Real Time Long Term Batch 1.5 Sec The solution.. 10,000 write/sec per server

18. Keep Things In Memory Facebook keeps 80% of its data in Memory (Stanford research) RAM is 100-1000x faster than Disk (Random seek) • Disk: 5 -10ms • RAM: ~0.001msec

19. TWITTER REAL-TIME ANALYTICS SYSTEM 19

28.  (Tens of) thousands of tweets per second to process  Assumption: Need to process in near real time  Aggregate counters for each word  A few 10s of thousands of words (or hundreds of thousands if we include URLs)  System needs to linearly scale  System needs to be fault tolerant Word Count - Analyze the Problem ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved28

30. Sharding (Partitioning) Tokenizer1 Filterer 1 Tokenizer2 Filterer 2 Tokenizer 3 Filterer 3 Tokenizer n Filterer n Counter Updater 1 Counter Updater 2 Counter Updater 3 Counter Updater n

39.  Facebook messages  http://highscalability.com/blog/2010/11/16/facebooks-new-real-time- messaging-system-hbase-to-store-135.html  Facebook Real time analytics  http://highscalability.com/blog/2011/3/22/facebooks-new-realtime- analytics-system-hbase-to-process-20.html  Learn and fork the code on github: https://github.com/Gigaspaces/rt-analytics  Detailed blog post http://bit.ly/gs-bigdata-analytics  Twitter in numbers: http://blog.twitter.com/2011/03/numbers.html  Twitter Storm: http://bit.ly/twitter-storm ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved39 References

Notas del editor

Real time is ideally less than a second, not 30 seconds, not 5 seconds
We live almost every aspect of our lives in a real-time world. Think about our social communications; we update our friends online via social networks and micro-blogging, we text from our mobiles, or message from our laptops. But it's not just our social lives; we shop online whenever we want, we search the web for immediate answers to our questions, we trade stocks online, we pay our bills, and do our banking. All online and all in real time.Real time doesn't just affect our personal lives. Enterprises and government agencies need real-time insights to be successful, whether they are investment firms that need fast access to market views and risk analysis, or retailers that need to adjust their online campaigns and recommendations. Even homeland security has come to increasingly rely on real-time monitoring.The amount of data that flows in these systems is huge.Major social networking platforms like Facebook and Twitter have developed their own architectures for handling the need for real-time analytics on huge amounts of data. However, not every company has the need or resources to build their own Twitter-like solution.
Big data is definitely expected to grow and expand. Amount of data is growing and the demand grows as well. The requirements for analytics in real time is a must.
The Two Vs of Big Data are velocity and volume. As said before, the volume of data we need to handle is huge and at the same time we need to do it fast. We are required to make very complex calculations in read time and we need to perform those for a very large amount of data. The data is usually spread among many servers, distributed and each server would perform it’s calculation and then results would be aggregated – map reduce. This is a very common pattern to perform real time analytics. Having said that, we can see that sometimes the latency requirement is more challenging and we need to improve the time it takes to make these calculations. You can’t go straight to relational DB – not designed to handle the speed and volumes we’re talking about, that’s why we can look at NoSQL or cache.NoSQL can go further // I don’t have contraints of a relational db and I can store the data as it is (in JSON – the format used by Twitter) – but processing the sheer amount of data in the timeframes we need is incredibly challenging.
I think analytics – when we’re talking about Big Data and something like Twitter – can be split into three categories, or buckets.The first bucket is “Counting” How many signups, tweets or retweets are there for a topic?I might also be interested in counting in relation to demographic information – for example, how many people are tweeting right now at this event and on what types of devices?The “Correlating” bucket might contain questions like how many twitter users are using desktop vs mobile - and what's the trend? Within the week, within in the month?Our 3rd bucket “Research” is similar to 2, but looking at more depth in the past – here we require a lot of processing of historic data
Counting calculations – we expect to see results in real time.The challenge is reliability > not that we lose money, but the accuracy of the system is going to be damaged, so the value of the report is going to be meaningless. Counting requires a very high high resolution - every tweet counts - we don't know which one will be important. If we lose something, the accuracy of the system will be damaged.
Correlating – we expect to see most results also in real time.These are the interactive queries where we expect a result that I can layout in my browser or a BI tool.
Research calcsare historical and Hadoop (for example) is a very popular framework for doing batch analytics. We don’t expect for real time response here but you never know what’s next 
It’s All about Timing.We expect to see real time results for lots of our calculations.We also need to make sure that our architecture allows us to be scalable.Today we might need to work with 100K TPS and it can easily grow to 200K TPS.We need to be highly available as well, we need to ensure zero downtime.For these we can use event driven and stream processing architectures.Correlation and research calculations are very interesting topics and we can expect longer response time, we however are going to examine the real time challenge.
We are going to talk about how facebook real time analytics system and also how they choose to store 135+ billion messages a month
http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.htmlYou may have read somewhere that Facebook has introduced a new Social Inbox integrating email, IM, SMS, text messages, on-site Facebook messages. All-in-all they need to store over 135 billion messages a month. Where do they store all that stuff? One of the posts gave the surprise answer - HBase beat out MySQL, Cassandra, and a few others.Why a surprise? Facebook created Cassandra and it was purpose built for an inbox type application, but they found Cassandra's eventual consistency model wasn't a good match for their new real-time Messages product. Facebook also has an extensive MySQL infrastructure, but they found performance suffered as data set and indexes grew larger. And they could have built their own, but they chose HBase.HBase is a scaleout table store supporting very high rates of row-level updates over massive amounts of data. Exactly what is needed for a Messaging system. HBase is also a column based key-value store built on the BigTable model. It's good at fetching rows by key or scanning ranges of rows and filtering. Also what is needed for a Messaging system. Complex queries are not supported however. Queries are generally given over to an analytics tool like Hive, which Facebook created to make sense of their multi-petabyte data warehouse, and Hive is based on Hadoop's file system, HDFS, which is also used by HBase.
Over the past year, social plugins have become an important and growing source of traffic for millions of websites. Today we're releasing a new version of Insights for Websites to give you better analytics on how people interact with your content and to help you optimize your website in real-time.Like button analyticsFor the first time, you can now access real-time analytics to optimize Like buttons across both your site and on Facebook. We use anonymized data to show you the number of times people saw Like buttons, clicked Like buttons, saw Like stories on Facebook, and clicked Like stories to visit your website.
Plugins are valueableSocial plugins have become an important and growing source of traffic for millions of websites over the past year. We released a new version of Insights for Websites last week to give site owners better analytics on how people interact with their content and to help them optimize their websites in real time. To accomplish this, we had to engineer a system that could process over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds.Data actionableHelp users take action to make their content more valuable.How many people see a plugin, how many people take action on it, and how many are converted to traffic back on your site. Make the data more timelyWent from a 48-hour turn around to 30 seconds.Multiple points of failure were removed to make this goal.
http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.htmlMySQL DB CountersHave a row with a key and a counter.Results in lots of database activity.Stats are kept at a day bucket granularity. Every day at midnight the stats would roll over. When the roll over period is reached this resulted in a lot of writes to the database, which caused a lot of lock contention.Tried to spread the work by taking into account time zones. Tried to shard things differently.The high write rate led to lock contention, it was easy to overload the databases, had to constantly monitor the databases, and had to rethink their sharding strategy.Solution not well tailored to the problem.In-Memory CountersIf you are worried about bottlenecks in IO then throw it all in-memory.No scale issues. Counters are stored in memory so writes are fast and the counters are easy to shard.Felt in-memory counters, for reasons not explained, weren't as accurate as other approaches. Even a 1% failure rate would be unacceptable. Analytics drive money so the counters have to be highly accurate. They didn't implement this system. It was a thought experiment and the accuracy issue caused them to move on.MapReduceUsed Hadoop/Hive for previous solution. Flexible. Easy to get running. Can handle IO, both massive writes and reads. Don't have to know how they will query ahead of time. The data can be stored and then queried.Not realtime. Many dependencies. Lots of points of failure. Complicated system. Not dependable enough to hit realtime goals.CassandraHBase seemed a better solution based on availability and the write rate.Write rate was the huge bottleneck being solved.
http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.htmlThe Winner: HBase + Scribe + Ptail + PumaAt a high level:HBase stores data across distributed machines.Use a tailing architecture, new events are stored in log files, and the logs are tailed.A system rolls the events up and writes them into storage.A UI pulls the data out and displays it to users.Data FlowUser clicks Like on a web page.Fires AJAX request to Facebook.Request is written to a log file using Scribe. Scribe handles issues like file roll over.Scribe is built on the same HTFS file store Hadoop is built on.Write extremely lean log lines. The more compact the log lines the more can be stored in memory.PtailData is read from the log files using Ptail. Ptail is an internal tool built to aggregate data from multiple Scribe stores. It tails the log files and pulls data out.Ptail data is separated out into three streams so they can eventually be sent to their own clusters in different datacenters.Plugin impressionNews feed impressionsActions (plugin + news feed)PumaBatch data to lessen the impact of hot keys. Even though HBase can handle a lot of writes per second they still want to batch data. A hot article will generate a lot of impressions and news feed impressions which will cause huge data skews which will cause IO issues. The more batching the better.Batch for 1.5 seconds on average. Would like to batch longer but they have so many URLs that they run out of memory when creating a hashtable.Wait for last flush to complete for starting new batch to avoid lock contention issues.UI Renders DataFrontends are all written in PHP.The backend is written in Java and Thrift is used as the messaging format so PHP programs can query Java services.Caching solutions are used to make the web pages display more quickly.Performance varies by the statistic. A counter can come back quickly. Find the top URL in a domain can take longer. Range from .5 to a few seconds. The more and longer data is cached the less realtime it is.Set different caching TTLs in memcache.MapReduceThe data is then sent to MapReduce servers so it can be queried via Hive.This also serves as a backup plan as the data can be recovered from Hive.Raw logs are removed after a period of time.HBase is a distribute column store. Database interface to Hadoop. Facebook has people working internally on HBase. Unlike a relational database you don't create mappings between tables.You don't create indexes. The only index you have a primary row key.From the row key you can have millions of sparse columns of storage. It's very flexible. You don't have to specify the schema. You define column families to which you can add keys at anytime.Key feature to scalability and reliability is the WAL, write ahead log, which is a log of the operations that are supposed to occur. Based on the key, data is sharded to a region server. Written to WAL first.Data is put into memory. At some point in time or if enough data has been accumulated the data is flushed to disk.If the machine goes down you can recreate the data from the WAL. So there's no permanent data loss.Use a combination of the log and in-memory storage they can handle an extremely high rate of IO reliably. HBase handles failure detection and automatically routes across failures.Currently HBaseresharding is done manually.Automatic hot spot detection and resharding is on the roadmap for HBase, but it's not there yet.Every Tuesday someone looks at the keys and decides what changes to make in the sharding plan.Schema Store on a per URL basis a bunch of counters.A row key, which is the only lookup key, is the MD5 hash of the reverse domainSelecting the proper key structure helps with scanning and sharding.A problem they have is sharding data properly onto different machines. Using a MD5 hash makes it easier to say this range goes here and that range goes there. For URLs they do something similar, plus they add an ID on top of that. Every URL in Facebook is represented by a unique ID, which is used to help with sharding.A reverse domain, com.facebook/ for example, is used so that the data is clustered together. HBase is really good at scanning clustered data, so if they store the data so it's clustered together they can efficiently calculate stats across domains. Think of every row a URL and every cell as a counter, you are able to set different TTLs (time to live) for each cell. So if keeping an hourly count there's no reason to keep that around for every URL forever, so they set a TTL of two weeks. Typically set TTLs on a per column family basis. Per server they can handle 10,000 writes per second. Checkpointing is used to prevent data loss when reading data from log files. Tailers save log stream check points in HBase.Replayed on startup so won't lose data.Useful for detecting click fraud, but it doesn't have fraud detection built in.Tailer Hot SpotsIn a distributed system there's a chance one part of the system can be hotter than another.One example are region servers that can be hot because more keys are being directed that way.One tailer can be lag behind another too.If one tailer is an hour behind and the others are up to date, what numbers do you display in the UI?For example, impressions have a way higher volume than actions, so CTR rates were way higher in the last hour.Solution is to figure out the least up to date tailer and use that when querying metrics.
In Twitter, the primary relationship between entities is many-to-many. Every post is sent to numerous followers of the user who sent the post; at the same time, each user can follow many other users. This causes Twitter to behave like a living organism, growing unexpectedly in many different directions.Let me give you an example. One analytic where I need to process tweets is to determine Twitter Reach – Reach is how many unique Twitter accounts received tweets about my topic.So, how do I compute my reach?There are several stages in the processing1. First, I need to record every tweet2. Then I can count how many followers got that tweet3. Then I need to understand the distinct reach and I need to account for this > meaning for each follower I need to look at each of their followers and remove the duplicates.Try to image what it takes to produce that number. If my tweet is retweeted by 100 users, each of whom has 100 followers – well, it starts to take a fair bit of number crunching.
Read mostly – duplicate the data so you can optimize the read.
Let’s analyze the problems that a simple Twitter word count presentsThe challenge here seems straightforward:Tens of thousands of tweets need to be stored and parsed every secondWord counters need to be aggregated continuously. Even though tweets are limited to 140 characters, we are dealing with hundreds of thousands of words per second.This is big.
In many ways this is the bench mark for other systems because it does stretch the limits > There is a huge amount of activity to analyze – the scale is enormous> And we want to grab a lot of information out of it – and this is the challenge - how do we grab the stream in real time without effecting latency?> how do we deal w/ that stream in real-time?> how do we handle the write scalability in real-time?> how do we make the system bullet-proof and easily scalable?> how do we begin to do analytics on this?
Storm is a real time, open source data streaming framework that functions entirely in memory. Storm is designed to be run on several machines to provided parallelism.Real-time processing is becoming very popular, and Storm is a popular open source framework and runtime used by Twitter for processing real-time data streams. Storm addresses the complexity of running real time streams through a compute cluster by providing an elegant set of abstractions that make it easier to reason about your problem domain by letting you focus on data flows rather than on implementation details.
It constructs a processing graph that feeds data from an input source through processing nodes. The processing graph is called a "topology". The input data sources are called "spouts", and the processing nodes are called "bolts". The data model consists of tuples. Tuples flow from Spouts to the bolts, which execute user code. Besides simply being locations where data is transformed or accumulated, bolts can also join streams and branch streams. Storm topologies are deployed in a manner somewhat similar to a webapp; a jar file is presented to a deployer which distributes it around the cluster where it is loaded and executed. A topology runs until it is killed.
zookeeper - Storm uses Zookeeper to communicate between the "Nimbus"(master) and the 'Supervisors" (workers), as well as to store its current state. Zookeeper coodinates activity in the cluster, and provides operational state storage.storm-nimbus – The topology execution coordinator for the cluster. The Nimbus is a singleton in the cluster (i.e. not elastic). It is stateless however (due to storing state in Zookeeper) and there for can fail and be restarted without consequence even to running jobs.storm-supervisor – The supervisors actually run the topology code. There can/should be many of these (i.e. elastic). The parallelism attributes of a given topology are specified in the topology itself.
Data grids are more event driven based while strom is used for flow/streaming. Storm have more capabilites. Storm is very specifically directed at the streaming problem, and is optimized for that use case. In order to produce extremely high throughput, it pushes responsibility for reliability outside of its own framework. Also because of its streaming focus, it provides higher level abstractions that make reasoning about streaming easier than in XAP.Reliable - The architecture is oriented to making data in-memory nearly as reliable as that on disk. Thus, writing into XAP involves some level of serialization and perhaps a network hop as well. Storm doesn't aspire to this level of reliability, instead it provides the means for the suppliers and consumers of data to provide it instead. Storm is "optimistic" in roughly the same sense that an optimistic lock in a database is optimistic: it assumes success is far more likely than failure, and so is willing to big hits to performance when failures occur because they are so rare. XAP is more pessimistic in this sense. XAP is designed to be a source of truth for the data it holds, and goes to great lengths to achieve it.For reasons sited above, there is no way, even in principle, for XAP to have a comparable thoughput to Storm: at least when there is no persistence. This caveat is critical however, since real world systems almost always need persistence, and ultra-fast in-memory persistence is one of XAP's main strengths. I also mentioned that Storm has higher level abstractions for Streaming, which make programming it more straightforward for streaming applications. Whereas in XAP you could implement streaming as a series of event driven processing stages, there is no concept of a "stream" or any kind of "flow" at the API level.Storm with XAPBasically, Spouts provide the source of tuples for Storm processing. For spouts to be maximally performant and reliable, they need to provide tuples in batches, and be able to replay failed batches when necessary. Of course, in order to have batches, you need storage, and to be able to replay batches, you need reliable storage. XAP is about the highest performing, reliable source of data out there, so a spout that serves tuples from XAP is a natural combination. Recall that Storm is a stream processing framework and runtime, and this presupposes the existence of a stream for it to read from. So there are really two artifacts needed for XAP to provide a spout to Storm: a "stream" in XAP, and of course the spout that reads from it. Realizing this, I wrote a simple service for XAP that leverages XAP's FIFO capabilities called XAPStream. It is a standalone (Storm independent) service that lets clients dynamically create, destroy, and of course read and write from streams in both batch and non-batch modes.

Big data and noSQL in real time

Recomendados

Recomendados

Más contenido relacionado

Último

Último (20)

Destacado

Destacado (20)

Big data and noSQL in real time

Notas del editor