Description:
A HBase is a NoSQL column store. What does that mean functionally to a software developer?
-A conceptional view of HBase
-How to use HBase
-What features HBase has
-Benefits of HBase
How are we using HBase here at MyLife? I will describe three projects here at MyLife that are currently using HBase in production that I was/am involved with.
-Email content storage
-Connection-Identity mappings
-User stream cache backing
Each of these projects uses HBase in a different way.
3. HBase: In brief
I could talk about…
ZooKeeper quorums
Source: aazk.org
4. HBase: In brief
I could talk about…
Compaction
Source: www.wasteprousa.com
5. HBase: In brief
I could talk about…
How HBase is Implemented
HDFS
Blocks
Regions
META table
Etc…
6. HBase: In brief
I could talk about…
HBase VS
Cassandra
Redis
MySQL
Etc…
7. HBase: In brief
However none of those are my
primary view as a developer.
As a developer I want to talk about
what HBase can do for me. How it
can make MyLife (pun intended)
easier.
8. HBase: In brief
“I choose a lazy person to do a hard
job. Because a lazy person will find
an easy way to do it.”
9. HBase: In brief
“I choose a lazy person to do a hard
job. Because a lazy person will find
an easy way to do it.”
–Bill Gates
10. HBase: In brief
So what does HBase do for me the
developer?
TL;DR
IT STORES DATA!
16. A Data Structures Interlude
Key == Last Name, First Name,
Middle Initial
Value == Extension
I.e.
Example,Dude,X x555
17. A Data Structures Interlude
So now that we know what a map is
what would a map of maps looks
like? An HBase like analogy.
18. A Data Structures Interlude
An analogy ( a dated analogy if someone can
think of a current one please please let me
know) to HBase is an index file in a library by
ISBN. You look up the a book by ISBN. The
ISBN is your key. The value in this case is a
book that contains a list of books!
Key == ISBN
Value == Book that lists other books!
0786704810 Author, Title, Publisher, Year
21. HBase: In brief
Some quick facts:
Column families are defined ahead of time and
require the table to disabled to be altered.
Only Column families are fixed. Everything
under that level of maps in flexible.
Qualifiers can be added or removed on the fly.
Along with their versions
“The Map” itself is also defined ahead of time
25. HBase: The Test Case
One of the services we provide to our users is a
message stream. This stream can include
email. Which works like an email client (i.e.
outlook or mail.app or on your phone) storing
your email messages so you can get them
quickly.
We found ourselves storing 100’s of gigabytes
of email contents in our Oracle RAC database.
26. HBase: The Test Case
Since this data is only accessed by key it made
sense to move out of Oracle and into HBase.
27. HBase: The Test Case
Key ==
accountId_providerAccountId_messageId_bodyId
28. HBase: The Test Case
Key ==
accountId_providerAccountId_messageId_bodyId
This is is a nice key because all the messages for a
particular user are together by prefix.
Since HBase maintains the keys sorted we can use
a Scan to grab them all quickly at one time.
30. HBase: The Test Case
Advantages vs Previous solution:
Faster
Cheaper
Less DB load
31. HBase: The ideal case
Another service we offer our users is the ability
to import their social and email connections so
they can have one unified view of all their
connections across providers. Allowing users to
manage data by person rather than by
account.
32. HBase: The ideal case
This has two main pieces of data:
1.The social profile information
2.The relationship between that profile and an
Identity
33. HBase: The ideal case
What makes this ideal for HBase?
1. The profile is sparse data that is only
accessed by key!
34. HBase: The ideal case
What makes this ideal for HBase?
2. The relationship between a profile and its
identity is only a key-value pair and it reverse!
35. A Data Structures Interlude
Key == Last Name, First Name,
Middle Initial
Value == Extension
I.e.
Example,Dude,X x555
36. A Data Structures Interlude
Key == Extension
Value == Last Name, First Name,
Middle Initial
I.e.
x555 Example,Dude,X
37. HBase: The ideal case
Dataflow
1.Get profile from provider
2.Check if the profile maps to an existing Identity
in HBase
1. If it doesn’t exist store a version of the profile in
HBase with providerId as key and profile
information as values
3.Associate profile with identity
1. create row in HBase with identityId_providerId as
key
4.Update profile with the identity it is associated
with
38. HBase: The ideal case
Coprocessors!
What are Coprocessors?
Another feature of HBase which work like
triggers.
A coprocessor is a piece of logic attached to an
HBase put that is executed on the HBase
cluster.
40. HBase: The Awesome Case
Originally this system used local caching to store
user stream data but has the stream grew this
became impractical.
The solution here was a distributed cache great!
41. HBase: The Awesome Case
Distributed cache allows us to scale but unless we
have a huge grid some user streams will still get
evicted from the cache. Which means when the
user visits again we have to fetch their streams
from the source which is slow…
42. HBase: The Awesome Case
Enter HBase from great to awesome!
To fix the latency associated with eviction we
added HBase as a backing store to our distributed
cache. This means that records in our cache are
periodically written to HBase and are written
HBase before being evicted from the cache.
43. HBase: The Awesome Case
Distributed cache + HBase == Awesome!
Why?
Persistence – user streams now live in HBase for
as long as we want them to.
Speed – read through from HBase are fast
Transparency – as far as application is concerned
everything is just in the cache
44. HBase: The Awesome Case
Distributed cache + HBase == Awesome!
Why?
Reliability – HBase been solid and all the data is
stored redundantly
Like a RDMS or a file stored data but in different ways
Again from a functional POV
That’s it. remember that. The rest of the terminology just tells you where you are in that nest of maps.
Before we get to far since HBase stores data in maps lets take a brief step back here and let me describe a map as quickly as I can since it fundamental to HBase.
A map is away of storing data so it can be retrieved by a key. This is concept most people are familiar with like finding a co-worker’s extension on the company directory. Here we have a key Last name, first name, middle initial which MAPS to the extension. BTW we are going to talk about keys A LOT!
To be precise it would if that other book also listed other books but you get the idea. BTW that is a real example you can look it up. Also I think you all just passed CS 201
SOOO… back to HBase.
This is HBase in a nutshell.
To quote from the HBase documentation: “All other map returning methods make use of this map internally.”
So they even say this pretty much all there is to it functionally. But this structure allows for some very cool things.
I get some freebies here. Quick looks up by key or key prefixes (more on that later). Flexibility. Versioning. These are things we have used and will be looking at in our implementations later.
SOOO… back to HBase.
This is HBase in a nutshell.
To quote from the HBase documentation: “All other map returning methods make use of this map internally.”
So they even say this pretty much all there is to it functionally. But this structure allows for some very cool things.
I get some freebies here. Quick looks up by key or key prefixes (more on that later). Flexibility. Versioning. These are things we have used and will be looking at in our implementations later.
From this map structure we get flexibility since we can add or remove items from the Map with one caveat column families are fixed. “The Map” in HBase terms is the table
SOOO… back to HBase.
This is HBase in a nutshell.
To quote from the HBase documentation: “All other map returning methods make use of this map internally.”
So they even say this pretty much all there is to it functionally. But this structure allows for some very cool things.
I get some freebies here. Quick looks up by key or key prefixes (more on that later). Flexibility. Versioning. These are things we have used and will be looking at in our implementations later.
We currently have three production solutions implemented using HBase
The Test case our first production use of HBase
The ideal case which an almost perfect match for HBase
And finally the awesome case where we added HBase to something great to make it awesome
This (to say the least was) not ideal for several reasons including cost and scalability
accountId is mylife.com accountid
providerAccountId is the id we give to the relation between a mylife account and a provider account ie this users gmail account
messageId is the unique id each email message is given
bodyId is a reverse timestamp given to each body (html or text)
Like our example of a company directory you can easily find everyone with the same last name
Our first use of HBase was very straight forward but it works ! And it works well.
This implementation is faster, cheaper and saves precious DB resources for where they are needed most. Things that need query and transaction capability
What we call an Identity here is really a person. One person probably has many social profiles like a linkedin and a Facebook profile.
What is sparse data? That is when the record you store that is mostly empty fields. Like the contact page in your phone has a name and phone number but probably not much else even though there is a place for home address, company name, birthday, anniversary and bunch of other stuff. That is also sparse data.
Remember I said HBase is flexible? Well this is how you use that flexibility.
Social profiles are similarly sparse. There is a lot of potential data in social profiles but usually only a few items of data will be there most of the time and the potential fields vary from provider to provider.
For example first name is almost always in a social profile but middle name probably not. HBase is great for this since it only stores that data that is there no wasting space storing empty cells or time transferring them over the network. It also allows us to store fields for different social providers together or add new fields as we add providers without having to update the storage just the code that needs the data.
The only accessed by key bit is important also but we have already covered that.
Exciting no? its all fitting together. So we know about key-value pairs but what is the reverse part about?
Time for another data structures interlude!
Last time we had this.
The reverse index is simply the same data REVERSED!
So you get a call from an extension you don’t know you go look up the name it belongs to.
This has been another data structures interlude!
A simplified data flow. For social connections.
In step 2 we are using HBase’s versioning to keep versions of the social profile so we get a history of changes
Step 4 is where we are doing our reverse index. So we can find the identity of a social profile.
So how did we implement number 4 and make this part of the ideal HBase use case?
A coprocessor is an HBase feature we have not touched on till now (have to save a few surprises)
In our case we built a coprocessor to update the profile record when we are associating it with its identity.
This has several advantages:
The reverse index is built at the same time as the primary index
The reverse index gets created no matter the source of the put
Any application can rely on the primary and reverse indexes always existing together
I mentioned briefly in our first case about message streams here is another part of that same system that uses HBase. Once we have a users provider streams and have homogenized them we need them available to build the users personal aggregated stream.
Persistence – in this case we are using another HBase feature TTLs so that streams that have not been updated in 4 weeks gets removed automatically.
Speed- read through (times when a stream is gone from the in memory cache and has to be fetched from HBase) are basically as fast as the network since we are getting by key.