Speaker: Jonathan Natkins (WibiData)
Many companies aspire to have 360-degree views of their data. Whether they're concerned about customers, users, accounts, or more abstract things like sensors, organizations are focused on developing capabilities for analyzing all the data they have about these entities. This talk will introduce the concept of entity-centric storage, discuss what it means, what it enables for businesses, and how to develop an entity-centric system using the open-source Kiji framework and HBase. It will also compare and contrast traditional methods of building a 360-degree view on a relational database versus building against a distributed key-value store, and why HBase is a good choice for implementing an entity-centric system.
4. What is a 360º View For?
Past
What interactions has a customer had in the past?
Present
What is the customer doing right now?
Future
What is the customer likely do to next?
Past and present inform the future
10. Challenges With Star Schemas
How do we answer the original question?
Full table scan + joins
OLTP systems will likely fall over from the
volume
OLAP systems are usually not optimized for
single-row lookups
13. Why
HBase rows can store both static and
event-oriented data
Cell versions are key
Single-row lookups are extremely fast
14. is for Building
Entity-Centric Systems
Often used for:
Building recommendation systems
Personalized search
Real-time HBase applications
Underlying technologies:
15. Designing an Entity-Centric
Datastore
Ask yourself this: what is the entity?
Determine your entity by determining how
you want to analyze the data
It’s ok to have data organized in multiple
ways
16. Schema Management with Kiji
Sometimes you actually want a schema layer
Defining a schema allows for data discoverability
17. Column Families in Kiji
Kiji has two types of column families
Group families are similar to relational
tables
Predefined set of columns
Each column has its own data type
Map families specify columns at runtime
Every column has the same data type
19. Choosing a Row Key
Row keys in Kiji are componentized
[ ‘component1’, ‘component2’, 1234 ]
More efficient than byte arrays
Consider ‘1234567890’ versus [ 1234567890 ]
Good for scanning areas of the keyspace
20. A Common Use for
Components
Known users IDs versus unknown IDs
On a website, how do you differentiate
between a logged-in or cookie’d user versus a
brand new visitor
[ ‘K’, ‘user1234’ ] or [ ‘U’, ‘unknown2345’ ]
Physically and logically separate rows
Run jobs over all known or unknown users
21. Identifying Known Users
Problem: Users have many cookies over
time.
Challenge: Ideally, we would have a single
row for each user. How do we ensure that
new data goes to the right row?
22. Finding Known Users With
Lookup Tables
HBase get operations are fast
It’s easy enough to create a table that
contains a mapping of cookies to known
user IDs
When data is loaded, check the lookup
table to determine if you should write data
to an existing row or a new one
24. Unhashed Row Keys
Node 1 Node 2 Node 3
Region
A-B
Region
B-C
Region
D-E
Region
F-G
Region
H-I
Region
J-K
25. Hash-Prefixed Row Keys
Node 1 Node 2 Node 3
Region
00A-0fK
Region
10A-1fK
Region
20A-2fK
Region
30A-3fK
Region
40A-4fK
Region
50A-5fK
26. Storing Event Series
360º views need easy access to all the
transactions and events for a user
HBase cells may contain more than one
version
Kiji leverages this to store event series
data like clicks or purchases sessions:23
45
sessions:23
45
sessions:23
45
sessions:12
34
sessions:12
34
info:purchase
sinfo:name info:email
sessions:12
34
sessions:23
45
info:purchase
s
info:purchase
s
27. How Many Events is Too Many?
The HBase book warns that too many
versions of a cell can cause StoreFile
bloat
HBase will never split a row
Common tactic is to add a timestamp
range to the row key
Kiji makes this easy with componentized row
28. Beware of Timestamp Misuse
A major reason the HBase book warns
against mucking with timestamps is that
they can be dangerous
What happens if you use a sequence number
as a timestamp? Think about TTLs
30. Why is Evolution Necessary?
No entity-centric system will be the end-all,
be-all the first time around
Data sources in large enterprises are
usually heavily silo’d
Start small
Incorporate new data sources over time
31. Putting it Together
Kiji includes a shell to use DDL to create
tables
Many of the features that have been
discussed are declarative via the DDL
32. Users Table
CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'
ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))
PROPERTIES (NUMREGIONS = 32)
WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' (
MAXVERSIONS = INFINITY,
TTL = FOREVER,
INMEMORY = false,
MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'
),
LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' (
MAXVERSIONS = 10,
TTL = FOREVER,
INMEMORY = true,
FAMILY recs (
recommended CLASS com.kiji.avro.ProductRecList
WITH DESCRIPTION 'Recommended products.'
)
);
33. Users Table
CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'
ROW KEY FORMAT (type STRING, user_id STRING NOT
NULL, HASH(THROUGH user_id))
PROPERTIES (NUMREGIONS = 32)
WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' (
MAXVERSIONS = INFINITY,
TTL = FOREVER,
INMEMORY = false,
MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'
),
LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' (
MAXVERSIONS = 10,
TTL = FOREVER,
INMEMORY = true,
FAMILY recs (
recommended CLASS com.kiji.avro.ProductRecList
WITH DESCRIPTION 'Recommended products.'
)
);
34. Users TableCREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'
ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))
PROPERTIES (NUMREGIONS = 32)
WITH LOCALITY GROUP default
WITH DESCRIPTION 'main storage' (
MAXVERSIONS = INFINITY,
TTL = FOREVER,
INMEMORY = false,
MAP TYPE FAMILY events CLASS
com.kiji.avro.Event
WITH DESCRIPTION 'events'
),
LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' (
MAXVERSIONS = 10,
TTL = FOREVER,
INMEMORY = true,
FAMILY recs (
recommended CLASS com.kiji.avro.ProductRecList
WITH DESCRIPTION 'Recommended products.'
)
);
35. Users Table
CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'
ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))
PROPERTIES (NUMREGIONS = 32)
WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' (
MAXVERSIONS = INFINITY,
TTL = FOREVER,
INMEMORY = false,
MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'
),
LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' (
MAXVERSIONS = 10,
TTL = FOREVER,
INMEMORY = true,
FAMILY recs (
recommended CLASS
com.kiji.avro.ProductRecList
WITH DESCRIPTION 'Recommended products.’
)
);
36. In Summary…
Designing applications in an entity-centric
fashion can make them easier to build and
more efficient
Kiji can speed up the development
process of 360º views