HBase brings interactivity to Hadoop, and allows users to collect, manage and process data in real-time. Lily wraps HBase and Solr in a comprehensive Big Data platform, with HBase-native secondary indexing complementing ad-hoc structured search. Through spare write-cycles during read operations, Lily transforms HBase in an scalable data management engine providing interactive analytics, profile harvesting and real-time recommendations. This talk highlights the architecture of Lily, how it completes HBase, and explains some of its implementation use cases.
UiPath Community: Communication Mining from Zero to Hero
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily & HBase - ngdata
1. Making Sense of Data
Lily goes shopping –
real-time recommendations with HBase
HBaseCon, May 2012
Steven Noels – VP Product – @stevenn
WWW.NGDATA.COM
2. Lily Core 2’ recap
• HBase-backed data repository,
with batteries included
• Data model:
• high-level data model on top of HBase’s
client app
byte[]’s
• schema
• versioning (schema and data) Lily
• links, variants
RowLog
• Java & REST API's
• Indexing: HBase Solr et al.
• through configuration, not implementation
• incremental and batch index maintenance
• RowLog: distributed, durable queue for sec.
actions
• Open Source: www.lilyproject.org (Apache
License)
WWW.NGDATA.COM
3. Why HBase?
• BigTable model
• sparseness
• atomic row updates aka concistency
• auto-partitioning
• Apache license
• A great community led by a Saint J
WWW.NGDATA.COM
4. Portfolio Overview
Real-time AI
Recommendations
Industry algorithms and rules
commercial availability
Trend Analytics
Pattern Detection
Profile Development
Context and Activity Tracking open source
Social Stream Ingestion
Schema and Data Management
Total Data Aggregation
Real-time Index and Retrieval
Security and Enterprise Connectors
WWW.NGDATA.COM
5. Lily (=HBase) In Use
Some of the larger Lily deployments
• media
• aggregation, database publishing and online archives
• finance
• real-time identity fraud detection
• retail banking
• contextualized (time+loc+person) mobile coupons
• retail
• e-commerce platform:
product catalog, consumer data store, real-time
indexing
WWW.NGDATA.COM
6. Collaborative Filtering?
Recommend items similar to a user’s highly-preferred items
WWW.NGDATA.COM
7. Collaborative Filtering is … Matrixes
Sean likes “Scarface” a lot (123,654,5.0)!
Robin likes “Scarface” somewhat (789,654,3.0)!
Grant likes “The Notebook” not at all (345,876,1.0)!
… …!
(Magic)
Grant may like “Scarface” quite a bit (345,654,4.5)!
… …!
WWW.NGDATA.COM
9. Fitting Recommendations into the Lily
Architecture
LILY CRUD API
Lily/HBase Secondary Indexes
read/write demultiplexer
co-occurence
lookup matrix
rowlog activity store
Steven Noels
stevenn@ngdata.com
www.ngdata.com
telephone: +32 9 33 engine
LILY recommender 88 220
data profile data, activity, profile scoring
indexes
store store Gent (Belgium)
propensity
custom ...
k-means
ALS
Makers of
Lily Core Repository
algorithm support
WWW.NGDATA.COM
10. Preferencing aka Feeding the Matrix
• Transaction-based preferencing
• Pluggable preference strategies, using Lily-based data
(HBase&Solr) for decision making
• e.g. credit card statement = transactions between users and product
families
• Preference weighting
• Ingest: REST API, bulk support
• Real-time updating of the recommendation model
• Profile Store
• Profile activities can be preferenced
• Support for Profile behavior analysis
WWW.NGDATA.COM
11. Making recommendations
• Recommender
• Pluggable recommender strategies, using Lily-based data
(HBase&Solr) for decision making
• Multi-model support: user-item & item-user recommendations
• Estimation of both preferenced and non-preferenced items
• Geolocation-based recommendations
• Re-scoring
• REST API
• (Planned)
• Support for Classifications
(scenario - Recommend me all (possible) coffee drinkers)
• Matrix / recommendation indexing
WWW.NGDATA.COM
12. Other upcoming Lily Features
• Secondary indexes (= Lily Core!)
• indexes are defined through configuration
• single or multi-field indexes
• range queries and prefix queries
• asc or desc sorted results
• can read huge, sorted lists
• synchronously updated: index updates are applied by rowlog
secondary actions
• online building of new indexes (no table locks)
• MapReduce integration
• SolrCloud integration
• Index shards and configuration managed through ZooKeeper
WWW.NGDATA.COM