SlideShare una empresa de Scribd logo
1 de 17
Descargar para leer sin conexión
HBase schema design
    case studies

    Organized by Evan/Qingyan Liu
     qingyan123 (AT) gmail.com
              2009.7.13
The Tao is ...




De-normalization
Case 1: locations
●
    China
    ●
        Beijing
    ●
        Shanghai
    ●
        Guangzhou
    ●
        Shandong
        –   Jinan
        –   Qingdao
    ●
        Sichuan
        –   Chengdu
In RDBMS
loc_id PK   loc_name      parent_id   child_id
1           China                     2,3,4,5
2           Beijing       1
3           Shanghai      1
4           Guangzhou     1
5           Shandong      1           7,8
6           Sichuan       1           9
7           Jinan         1,5
8           Qingdao       1,5
9           Chengdu       1,6
In HBase
row                      column families
           name:         parent:           child:
<loc_id>                 parent:<loc_id>   child:<loc_id>
1          China                           child:1=state
                                           child:2=state
                                           child:3=state
                                           child:4=state
                                           child:5=state
                                           child:6=state
5          Shangdong     parent:1=nation child:7=city
                                         child:8=city
8          Qingdao       parent:1=nation
                         parent:5=state
Case 2: student-course
●
    Student
    ●
        1 S ~ many C
●
    Course
    ●
        1 C ~ many S
In RDBMS


Students                Courses
id PK      SCs          id PK
name       student_id   title
sex        course_id    introduction
age        type         teacher_id
In HBase
row                           column families
               info:               course:
<student_id>   info:name           course:<course_id>=type
               info:sex
               info:age

row                           column families
               info:               student:
<course_id>    info:title          student:<student_id>=type
               info:introduction
               info:teacher_id
Case 3: user-action
●
    users performs actions now and then
    ●
        store every events
    ●
        query recent events of a user
In RDBMS
                      Actions
                      id PK
                      user_id IDX
                      name
                      time

●   For fast SELECT id, user_id, name, time FROM Action
    WHERE user_id=XXX ORDER BY time DESC LIMIT 10
    OFFSET 20, we must create index on user_id.
    However, indices will greatly decrease insert speed
    for index-rebuild.
In HBase
row                                   column families
                              name:
<user><Long.MAX_VALUE -
System.currentTimeMillis()>
<event id>
Case 4: user-friends
●
    1 user has 1+ friends
●
    will lookup all friends of a user
In RDBMS

        Users
                           Friendships
        id IDX
                           user_id IDX
        name
                           friend_id
        sex
                           type
        age
●
    SELECT * FROM friendships WHERE
    user_id='XXX';
In HBase

row                           column families
                 info:            friend:
<user_id>        info:name        friend:<user_id>=type
                 info:sex
                 info:age

 ●
      actually, it is a graph can be represented by a
      sparse matrix.
 ●
      then you can use M/R to find sth interesting.
      e.g. the shortest path from user A to user B.
Case 5: access log
●
    each log line contains time, ip, domain, url,
    referer, browser_cookie, login_id, etc
●
    will be analyzed every 5 minutes, every hour,
    daily, weekly, and monthly
In RDBMS

Accesslog
time
ip IDX
domain
url
referer
browser_cookie IDX
login_id IDX
In HBase

row                                                  column families
                                          http:                     user
<time><INC_COUNTER>                       http:ip                   user:browser_
                                          http:domain               cookie
                                          http:url                  user:login_id
                                          http:referer




INC_COUNTER is used to distinguish the adjacent same time values.

Más contenido relacionado

La actualidad más candente

Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
AmebaのMongoDB活用事例
AmebaのMongoDB活用事例AmebaのMongoDB活用事例
AmebaのMongoDB活用事例
Akihiro Kuwano
 
Big Data in Real-Time at Twitter
Big Data in Real-Time at TwitterBig Data in Real-Time at Twitter
Big Data in Real-Time at Twitter
nkallen
 

La actualidad más candente (20)

爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話
 
Open Policy Agent (OPA) 入門
Open Policy Agent (OPA) 入門Open Policy Agent (OPA) 入門
Open Policy Agent (OPA) 入門
 
Hadoop -NameNode HAの仕組み-
Hadoop -NameNode HAの仕組み-Hadoop -NameNode HAの仕組み-
Hadoop -NameNode HAの仕組み-
 
Apache Igniteインメモリーデータ処理プラットフォーム:特徴&利活用
Apache Igniteインメモリーデータ処理プラットフォーム:特徴&利活用Apache Igniteインメモリーデータ処理プラットフォーム:特徴&利活用
Apache Igniteインメモリーデータ処理プラットフォーム:特徴&利活用
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Multi-Tenant HBase Cluster - HBaseCon2018-final
Multi-Tenant HBase Cluster - HBaseCon2018-finalMulti-Tenant HBase Cluster - HBaseCon2018-final
Multi-Tenant HBase Cluster - HBaseCon2018-final
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
Operations: Production Readiness
Operations: Production ReadinessOperations: Production Readiness
Operations: Production Readiness
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 
多要素認証による Amazon WorkSpaces の利用
多要素認証による Amazon WorkSpaces の利用多要素認証による Amazon WorkSpaces の利用
多要素認証による Amazon WorkSpaces の利用
 
9/14にリリースされたばかりの新LTS版Java 17、ここ3年間のJavaの変化を知ろう!(Open Source Conference 2021 O...
9/14にリリースされたばかりの新LTS版Java 17、ここ3年間のJavaの変化を知ろう!(Open Source Conference 2021 O...9/14にリリースされたばかりの新LTS版Java 17、ここ3年間のJavaの変化を知ろう!(Open Source Conference 2021 O...
9/14にリリースされたばかりの新LTS版Java 17、ここ3年間のJavaの変化を知ろう!(Open Source Conference 2021 O...
 
ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리
 
Swagger ではない OpenAPI Specification 3.0 による API サーバー開発
Swagger ではない OpenAPI Specification 3.0 による API サーバー開発Swagger ではない OpenAPI Specification 3.0 による API サーバー開発
Swagger ではない OpenAPI Specification 3.0 による API サーバー開発
 
ElastiCacheを利用する上でキャッシュをどのように有効に使うべきか
ElastiCacheを利用する上でキャッシュをどのように有効に使うべきかElastiCacheを利用する上でキャッシュをどのように有効に使うべきか
ElastiCacheを利用する上でキャッシュをどのように有効に使うべきか
 
20190122 AWS Black Belt Online Seminar Amazon Redshift Update
20190122 AWS Black Belt Online Seminar Amazon Redshift Update20190122 AWS Black Belt Online Seminar Amazon Redshift Update
20190122 AWS Black Belt Online Seminar Amazon Redshift Update
 
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
 
AmebaのMongoDB活用事例
AmebaのMongoDB活用事例AmebaのMongoDB活用事例
AmebaのMongoDB活用事例
 
Hive on Tezのベストプラクティス
Hive on TezのベストプラクティスHive on Tezのベストプラクティス
Hive on Tezのベストプラクティス
 
Big Data in Real-Time at Twitter
Big Data in Real-Time at TwitterBig Data in Real-Time at Twitter
Big Data in Real-Time at Twitter
 
Design Patterns using Amazon DynamoDB
 Design Patterns using Amazon DynamoDB Design Patterns using Amazon DynamoDB
Design Patterns using Amazon DynamoDB
 

Destacado

HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915
Dan Han
 
HBaseCon 2013: General Session
HBaseCon 2013: General SessionHBaseCon 2013: General Session
HBaseCon 2013: General Session
Cloudera, Inc.
 

Destacado (20)

Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBase
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Hadoop and HBase in the Real World
Hadoop and HBase in the Real WorldHadoop and HBase in the Real World
Hadoop and HBase in the Real World
 
A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915
 
Osc2012 spring HBase Report
Osc2012 spring HBase ReportOsc2012 spring HBase Report
Osc2012 spring HBase Report
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
 
Cassandra v0.6-siryou
Cassandra v0.6-siryouCassandra v0.6-siryou
Cassandra v0.6-siryou
 
Hog user manual v3
Hog user manual v3Hog user manual v3
Hog user manual v3
 
Unlocking Data for Analysts & Developers
Unlocking Data for Analysts & DevelopersUnlocking Data for Analysts & Developers
Unlocking Data for Analysts & Developers
 
Spark!
Spark!Spark!
Spark!
 
HBaseCon 2013: General Session
HBaseCon 2013: General SessionHBaseCon 2013: General Session
HBaseCon 2013: General Session
 
20分でわかるHBase
20分でわかるHBase20分でわかるHBase
20分でわかるHBase
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

20090713 Hbase Schema Design Case Studies

  • 1. HBase schema design case studies Organized by Evan/Qingyan Liu qingyan123 (AT) gmail.com 2009.7.13
  • 2. The Tao is ... De-normalization
  • 3. Case 1: locations ● China ● Beijing ● Shanghai ● Guangzhou ● Shandong – Jinan – Qingdao ● Sichuan – Chengdu
  • 4. In RDBMS loc_id PK loc_name parent_id child_id 1 China 2,3,4,5 2 Beijing 1 3 Shanghai 1 4 Guangzhou 1 5 Shandong 1 7,8 6 Sichuan 1 9 7 Jinan 1,5 8 Qingdao 1,5 9 Chengdu 1,6
  • 5. In HBase row column families name: parent: child: <loc_id> parent:<loc_id> child:<loc_id> 1 China child:1=state child:2=state child:3=state child:4=state child:5=state child:6=state 5 Shangdong parent:1=nation child:7=city child:8=city 8 Qingdao parent:1=nation parent:5=state
  • 6. Case 2: student-course ● Student ● 1 S ~ many C ● Course ● 1 C ~ many S
  • 7. In RDBMS Students Courses id PK SCs id PK name student_id title sex course_id introduction age type teacher_id
  • 8. In HBase row column families info: course: <student_id> info:name course:<course_id>=type info:sex info:age row column families info: student: <course_id> info:title student:<student_id>=type info:introduction info:teacher_id
  • 9. Case 3: user-action ● users performs actions now and then ● store every events ● query recent events of a user
  • 10. In RDBMS Actions id PK user_id IDX name time ● For fast SELECT id, user_id, name, time FROM Action WHERE user_id=XXX ORDER BY time DESC LIMIT 10 OFFSET 20, we must create index on user_id. However, indices will greatly decrease insert speed for index-rebuild.
  • 11. In HBase row column families name: <user><Long.MAX_VALUE - System.currentTimeMillis()> <event id>
  • 12. Case 4: user-friends ● 1 user has 1+ friends ● will lookup all friends of a user
  • 13. In RDBMS Users Friendships id IDX user_id IDX name friend_id sex type age ● SELECT * FROM friendships WHERE user_id='XXX';
  • 14. In HBase row column families info: friend: <user_id> info:name friend:<user_id>=type info:sex info:age ● actually, it is a graph can be represented by a sparse matrix. ● then you can use M/R to find sth interesting. e.g. the shortest path from user A to user B.
  • 15. Case 5: access log ● each log line contains time, ip, domain, url, referer, browser_cookie, login_id, etc ● will be analyzed every 5 minutes, every hour, daily, weekly, and monthly
  • 17. In HBase row column families http: user <time><INC_COUNTER> http:ip user:browser_ http:domain cookie http:url user:login_id http:referer INC_COUNTER is used to distinguish the adjacent same time values.