SlideShare una empresa de Scribd logo
1 de 34
Descargar para leer sin conexión
Apache HBase 
For Architects 
Nick Dimiduk, Hortonworks 
Strata/Hadoop World Barcelona, 2014-11-21 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 1
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. Page 2
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Agenda 
• Background 
– (how did we get here?) 
• TL;DR 
– (don’t waste my time!) 
• High-level Architecture 
– (where are we?) 
• Anatomy of a RegionServer 
– (how does this thing work?) 
• By Example 
– (how do I use it?) 
• Resources 
– (where do we go from here?) 
Page 3
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Background 
Page 4
So what is HBase anyway? 
• BigTable paper from Google, 2006, Dean et al. 
– “Bigtable is a sparse, distributed, persistent multi-dimensional sorted map.” 
– http://research.google.com/archive/bigtable.html 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
• Key Features: 
– Distributed storage across cluster of machines 
– Random, online read and write data access 
– Schemaless data model (“NoSQL”) 
– Self-managed data partitions 
Page 5
Apache Hadoop Dependencies 
• Apache Hadoop Distributed Filesystem (HDFS) 
– Distributed, fault-tolerant, throughput-optimized data storage 
– The Google File System, 2003, Ghemawat et al. 
– http://research.google.com/archive/gfs.html 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
• Apache Zookeeper (ZK) 
– Distributed, available, reliable coordination system 
– The Chubby Lock Service …, 2006, Burrows 
– http://research.google.com/archive/chubby.html 
• Apache Hadoop MapReduce (MR) 
– Distributed, fault-tolerant, batch-oriented data processing 
– MapReduce: …, 2004, Dean and Ghemawat 
– http://research.google.com/archive/mapreduce.html 
Page 6
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
TL;DR 
Page 7
So what is HBase anyway? 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 8 
C1 tree C0 tree 
Disk Memory 
Figure 2.1. Schematic picture of an LSM-tree of two components 
Figure 2.1 reproduced from O’Neil, Patrick, et al. "The log-structured 
merge-tree (LSM-tree)." Acta Informatica 33.4 (1996): 351-385.
So what is HBase anyway? 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 9 
DataNode RegionServer C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory
primarily random reads and writes. In other deployments, is also a part of the workloads, TaskTrackers, Servers can run together. 
primarily random primarily reads random and writes. reads and In other writes. is also a part is also of the a part workloads, of the workloads, TaskTrackers, Servers can Servers run together. 
can run together. 
So what is HBase anyway? 
DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes C1 DataNode RegionServer DataNode RegionServer DataNode C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 C1 Figure 3.7 HBase Figure RegionServer 3.7 HBase and RegionServer HDFS DataNode and DataNode RegionServer DataNode RegionServer C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 10 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
DataNode RegionServer DataNode C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
DataNode RegionServer C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory
RegionServers, and each RegionServer typically hosts multiple regions. 
RegionServers, and each RegionServer typically hosts multiple regions. 
RegionServers, and each RegionServer typically hosts multiple regions. 
RegionServers, and each RegionServer typically hosts multiple regions. 
RegionServers, RegionServers, and each RegionServer and each RegionServer typically hosts typically multiple hosts Given that the Given underlying that the underlying data is stored data in is HDFS, stored which in HDFS, is available a single namespace, a single namespace, all RegionServers all RegionServers have access have to the access same to system and system can therefore and can host therefore any region host any (figure region 3.8). (figure By physically 3.8). Nodes and Nodes RegionServers, and RegionServers, you can use you the can data use locality the data property; locality can theoretically can theoretically read and write read to and write local to DataNode local as DataNode the You may wonder You may where wonder the TaskTrackers where the TaskTrackers are in this are scheme HBase deployments, HBase deployments, the MapReduce the MapReduce framework framework isn’t deployed isn’t primarily random primarily reads random and writes. reads and In other writes. deployments, In other deployments, where is also a part is also of the a part workloads, of the workloads, TaskTrackers, TaskTrackers, DataNodes, Servers can Servers run together. 
can run together. 
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
Nodes and Nodes RegionServers, and Nodes RegionServers, you and can RegionServers, use you the can data use locality you the can data property; use locality the data that property; is, locality can theoretically can theoretically read and can write theoretically read to and write local read to DataNode and write local to as DataNode the primary local as DataNode the You may wonder You may where wonder You the may TaskTrackers where wonder the TaskTrackers where are in the this TaskTrackers are scheme in this of are scheme things. HBase deployments, HBase deployments, the HBase MapReduce deployments, the MapReduce framework the MapReduce framework isn’t deployed framework isn’t at deployed all if the isn’t primarily random primarily reads random primarily and writes. reads random and In other writes. reads deployments, and In other writes. deployments, In where other the deployments, MapReduce where is also a part is also of the a part workloads, is also of the a part workloads, TaskTrackers, of the workloads, TaskTrackers, DataNodes, TaskTrackers, DataNodes, and HBase Servers can Servers run together. 
can Servers run together. 
can run together. 
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing 
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing 
primarily random primarily reads random primarily and writes. reads random and In other writes. reads deployments, and In other writes. deployments, In where other the deployments, MapReduce where is also a part is also of the a part workloads, is also of the a part workloads, TaskTrackers, of the workloads, TaskTrackers, DataNodes, TaskTrackers, DataNodes, and HBase Servers can Servers run together. 
can Servers run together. 
can run together. 
Figure 3.6 A table consists of multiple smaller chunks called regions. 
Figure 3.6 A table consists of multiple smaller chunks called regions. 
Figure 3.6 A table consists of multiple smaller chunks called regions. 
Figure 3.6 A table consists of multiple smaller chunks called regions. 
Figure 3.6 A table consists of multiple smaller chunks called regions. 
Figure 3.6 A table consists of multiple smaller chunks called DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes C1 Given that the underlying data is stored in HDFS, which is available to all clients as 
Given that the underlying data is stored in HDFS, which is available to all clients as 
Given that the underlying data is stored in HDFS, which is available to all clients as 
Given that the underlying data is stored in HDFS, which is available to a single namespace, all RegionServers have access to the same persisted files system and can therefore host any region (figure 3.8). By physically collocating Nodes and RegionServers, you can use the data locality property; that is, can theoretically read and write to local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together. 
can theoretically read and write to the local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. In some 
can theoretically read and write to the local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. In some 
can theoretically read and write to local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. In some 
So what is HBase anyway? 
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
a single namespace, all RegionServers have access to the same persisted files in the file 
system and can therefore host any region (figure 3.8). By physically collocating Data- 
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
a single namespace, all RegionServers have access to the same persisted files in the file 
system and can therefore host any region (figure 3.8). By physically collocating Data- 
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
a single namespace, all RegionServers have access to the same persisted files in the file 
system and can therefore host any region (figure 3.8). By physically collocating Data- 
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
Servers can run together. 
Servers can run together. 
store/access data on HDFS. The master process does the distribution of regions among 
RegionServers, and each RegionServer typically hosts multiple regions. 
store/access data on HDFS. The master process does the distribution of regions among 
RegionServers, and each RegionServer typically hosts multiple regions. 
store/access data on HDFS. The master process does the distribution of regions among 
RegionServers, and each RegionServer typically hosts multiple regions. 
store/access data on HDFS. The master process does the distribution of regions RegionServers, and each RegionServer typically hosts multiple regions. 
store/access store/data access on HDFS. data The on master HDFS. process The master does process the distribution does RegionServers, RegionServers, and each RegionServer and each RegionServer typically hosts typically multiple hosts Given that the Given underlying that the underlying data is stored data in is HDFS, stored which in HDFS, is available a single namespace, a single namespace, all RegionServers all RegionServers have access have to the access same to system and system can therefore and can host therefore any region host any (figure region 3.8). (figure By physically 3.8). Nodes and Nodes RegionServers, and RegionServers, you can use you the can data use locality the data property; locality can theoretically can theoretically read and write read to and write local to DataNode local as DataNode the You may wonder You may where wonder the TaskTrackers where the TaskTrackers are in this are scheme HBase deployments, HBase deployments, the MapReduce the MapReduce framework framework isn’t deployed isn’t primarily random primarily reads random and writes. reads and In other writes. deployments, In other deployments, where is also a part is also of the a part workloads, of the workloads, TaskTrackers, TaskTrackers, DataNodes, Servers can Servers run together. 
can run together. 
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is 
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing 
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is 
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing 
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is 
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing 
Given that the underlying data is stored in HDFS, which is available to all clients as 
Given that the underlying data is stored in HDFS, which is available to all clients as 
Given that the underlying data is stored in HDFS, which is available to all clients as 
Given that the underlying data is stored in HDFS, which is available to a single namespace, all RegionServers have access to the same persisted files system and can therefore host any region (figure 3.8). By physically collocating Nodes and RegionServers, you can use the data locality property; that is, can theoretically read and write to local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together. 
can theoretically read and write to the local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. In some 
can theoretically read and write to the local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. In some 
can theoretically read and write to local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. In some 
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
a single namespace, all RegionServers have access to the same persisted files in the file 
system and can therefore host any region (figure 3.8). By physically collocating Data- 
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
a single namespace, all RegionServers have access to the same persisted files in the file 
system and can therefore host any region (figure 3.8). By physically collocating Data- 
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
a single namespace, all RegionServers have access to the same persisted files in the file 
system and can therefore host any region (figure 3.8). By physically collocating Data- 
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
Servers can run together. 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
Servers can run together. 
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is 
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing 
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is 
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing 
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is 
C1 tree C0 tree 
primarily Disk Memory 
C1 tree C0 tree 
random reads and writes. In other deployments, where the MapReduce pro-cessing 
Disk Memory Cache 
C1 tree C0 tree 
can theoretically read and write to the local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. In some 
can theoretically read and write to the local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. In some 
can theoretically read and write to local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. In some 
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
is also Disk Memory 
C1 tree C0 tree 
a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
is also a part of Disk C1 tree the Memory 
C0 tree 
workloads, TaskTrackers, DataNodes, and HBase Region- 
DataNode RegionServer DataNode RegionServer C1 tree C0 tree 
C1 tree C0 tree 
Disk Memory 
Disk Memory 
C1 tree C0 tree 
C1 tree C0 tree 
Disk Memory Cache 
Disk Memory Cache 
C1 tree C0 tree 
C1 tree C0 tree 
Disk Memory 
Disk Memory 
C1 tree C0 tree 
C1 tree C0 tree 
Figure 3.7 HBase RegionServer and HDFS DataNode processes Licensed to Nick Dimiduk <ndimiduk@DataNode RegionServer DataNode RegionServer DataNode RegionServer 
Disk Memory 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
Disk Memory 
Servers can run together. 
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is 
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing 
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is 
C1 tree C0 tree 
primarily Disk Memory 
C1 tree C0 tree 
random reads and writes. In other deployments, where the MapReduce pro-cessing 
Disk Memory Cache 
Servers can run together. 
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is 
C1 tree C0 tree 
C1 tree C0 tree 
primarily Disk Memory 
C1 tree C0 tree 
random Disk Memory 
C1 tree C0 tree 
reads and writes. In other deployments, where the MapReduce pro-cessing 
Disk Memory Cache 
Disk Memory Cache 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
DataNode RegionServer DataNode RegionServer Disk Memory 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
DataNode RegionServer DataNode RegionServer Disk Memory 
Disk Memory 
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes Licensed to Nick Dimiduk <ndimiduk@C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
Figure C1 tree C0 tree 
3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
is also Disk Memory 
C1 tree C0 tree 
a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
Disk Memory 
Figure C1 tree C0 tree 
3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on Disk Memory 
C1 tree C0 tree 
Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
DataNode RegionServer DataNode RegionServer Disk Memory 
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
DataNode RegionServer DataNode RegionServer Disk Memory 
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
Servers can run together. 
Servers can run together. 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
Servers can run together. 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
Figure C1 tree C0 tree 
3.7 HBase C1 tree C0 tree 
RegionServer tree C0 tree 
and HDFS tree DataNode C0 tree 
processes are typically collocated on Disk Memory 
Disk Memory 
Disk Memory 
Disk Memory 
C1 tree C0 tree 
C1 tree C0 tree 
C1 tree C0 tree 
C1 tree C0 tree 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
DataNode RegionServer DataNode RegionServer C1 tree C0 tree 
C1 tree C0 tree 
C1 tree C0 tree 
Disk Memory 
Disk Memory 
Disk Memory 
C1 tree C0 tree 
C1 tree C0 tree 
C1 tree C0 tree 
Disk Memory Cache 
Disk Memory Cache 
Disk Memory Cache 
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically C1 tree C0 tree 
C1 Disk Memory 
C1 tree C0 tree 
Disk Memory 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on DataNode RegionServer DataNode RegionServer DataNode RegionServer 
Disk Memory 
Disk Memory 
tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 11 
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing 
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
Servers can run together. 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
Figure 3.7 HBase C1 tree C0 tree 
RegionServer and HDFS DataNode processes are typically collocated on the same host. 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
Disk Memory 
Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
Servers can run together. 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
Servers can run together. 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
RegionServers, and each RegionServer typically hosts multiple regions. 
RegionServers, and each RegionServer typically hosts multiple regions. 
RegionServers, and each RegionServer typically hosts multiple regions. 
RegionServers, and each RegionServer typically hosts multiple regions. 
RegionServers, RegionServers, and each RegionServer and each RegionServer typically hosts typically multiple hosts Given that the Given underlying that the underlying data is stored data in is HDFS, stored which in HDFS, is available a single namespace, a single namespace, all RegionServers all RegionServers have access have to the access same to system and system can therefore and can host therefore any region host any (figure region 3.8). (figure By physically 3.8). Nodes and Nodes RegionServers, and RegionServers, you can use you the can data use locality the data property; locality can theoretically can theoretically read and write read to and write local to DataNode local as DataNode the You may wonder You may where wonder the TaskTrackers where the TaskTrackers are in this are scheme HBase deployments, HBase deployments, the MapReduce the MapReduce framework framework isn’t deployed isn’t primarily random primarily reads random and writes. reads and In other writes. deployments, In other deployments, where is also a part is also of the a part workloads, of the workloads, TaskTrackers, TaskTrackers, DataNodes, Servers can Servers run together. 
can run together. 
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
Nodes and Nodes RegionServers, and Nodes RegionServers, you and can RegionServers, use you the can data use locality you the can data property; use locality the data that property; is, locality can theoretically can theoretically read and can write theoretically read to and write local read to DataNode and write local to as DataNode the primary local as DataNode the You may wonder You may where wonder You the may TaskTrackers where wonder the TaskTrackers where are in the this TaskTrackers are scheme in this of are scheme things. HBase deployments, HBase deployments, the HBase MapReduce deployments, the MapReduce framework the MapReduce framework isn’t deployed framework isn’t at deployed all if the isn’t primarily random primarily reads random primarily and writes. reads random and In other writes. reads deployments, and In other writes. deployments, In where other the deployments, MapReduce where is also a part is also of the a part workloads, is also of the a part workloads, TaskTrackers, of the workloads, TaskTrackers, DataNodes, TaskTrackers, DataNodes, and HBase Servers can Servers run together. 
can Servers run together. 
can run together. 
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing 
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing 
primarily random primarily reads random primarily and writes. reads random and In other writes. reads deployments, and In other writes. deployments, In where other the deployments, MapReduce where is also a part is also of the a part workloads, is also of the a part workloads, TaskTrackers, of the workloads, TaskTrackers, DataNodes, TaskTrackers, DataNodes, and HBase Servers can Servers run together. 
can Servers run together. 
can run together. 
Figure 3.6 A table consists of multiple smaller chunks called regions. 
Figure 3.6 A table consists of multiple smaller chunks called regions. 
Figure 3.6 A table consists of multiple smaller chunks called regions. 
Figure 3.6 A table consists of multiple smaller chunks called regions. 
Figure 3.6 A table consists of multiple smaller chunks called regions. 
Figure 3.6 A table consists of multiple smaller chunks called DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes C1 Given that the underlying data is stored in HDFS, which is available to all clients as 
Given that the underlying data is stored in HDFS, which is available to all clients as 
Given that the underlying data is stored in HDFS, which is available to all clients as 
Given that the underlying data is stored in HDFS, which is available to a single namespace, all RegionServers have access to the same persisted files system and can therefore host any region (figure 3.8). By physically collocating Nodes and RegionServers, you can use the data locality property; that is, can theoretically read and write to local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together. 
can theoretically read and write to the local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. In some 
can theoretically read and write to the local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. In some 
can theoretically read and write to local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. In some 
So what is HBase anyway? 
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
a single namespace, all RegionServers have access to the same persisted files in the file 
system and can therefore host any region (figure 3.8). By physically collocating Data- 
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
a single namespace, all RegionServers have access to the same persisted files in the file 
system and can therefore host any region (figure 3.8). By physically collocating Data- 
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
a single namespace, all RegionServers have access to the same persisted files in the file 
system and can therefore host any region (figure 3.8). By physically collocating Data- 
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
Servers can run together. 
Servers can run together. 
store/access data on HDFS. The master process does the distribution of regions among 
RegionServers, and each RegionServer typically hosts multiple regions. 
store/access data on HDFS. The master process does the distribution of regions among 
RegionServers, and each RegionServer typically hosts multiple regions. 
store/access data on HDFS. The master process does the distribution of regions among 
RegionServers, and each RegionServer typically hosts multiple regions. 
store/access data on HDFS. The master process does the distribution of regions RegionServers, and each RegionServer typically hosts multiple regions. 
store/access store/data access on HDFS. data The on master HDFS. process The master does process the distribution does RegionServers, RegionServers, and each RegionServer and each RegionServer typically hosts typically multiple hosts Given that the Given underlying that the underlying data is stored data in is HDFS, stored which in HDFS, is available a single namespace, a single namespace, all RegionServers all RegionServers have access have to the access same to system and system can therefore and can host therefore any region host any (figure region 3.8). (figure By physically 3.8). Nodes and Nodes RegionServers, and RegionServers, you can use you the can data use locality the data property; locality can theoretically can theoretically read and write read to and write local to DataNode local as DataNode the You may wonder You may where wonder the TaskTrackers where the TaskTrackers are in this are scheme HBase deployments, HBase deployments, the MapReduce the MapReduce framework framework isn’t deployed isn’t primarily random primarily reads random and writes. reads and In other writes. deployments, In other deployments, where is also a part is also of the a part workloads, of the workloads, TaskTrackers, TaskTrackers, DataNodes, Servers can Servers run together. 
can run together. 
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is 
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing 
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is 
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing 
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is 
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing 
Given that the underlying data is stored in HDFS, which is available to all clients as 
Given that the underlying data is stored in HDFS, which is available to all clients as 
Given that the underlying data is stored in HDFS, which is available to all clients as 
Given that the underlying data is stored in HDFS, which is available to a single namespace, all RegionServers have access to the same persisted files system and can therefore host any region (figure 3.8). By physically collocating Nodes and RegionServers, you can use the data locality property; that is, can theoretically read and write to local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together. 
can theoretically read and write to the local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. In some 
can theoretically read and write to the local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. In some 
can theoretically read and write to local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. In some 
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
a single namespace, all RegionServers have access to the same persisted files in the file 
system and can therefore host any region (figure 3.8). By physically collocating Data- 
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
a single namespace, all RegionServers have access to the same persisted files in the file 
system and can therefore host any region (figure 3.8). By physically collocating Data- 
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
a single namespace, all RegionServers have access to the same persisted files in the file 
system and can therefore host any region (figure 3.8). By physically collocating Data- 
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
Servers can run together. 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
Servers can run together. 
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is 
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing 
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is 
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing 
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is 
C1 tree C0 tree 
primarily Disk Memory 
C1 tree C0 tree 
random reads and writes. In other deployments, where the MapReduce pro-cessing 
Disk Memory Cache 
C1 tree C0 tree 
can theoretically read and write to the local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. In some 
can theoretically read and write to the local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. In some 
can theoretically read and write to local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. In some 
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
is also Disk Memory 
C1 tree C0 tree 
a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
is also a part of Disk C1 tree the Memory 
C0 tree 
workloads, TaskTrackers, DataNodes, and HBase Region- 
DataNode RegionServer DataNode RegionServer C1 tree C0 tree 
C1 tree C0 tree 
Disk Memory 
Disk Memory 
C1 tree C0 tree 
C1 tree C0 tree 
Disk Memory Cache 
Disk Memory Cache 
C1 tree C0 tree 
C1 tree C0 tree 
Disk Memory 
Disk Memory 
C1 tree C0 tree 
C1 tree C0 tree 
Figure 3.7 HBase RegionServer and HDFS DataNode processes Licensed to Nick Dimiduk <ndimiduk@DataNode RegionServer DataNode RegionServer DataNode RegionServer 
Disk Memory 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
Disk Memory 
Servers can run together. 
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is 
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing 
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is 
C1 tree C0 tree 
primarily Disk Memory 
C1 tree C0 tree 
random reads and writes. In other deployments, where the MapReduce pro-cessing 
Disk Memory Cache 
Servers can run together. 
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is 
C1 tree C0 tree 
C1 tree C0 tree 
primarily Disk Memory 
C1 tree C0 tree 
random Disk Memory 
C1 tree C0 tree 
reads and writes. In other deployments, where the MapReduce pro-cessing 
Disk Memory Cache 
Disk Memory Cache 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
DataNode RegionServer DataNode RegionServer Disk Memory 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
DataNode RegionServer DataNode RegionServer Disk Memory 
Disk Memory 
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes Licensed to Nick Dimiduk <ndimiduk@C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
Figure C1 tree C0 tree 
3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
is also Disk Memory 
C1 tree C0 tree 
a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
Disk Memory 
Figure C1 tree C0 tree 
3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on Disk Memory 
C1 tree C0 tree 
Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
DataNode RegionServer DataNode RegionServer Disk Memory 
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
DataNode RegionServer DataNode RegionServer Disk Memory 
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
Servers can run together. 
Servers can run together. 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
Servers can run together. 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
Figure C1 tree C0 tree 
3.7 HBase C1 RegionServer and HDFS DataNode processes are typically collocated on Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
"2011-07-04" Disk Memory 
C1 tree C0 tree 
1368396302 "fourth of July" 
tree C0 tree 
Disk Memory 
C1 tree C0 tree 
tree C0 tree 
Disk Memory 
C1 tree C0 tree 
tree C0 tree 
Disk Memory 
C1 tree C0 tree 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
DataNode RegionServer DataNode RegionServer C1 tree C0 tree 
C1 tree C0 tree 
C1 tree C0 tree 
Disk Memory 
Disk Memory 
Disk Memory 
C1 tree C0 tree 
C1 tree C0 tree 
C1 tree C0 tree 
Disk Memory Cache 
Disk Memory Cache 
Disk Memory Cache 
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically C1 tree C0 tree 
C1 Disk Memory 
C1 tree C0 tree 
Disk Memory 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on DataNode RegionServer DataNode RegionServer DataNode RegionServer 
Disk Memory 
Disk Memory 
tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 12 
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing 
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
Servers can run together. 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
Figure 3.7 HBase C1 tree C0 tree 
RegionServer and HDFS DataNode processes are typically collocated on the same host. 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
Disk Memory 
Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
Servers can run together. 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory 
Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
Licensed to Nick Dimiduk <ndimiduk@gmail.com> 
Servers can run together. 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. 
a 
cf1 
1368394583 7 
1368394261 "hello" 
"bar" 
1368394583 22 
1368394925 13.6 
1368393847 "world" 
"foo" 
cf2 
1.0001 1368387684 "almost the loneliest number"
High-level Architecture 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 13
Logical Data Model 
1368394583 7 
1368394261 "hello" 
"bar" 
1368394583 22 
1368394925 13.6 
1368393847 "world" 
"foo" 
"2011-07-04" 1368396302 "fourth of July" 
1.0001 1368387684 "almost the loneliest number" 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 14 
a 
cf1 
cf2 
b cf2 "thumb" 1368387247 [3.6 kb png data] 
Table A 
rowkey 
column 
family 
column 
qualifier 
timestamp value 
Rows 
Column Families
Logical Architecture 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 15 
Table A 
a 
b 
c 
d 
e 
f 
g 
h 
i 
j 
k 
l 
m 
n 
o 
p 
Region 1 
Region 2 
Region 3 
Region 4 
Region Server 7 
Table A, Region 1 
Table A, Region 2 
Table G, Region 1070 
Table L, Region 25 
Region Server 86 
Table A, Region 3 
Table C, Region 30 
Table F, Region 160 
Table F, Region 776 
Region Server 367 
Table A, Region 4 
Table C, Region 17 
Table E, Region 52 
Table P, Region 1116
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
and RegionServers, you can use the data locality property; that can theoretically read and write to the local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of HBase deployments, the MapReduce framework isn’t deployed at all primarily random reads and writes. In other deployments, where the is also a part of the workloads, TaskTrackers, DataNodes, and Servers can run together. 
Nodes Nodes DataNode RegionServer DataNode DataNode RegionServer RegionServer DataNode DataNode RegionServer Figure 3.7 HBase RegionServer Figure 3.7 and HDFS HBase DataNode RegionServer processes and HDFS are typically DataNode collocated processes and RegionServers, you can use the data locality property; can theoretically read and write to the local DataNode You may wonder where the TaskTrackers are in this HBase deployments, the MapReduce framework isn’t deployed primarily random reads and writes. In other deployments, is also a part of the workloads, TaskTrackers, DataNodes, Servers can run together. 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same Nodes and RegionServers, you can use the data can theoretically read and write to the local You may wonder where the TaskTrackers HBase deployments, the MapReduce framework primarily random reads and writes. In other deployments, is also a part of the workloads, TaskTrackers, Servers can run together. 
system and can therefore host any region (figure 3.8). By physically collocating Data- 
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers 
Physical Architecture 
can theoretically read and write to the local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the workload primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together. 
can theoretically read and write to the local DataNode as the primary DataNode. 
You may wonder where the TaskTrackers are in this scheme of things. In some 
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is 
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing 
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- 
Zoo 
Keeper 
Zoo 
Keeper 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
DataNode RegionServer DataNode RegionServer Region 
Region 
Region 
Server 
Server 
Server 
Data 
Data 
Data 
Node 
Node 
Node 
Figure Licensed Attribution-3.7 under a ShareAlike HBase Creative Commons 
3.0 Unported RegionServer License. 
and HDFS DataNode processes Page 16 
are typically Region 
Server 
Data 
Node 
... 
can theoretically read and write You may wonder where the TaskTrackers HBase deployments, the MapReduce primarily random reads and writes. is also a part of the workloads, Servers can run together. 
DataNode RegionServer Master 
Master 
Figure 3.7 HBase RegionServer and HDFS Servers can run together. 
DataNode RegionServer DataNode RegionServer DataNode RegionServer 
Name 
Node 
DataNode RegionServer DataNode RegionServer HBase 
Client 
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.HDFS 
HBase
User API 
• {rowkey => {family => {qualifier => {version => value}}}} 
– Think: nested TreeMap (Java), OrderedDictionary (C#), OrderedDict (Python) 
• Basic data operations: GET, PUT, DELETE 
• SCAN over range of key-values 
– benefit of the sorted rowkey business 
– this is how you implement any kind of "complex query” * 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
• GET, SCAN support Filters 
– Push application logic to RegionServers 
• INCREMENT, APPEND, CheckAnd{Put,Delete} 
– Server-side, atomic data operations, can be contentious! 
Page 17 
* This is also a foundational component in what we refer to as 
“schema design” in this “schemaless” database.
Anatomy of a RegionServer 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 18
So what is HBase anyway? 
C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory Cache 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 19 
DataNode RegionServer C1 tree C0 tree 
Disk Memory 
C1 tree C0 tree 
Disk Memory
Storage Machinery 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 20 
RegionServer (HBase) 
DataNode (Hadoop DFS) 
HLog 
(WAL) 
HRegion 
HStore 
StoreFile 
HFile 
StoreFile 
HFile 
MemStore 
... 
... 
HStore 
BlockCache 
HRegion 
HStore HStore 
... 
...
Storage Machinery 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 21 
RegionServer (HBase) 
DataNode (Hadoop DFS) 
HLog 
(WAL) 
HRegion 
HStore 
StoreFile 
HFile 
StoreFile 
HFile 
MemStore 
... 
... 
HStore 
BlockCache 
HRegion 
HStore HStore 
... 
... 
C1 
C0 
C1 
C0 
C1 
C0 
C1 
C0 
Cache
Storage Machinery: Write Path 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 22 
RegionServer (HBase) 
DataNode (Hadoop DFS) 
HLog 
(WAL) 
HRegion 
HStore 
StoreFile 
HFile 
StoreFile 
HFile 
MemStore 
... 
... 
HStore 
BlockCache 
HRegion 
HStore HStore 
... 
... 
1 
2 
3 
4 
5
Storage Machinery: Read Path 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 23 
RegionServer (HBase) 
DataNode (Hadoop DFS) 
HLog 
(WAL) 
HRegion 
HStore 
StoreFile 
HFile 
StoreFile 
HFile 
MemStore 
... 
... 
HStore 
BlockCache 
HRegion 
HStore HStore 
... 
... 
1 5 
2 
3 
3 
2 
4
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
By Example 
Page 24
Database Dichotomy 
Latency 
Predicate 
Pushdown 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 25 
Row+Column 
Bloomfilter 
Lossy 
WAL Durability 
Smart Rowkey 
Design 
Compression 
Write Read 
Throughput 
Larger 
Block Size 
Bulkload 
HFiles 
Compression 
Compression 
Compression 
Smaller 
Block Size 
Larger 
BlockCache 
Increase 
Blocking Storefiles 
Increased 
Scanner Caching 
Larger 
Flush Size 
Smart Rowkey 
Design 
Smart Rowkey 
Design 
Smart Rowkey 
Design
Web-scale Database 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 26 
App 
server 
App 
server 
App 
server 
App 
server 
App 
server 
Counters 
Sessions 
User profiles 
Social Media 
Application 
Data 
Latency 
Write Read 
Throughput
Search Search 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
“BigIndex” 
Page 27 
"BigIndex" 
Document 
store 
Search Search 
Search 
App 
server 
App 
server 
App 
server 
App 
server 
App 
server 
Latency 
Write Read 
Throughput
Materialized View 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 28 
App 
server 
App 
server 
App 
server 
App 
server 
App 
server 
Latency 
Write Read 
Throughput
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
ETL Assist 
Write Read 
Page 29 
Latency 
Throughput
Lambda Architecture 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 30 
App 
server 
App 
server 
App 
server 
App 
server 
App 
server
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Resources 
Page 31
Join the Community! 
• hbase.apache.org 
– hbase.apache.org/book.html 
– blogs.apache.org/hbase/ 
– hbase.apache.org/mail-lists.html 
• IRC: irc.freenode.net #hbase 
• JIRA: issues.apache.org/jira/browse/HBASE 
• Source: git clone git://git.apache.org/hbase.git 
• In person 
– HBaseCon, hbasecon.com 
– Hadoop Summit, hadoopsummit.org 
– Strata / Hadoop World, strataconf.com 
– Local meetup near you! 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 32
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
HBase 0.98/1.0 
• Hardening 
– Stability, Reliability, Availability, Performance 
• Horizontal Scalability 
– 1000’s of machines 
• Availability 
– Speed of Recovery (MTTR), Region Replicas 
• Improved Multi-tenancy 
– RPC priorities/QoS, namespace management 
• Cell-level security 
• Semantic Versioning 
• Client API cleanup 
Page 33
Thanks! 
Licensed under a Creative Commons 
Attribution-ShareAlike 3.0 Unported License. 
Page 34 
M A N N I N G 
Nick Dimiduk 
Amandeep Khurana 
FOREWORD BY 
Michael Stack 
hbaseinaction.com 
Nick Dimiduk 
github.com/ndimiduk 
@xefyr 
n10k.com 
strataeucftw 
slideshare.net/xefyr/hbase-for-architects

Más contenido relacionado

La actualidad más candente

Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
 
Hadoop and HBase in the Real World
Hadoop and HBase in the Real WorldHadoop and HBase in the Real World
Hadoop and HBase in the Real WorldCloudera, Inc.
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the BasicsHBaseCon
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicasenissoz
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the BasicsHBaseCon
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbaseRavi Veeramachaneni
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance TuningLars Hofhansl
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaCloudera, Inc.
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to HadoopAnandMHadoop
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationAdam Kawa
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseCloudera, Inc.
 

La actualidad más candente (20)

Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
 
Hadoop and HBase in the Real World
Hadoop and HBase in the Real WorldHadoop and HBase in the Real World
Hadoop and HBase in the Real World
 
NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBase
 

Destacado

HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
 
HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)Nick Dimiduk
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks
 
HBase Data Types (WIP)
HBase Data Types (WIP)HBase Data Types (WIP)
HBase Data Types (WIP)Nick Dimiduk
 
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLNick Dimiduk
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
 
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)tatsuya6502
 
Apache HBase Low Latency
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low LatencyNick Dimiduk
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseEdureka!
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
NoSQL with Hadoop and HBase
NoSQL with Hadoop and HBaseNoSQL with Hadoop and HBase
NoSQL with Hadoop and HBaseNGDATA
 
Web intelligence and big data
Web intelligence and big dataWeb intelligence and big data
Web intelligence and big dataRafael Mendes
 
The Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseThe Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseDataWorks Summit
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBaseCarol McDonald
 
Bring Cartography to the Cloud
Bring Cartography to the CloudBring Cartography to the Cloud
Bring Cartography to the CloudNick Dimiduk
 

Destacado (20)

HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical Applications
 
HBase Data Types (WIP)
HBase Data Types (WIP)HBase Data Types (WIP)
HBase Data Types (WIP)
 
HBase Data Types
HBase Data TypesHBase Data Types
HBase Data Types
 
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQL
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
 
Apache HBase Low Latency
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low Latency
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Elasticsearch Workshop
Elasticsearch WorkshopElasticsearch Workshop
Elasticsearch Workshop
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
NoSQL with Hadoop and HBase
NoSQL with Hadoop and HBaseNoSQL with Hadoop and HBase
NoSQL with Hadoop and HBase
 
The analytics edge
The analytics edgeThe analytics edge
The analytics edge
 
Web intelligence and big data
Web intelligence and big dataWeb intelligence and big data
Web intelligence and big data
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
 
The Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseThe Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBase
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBase
 
Bring Cartography to the Cloud
Bring Cartography to the CloudBring Cartography to the Cloud
Bring Cartography to the Cloud
 

Similar a Apache HBase for Architects

Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comEdward D. Kim
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Hadoop interview quations1
Hadoop interview quations1Hadoop interview quations1
Hadoop interview quations1Vemula Ravi
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxssuser8c3ea7
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystemrohitraj268
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
Data Storage and Management project Report
Data Storage and Management project ReportData Storage and Management project Report
Data Storage and Management project ReportTushar Dalvi
 
HBase Mongo_DB Project
HBase Mongo_DB ProjectHBase Mongo_DB Project
HBase Mongo_DB ProjectSonali Gupta
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptxSakthiVinoth78
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Rupak Roy
 

Similar a Apache HBase for Architects (20)

Hbase
HbaseHbase
Hbase
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.com
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop interview quations1
Hadoop interview quations1Hadoop interview quations1
Hadoop interview quations1
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Data Storage and Management project Report
Data Storage and Management project ReportData Storage and Management project Report
Data Storage and Management project Report
 
HBase Mongo_DB Project
HBase Mongo_DB ProjectHBase Mongo_DB Project
HBase Mongo_DB Project
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 

Último

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Último (20)

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

Apache HBase for Architects

  • 1. Apache HBase For Architects Nick Dimiduk, Hortonworks Strata/Hadoop World Barcelona, 2014-11-21 Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 1
  • 2. Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 2
  • 3. Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Agenda • Background – (how did we get here?) • TL;DR – (don’t waste my time!) • High-level Architecture – (where are we?) • Anatomy of a RegionServer – (how does this thing work?) • By Example – (how do I use it?) • Resources – (where do we go from here?) Page 3
  • 4. Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Background Page 4
  • 5. So what is HBase anyway? • BigTable paper from Google, 2006, Dean et al. – “Bigtable is a sparse, distributed, persistent multi-dimensional sorted map.” – http://research.google.com/archive/bigtable.html Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. • Key Features: – Distributed storage across cluster of machines – Random, online read and write data access – Schemaless data model (“NoSQL”) – Self-managed data partitions Page 5
  • 6. Apache Hadoop Dependencies • Apache Hadoop Distributed Filesystem (HDFS) – Distributed, fault-tolerant, throughput-optimized data storage – The Google File System, 2003, Ghemawat et al. – http://research.google.com/archive/gfs.html Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. • Apache Zookeeper (ZK) – Distributed, available, reliable coordination system – The Chubby Lock Service …, 2006, Burrows – http://research.google.com/archive/chubby.html • Apache Hadoop MapReduce (MR) – Distributed, fault-tolerant, batch-oriented data processing – MapReduce: …, 2004, Dean and Ghemawat – http://research.google.com/archive/mapreduce.html Page 6
  • 7. Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. TL;DR Page 7
  • 8. So what is HBase anyway? Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 8 C1 tree C0 tree Disk Memory Figure 2.1. Schematic picture of an LSM-tree of two components Figure 2.1 reproduced from O’Neil, Patrick, et al. "The log-structured merge-tree (LSM-tree)." Acta Informatica 33.4 (1996): 351-385.
  • 9. So what is HBase anyway? C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 9 DataNode RegionServer C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory
  • 10. primarily random reads and writes. In other deployments, is also a part of the workloads, TaskTrackers, Servers can run together. primarily random primarily reads random and writes. reads and In other writes. is also a part is also of the a part workloads, of the workloads, TaskTrackers, Servers can Servers run together. can run together. So what is HBase anyway? DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes C1 DataNode RegionServer DataNode RegionServer DataNode C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 C1 Figure 3.7 HBase Figure RegionServer 3.7 HBase and RegionServer HDFS DataNode and DataNode RegionServer DataNode RegionServer C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 10 C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache tree C0 tree Disk Memory C1 tree C0 tree Disk Memory tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache tree C0 tree Disk Memory C1 tree C0 tree Disk Memory DataNode RegionServer DataNode C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory DataNode RegionServer C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory
  • 11. RegionServers, and each RegionServer typically hosts multiple regions. RegionServers, and each RegionServer typically hosts multiple regions. RegionServers, and each RegionServer typically hosts multiple regions. RegionServers, and each RegionServer typically hosts multiple regions. RegionServers, RegionServers, and each RegionServer and each RegionServer typically hosts typically multiple hosts Given that the Given underlying that the underlying data is stored data in is HDFS, stored which in HDFS, is available a single namespace, a single namespace, all RegionServers all RegionServers have access have to the access same to system and system can therefore and can host therefore any region host any (figure region 3.8). (figure By physically 3.8). Nodes and Nodes RegionServers, and RegionServers, you can use you the can data use locality the data property; locality can theoretically can theoretically read and write read to and write local to DataNode local as DataNode the You may wonder You may where wonder the TaskTrackers where the TaskTrackers are in this are scheme HBase deployments, HBase deployments, the MapReduce the MapReduce framework framework isn’t deployed isn’t primarily random primarily reads random and writes. reads and In other writes. deployments, In other deployments, where is also a part is also of the a part workloads, of the workloads, TaskTrackers, TaskTrackers, DataNodes, Servers can Servers run together. can run together. Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Nodes and Nodes RegionServers, and Nodes RegionServers, you and can RegionServers, use you the can data use locality you the can data property; use locality the data that property; is, locality can theoretically can theoretically read and can write theoretically read to and write local read to DataNode and write local to as DataNode the primary local as DataNode the You may wonder You may where wonder You the may TaskTrackers where wonder the TaskTrackers where are in the this TaskTrackers are scheme in this of are scheme things. HBase deployments, HBase deployments, the HBase MapReduce deployments, the MapReduce framework the MapReduce framework isn’t deployed framework isn’t at deployed all if the isn’t primarily random primarily reads random primarily and writes. reads random and In other writes. reads deployments, and In other writes. deployments, In where other the deployments, MapReduce where is also a part is also of the a part workloads, is also of the a part workloads, TaskTrackers, of the workloads, TaskTrackers, DataNodes, TaskTrackers, DataNodes, and HBase Servers can Servers run together. can Servers run together. can run together. primarily random reads and writes. In other deployments, where the MapReduce pro-cessing primarily random reads and writes. In other deployments, where the MapReduce pro-cessing primarily random primarily reads random primarily and writes. reads random and In other writes. reads deployments, and In other writes. deployments, In where other the deployments, MapReduce where is also a part is also of the a part workloads, is also of the a part workloads, TaskTrackers, of the workloads, TaskTrackers, DataNodes, TaskTrackers, DataNodes, and HBase Servers can Servers run together. can Servers run together. can run together. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes C1 Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to a single namespace, all RegionServers have access to the same persisted files system and can therefore host any region (figure 3.8). By physically collocating Nodes and RegionServers, you can use the data locality property; that is, can theoretically read and write to local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together. can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some So what is HBase anyway? is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Servers can run together. Servers can run together. store/access data on HDFS. The master process does the distribution of regions among RegionServers, and each RegionServer typically hosts multiple regions. store/access data on HDFS. The master process does the distribution of regions among RegionServers, and each RegionServer typically hosts multiple regions. store/access data on HDFS. The master process does the distribution of regions among RegionServers, and each RegionServer typically hosts multiple regions. store/access data on HDFS. The master process does the distribution of regions RegionServers, and each RegionServer typically hosts multiple regions. store/access store/data access on HDFS. data The on master HDFS. process The master does process the distribution does RegionServers, RegionServers, and each RegionServer and each RegionServer typically hosts typically multiple hosts Given that the Given underlying that the underlying data is stored data in is HDFS, stored which in HDFS, is available a single namespace, a single namespace, all RegionServers all RegionServers have access have to the access same to system and system can therefore and can host therefore any region host any (figure region 3.8). (figure By physically 3.8). Nodes and Nodes RegionServers, and RegionServers, you can use you the can data use locality the data property; locality can theoretically can theoretically read and write read to and write local to DataNode local as DataNode the You may wonder You may where wonder the TaskTrackers where the TaskTrackers are in this are scheme HBase deployments, HBase deployments, the MapReduce the MapReduce framework framework isn’t deployed isn’t primarily random primarily reads random and writes. reads and In other writes. deployments, In other deployments, where is also a part is also of the a part workloads, of the workloads, TaskTrackers, TaskTrackers, DataNodes, Servers can Servers run together. can run together. HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to a single namespace, all RegionServers have access to the same persisted files system and can therefore host any region (figure 3.8). By physically collocating Nodes and RegionServers, you can use the data locality property; that is, can theoretically read and write to local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together. can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- DataNode RegionServer DataNode RegionServer DataNode RegionServer a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Servers can run together. DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.com> Servers can run together. HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is C1 tree C0 tree primarily Disk Memory C1 tree C0 tree random reads and writes. In other deployments, where the MapReduce pro-cessing Disk Memory Cache C1 tree C0 tree can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also Disk Memory C1 tree C0 tree a part of the workloads, TaskTrackers, DataNodes, and HBase Region- C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree is also a part of Disk C1 tree the Memory C0 tree workloads, TaskTrackers, DataNodes, and HBase Region- DataNode RegionServer DataNode RegionServer C1 tree C0 tree C1 tree C0 tree Disk Memory Disk Memory C1 tree C0 tree C1 tree C0 tree Disk Memory Cache Disk Memory Cache C1 tree C0 tree C1 tree C0 tree Disk Memory Disk Memory C1 tree C0 tree C1 tree C0 tree Figure 3.7 HBase RegionServer and HDFS DataNode processes Licensed to Nick Dimiduk <ndimiduk@DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory DataNode RegionServer DataNode RegionServer DataNode RegionServer C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory Servers can run together. HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is C1 tree C0 tree primarily Disk Memory C1 tree C0 tree random reads and writes. In other deployments, where the MapReduce pro-cessing Disk Memory Cache Servers can run together. HBase deployments, the MapReduce framework isn’t deployed at all if the workload is C1 tree C0 tree C1 tree C0 tree primarily Disk Memory C1 tree C0 tree random Disk Memory C1 tree C0 tree reads and writes. In other deployments, where the MapReduce pro-cessing Disk Memory Cache Disk Memory Cache DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.com> DataNode RegionServer DataNode RegionServer Disk Memory Disk Memory C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes Licensed to Nick Dimiduk <ndimiduk@C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure C1 tree C0 tree 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also Disk Memory C1 tree C0 tree a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory Figure C1 tree C0 tree 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on Disk Memory C1 tree C0 tree Licensed to Nick Dimiduk <ndimiduk@gmail.com> DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically C1 tree C0 tree Disk Memory C1 tree C0 tree DataNode RegionServer DataNode RegionServer Disk Memory C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Servers can run together. Servers can run together. C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache Servers can run together. C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure C1 tree C0 tree 3.7 HBase C1 tree C0 tree RegionServer tree C0 tree and HDFS tree DataNode C0 tree processes are typically collocated on Disk Memory Disk Memory Disk Memory Disk Memory C1 tree C0 tree C1 tree C0 tree C1 tree C0 tree C1 tree C0 tree Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on Licensed to Nick Dimiduk <ndimiduk@gmail.com> DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache DataNode RegionServer DataNode RegionServer C1 tree C0 tree C1 tree C0 tree C1 tree C0 tree Disk Memory Disk Memory Disk Memory C1 tree C0 tree C1 tree C0 tree C1 tree C0 tree Disk Memory Cache Disk Memory Cache Disk Memory Cache C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically C1 tree C0 tree C1 Disk Memory C1 tree C0 tree Disk Memory Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory Disk Memory tree C0 tree Disk Memory C1 tree C0 tree Disk Memory tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 11 primarily random reads and writes. In other deployments, where the MapReduce pro-cessing is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- Servers can run together. DataNode RegionServer DataNode RegionServer DataNode RegionServer C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Figure 3.7 HBase C1 tree C0 tree RegionServer and HDFS DataNode processes are typically collocated on the same host. Disk Memory C1 tree C0 tree Disk Memory tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Licensed to Nick Dimiduk <ndimiduk@gmail.com> Disk Memory Licensed to Nick Dimiduk <ndimiduk@gmail.com> Licensed to Nick Dimiduk <ndimiduk@gmail.com> Servers can run together. DataNode RegionServer DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Licensed to Nick Dimiduk <ndimiduk@gmail.com> tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Licensed to Nick Dimiduk <ndimiduk@gmail.com> Licensed to Nick Dimiduk <ndimiduk@gmail.com> Servers can run together. DataNode RegionServer DataNode RegionServer DataNode RegionServer C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
  • 12. RegionServers, and each RegionServer typically hosts multiple regions. RegionServers, and each RegionServer typically hosts multiple regions. RegionServers, and each RegionServer typically hosts multiple regions. RegionServers, and each RegionServer typically hosts multiple regions. RegionServers, RegionServers, and each RegionServer and each RegionServer typically hosts typically multiple hosts Given that the Given underlying that the underlying data is stored data in is HDFS, stored which in HDFS, is available a single namespace, a single namespace, all RegionServers all RegionServers have access have to the access same to system and system can therefore and can host therefore any region host any (figure region 3.8). (figure By physically 3.8). Nodes and Nodes RegionServers, and RegionServers, you can use you the can data use locality the data property; locality can theoretically can theoretically read and write read to and write local to DataNode local as DataNode the You may wonder You may where wonder the TaskTrackers where the TaskTrackers are in this are scheme HBase deployments, HBase deployments, the MapReduce the MapReduce framework framework isn’t deployed isn’t primarily random primarily reads random and writes. reads and In other writes. deployments, In other deployments, where is also a part is also of the a part workloads, of the workloads, TaskTrackers, TaskTrackers, DataNodes, Servers can Servers run together. can run together. Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Nodes and Nodes RegionServers, and Nodes RegionServers, you and can RegionServers, use you the can data use locality you the can data property; use locality the data that property; is, locality can theoretically can theoretically read and can write theoretically read to and write local read to DataNode and write local to as DataNode the primary local as DataNode the You may wonder You may where wonder You the may TaskTrackers where wonder the TaskTrackers where are in the this TaskTrackers are scheme in this of are scheme things. HBase deployments, HBase deployments, the HBase MapReduce deployments, the MapReduce framework the MapReduce framework isn’t deployed framework isn’t at deployed all if the isn’t primarily random primarily reads random primarily and writes. reads random and In other writes. reads deployments, and In other writes. deployments, In where other the deployments, MapReduce where is also a part is also of the a part workloads, is also of the a part workloads, TaskTrackers, of the workloads, TaskTrackers, DataNodes, TaskTrackers, DataNodes, and HBase Servers can Servers run together. can Servers run together. can run together. primarily random reads and writes. In other deployments, where the MapReduce pro-cessing primarily random reads and writes. In other deployments, where the MapReduce pro-cessing primarily random primarily reads random primarily and writes. reads random and In other writes. reads deployments, and In other writes. deployments, In where other the deployments, MapReduce where is also a part is also of the a part workloads, is also of the a part workloads, TaskTrackers, of the workloads, TaskTrackers, DataNodes, TaskTrackers, DataNodes, and HBase Servers can Servers run together. can Servers run together. can run together. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes C1 Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to a single namespace, all RegionServers have access to the same persisted files system and can therefore host any region (figure 3.8). By physically collocating Nodes and RegionServers, you can use the data locality property; that is, can theoretically read and write to local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together. can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some So what is HBase anyway? is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Servers can run together. Servers can run together. store/access data on HDFS. The master process does the distribution of regions among RegionServers, and each RegionServer typically hosts multiple regions. store/access data on HDFS. The master process does the distribution of regions among RegionServers, and each RegionServer typically hosts multiple regions. store/access data on HDFS. The master process does the distribution of regions among RegionServers, and each RegionServer typically hosts multiple regions. store/access data on HDFS. The master process does the distribution of regions RegionServers, and each RegionServer typically hosts multiple regions. store/access store/data access on HDFS. data The on master HDFS. process The master does process the distribution does RegionServers, RegionServers, and each RegionServer and each RegionServer typically hosts typically multiple hosts Given that the Given underlying that the underlying data is stored data in is HDFS, stored which in HDFS, is available a single namespace, a single namespace, all RegionServers all RegionServers have access have to the access same to system and system can therefore and can host therefore any region host any (figure region 3.8). (figure By physically 3.8). Nodes and Nodes RegionServers, and RegionServers, you can use you the can data use locality the data property; locality can theoretically can theoretically read and write read to and write local to DataNode local as DataNode the You may wonder You may where wonder the TaskTrackers where the TaskTrackers are in this are scheme HBase deployments, HBase deployments, the MapReduce the MapReduce framework framework isn’t deployed isn’t primarily random primarily reads random and writes. reads and In other writes. deployments, In other deployments, where is also a part is also of the a part workloads, of the workloads, TaskTrackers, TaskTrackers, DataNodes, Servers can Servers run together. can run together. HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to a single namespace, all RegionServers have access to the same persisted files system and can therefore host any region (figure 3.8). By physically collocating Nodes and RegionServers, you can use the data locality property; that is, can theoretically read and write to local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together. can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- DataNode RegionServer DataNode RegionServer DataNode RegionServer a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Servers can run together. DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.com> Servers can run together. HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is C1 tree C0 tree primarily Disk Memory C1 tree C0 tree random reads and writes. In other deployments, where the MapReduce pro-cessing Disk Memory Cache C1 tree C0 tree can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also Disk Memory C1 tree C0 tree a part of the workloads, TaskTrackers, DataNodes, and HBase Region- C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree is also a part of Disk C1 tree the Memory C0 tree workloads, TaskTrackers, DataNodes, and HBase Region- DataNode RegionServer DataNode RegionServer C1 tree C0 tree C1 tree C0 tree Disk Memory Disk Memory C1 tree C0 tree C1 tree C0 tree Disk Memory Cache Disk Memory Cache C1 tree C0 tree C1 tree C0 tree Disk Memory Disk Memory C1 tree C0 tree C1 tree C0 tree Figure 3.7 HBase RegionServer and HDFS DataNode processes Licensed to Nick Dimiduk <ndimiduk@DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory DataNode RegionServer DataNode RegionServer DataNode RegionServer C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory Servers can run together. HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is C1 tree C0 tree primarily Disk Memory C1 tree C0 tree random reads and writes. In other deployments, where the MapReduce pro-cessing Disk Memory Cache Servers can run together. HBase deployments, the MapReduce framework isn’t deployed at all if the workload is C1 tree C0 tree C1 tree C0 tree primarily Disk Memory C1 tree C0 tree random Disk Memory C1 tree C0 tree reads and writes. In other deployments, where the MapReduce pro-cessing Disk Memory Cache Disk Memory Cache DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.com> DataNode RegionServer DataNode RegionServer Disk Memory Disk Memory C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes Licensed to Nick Dimiduk <ndimiduk@C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure C1 tree C0 tree 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also Disk Memory C1 tree C0 tree a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory Figure C1 tree C0 tree 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on Disk Memory C1 tree C0 tree Licensed to Nick Dimiduk <ndimiduk@gmail.com> DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically C1 tree C0 tree Disk Memory C1 tree C0 tree DataNode RegionServer DataNode RegionServer Disk Memory C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Servers can run together. Servers can run together. C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache Servers can run together. C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure C1 tree C0 tree 3.7 HBase C1 RegionServer and HDFS DataNode processes are typically collocated on Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on Licensed to Nick Dimiduk <ndimiduk@gmail.com> "2011-07-04" Disk Memory C1 tree C0 tree 1368396302 "fourth of July" tree C0 tree Disk Memory C1 tree C0 tree tree C0 tree Disk Memory C1 tree C0 tree tree C0 tree Disk Memory C1 tree C0 tree DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache DataNode RegionServer DataNode RegionServer C1 tree C0 tree C1 tree C0 tree C1 tree C0 tree Disk Memory Disk Memory Disk Memory C1 tree C0 tree C1 tree C0 tree C1 tree C0 tree Disk Memory Cache Disk Memory Cache Disk Memory Cache C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically C1 tree C0 tree C1 Disk Memory C1 tree C0 tree Disk Memory Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory Disk Memory tree C0 tree Disk Memory C1 tree C0 tree Disk Memory tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 12 primarily random reads and writes. In other deployments, where the MapReduce pro-cessing is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- Servers can run together. DataNode RegionServer DataNode RegionServer DataNode RegionServer C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Figure 3.7 HBase C1 tree C0 tree RegionServer and HDFS DataNode processes are typically collocated on the same host. Disk Memory C1 tree C0 tree Disk Memory tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Licensed to Nick Dimiduk <ndimiduk@gmail.com> Disk Memory Licensed to Nick Dimiduk <ndimiduk@gmail.com> Licensed to Nick Dimiduk <ndimiduk@gmail.com> Servers can run together. DataNode RegionServer DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Licensed to Nick Dimiduk <ndimiduk@gmail.com> tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Licensed to Nick Dimiduk <ndimiduk@gmail.com> Licensed to Nick Dimiduk <ndimiduk@gmail.com> Servers can run together. DataNode RegionServer DataNode RegionServer DataNode RegionServer C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. a cf1 1368394583 7 1368394261 "hello" "bar" 1368394583 22 1368394925 13.6 1368393847 "world" "foo" cf2 1.0001 1368387684 "almost the loneliest number"
  • 13. High-level Architecture Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 13
  • 14. Logical Data Model 1368394583 7 1368394261 "hello" "bar" 1368394583 22 1368394925 13.6 1368393847 "world" "foo" "2011-07-04" 1368396302 "fourth of July" 1.0001 1368387684 "almost the loneliest number" Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 14 a cf1 cf2 b cf2 "thumb" 1368387247 [3.6 kb png data] Table A rowkey column family column qualifier timestamp value Rows Column Families
  • 15. Logical Architecture Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 15 Table A a b c d e f g h i j k l m n o p Region 1 Region 2 Region 3 Region 4 Region Server 7 Table A, Region 1 Table A, Region 2 Table G, Region 1070 Table L, Region 25 Region Server 86 Table A, Region 3 Table C, Region 30 Table F, Region 160 Table F, Region 776 Region Server 367 Table A, Region 4 Table C, Region 17 Table E, Region 52 Table P, Region 1116
  • 16. Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers and RegionServers, you can use the data locality property; that can theoretically read and write to the local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of HBase deployments, the MapReduce framework isn’t deployed at all primarily random reads and writes. In other deployments, where the is also a part of the workloads, TaskTrackers, DataNodes, and Servers can run together. Nodes Nodes DataNode RegionServer DataNode DataNode RegionServer RegionServer DataNode DataNode RegionServer Figure 3.7 HBase RegionServer Figure 3.7 and HDFS HBase DataNode RegionServer processes and HDFS are typically DataNode collocated processes and RegionServers, you can use the data locality property; can theoretically read and write to the local DataNode You may wonder where the TaskTrackers are in this HBase deployments, the MapReduce framework isn’t deployed primarily random reads and writes. In other deployments, is also a part of the workloads, TaskTrackers, DataNodes, Servers can run together. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same Nodes and RegionServers, you can use the data can theoretically read and write to the local You may wonder where the TaskTrackers HBase deployments, the MapReduce framework primarily random reads and writes. In other deployments, is also a part of the workloads, TaskTrackers, Servers can run together. system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Physical Architecture can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the workload primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together. can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- Zoo Keeper Zoo Keeper DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer Region Region Region Server Server Server Data Data Data Node Node Node Figure Licensed Attribution-3.7 under a ShareAlike HBase Creative Commons 3.0 Unported RegionServer License. and HDFS DataNode processes Page 16 are typically Region Server Data Node ... can theoretically read and write You may wonder where the TaskTrackers HBase deployments, the MapReduce primarily random reads and writes. is also a part of the workloads, Servers can run together. DataNode RegionServer Master Master Figure 3.7 HBase RegionServer and HDFS Servers can run together. DataNode RegionServer DataNode RegionServer DataNode RegionServer Name Node DataNode RegionServer DataNode RegionServer HBase Client Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.HDFS HBase
  • 17. User API • {rowkey => {family => {qualifier => {version => value}}}} – Think: nested TreeMap (Java), OrderedDictionary (C#), OrderedDict (Python) • Basic data operations: GET, PUT, DELETE • SCAN over range of key-values – benefit of the sorted rowkey business – this is how you implement any kind of "complex query” * Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. • GET, SCAN support Filters – Push application logic to RegionServers • INCREMENT, APPEND, CheckAnd{Put,Delete} – Server-side, atomic data operations, can be contentious! Page 17 * This is also a foundational component in what we refer to as “schema design” in this “schemaless” database.
  • 18. Anatomy of a RegionServer Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 18
  • 19. So what is HBase anyway? C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 19 DataNode RegionServer C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory
  • 20. Storage Machinery Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 20 RegionServer (HBase) DataNode (Hadoop DFS) HLog (WAL) HRegion HStore StoreFile HFile StoreFile HFile MemStore ... ... HStore BlockCache HRegion HStore HStore ... ...
  • 21. Storage Machinery Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 21 RegionServer (HBase) DataNode (Hadoop DFS) HLog (WAL) HRegion HStore StoreFile HFile StoreFile HFile MemStore ... ... HStore BlockCache HRegion HStore HStore ... ... C1 C0 C1 C0 C1 C0 C1 C0 Cache
  • 22. Storage Machinery: Write Path Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 22 RegionServer (HBase) DataNode (Hadoop DFS) HLog (WAL) HRegion HStore StoreFile HFile StoreFile HFile MemStore ... ... HStore BlockCache HRegion HStore HStore ... ... 1 2 3 4 5
  • 23. Storage Machinery: Read Path Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 23 RegionServer (HBase) DataNode (Hadoop DFS) HLog (WAL) HRegion HStore StoreFile HFile StoreFile HFile MemStore ... ... HStore BlockCache HRegion HStore HStore ... ... 1 5 2 3 3 2 4
  • 24. Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. By Example Page 24
  • 25. Database Dichotomy Latency Predicate Pushdown Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 25 Row+Column Bloomfilter Lossy WAL Durability Smart Rowkey Design Compression Write Read Throughput Larger Block Size Bulkload HFiles Compression Compression Compression Smaller Block Size Larger BlockCache Increase Blocking Storefiles Increased Scanner Caching Larger Flush Size Smart Rowkey Design Smart Rowkey Design Smart Rowkey Design
  • 26. Web-scale Database Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 26 App server App server App server App server App server Counters Sessions User profiles Social Media Application Data Latency Write Read Throughput
  • 27. Search Search Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. “BigIndex” Page 27 "BigIndex" Document store Search Search Search App server App server App server App server App server Latency Write Read Throughput
  • 28. Materialized View Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 28 App server App server App server App server App server Latency Write Read Throughput
  • 29. Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. ETL Assist Write Read Page 29 Latency Throughput
  • 30. Lambda Architecture Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 30 App server App server App server App server App server
  • 31. Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Resources Page 31
  • 32. Join the Community! • hbase.apache.org – hbase.apache.org/book.html – blogs.apache.org/hbase/ – hbase.apache.org/mail-lists.html • IRC: irc.freenode.net #hbase • JIRA: issues.apache.org/jira/browse/HBASE • Source: git clone git://git.apache.org/hbase.git • In person – HBaseCon, hbasecon.com – Hadoop Summit, hadoopsummit.org – Strata / Hadoop World, strataconf.com – Local meetup near you! Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 32
  • 33. Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase 0.98/1.0 • Hardening – Stability, Reliability, Availability, Performance • Horizontal Scalability – 1000’s of machines • Availability – Speed of Recovery (MTTR), Region Replicas • Improved Multi-tenancy – RPC priorities/QoS, namespace management • Cell-level security • Semantic Versioning • Client API cleanup Page 33
  • 34. Thanks! Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 34 M A N N I N G Nick Dimiduk Amandeep Khurana FOREWORD BY Michael Stack hbaseinaction.com Nick Dimiduk github.com/ndimiduk @xefyr n10k.com strataeucftw slideshare.net/xefyr/hbase-for-architects