SlideShare a Scribd company logo
1 of 22
Hadoop, Hbase  и другие  H… 25 . 12 .200 9
Мама, что это? ,[object Object],Стр.
Стр.   SPARSE
Стр.   MULTIDIMENSIONAL / GLIST Lists {   danny.todolist1: {  items:  {   Avatar: 17/12/2009, cleanup: 25/12/2009, … }   attributes: { private: true, expiration: 10/10/2010, notify: true, … } },   danny.wishlist1: {  items:  {   AppleMacbook: http://applestore.com, Gucci pur hommes: http://guccistore.com, … },   attributes: { private:false, …  } },   … }
sorted Стр.   ВМЕСТО ИНДЕКСОВ, КАК В  RDBMS ! В СТРОГОМ ЛЕКСИКОГРАФИЧЕСКОМ ПОРЯДКЕ! НЕ ПОДДЕРЖИВАЕТ  FULL RELATIONAL DATAMODEL
Концептуальное  /  Физическое хранение данных Стр.
THE REAL BOSS – BIGTABLE Стр.   Google Earth Google Analytics Last.fm Facebook Adobe WHO IS THE REAL BOSS? BIGTABLE   HBASE
WELCOME TO THE DISTRIBUTED WORLD Стр.   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Стр.   Pig on Twitter
Стр.   Hive on Facebook
WHEN HBASE SHINES /  STINKS ,[object Object],[object Object],Стр.
Стр.
ПРАВИЛА  HBASE  DATA MODEL ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Стр.
Стр.
Стр.
Стр.
Стр.
Стр.
Стр.
Стр.   CLUSTER STRUCTURE
Стр.   PERFORMANCE
Стр.   PERFORMANCE

More Related Content

Similar to HBase

DBD lection 4. Big Data, NoSQL. In Russian.
DBD lection 4. Big Data, NoSQL. In Russian.DBD lection 4. Big Data, NoSQL. In Russian.
DBD lection 4. Big Data, NoSQL. In Russian.mikhaelsmirnov
 
Лекция 2. Основы Hadoop
Лекция 2. Основы HadoopЛекция 2. Основы Hadoop
Лекция 2. Основы HadoopTechnopark
 
Hadoop -> Cascading -> Cascalog
Hadoop -> Cascading -> CascalogHadoop -> Cascading -> Cascalog
Hadoop -> Cascading -> CascalogAndrew Panfilov
 
XML Native Database на примере SednaXML
XML Native Database на примере SednaXMLXML Native Database на примере SednaXML
XML Native Database на примере SednaXMLSlach
 
Review of all-flash array market and benefits
Review of all-flash array market and benefitsReview of all-flash array market and benefits
Review of all-flash array market and benefitsBogdan Vakulyuk
 
Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Alexey Zinoviev
 
20160330 Приложение с использованием данных с сервера (EMS)
20160330 Приложение с использованием данных с сервера (EMS) 20160330 Приложение с использованием данных с сервера (EMS)
20160330 Приложение с использованием данных с сервера (EMS) Andrew Sovtsov
 
IMS DB vs DB2 for z/OS
IMS DB vs DB2 for z/OSIMS DB vs DB2 for z/OS
IMS DB vs DB2 for z/OSGregory Vlasov
 
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)Andrei Nikolaenko
 
Инфраструктура Big data - от источников до быстрых витрин - версия для МИСиС
Инфраструктура Big data - от источников до быстрых витрин - версия для МИСиСИнфраструктура Big data - от источников до быстрых витрин - версия для МИСиС
Инфраструктура Big data - от источников до быстрых витрин - версия для МИСиСYury Petrov
 

Similar to HBase (14)

DBD lection 4. Big Data, NoSQL. In Russian.
DBD lection 4. Big Data, NoSQL. In Russian.DBD lection 4. Big Data, NoSQL. In Russian.
DBD lection 4. Big Data, NoSQL. In Russian.
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Лекция 2. Основы Hadoop
Лекция 2. Основы HadoopЛекция 2. Основы Hadoop
Лекция 2. Основы Hadoop
 
Big Data и ArcGIS
Big Data и ArcGISBig Data и ArcGIS
Big Data и ArcGIS
 
Hadoop -> Cascading -> Cascalog
Hadoop -> Cascading -> CascalogHadoop -> Cascading -> Cascalog
Hadoop -> Cascading -> Cascalog
 
Nrdbms
NrdbmsNrdbms
Nrdbms
 
XML Native Database на примере SednaXML
XML Native Database на примере SednaXMLXML Native Database на примере SednaXML
XML Native Database на примере SednaXML
 
Nosql and Mongodb
Nosql and MongodbNosql and Mongodb
Nosql and Mongodb
 
Review of all-flash array market and benefits
Review of all-flash array market and benefitsReview of all-flash array market and benefits
Review of all-flash array market and benefits
 
Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15
 
20160330 Приложение с использованием данных с сервера (EMS)
20160330 Приложение с использованием данных с сервера (EMS) 20160330 Приложение с использованием данных с сервера (EMS)
20160330 Приложение с использованием данных с сервера (EMS)
 
IMS DB vs DB2 for z/OS
IMS DB vs DB2 for z/OSIMS DB vs DB2 for z/OS
IMS DB vs DB2 for z/OS
 
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
 
Инфраструктура Big data - от источников до быстрых витрин - версия для МИСиС
Инфраструктура Big data - от источников до быстрых витрин - версия для МИСиСИнфраструктура Big data - от источников до быстрых витрин - версия для МИСиС
Инфраструктура Big data - от источников до быстрых витрин - версия для МИСиС
 

HBase

  • 1. Hadoop, Hbase и другие H… 25 . 12 .200 9
  • 2.
  • 3. Стр. SPARSE
  • 4. Стр. MULTIDIMENSIONAL / GLIST Lists { danny.todolist1: { items: { Avatar: 17/12/2009, cleanup: 25/12/2009, … } attributes: { private: true, expiration: 10/10/2010, notify: true, … } }, danny.wishlist1: { items: { AppleMacbook: http://applestore.com, Gucci pur hommes: http://guccistore.com, … }, attributes: { private:false, … } }, … }
  • 5. sorted Стр. ВМЕСТО ИНДЕКСОВ, КАК В RDBMS ! В СТРОГОМ ЛЕКСИКОГРАФИЧЕСКОМ ПОРЯДКЕ! НЕ ПОДДЕРЖИВАЕТ FULL RELATIONAL DATAMODEL
  • 6. Концептуальное / Физическое хранение данных Стр.
  • 7. THE REAL BOSS – BIGTABLE Стр. Google Earth Google Analytics Last.fm Facebook Adobe WHO IS THE REAL BOSS? BIGTABLE HBASE
  • 8.
  • 9. Стр. Pig on Twitter
  • 10. Стр. Hive on Facebook
  • 11.
  • 13.
  • 20. Стр. CLUSTER STRUCTURE
  • 21. Стр. PERFORMANCE
  • 22. Стр. PERFORMANCE

Editor's Notes

  1. persistent Это прилагательное обозначает всего лишь «постоянный», то есть в данном контексте оно говорит только о том, что данная система не зависит от использующих ее приложений, а также хранится на устройствах постоянного хранения данных, а не в оперативной памяти. distributed Распределенность этих систем можно рассматривать с двух точек зрения: Как HBase, так и BigTable сами по себе могут функционировать на большом количестве серверов, которые можно разделить на две большие категории: master и slave. Slave сервера собственно выполняют всю работу с данными, а master — лишь только координируют их действия и управляют процессом в целом. Этот факт обеспечивают высокую степень устойчивости к сбоям (в HBase правда количество master-серверов ограничено одним, что представляет собой единственную точку, сбой в которой приведет к отказу всей системы, но это лишь временная проблема, которую наверняка устранят в следующих версиях), а также существенно облегчает масштабируемость всей системы так как добавление дополнительных серверов (а значит и увеличение производительности и вместительности системы) достаточно тривиально, безболезненно и не мешает общему ее функционированию. Помимо этого каждая из этих систем обычно использует для хранения данных кластерную файловую систему (HBase — HDFS , а BigTable — GFS ), которые тоже по своей природе являются распределенными и функционируют по схожему принципу, обеспечивая дополнительную сохранность данных, реплицируя их в нескольких экземплярах на нескольких серверах (обычно трех). sorted HBase и BigTable не строят никакие индексы для ускорения процесса извлечения данных, единственное используемое в них правило заключается в следующем: каждый slave-сервер в системе отвечает за определенный диапазон ключей (от и до определенных его значений), и держит все записи в строгом лексикографическом порядке по ключам (заметьте: сортировку значений никто не гарантирует!). Продолжая пример с JSON это выглядело бы примерно вот так: { "123" : "map", "abc" : "sample", "mnbvcxz" : "looks like this", "qqqq" : "some", "zz" : "JSON" } Этим фактом можно активно пользоваться при планировании использовании системы, например если в качестве ключей планируется использовать доменные именные имена, то имеет смысл использовать их в «развернутом» виде, например: «com.example.www» вместо «www.example.com». Это почти наверняка обеспечит попадание всех поддоменов одного и того же домена на один slave-сервер, а также группировку доменов по зонам.
  2. RDBMS: - null values are must! - same length HBASE: - пихайте в каждый ряд любые колонки, которые придут в голову под любыми именами, ряды – резиновые! - column names carry info themselves!
  3. На самом деле, 3- dimensional. Основные части структуры: row, column-family, column, timestamp A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. (row:string, column:string, time:int64) ! String Can store ascii, any binary or use serialization (eg: thril, protobuf)
  4. sorted HBase и BigTable не строят никакие индексы для ускорения процесса извлечения данных, но зато … Bigtable maintains data in lexicographic order by row key . The row range for a table is dynamically partitioned на каждый - сервер . Each row range is called a tablet, which is the unit of distribution and load balancing. As a result, reads of short row ranges are efficient and typically require communication with only a small number of machines. Clients can exploit this property by selecting their row keys so that they get good locality for their data accesses. For example, in Webtable, pages in the same domain are grouped together into contiguous rows by reversing the hostname components of the URLs. For example, we store data for maps.google.com/index.html under the key com.google.maps/index.html. Storing pages from the same domain near each other makes some host and domain analyses more efficient.
  5. Conceptual View   Conceptually a table may be thought of a collection of rows that are located by a row key (and optional timestamp) and where any column may not have a value for a particular row key (sparse). The following example is a slightly modified form of the one on page 2 of the Bigtable Paper (adds a new column family "mime:").   Physical Storage View   Although at a conceptual level, tables may be viewed as a sparse set of rows, physically they are stored on a per-column family basis. This is an important consideration for schema and application designers to keep in mind.   Pictorially, the table shown in the conceptual view above would be stored as follows . It is important to note in the diagram above that the empty cells shown in the conceptual view are not stored since they need not be in a column-oriented storage format. Thus a request for the value of the "contents:" column at time stamp t8 would return no value. Similarly, a request for an "anchor:my.look.ca" value at time stamp t9 would return no value.   However, if no timestamp is supplied, the most recent value for a particular column would be returned and would also be the first one found since timestamps are stored in descending order. Thus a request for the values of all columns in the row "com.cnn.www" if no timestamp is specified would be: the value of "contents:" from time stamp t6, the value of "anchor:cnnsi.com" from time stamp t9, the value of "anchor:my.look.ca" from time stamp t8 and the value of "mime:" from time stamp t6.
  6. Вы слышали там промелькнуло понятие BigTable. Это неспроста. Теперь самое время начать по-настоящему, с истоков. Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs toweb pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Bigtable does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format, and allows clients to reason about the locality properties of the data represented in the underlying storage. Data is indexed using row and column names that can be arbitrary strings. Bigtable also treats data as uninterpreted strings, although clients often serialize various forms of structured and semi-structured data into these strings. There are a few different terms used in either system describing the same thing. The most prominent being what HBase calls "regions" while Google refers to it as "tablet". These are the partitions of subsequent rows spread across many "region servers" - or "tablet server" respectively. Hbase is the open source answer to BigTable, Google's highly scalable distributed database. It is built on top of Hadoop (product), which implements functionality similar to Google's GFS and Map/Reduce systems. Both Google's GFS and Hadoop's HDFS provide a mechanism to reliably store large amounts of data. However, there is not really a mechanism for organizing the data and accessing only the parts that are of interest to a particular application. Bigtable (and Hbase) provide a means for organizing and efficiently accessing these large data sets. Hbase is still not ready for production, but it's a glimpse into the power that will soon be available to your average website builder. Google is of course still way ahead of the game. They have huge core competencies in data center roll out and they will continually improve their stack. It will be interesting to see how these sorts of tools along with Software as a Service can be leveraged to create the next generation of systems. Так что настоящий серый кардинал за всем – это Google BigTable . Google Earth Google operates a collection of services that provide users with access to high-resolution satellite imagery of the world's surface, both through the web-based Google Maps interface (maps.google.com) and through the Google Earth (earth.google.com) custom client software. These products allow users to navigate across the world's surface: they can pan, view, and annotate satellite imagery at many different levels of resolution. This system uses one table to preprocess data, and a different set of tables for serving client data. The preprocessing pipeline uses one table to store raw imagery. During preprocessing, the imagery is cleaned and consolidated into nal serving data. This table contains approximately 70 terabytes of data and therefore is served from disk. The images are efciently compressed already, so Bigtable compression is disabled. Each row in the imagery table corresponds to a single geographic segment. Rows are named to ensure that adjacent geographic segments are stored near each other. The table contains a column family to keep track of the sources of data for each segment. This column family has a large number of columns: essentially one for each raw data image. Since each segment is only built from a few images, this column family is very sparse. The preprocessing pipeline relies heavily on MapReduce over Bigtable to transform data. The overall system processes over 1 MB/sec of data per tablet server during some of these MapReduce jobs. The serving system uses one table to index data stored in GFS. This table is relatively small ( ˜ 500 GB), but it must serve tens of thousands of queries per second per datacenter with low latency. As a result, this table is hosted across hundreds of tablet servers and contains inmemory column families.     Adobe # We use Hadoop and HBase in several areas from social services to structured data storage and processing for internal use. # We currently have about 30 nodes running HDFS, Hadoop and HBase in clusters ranging from 5 to 14 nodes on both production and development. We plan a deployment on an 80 nodes cluster. Facebook We use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning. Yahoo! - More than 100,000 CPUs in >25,000 computers running Hadoop - Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) o Used to support research for Ad Systems and Web Search o Also used to do scaling tests to support development of Hadoop on larger clusters - >40% of Hadoop Jobs within Yahoo are Pig jobs. - Last.fm * 31 nodes * Dual quad-core Xeon L5520 (Nehalem) @ 2.27GHz, 16GB RAM, 4TB/node storage. * Used for charts calculation, log analysis, A/B testing
  7. Работая с Hbase, так или иначе столкнетесь со следующими понятиями! HADOOP IS A The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop includes these subprojects: * Hadoop Common: The common utilities that support the other Hadoop subprojects. * Avro: A data serialization system that provides dynamic integration with scripting languages. * Chukwa: A data collection system for managing large distributed systems. * HBase: A scalable, distributed database that supports structured data storage for large tables. * HDFS: A distributed file system that provides high throughput access to application data. * Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. * MapReduce: A software framework for distributed processing of large data sets on compute clusters. * Pig: A high-level data-flow language and execution framework for parallel computation. * ZooKeeper: A high-performance coordination service for distributed applications. ZOOKEEPER is a lock service. This is from the BigTable paper: Bigtable uses Chubby for a variety of tasks: to ensure that there is at most one active master at any time; to store the bootstrap location of Bigtable data (see Section 5.1); to discover tablet servers and finalize tablet server deaths (see Section 5.2); to store Bigtable schema information (the column family information for each table); and to store access control lists. If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable. There is a lot of overlap compared to how HBase does use ZooKeeper. What is different though is that schema information is not stored in ZooKeeper (yet, see http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases) for details. What is important here though is the same reliance on the lock service being available. From my own experience and reading the threads on the HBase mailing list it is often underestimated what can happen when ZooKeeper does not get the resources it needs to react timely. It is better to have a small ZooKeeper cluster on older machines not doing anything else as opposed to having ZooKeeper nodes running next to the already heavy Hadoop or HBase processes. Once you starve ZooKeeper you will see a domino effect of HBase nodes going down with it - including the master(s). Another important difference is that ZooKeeper is no lock service like Chubby - and I do think it does not have to be as far as HBase is concerned. So where Chubby creates a lock file to indicate a tablet server is up and running HBase in turn uses ephemeral nodes that exist as long as the session between the RegionServer which creates that node and ZooKeeper is active. This also causes the differences in semantics where in BigTable can delete a tablet servers lock file to indicate that it has lost its lease on tablets. In HBase this has to be handled differently because of the slightly less restrictive architecture of ZooKeeper. These are only semantics as mentioned and do not mean one is better than the other - just different. MAPREDUCE... IS A software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. Typically the compute nodes and the storage nodes are the same, that is, the Map/Reduce framework and the Hadoop Distributed File System (see HDFS Architecture ) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. Thus it takes advantage in pairing with HDFS executing work on the HDFS node that has data needed for work. A Map/Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. HDFS ... IS A распределенная файловая система. HDFS GFS Главный сервер NameNode Master Подчиненные сервера DataNode Servers Chunk Servers Операции Append и Snapshot - + Авто - востановление после отказа главного сервера - + Язык реализации Java C++   Механизм репликации: При обнаружении NameNode-сервером отказа одного из DataNode-серверов (отсутствие heartbeat-сообщений от оного), запускается механизм репликации данных: - выбор новых DataNode-серверов для новых реплик - балансировка размещения данных по DataNode-серверам Аналогичные действия производятся в случае повреждении реплик или в случае увеличения количества реплик присущих каждому блоку. Стратегия размещение реплик: Данные хранятся в виде последовательности блоков фиксированного размера. Копии блоков (реплики) хранятся на нескольких серверах, по умолчанию – трех. Их размещение происходит следующим образом: - первая реплика размещается на локальном ноде - вторая реплика на другой ноде в этой же стойке - третья реплика на произвольной ноде другой стойки - остальные реплики размещаются произвольным способом При чтении данных клиент выбирает ближайшую к нему DataNode-сервер с репликой. Целостность данных: Ослабленная модель целостности данных, реализованная в файловой системе, не гарантирует идентичность реплик. Поэтому HDFS перекладывает проверку целостности данных на клиентов. При создании файла клиент рассчитывает контрольные суммы каждые 512 байт, которые в последующем сохраняются на DataNode-сервере. При считывании файла, клиент обращается к данным и контрольным суммам. И, в случае их несоответствия происходит обращение к другой реплике. Запись данных: «При записи данных в HDFS используется подход, позволяющий достигнуть высокой пропускной способности. Приложение ведет запись в потоковом режиме, при этом HDFS-клиент кэширует записываемые данные во временном локальном файле. Когда в файле накапливаются данные на один HDFS-блок, клиент обращается к NameNode-серверу, который регистрирует новый файл, выделяет блок и возвращает клиенту список datanode-серверов для хранения реплик блока. Клиент начинает передачу данных блока из временного файла первому DataNode-серверу из списка. DataNode-сервер сохраняет данные на диске и пересылает следующему DataNode-серверу в списке. Таким образом, данные передаются в конвейерном режиме и реплицируются на требуемом количестве серверов. По окончании записи, клиент уведомляет NameNode-сервер, который фиксирует транзакцию создания файла, после чего он становится доступным в системе»
  8. Challenge: how many tweets per users? http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009?src=related_normal&rel=1715825
  9. SQL-like language for the distributed database. http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs
  10. Many recent projects have tackled the problem of providing distributed storage or higher-level services over wide area networks, often at “ Internet scale. ” This includes work on distributed hash tables that began with projects such as CAN [29], Chord [32], Tapestry [37], and Pastry [30]. These systems address concerns that do not arise for Bigtable, such as highly variable bandwidth, untrusted participants, or frequent reconguration; decentralized control and Byzantine fault tolerance are not Bigtable goals. When HBase Shines   One place where HBase really does well is when you have records that are very sparse. This might mean un- or semi-structured data. In any case, unlike row-oriented RDBMSs, HBase is column-oriented, meaning that nulls are stored for free. If you have a row that only has one out of dozens of possible columns, literally only that single column is stored. This can mean huge savings in both disk space and IO read time.   Another way that HBase matches well to un- or semi-structured data is in its treatment of column families. In HBase, individual records of data are called cells. Cells are addressed with a row key/column family/cell qualifier/timestamp tuple. However, when you define your schema, you only specify what column families you want, with the qualifier portion determined dynamically by consumers of the table at runtime. This means that you can store pretty much anything in a column family without having to know what it will be in advance. This also allows you to essentially store one-to-many relationships in a single row! Note that this is not denormalization in the traditional sense, as you aren’t storing one row per parent-child tuple. This can be very powerful – if your child entities are truly subordinate, they can be stored with their parent, eliminating all join operations.   In addition to handling sparse data well, HBase is also great for versioned data. As mentioned, the timestamp is a part of the cell “coordinates”. This is handy, because HBase stores a configurable number of versions of each cell you write, and then allows you to query what the state of that cell is at different points in time. Imagine, for instance, a record of a person with a column for location. Over time, that location might change. HBase’s schema would allow you to easily store a person’s location history along with when it changed, all in the same logical place.   Finally, of course, there’s the scaling. HBase is designed to partition horizontally across tens to hundreds of commodity PCs. This is how HBase deals with the problem of adding more CPUs, RAM and disks. I don’t feel like I need to go far down the road of discussing this idea, because it seems to be the one thing everyone gets about HBase. (If you need more convincing, read the original Bigtable paper. It’s got graphs!) However, data denormalization for the sake of performance is a technique that is in use in standard RDBMSs as well. When HBase Isn’t Right   I’ll just go ahead and say it: HBase isn’t right for every purpose. Sure, you could go ahead and take your problem domain and squeeze it into HBase in one way or another, but then you’d be committing the same error we’re trying avoid by moving away from RDBMSs in the first place.   Firstly, if your data fits into a standard RDBMS without too much squeezing, chances are you don’t need HBase. That is, if a modestly expensive server loaded with MySQL fits your needs, then that’s probably what you want. Don’t make the mistake of assuming you need massive scale right off the bat.   Next, if your data model is pretty simple, you probably want to use a RDBMS. If your entities are all homogeneous, you’ll probably have an easy time of mapping your objects to tables. You also get some nice flexibility in terms of your ability to add indexes, query on non-primary-key values, do aggregations, and so on without much additional work. This is where RDBMSs shine – for decades they’ve been doing this sort of thing and doing it well, at least at lower scale. HBase, on the other hand, doesn’t allow for querying on non-primary-key values, at least directly. HBase allows get operations by primary key and scans (think: cursor) over row ranges. (If you have both scale and need of secondary indexes, don’t worry – Lucene to the rescue! But that’s another post.)   Finally, another thing you shouldn’t do with HBase (or an RDBMS, for that matter), is store large amounts of binary data. When I say large amounts, I mean tens to hundreds of megabytes. Certainly both RDBMSs and HBase have the capabilities to store large amounts of binary data. However, again, we have an impedance mismatch. RDBMSs are built to be fast metadata stores; HBase is designed to have lots of rows and cells, but functions best when the rows are (relatively) small. HBase splits the virtual table space into regions that can be spread out across many servers. The default size of individual files in a region is 256MB. The closer to the region limit you make each row, the more overhead you are paying to host those rows. If you have to store a lot of big files, then you’re best off storing in the local filesystem, or if you have LOTS of data, HDFS. You can still keep the metadata in an RDBMS or HBase – but do us all a favor and just keep the path in the metadata.
  11. Структура кластера The Bigtable implementation has three major components: a library that is linked into every client, one master server, and many tablet servers. Tablet servers can be dynamically added (or removed) from a cluster to accomodate changes in workloads. The master is responsible for assigning tablets to tablet servers, detecting the addition and expiration of tablet servers, balancing tablet-server load, and garbage collection of les in GFS. In addition, it handles schema changes such as table and column family creations. Each tablet server manages a set of tablets (typically we have somewhere between ten to a thousand tablets per tablet server). The tablet server handles read and write requests to the tablets that it has loaded, and also splits tablets that have grown too large. As with many single-master distributed storage systems [17, 21], client data does not move through the master: clients communicate directly with tablet servers for reads and writes. Because Bigtable clients do not rely on the master for tablet location information, most clients never communicate with the master. As a result, the master is lightly loaded in practice. A Bigtable cluster stores a number of tables. Each table consists of a set of tablets, and each tablet contains all data associated with a row range. Initially, each table consists of just one tablet. As a table grows, it is automatically split into multiple tablets, each approximately 100-200 MB in size by default.     The first level is a file stored in Chubby that contains the location of the root tablet. The root tablet contains the location of all tablets in a special METADATA table. Each METADATA tablet contains the location of a set of user tablets. The root tablet is just the first tablet in the METADATA table, but is treated specially - it is never split - to ensure that the tablet location hierarchy has no more than three levels.     As mentioned above in HBase the root region is its own table with a single region. If that makes a difference to having it as the first (non-splittable) region of the meta table I doubt strongly. It is just the same feature but implemented differently.   The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet's table identifier and its end row.     HBase does have a different layout here. It stores the start and end row with each region where the end row is exclusive and denotes the first (or start) row of the next region. Again, these are minor differences and I am not sure if there is a better or worse solution. It is just done differently.   5.1 Tablet Location We use a three-level hierarchy analogous to that of a B+- tree [10] to store tablet location information (Figure 4).     The rst level is a le stored in Chubby that contains the location of the root tablet. The root tablet contains the location of all tablets in a special METADATA table. Each METADATA tablet contains the location of a set of user tablets. The root tablet is just the rst tablet in the METADATA table, but is treated specially — it is never split — to ensure that the tablet location hierarchy has no more than three levels. The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet's table [ рис ] REGIONSERVER holds a number of regions.   ZOOKEEPER From my own experience and reading the threads on the HBase mailing list it is often underestimated what can happen when ZooKeeper does not get the resources it needs to react timely. It is better to have a small ZooKeeper cluster on older machines not doing anything else as opposed to having As of August 2006, there are 388 non-test Bigtable clusters running in various Google machine clusters, with a combined total of about 24,500 tablet servers. [pic]
  12. Performance testing Figure 6 shows two views on the performance of our benchmarks when reading and writing 1000-byte values to Bigtable. The table shows the number of operations per second per tablet server; the graph shows the aggregate number of operations per second. Single tablet-server performance Let us rst consider performance with just one tablet server. Random reads are slower than all other operations by an order of magnitude or more. Each random read involves the transfer of a 64 KB SSTable block over the network from GFS to a tablet server, out of which only a single 1000-byte value is used. The tablet server executes approximately 1200 reads per second, which translates into approximately 75 MB/s of data read from GFS. This bandwidth is enough to saturate the tablet server CPUs because of overheads in our networking stack, SSTable parsing, and Bigtable code, and is also almost enough to saturate the network links used in our system. Most Bigtable applications with this type of an access pattern reduce the block size to a smaller value, typically 8KB. Random reads from memory are much faster since each 1000-byte read is satised from the tablet server's local memory without fetching a large 64 KB block from GFS. Random and sequential writes perform better than random reads since each tablet server appends all incoming writes to a single commit log and uses group commit to stream these writes efciently to GFS. There is no signi cant difference between the performance of random writes and sequential writes; in both cases, all writes to the tablet server are recorded in the same commit log. Sequential reads perform better than random reads since every 64 KB SSTable block that is fetched from GFS is stored into our block cache, where it is used to serve the next 64 read requests. Scans are even faster since the tablet server can return a large number of values in response to a single client RPC, and therefore RPC overhead is amortized over a large number of values. Scaling Aggregate throughput increases dramatically, by over a factor of a hundred, as we increase the number of tablet servers in the system from 1 to 500. For example, the performance of random reads from memory increases by almost a factor of 300 as the number of tablet server increases by a factor of 500. This behavior occurs because the bottleneck on performance for this benchmark is the individual tablet server CPU. However, performance does not increase linearly. For most benchmarks, there is a signicant drop in per-server throughput when going from 1 to 50 tablet servers. This drop is caused by imbalance in load in multiple server congurations, often due to other processes contending for CPU and network.
  13. HBase Performance Numbers • 1m rows, 1 column per row (~16 bytes) – SequenRal insert: 24s, .024ms/row – Random read: 1.42ms/row (avg) – Full Scan: 11s, (117ms/10k rows) • Performance under cache is very high: – 1ms to get single row – 20 ms to read 550 rows – 75ms to get 5500 rows