3. Managing and analysing massive data
◦ Provides high performance
◦ Scales over clusters of thousands of heterogeneous
machines
◦ Versatile-adaptability of a system to analytical
queries of varying complexity
How does one build real world applications with
HadoopDB?
4. Database Connector - connects Hadoop with
the single-node database systems.
Data Loader - partitions data and manages
parallel loading of data into the database
systems.
Catalog - tracks locations of different data
chunks,including those replicated across
multiple nodes.
SQL-MapReduce-SQL (SMS) planner - extends
Hive to provide a SQL interface to HadoopDB
5.
6. Supports any JDBC-compliant database server
as an underlying DBMS layer
Applications built on top of HadoopDB
generally use the 3-tier architecture
◦ data tier
◦ business logic tier
◦ presentation tier
HadoopDB is a black box(in application
perspective)
7. A semantic web/biological data analysis
application.
A business data warehousing application.
8. Semantic web is an effort by the W3C to
enable integration and sharing of data across
dierent applications
RDF- is a directed, labeled graph data format
for representing information in the Web
SPARQL –is an RDF query language
9. Find all proteins whose existence in the
`Human' organism is uncertain
SPARQL query :
10.
11. demonstrate
◦ how the data administrator should prepare the
dataset.
Analyst- is shielded from the complexity of
the actual implementation of the RDF storage
layer.
12. Natural target application for HadoopDB.
Common business data warehousing
workloads are read-mostly and involve
analytical queries over a complex schema
To achieve good query performance, the
dataset requires signicant preparation
through data partitioning and replication to
optimize for join queries
Data & Queries- TPC-H benchmark
14. Audience is invited to query both data sets
through HadoopDB
Data sets are located in a remote cluster
Multiple users interaction- two client
machines that connect to the clusters.
15. user selects dataset
SemanticWeb—Biological Data Analysis
- An animation of the behind-the-scenes data
preparation & loading is presented
- Details on the tools used for data conversion from
RDF to relational form.
Business Data Warehousing- the animation provides
details on the partitioning scheme, the interaction
between the loader and catalog components, and a
summary of the configuration parameters
User select and parametrize a query to execute
-User can then monitor the progress of query
execution
16. In addition demonstrate HadoopDB's fault-
tolerance with the introduction of a node
failure.
For a subset of the predened queries, as the
query executes in the background, an animation
of the flow of data and control through the
HadoopDB system is simultaneously presented,
highlighting which parts of the query execution
are run in parallel.
HadoopDB therefore pushes computation closer to data (into the data tier) to achieve maximum parallelization in a multi-node clustercomplexity of the data tier and its parallel nature is hidden from the application developer
Universal Protein Resource.presentation layer consists of a web-based interface where analysts specify queries and view resultslogic layer consists of a SPARQL to SQL conversion toollogic and data layer communicate through JDBC
presentations provide our audience with an idea of the eort required for data preparation in HadoopDB