NoSQL Needs SomeSQL

© 2015 IBM CorporationHadoop Summit – San Jose 2015
NoSQL Needs SomeSQL
Scott C. Gray (sgray@us.ibm.com)
Senior Architect and STSM, Big SQL, Big Data Open Source

© 2015 IBM Corporation2 Hadoop Summit – San Jose, CA – June 2015
Agenda
 SQL Overview
 History
 Pro’s and Con’s
 Challenges of SQL on Hadoop
 NoSQL Overview
 History
 Solving the Challenges
 Advantages and Tradeoffs
 Conclusion and Questions

Structured Query Language
Quick History on SQL (for NoSQL comparison later on)
 Developed in the 1970’s by IBM
 Multiple commercial offerings by 1980
 Standardization began in 1986 and continues today
 SQL:2011 is the most recent standard
 Defining characteristics:
 Tabular (row/column storage)
 Strict schema
 Highly encourages a relational design

What’s to Like? The obvious:
 A well known language
 Ubiquitous use by IT and business
 Standardization makes skills (and applications)
easily transferable
 Many, many tools available due to a relatively
simple and common data model
 Relational model allows you to easily explore
data relationships
 Sales by part #
 Sales by region
 Sales by customer
 …

What’s to Like? The not-so-obvious
 Formal and strict modelling allows for very smart optimizations based upon
 Data distribution (statistics)
 Data size (bytes per row, rows per page, etc.)
 Data type domains (value ranges, nullability, etc.)
 Declared domains (CHECK constraints)
 Formal relationships (referential constraints)
 The database engine can make very smart query strategy decisions

What’s to NOT to Like?
 Typically not so efficient with sparse data
 This is changing with modern columnar stores – but they have tradeoffs too
 Very rigid, simple, data model makes modeling complex objects tedious
 May take dozens of tables to model one “object” (e.g. XML document)
 Fetching one “object” now requires significant work to reconstruct (many joins)
 Evolving the data model can be non-trivial
 E.g. changing a column’s type may require a table rebuild (and all dependent tables!)
 The relational model can make it difficult to be agile!
 The structure of all data must be defined up front

What about Apache Drill?
 All of this talk about schema inflexibility…but what about projects
like Apache Drill??
 Apache Drill allows for efficient SQL queries against data without a schema*
 *It at least needs to know how the data is encoded (e.g. JSON, XML, etc.)
 It re-evaluates the structure of each “row” of data as it runs
 Supports a number of NoSQL platforms (HBase, MongoDB, etc.)
 But this only addresses the flexibility of the query language, and sill suffers from:
 Difficult to make optimization decisions (they are making some strides here…)
 Still pay a cost for joins (more on this coming up…)
 You still may not be able to ask a “table” what it’s schema is
• Lots of tooling relies upon this

SQL on Hadoop
The Great Promise
 In many ways, the architecture of Hadoop runs against the grain of relational processing
 Most DW’s rely heavily on controlled data placement
 Data is explicitly partitioned across the cluster
 A particular node “owns” a known subset of data
 Partitioning tables on the same key(s) and on the same
nodes allows for co-located processing
 The fundamental design of HDFS explicitly implements
“random” data placement
 No matter which node writes a block there is no
guarantee a copy will live on that node
 Rebalancing HDFS can move blocks around
 So, no co-located processing without bending over
backwards
See my other session:
Challenges of SQL on Hadoop
Thursday, 3:10pm – Grand Ballroom 220C
Partition A
T1 T2
Partition B
T1 T2
Partition C
T1 T2
Query
Coordinator
HDFS

SQL on Hadoop
Query Processing Without Data Placement
 Without co-location the options for join processing are limited
 Redistribution join
 DB engines read and filter “local” blocks for each table
 Records with the same key are shipped to the same
node to be joined
 In the worst case both joined tables are moved in their entirety!
 Doesn’t really work well for non-equijoins (!=, <, >, etc.)
 Hash Join
 Smaller, or heavily filtered, tables are shipped to all
other nodes
 An in memory hash table is used for very fast joins
 Can still lead to a lot of network to move the small table
T1
T1
DB
Engine
T1DB
Engine
T2
DB
Engine T2
DB
Engine
DB
Engine
DB
Engine
DB
Engine
Broadcast Join
T1
T1
DB
Engine
T1DB
Engine
T2
DB
Engine
Hash Join
T2 T2

Enter: NoSQL (“Not Only” SQL!)
History of NoSQL
 It’s older than SQL!
 First database created in 1965 by TRW
 IBM’s IMS (hierarchical database) created for NASA and the Apollo space program in 1966
 Advanced on Hadoop by Google’s BigTable papers
 Defining characteristics:
 No pre-defined schema (a.k.a. late-binding, scheme-on-read)
 Designed for horizontal scale-out
 Related data tends to be physically co-located or nested
 Strongly encourages non-relational designs
 Typically API-accessed (or path expressions)

Solving the Relational on Hadoop Challenge
 We saw the challenges of relational joins on distributed data
 There isn't time to explore each NoSQL technology
 Let's focus on one popular technology (HBase) and explore
how can solve our relational woes and the tradeoffs….

HBase In One Slide
 HBase is a popular key-value store for Hadoop
 Client/server database
 A table has no schema, just a name
 All HBase tables are ordered and
accessed by primary key
 Each row can have zero or more
name-value stores (“column family”)
 Each column family can have zero or
more name-value pairs
 Names and values are just binary data;
there are no data types!
MyTable
123412
Key Value
fname
lname
age
mobile
Scott
Gray
45
609-555-1212
Row Key Col Family: userinfo Col Family: changehistory
Key Value
20140721
20141103
fname=Scot
age=44
123746
Key Value
fname
lname
age
home
Mary
Swanson
28
123-555-1212
139442
Key Value
fname
lname
age
team
Kimi
Räikkönen
34
Ferrari
Key Value
20130911
20131007
team=Lotus
age=33
Key Value

Describing an HBase Table Relationally
 Different database engines provide different
mechanisms for describing HBase tables
 Describe how data is encoded in the table
 Map the column family:column to relational
column(s)
 But some common HBase design patterns are
difficult/impossible to describe relationally…
CREATE HBASE TABLE MY_TABLE
(
C1 INT NOT NULL,
C2 INT NOT NULL,
C3 INT NOT NULL,
C4 VARCHAR(10),
C5 DECIMAL(5,2),
C6 SMALLINT NOT NULL,
CONSTRAINT PK1 PRIMARY KEY (C1, C)
)
COLUMN MAPPING
(
KEY MAPPED BY (C1,C2) ENCODING BINARY,
CF:COL1 MAPPED BY (C3, c4)
SEPARATOR '|' ENCODING STRING
CF:COL2 MAPPED BY (C5, C6)
ENCODING SERDE ‘com.myco.MyJSONSerDe’
)
Big SQL Example

HBase Design Patterns
Getting Rid of the Join
 One common HBase design pattern is to physically nest related data within its parent row
 Take the typical department/employee relationship
 Each employee may be in its own column family within the dept
 Reading the dept automatically reads the employees with it
 No need for joins!
DepartmentEmployees
0001
Key Value
Name
Manager
Address
Phone
Finance
Bob Smith
451 St. Claire…
609-555-1212
Row Key Col Fam: dept_info
Key Value
287
934
16
1023
{ fname: Glen, lname: Hanks, … }
0002
Col Fam: employees
{ fname: Scott, lname: Anderson, … }
{ fname: Brian, lname: Applebaum, … }
{ fname: Jim, lname: Demes, … }
Key Value
Name
Manager
Address
Phone
Sales
Jane McClaren
555 Bailey …
408-314-8234
Key Value
287
934
{ fname: Tom, lname: Donohue, … }
{ fname: Mary, lname: Swanson, … }

HBase Design Patterns
Getting Rid of the Join
 Another approach is use the row key to force child data to be adjacent to the parent record
 Asking for row key 0001 gives just the dept
 Asking for keys >= 0001 and < 0002 gives dept + employees
 Odds are very good dept + employees are physically adjacent on the same server
DepartmentEmployees
0001
Key Value
Name
Manager
Finance
…
Row Key
0001/287
Key Value
Glen
Hanks
fname
lname
0001/934
Key Value
Scott
Anderson
fname
lname
dept_id
dept_id/emp_id

NoSQL Design Tradeoffs
 There are many other similar design approaches!
 What are the tradeoffs for such designs vs. relational?
 Advantages
 Related data is always co-located, no network hop for a join
 As data "shards" related data automatically stays together
 Schema can trivially be extended in the future
• Add new name/value pairs
• Add new column families
• Add new adjacent rows…

NoSQL Design Tradeoffs
 Disadvantages
 Relationships tend to be one-way
• What if I want to find the department a given employee is in?
• May need to maintain multiple copies of the data
• Cannot easily (efficiently) explore ad-hoc relationships
 Difficult to model
• Describing these data models to a relational engine is very difficult
• Hive has limited/restrictive support for ad-hoc data in column families
• Making the wrong choice can make SQL access impossible or limited
 Query optimization
• The developer is the query optimizer
• The data model dramatically limits available optimizations
 What's the schema??
• Database schema cannot be determined from the database!
• Tooling (data exploration/management) tends to need to be custom built

Why Not Just Model Relationally?
 You can, of course, just model you data relationally
 But, there is a good chance your data will not be co-located!
 Every joined row may require a network hop to fetch
 You’re back to most of the problems you were trying to solve!
 Modelling complex object is difficult
 Re-assembling complex objects is expensive
 Changing the data model is still a pain
Department
0001
Key Value
Name
Manager
Finance
…
Row Key Employee
287
Key Value
fname
lname
dept_id
Glen
Hanks
0001
Row Key
Region Server
Department
0001-0486
Employee
1-300
Region Server
Employee
301-999
Region Server
Department
0487-0923

So, All Is Lost Then?
 All is not lost!
 You can expose limited portions of your data model through SQL
 Co-processors/batch jobs can maintain relational views of non-relational data
 Some SQL solutions can model certain design patterns
 Hive can capture an entire column family into a MAP
 Big SQL allows for custom column decoders to map arbitrary data structures relationally
 Drill can dig into certain complex column types
 Mix-and-match relational design with what your SQL engine can do

Conclusion
 Not all NoSQL solutions have the same limitations as HBase!
 But invariably they all pose some challenge to traditional relational querying
 NoSQL fundamentally encourages nested relationships
 You have to plan to SQL access in advance
 It is important to understand the NoSQL capabilities of your SQL solution thoroughly
 There are a more challenges than I have described here!

Thank You!
 Thanks for putting up with me
 Questions?

NoSQL Needs SomeSQL

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a NoSQL Needs SomeSQL

Similar a NoSQL Needs SomeSQL (20)

Más de DataWorks Summit

Más de DataWorks Summit (20)

Último

Último (20)

NoSQL Needs SomeSQL