Publicidad
Publicidad

Más contenido relacionado

Destacado(20)

Publicidad

Más de DataWorks Summit(20)

Publicidad

HBase and Drill: How loosley typed SQL is ideal for NoSQL

  1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  2. © 2014 MapR Technologies 2 Contact Information Ted Dunning Chief Applications Architect at MapR Technologies Committer & PMC for Apache’s Drill, Zookeeper & Mahout VP of Incubator at Apache Foundation Email tdunning@apache.org tdunning@maprtech.com Twitter @ted_dunning Hashtag today: #BDE2015
  3. © 2014 MapR Technologies 3 Agenda • What does good mean? • What do we mean by loose typing? • Examples of what you can do • Real database with 10-20x fewer tables • Looking forward • Questions
  4. © 2014 MapR Technologies 4 What Does Good Mean (for a DB)? • Expressive – Must express the concepts we need • Efficient – Must run fast enough on cheap enough hardware
  5. © 2014 MapR Technologies 5 What Does Good Mean (for a DB)? • Expressive – Must express the concepts we need • Efficient – Must run fast enough on cheap enough hardware • Introspectable – Must be able to inspect the data and schema and gain understanding
  6. © 2014 MapR Technologies 6 What is New Here • Introspection is better when – A minimum of data entities are used to describe our model – No name overflow – Referential scoping helps narrow our focus to a simpler problem – Many-to-one relations can in-lined • Introspection was not a goal for the design of the relational model • Introspection was therefore not a result either
  7. © 2014 MapR Technologies 7 Older than Dirt • Relational theory is old (1970) – Pre-dates data structures – Predates mainstream recursive procedures – Predates lexical scoping – Predates logic programming – Predates real functional programming (Church, McCarthy, Iverson, Backus and not-withstanding) • Some updates are in order to enhance introspection
  8. © 2014 MapR Technologies 8 Contrast Relational and HBase Style noSQL Relational • Rows containing fields • Fields contain primitive types • Structure is fixed and uniform • Structure is pre-defined • Referential integrity (optional) • Expressions over sets of rows HBase / MapR DB • Rows contain fields • Fields bytes • Structure is flexible • No pre-defined structure • Single key • Column families • Timestamps • Versions
  9. © 2014 MapR Technologies 9 Contrast relational and HBase with Structuring Relational • Rows containing fields • Fields contain primitive types • Structure is fixed and uniform • Structure is pre-defined • Referential integrity (optional) • Expressions over sets of rows HBase + Structuring • Rows contain fields • Fields contain primitive types – Or objects, or lists • Structure is flexible, ragged • No pre-defined structure • Single key
  10. © 2014 MapR Technologies 10 Turtle Models for Databases • Allows complex objects in field values – JSON style lists and objects • Allow references to objects via join – Includes references localized within lists • Lists of objects and objects of lists are isomorphic to tables so … • Complex data in tables, • But also tables in complex data, • Even tables containing complex data containing tables
  11. © 2014 MapR Technologies 11 Proviso and Warning • This is not your father’s BLOB • And not the same as arrays with lateral view joins • Rationale to come as we talk about idioms
  12. © 2014 MapR Technologies 12 A Catalog of noSQL Idioms
  13. © 2014 MapR Technologies 13 Tables as Objects, Objects as Tables c1 c2 c3 Row-wise form c1 c2 c3 Column-wise form [ { c1:v1, c2:v2, c3:v3 }, { c1:v1, c2:v2, c3:v3 }, { c1:v1, c2:v2, c3:v3 } ] List of objects { c1:[v1, v2, v3], c2:[v1, v2, v3], c3:[v1, v2, v3] } Object containing lists
  14. © 2014 MapR Technologies 14 c1 c2 c3 c1 c2 c3 Micro Columnar Formats An entire table stored in columnar form can be a first-class value using these techniques This is very powerful for in-lining one-to-many relations.
  15. © 2014 MapR Technologies 15 Note • If embedded tables are first-class, schema becomes data • If schema is data-driven when embedded, constructs that elevate tables to top-level are impossible • Thus, embedded first-class objects implies late discovery of schema information
  16. © 2014 MapR Technologies 16 A first example: Time-series data
  17. © 2014 MapR Technologies 17 Column names as data • When column names are not pre-defined, they can convey information • Examples – Time offsets within a window for time series – Top-level domains for web crawlers – Vendor id’s for customer purchase profiles • Predefined schema is impossible for this idiom
  18. © 2014 MapR Technologies 18 Relational Model for Time-series
  19. © 2014 MapR Technologies 19 Table Design: Point-by-Point
  20. © 2014 MapR Technologies 20 Table Design: Hybrid Point-by-Point + Sub-table After close of window, data in row is restated as column-oriented tabular value in different column family.
  21. © 2014 MapR Technologies 21 Compression Results Samples are 64b time, 16 bit sample Sample time at 10kHz Sample time jitter makes it important to keep original time-stamp How much overhead to retain time-stamp?
  22. © 2014 MapR Technologies 22 A second example: Music meta-data
  23. © 2014 MapR Technologies 23 MusicBrainz on NoSQL • Artists, albums, tracks and labels are key objects • Reality check: – Add works (compositions), recordings, release, release group • 7 tables for artist alone • 12 for place, 7 for label, 17 for release/group, 8 for work – (but only 4 for recording!) – Total of 12 + 7 + 17 + 8 + 4 = 48 tables • But wait, there’s more! – 10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86 link tables, 5 cover art tables and 3 tables for CD timing info (138 total) – And 50 more tables that aren’t documented yet
  24. © 2014 MapR Technologies 24
  25. © 2014 MapR Technologies 25 180 tables not shown
  26. © 2014 MapR Technologies 26 236 tables to describe 7 kinds of things
  27. © 2014 MapR Technologies 27 Can we do better?
  28. © 2014 MapR Technologies 28 artist id gid name sort_name begin_date end_date ended type gender area being_area end_area comment list<ipi> list<isni> list<alias> list<release_id> list<recording_id> artist id gid name sort_name begin_date end_date ended type gender area being_area end_area comment list<ipi> list<isni> list<alias>
  29. © 2014 MapR Technologies 29 artist id gid name sort_name begin_date end_date ended type gender area being_area end_area comment list<ipi> list<isni> list<alias> list<release_id> list<recording_id>
  30. © 2014 MapR Technologies 30 artist id gid name sort_name begin_date end_date ended type gender area begin_area end_area comment list<ipi> list<isni> list<alias> list<release_id> list<recording_id> { name, begin_date, end_date }
  31. © 2014 MapR Technologies 31
  32. © 2014 MapR Technologies 32 {id, recording_id, name, list<credit> length} recording id gid list<credit> name list<track_ref> release id gid release_group_id list<credit> name barcode status packaging language script list<medium> {id, format, name, list<track>} release_group id gid name list<credit> type list<release_id>
  33. © 2014 MapR Technologies 33 27 tables reduce to 4
  34. © 2014 MapR Technologies 34 27 tables reduce to 4 so far
  35. © 2014 MapR Technologies 35 Further Reductions • All 86 link tables become properties on artists, releases and other entities • All 44 tag, rating and annotation tables become list properties • All 5 cover art tables become lists of file references • Current score: 162 tables become 4 • You get the idea
  36. © 2014 MapR Technologies 36 Is This Good? • Expressivity – The JSON data model is at least as expressive as the original relational model • Many cases easier to describe in nested data • No cases are harder • Efficiency – Inlining can increase data size. Locality improves, however – Sessionizing can substantially decrease data size – Inlining back-references is more efficient than ordinary indexes – Inlined columnar data allows 1000x speedup for time series • Introspection (you decide)
  37. © 2014 MapR Technologies 37 But How Can We Query This? • Can’t use SQL – SQL is strongly typed – SQL is heavily tied into the original relational model – SQL generating tools require relational model • Must use SQL – Vast numbers of tools and people understand how to write SQL – SQL is the lingua franca of databases
  38. © 2014 MapR Technologies 38 Squaring the Circle • Enter Apache Drill • Drill is SQL compliant – Uses standard syntax and semantics • Drill extends SQL – First class treatment of objects, lists – Full support for destructuring, flattening – Full power of relational model can be applied to complex data
  39. © 2014 MapR Technologies 39 Drill Provides Scalable and Extended SQL
  40. © 2014 MapR Technologies 40 Sample Query • Find Elvis select distinct id, name, alias from ( select id, flatten(alias.name) alias from artist where alias like 'Elvis%Presley' )
  41. © 2014 MapR Technologies 41 Example Query • Find discs where Elvis was credited select distinct album_id, name from ( select id album_id, name, flatten(credit) from release ) albums join ( select distinct artist_id from ( select id artist_id, flatten(alias) from artist where name like 'Elvis%Presley’ ) ) artists using artist_id
  42. © 2014 MapR Technologies 42 Summary • Extended relational model allows massive simplification – On a real example, we see >20x reduction in number of tables • Simplification drives improved introspection – This is good • Apache Drill gives very high performance execution for extended relational problems • You can try this out today
  43. © 2014 MapR Technologies 43
  44. © 2014 MapR Technologies 44 Short Books by Ted Dunning & Ellen Friedman • Published by O’Reilly in 2014 and 2015 • For sale from Amazon or O’Reilly • Free e-books currently available courtesy of MapR http://bit.ly/ebook-real- world-hadoop http://bit.ly/mapr-tsdb- ebook http://bit.ly/ebook- anomaly http://bit.ly/recommend ation-ebook
  45. © 2014 MapR Technologies 45 Real World Hadoop by Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly) Free copies at book signing today
  46. © 2014 MapR Technologies 46 Thank You!
  47. © 2014 MapR Technologies 47 Q&A @mapr maprtech tdunning@mapr.tech.com Engage with us! MapR maprtech mapr-technologies

Notas del editor

  1. Key ideas: Unique row key based on an id for each time series (looked up from a separate look-up table); important part of the efficiency of design is to have each column be a time off-set from the start time shown in the row key. Note that data is stored point-by-point in this wide table design. Ted’s notes from his original slide: One technique for increasing the rate at which data can be retrieved from a time series database is to store many values in each row. Doing this allows data points to be retrieved at a higher speed Because both HBase and MapR-DB store data ordered by the primary key, this design will cause rows containing data from a single time series to wind up near one another on disk. Retrieving data from a particular time series for a time range will involve largely sequential disk operations and therefore will be much faster than would be the case if the rows were widely scattered. Typically, the time window is adjusted so that 100–1,000 samples are in each row.
  2. Ted’s notes from original slide: The table design is improved by collapsing all of the data for a row into a single data structure known as a blob. This blob can be highly compressed so that less data needs to be read from disk. Also, having a single column per row decreases the per-column overhead incurred by the on-disk format that HBase uses, which further increases performance. Data can be progressively converted to the compressed format as soon as it is known that little or no new data is likely to arrive for that time series and time window. Commonly, once the time window ends, new data will only arrive for a few more seconds, and the compression of the data can begin. Since compressed and uncompressed data can coexist in the same row, if a few samples arrive after the row is compressed, the row can simply be compressed again to merge the blob and the late-arriving samples.
Publicidad