SlideShare una empresa de Scribd logo
1 de 33
Dremel: Interactive Analysis of Web-
Scale Datasets
 Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer,
 Shiva Shivakumar, Matt Tolton, Theo Vassilakis
 Google, Inc.
 VLDB, 2010

 Presentation by ensky (enskylin@gmail.com)
Outline
 Introduction
 Nested columnar storage
 Query processing
 Experiments
 Conclusion




                            2
Introduction
   Dremel is an query system
    For analysis of read-only nested data

   Use case

        Interactive                 Trends
           Tools                   Detection


             Web     Spam         Network
          Dashboards             Optimization
                                            3
DocId: 10
                                           Links


           Features
                                             Forward: 20
                                           Name
                                             Language
                                               Code: 'en-us'
                                               Country: 'us'
             Multi-level structure          Url: 'http://A'
                                           Name

             SQL-like query language
                                             Url: 'http://B'



             Fast: Trillion-row tables in second level
             Big scale: thousands of CPUs and
              petabytes of data
             Widely used by google: in production
              since 2006
SELECT A, COUNT(B) FROM T
GROUP BY A
T = {/gfs/1, /gfs/2, …, /gfs/100000}
                                                               4
Contribution
   This paper presented two major
    technique in Dremel:

   Nested columnar storage
    ◦ store / split into columns
    ◦ Assembly to record
   Query
    ◦ Language
    ◦ Execution
                                     5
Outline
 Introduction
 Nested columnar storage
 Query processing
 Experiments
 Conclusion




                            6
Why nested?
   Flexible
    ◦ Data in web and scientific is often non-
      relational
   Reduce cost
    ◦ Normalizing and recombining nested data
      is often prohibited in TB, PB scale of data.




                                                     7
Why column?
                  DocId: 10
                  Links
                                           column-oriented
                    Forward: 20
                  Name                              A
                                                  *    *
                                                    ...
                    Language
                      Code: 'en-us'
                      Country: 'us'             B          E
                    Url: 'http://A'                   *
                                               C           D
                                                               r1
                  Name
                    Url: 'http://B'

                                          r1
 r1                                                   r1
                                          r2                   r2
 r2                                   Read less,      r2
                                      cheaper
 Record-oriented                      decompression

Challenge: preserve structure, reconstruct from a subset of fields
                                                                 8
DocId: 10
                    Nested data model
Links
  Forward: 20        message Document {
  Forward: 40
  Forward: 60          required int64 DocId;           [1,1]
Name                   optional group Links {
  Language
    Code: 'en-us'
                         repeated int64 Backward;      [0,*]
    Country: 'us'        repeated int64 Forward;
  Language             }
    Code: 'en'
  Url: 'http://A'      repeated group Name {
Name                     repeated group Language {
  Url: 'http://B'
Name                       required string Code;
  Language                 optional string Country;    [0,1]
    Code: 'en-gb'
    Country: 'gb'
                         }
                         optional string Url;
DocId: 20              }
Links                }
  Backward: 10
  Backward: 30
  Forward: 80
Name
                     http://code.google.com/apis/protocolbuffers
  Url: 'http://C'                                                  9
Column-striped representation
       DocId            Name.Url                Links.Forward        Links.Backward
       value    r   d   value       r   d       value   r   d        value    r   d
        10     0    0   http://A    0   2        20     0   2        NULL     0   1
        20     0    0   http://B    1   2        40     1   2            10   0   2

DocId: 10
                         NULL       1   1        60     1   2            30   1   2
Links                   http://C    0   2        80     0   2
  Forward: 20
  Forward: 40
  Forward: 60
Name
  Language                 Name.Language.Code               Name.Language.Country
    Code: 'en-us'
    Country: 'us'           value       r   d               value    r    d
  Language                  en-us       0   2                   us   0    3
    Code: 'en'
  Url: 'http://A'            en         2   2                NULL    2    2
Name
  Url: 'http://B'           NULL        1   1                NULL    1    1
Name
  Language                  en-gb       1   2                   gb   1    3
    Code: 'en-gb'           NULL        0   1                NULL    0    1
    Country: 'gb'                                                                     10
Column-oriented problem
   When query sub-tree, there is no way for
    you to know the whole structure(even the
    position of yourself)             A
                                           *        *
                                       B        ...          E
   To solve this problem,                 *
    this paper presents            C            D
    repetition & definition
                                                        r1
                              r1
                                           r1
                              r2                        r2
           Where am I?
                                           r2
                                                                 11
Repetition and definition levels
                                                                        r
                                                        DocId: 10
                                                        Links             1
                                                          Forward: 20
                                                          Forward: 40
    Name.Language.Code                                    Forward: 60
     value    r   d                                     Name
                             First time, repeat=0         Language
     en-us    0   2                                         Code: 'en-us'
                          Language repeat, level = 2        Country: 'us'
       en     2   2                                       Language
                                                            Code: 'en'
     NULL     1   1                                       Url: 'http://A'
     en-gb    1   2                                     Name
                                                          Url: 'http://B'
     NULL     0   1                                     Name
                                                          Language
                                                            Code: 'en-gb'
                               None repeat, level = 0       Country: 'gb'


                                                        DocId: 20
                                                                         2  r
                                                        Links
                                                          Backward: 10
                                                          Backward: 30
    r: At what repeated field in the field's path         Forward: 80
       the value has repeated                           Name
                                                          Url: 'http://C'12
Repetition and definition levels
                                                                       r
                                                       DocId: 10
                                                       Links             1
                                                         Forward: 20
                                                         Forward: 40
    Name.Language.Country                                Forward: 60
     value    r   d                                    Name
                                                         Language
       us     0   3                                        Code: 'en-us'
                                                           Country: 'us'
      NULL    2   2                                      Language
                                                           Code: 'en'
      NULL    1   1                                      Url: 'http://A'
       gb     1   3                                    Name
                                                         Url: 'http://B'
      NULL    0   1                                    Name
                                                         Language
                                                           Code: 'en-gb'
                                                           Country: 'gb'


                                                       DocId: 20
                                                                        2  r
                                                       Links
                                                         Backward: 10
                                                         Backward: 30
    d: How many fields in paths that could be            Forward: 80
                                                       Name
       undefined (opt. or rep.) are actually present     Url: 'http://C'13
Column-oriented is all right,
but what if I still need record-
oriented query?

                  (e.g., MapReduce)


                                      14
Record assembly FSM
                                   Transitions labeled with
                                   repetition levels
                         DocId
                    0
                            0                     1
     1     Links.Backward        Links.Forward
                            0

                         0,1,2
    Name.Language.Code           Name.Language.Country
                            2
                         Name.Ur         0,1
                1
                             l
                           0



For record-oriented data processing (e.g., MapReduce)
                                                              15
Reading two fields

                                DocId: 10      s 1
                                Name
                                  Language
                DocId               Country: 'us'
                                  Language
                0               Name
                                Name
 1,2    Name.Language.Country     Language
                0                   Country: 'gb'

                                DocId: 20      s2
                                Name



Structure of parent fields is preserved.
Useful for queries like /Name[3]/Language[1]/Country

                                                       16
Outline
 Introduction
 Nested columnar storage
 Query processing
 Experiments
 Conclusion




                            17
Query language
 Based on SQL
 Performs
    ◦   Projection
    ◦   Selection
    ◦   Nested subqueries
    ◦   inner and intra-record aggregation
    ◦   Top-k
    ◦   Joins
    ◦   User-defined functions

                                             18
DocId: 10
                                                  Links
                                                    Forward: 20
                                                    Forward: 40

          Example usage                             Forward: 60
                                                  Name
                                                    Language
                                                      Code: 'en-us'
                                                      Country: 'us'
SELECT DocId AS Id,                                 Language
                                                      Code: 'en'
  COUNT(Name.Language.Code) WITHIN Name AS Cnt,     Url: 'http://A'
                                                  Name
  Name.Url + ',' + Name.Language.Code AS Str        Url: 'http://B'
                                                  Name
FROM t                                              Language
WHERE REGEXP(Name.Url, '^http') AND DocId < 20;       Code: 'en-gb'
                                                      Country: 'gb'



Output table                 Output schema
Id: 10                 t1    message QueryResult {
Name                           required int64 Id;
  Cnt: 2                       repeated group Name {
  Language                       optional uint64 Cnt;
    Str: 'http://A,en-us'        repeated group Language {
    Str: 'http://A,en'             optional string Str;
Name                             }
  Cnt: 0                       }
                             }
                                                                19
Query execution
                   client
                                     • Parallelizes scheduling
root server                            and aggregation
                                     • Fault tolerance
intermediate
                            ...      • query dispatcher can
servers
                                       dispatch servers

leaf servers       ...
(with local                  ...
 storage)


         storage layer (e.g., GFS)                               20
Example: count()
      SELECT A, COUNT(B) FROM T GROUP          SELECT A, SUM(c)
0     BY A                                     FROM (R11 UNION ALL R110)
      T = {/gfs/1, /gfs/2, …, /gfs/100000}     GROUP BY A




      R11                                R12
      SELECT A, COUNT(B) AS c            SELECT A, COUNT(B) AS c
1     FROM T11 GROUP BY A                FROM T12 GROUP BY A
      T11 = {/gfs/1, …, /gfs/10000}      T12 = {/gfs/10001, …, /gfs/20000}

...
      SELECT A, COUNT(B) AS c
3     FROM T31 GROUP BY A             ...
      T31 = {/gfs/1}

                Data access ops                                              21
Outline
 Introduction
 Nested columnar storage
 Query processing
 Experiments
 Conclusion




                            22
Experiment Data
• 1 PB of real data
  (uncompressed, non-replicated)
• 100K-800K tablets per table
• Experiments run during business hours

Table   Number of       Size (unrepl.,   Number       Data     Repl.
name    records         compressed)      of fields    center   factor
  T1      85 billion             87 TB          270     A        3×
  T2      24 billion             13 TB          530     A        3×
  T3        4 billion            70 TB         1200     A        3×
  T4      1+ trillion          105 TB            50     B        3×
  T5      1+ trillion            20 TB           30     B        2×
                                                                        23
Column v.s. Record"cold" time on local disk,
                                  time (sec)                            averaged over 30 runs
                                                                                 (e) parse as
                   from records                                                      C++ objects
10x speedup
using columnar                                       objects
storage                                                                          (d) read +
                                                                                     decompress
                                                               records
                                                                                 (c) parse as
                   from columns




                                           columns
                                                                                     C++ objects
                                                                                 (b) assemble
2-4x overhead of                                                                      records
using records                                                                    (a) read +
                                                                                     decompress

                                                     number of fields


            Table partition: 375 MB (compressed), 300K rows, 125 columns 24
MR and Dremel execution
 Avg # of terms in txtField in 85 billion record table T1
     execution time (sec) on 3000 nodes
                                             Sawzall program ran on MR:

                                             num_recs: table sum of int;
                                             num_words: table sum of int;
                                             emit num_recs <- 1;
                                             emit num_words <-

                                             count_words(input.txtField);

          87 TB        0.5 TB       0.5 TB



Q1: SELECT SUM(count_words(txtField)) / COUNT(*)
    FROM T1

MR overheads: launch jobs, schedule 0.5M tasks, assemble record
                                                           25
Impact of serving tree depth
      execution time (sec)




       (returns 100s of records)   (returns 1M records)


Q2:   SELECT country, SUM(item.amount) FROM T2
      GROUP BY country

Q3:   SELECT domain, SUM(item.amount) FROM T2
      WHERE domain CONTAINS ’.net’
      GROUP BY domain
                                   40 billion nested items26
Scalability
     execution time (sec)




                                        number of
                                        leaf servers




Q5 on a trillion-row table T4:
     SELECT TOP(aid, 20), COUNT(*) FROM T4
                                                 27
Outline
 Introduction
 Nested columnar storage
 Query processing
 Experiments
 Conclusion




                            28
Observation
                          Monthly query workload
                          of one 3000-node
percentage of queries     Dremel instance




                                          execution
                                          time (sec)




     Most queries complete under 10 sec                29
Conclusion
 Dremel is an query system
  For analysis of read-only nested data
 Main feature is fast(interactive response
  time), column-oriented, SQL-like Query
  language
 Introduced two major method:
    ◦ Nested columnar storage – in order to solve
      partial query problem.
    ◦ Query processing – parallel processing &
      distributing, decompressing queries


                                                    30
Comments
 Google is awesome!
 Pros
    ◦ Nested storage gives us more flexibility.
    ◦ repetition & definition is an novel idea
      and it can solve the locality problem
      easily.
    ◦ Distributed serving tree is awesome,
      faster than MR.



                                                  31
Comments
   Cons
    ◦ Read-only may not fit every requirement
    ◦ Dremel is not a database, so you’ll need
      to convert your real data into dremel
      when analyzing
      Converting may cost lost of time and space
      Google doesn’t care this problem, they have
       GFS and many servers.




                                                     32
Thank you for listening.




                           33

Más contenido relacionado

La actualidad más candente

Neo4j graphs in government
Neo4j graphs in governmentNeo4j graphs in government
Neo4j graphs in governmentNeo4j
 
Compiler Construction | Lecture 15 | Memory Management
Compiler Construction | Lecture 15 | Memory ManagementCompiler Construction | Lecture 15 | Memory Management
Compiler Construction | Lecture 15 | Memory ManagementEelco Visser
 
Mongodb - Scaling write performance
Mongodb - Scaling write performanceMongodb - Scaling write performance
Mongodb - Scaling write performanceDaum DNA
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDatabricks
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Databricks
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight OverviewJacques Nadeau
 
Evaluation of Expression in Query Processing
Evaluation of Expression in Query ProcessingEvaluation of Expression in Query Processing
Evaluation of Expression in Query ProcessingNeel Shah
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Delta Lake, un vernis pour parquet
Delta Lake, un vernis pour parquetDelta Lake, un vernis pour parquet
Delta Lake, un vernis pour parquetAlban Phélip
 
Spatial Indexing
Spatial IndexingSpatial Indexing
Spatial Indexingtorp42
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
6. Integrity and Security in DBMS
6. Integrity and Security in DBMS6. Integrity and Security in DBMS
6. Integrity and Security in DBMSkoolkampus
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
NOSQL and MongoDB Database
NOSQL and MongoDB DatabaseNOSQL and MongoDB Database
NOSQL and MongoDB DatabaseTariqul islam
 
2 PHASE COMMIT PROTOCOL
2 PHASE COMMIT PROTOCOL2 PHASE COMMIT PROTOCOL
2 PHASE COMMIT PROTOCOLKABILESH RAMAR
 
Apache Spark 3 Dynamic Partition Pruning
Apache Spark 3 Dynamic Partition PruningApache Spark 3 Dynamic Partition Pruning
Apache Spark 3 Dynamic Partition PruningAparup Chatterjee
 
Query processing-and-optimization
Query processing-and-optimizationQuery processing-and-optimization
Query processing-and-optimizationWBUTTUTORIALS
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 

La actualidad más candente (20)

RDD
RDDRDD
RDD
 
Neo4j graphs in government
Neo4j graphs in governmentNeo4j graphs in government
Neo4j graphs in government
 
Compiler Construction | Lecture 15 | Memory Management
Compiler Construction | Lecture 15 | Memory ManagementCompiler Construction | Lecture 15 | Memory Management
Compiler Construction | Lecture 15 | Memory Management
 
Mongodb - Scaling write performance
Mongodb - Scaling write performanceMongodb - Scaling write performance
Mongodb - Scaling write performance
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
Evaluation of Expression in Query Processing
Evaluation of Expression in Query ProcessingEvaluation of Expression in Query Processing
Evaluation of Expression in Query Processing
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Delta Lake, un vernis pour parquet
Delta Lake, un vernis pour parquetDelta Lake, un vernis pour parquet
Delta Lake, un vernis pour parquet
 
Spatial Indexing
Spatial IndexingSpatial Indexing
Spatial Indexing
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
6. Integrity and Security in DBMS
6. Integrity and Security in DBMS6. Integrity and Security in DBMS
6. Integrity and Security in DBMS
 
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
NOSQL and MongoDB Database
NOSQL and MongoDB DatabaseNOSQL and MongoDB Database
NOSQL and MongoDB Database
 
2 PHASE COMMIT PROTOCOL
2 PHASE COMMIT PROTOCOL2 PHASE COMMIT PROTOCOL
2 PHASE COMMIT PROTOCOL
 
Apache Spark 3 Dynamic Partition Pruning
Apache Spark 3 Dynamic Partition PruningApache Spark 3 Dynamic Partition Pruning
Apache Spark 3 Dynamic Partition Pruning
 
Query processing-and-optimization
Query processing-and-optimizationQuery processing-and-optimization
Query processing-and-optimization
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 

Similar a Dremel: interactive analysis of web-scale datasets

Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017techmaddy
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon
 
Polygot persistence for Java Developers - August 2011 / @Oakjug
Polygot persistence for Java Developers - August 2011 / @OakjugPolygot persistence for Java Developers - August 2011 / @Oakjug
Polygot persistence for Java Developers - August 2011 / @OakjugChris Richardson
 
Programming Design Guidelines
Programming Design GuidelinesProgramming Design Guidelines
Programming Design Guidelinesintuitiv.de
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Using RDA for Archives and Manuscripts
Using RDA for Archives and ManuscriptsUsing RDA for Archives and Manuscripts
Using RDA for Archives and ManuscriptsAdrienne Pruitt
 
Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries
Practical Implementation of Space-Efficient Dynamic Keyword DictionariesPractical Implementation of Space-Efficient Dynamic Keyword Dictionaries
Practical Implementation of Space-Efficient Dynamic Keyword DictionariesShunsuke Kanda
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesApache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesJimmy Lai
 
Keg.io
Keg.ioKeg.io
Keg.ioHeroku
 
Polyglot persistence for Java developers - moving out of the relational comfo...
Polyglot persistence for Java developers - moving out of the relational comfo...Polyglot persistence for Java developers - moving out of the relational comfo...
Polyglot persistence for Java developers - moving out of the relational comfo...Chris Richardson
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 

Similar a Dremel: interactive analysis of web-scale datasets (20)

Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
 
Seminar_Final
Seminar_FinalSeminar_Final
Seminar_Final
 
Polygot persistence for Java Developers - August 2011 / @Oakjug
Polygot persistence for Java Developers - August 2011 / @OakjugPolygot persistence for Java Developers - August 2011 / @Oakjug
Polygot persistence for Java Developers - August 2011 / @Oakjug
 
Programming Design Guidelines
Programming Design GuidelinesProgramming Design Guidelines
Programming Design Guidelines
 
Network and DNS Vulnerabilities
Network and DNS VulnerabilitiesNetwork and DNS Vulnerabilities
Network and DNS Vulnerabilities
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Using RDA for Archives and Manuscripts
Using RDA for Archives and ManuscriptsUsing RDA for Archives and Manuscripts
Using RDA for Archives and Manuscripts
 
Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries
Practical Implementation of Space-Efficient Dynamic Keyword DictionariesPractical Implementation of Space-Efficient Dynamic Keyword Dictionaries
Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesApache thrift-RPC service cross languages
Apache thrift-RPC service cross languages
 
Keg.io
Keg.ioKeg.io
Keg.io
 
Chado-XML
Chado-XMLChado-XML
Chado-XML
 
Polyglot persistence for Java developers - moving out of the relational comfo...
Polyglot persistence for Java developers - moving out of the relational comfo...Polyglot persistence for Java developers - moving out of the relational comfo...
Polyglot persistence for Java developers - moving out of the relational comfo...
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
C++primer
C++primerC++primer
C++primer
 
Avro
AvroAvro
Avro
 
2011 09-pdfjs
2011 09-pdfjs2011 09-pdfjs
2011 09-pdfjs
 
Ruby DSL
Ruby DSLRuby DSL
Ruby DSL
 

Más de Hung-yu Lin

2014 database - course 2 - php
2014 database - course 2 - php2014 database - course 2 - php
2014 database - course 2 - phpHung-yu Lin
 
2014 database - course 3 - PHP and MySQL
2014 database - course 3 - PHP and MySQL2014 database - course 3 - PHP and MySQL
2014 database - course 3 - PHP and MySQLHung-yu Lin
 
2014 database - course 1 - www introduction
2014 database - course 1 - www introduction2014 database - course 1 - www introduction
2014 database - course 1 - www introductionHung-yu Lin
 
OpenWebSchool - 11 - CodeIgniter
OpenWebSchool - 11 - CodeIgniterOpenWebSchool - 11 - CodeIgniter
OpenWebSchool - 11 - CodeIgniterHung-yu Lin
 
OpenWebSchool - 06 - PHP + MySQL
OpenWebSchool - 06 - PHP + MySQLOpenWebSchool - 06 - PHP + MySQL
OpenWebSchool - 06 - PHP + MySQLHung-yu Lin
 
OpenWebSchool - 05 - MySQL
OpenWebSchool - 05 - MySQLOpenWebSchool - 05 - MySQL
OpenWebSchool - 05 - MySQLHung-yu Lin
 
OpenWebSchool - 02 - PHP Part I
OpenWebSchool - 02 - PHP Part IOpenWebSchool - 02 - PHP Part I
OpenWebSchool - 02 - PHP Part IHung-yu Lin
 
OpenWebSchool - 01 - WWW Intro
OpenWebSchool - 01 - WWW IntroOpenWebSchool - 01 - WWW Intro
OpenWebSchool - 01 - WWW IntroHung-yu Lin
 
OpenWebSchool - 03 - PHP Part II
OpenWebSchool - 03 - PHP Part IIOpenWebSchool - 03 - PHP Part II
OpenWebSchool - 03 - PHP Part IIHung-yu Lin
 
Google App Engine
Google App EngineGoogle App Engine
Google App EngineHung-yu Lin
 

Más de Hung-yu Lin (11)

2014 database - course 2 - php
2014 database - course 2 - php2014 database - course 2 - php
2014 database - course 2 - php
 
2014 database - course 3 - PHP and MySQL
2014 database - course 3 - PHP and MySQL2014 database - course 3 - PHP and MySQL
2014 database - course 3 - PHP and MySQL
 
2014 database - course 1 - www introduction
2014 database - course 1 - www introduction2014 database - course 1 - www introduction
2014 database - course 1 - www introduction
 
OpenWebSchool - 11 - CodeIgniter
OpenWebSchool - 11 - CodeIgniterOpenWebSchool - 11 - CodeIgniter
OpenWebSchool - 11 - CodeIgniter
 
OpenWebSchool - 06 - PHP + MySQL
OpenWebSchool - 06 - PHP + MySQLOpenWebSchool - 06 - PHP + MySQL
OpenWebSchool - 06 - PHP + MySQL
 
OpenWebSchool - 05 - MySQL
OpenWebSchool - 05 - MySQLOpenWebSchool - 05 - MySQL
OpenWebSchool - 05 - MySQL
 
OpenWebSchool - 02 - PHP Part I
OpenWebSchool - 02 - PHP Part IOpenWebSchool - 02 - PHP Part I
OpenWebSchool - 02 - PHP Part I
 
OpenWebSchool - 01 - WWW Intro
OpenWebSchool - 01 - WWW IntroOpenWebSchool - 01 - WWW Intro
OpenWebSchool - 01 - WWW Intro
 
OpenWebSchool - 03 - PHP Part II
OpenWebSchool - 03 - PHP Part IIOpenWebSchool - 03 - PHP Part II
OpenWebSchool - 03 - PHP Part II
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
 
Redis
RedisRedis
Redis
 

Último

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Último (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Dremel: interactive analysis of web-scale datasets

  • 1. Dremel: Interactive Analysis of Web- Scale Datasets Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Google, Inc. VLDB, 2010 Presentation by ensky (enskylin@gmail.com)
  • 2. Outline  Introduction  Nested columnar storage  Query processing  Experiments  Conclusion 2
  • 3. Introduction  Dremel is an query system For analysis of read-only nested data  Use case Interactive Trends Tools Detection Web Spam Network Dashboards Optimization 3
  • 4. DocId: 10 Links Features Forward: 20 Name Language Code: 'en-us' Country: 'us'  Multi-level structure Url: 'http://A' Name  SQL-like query language Url: 'http://B'  Fast: Trillion-row tables in second level  Big scale: thousands of CPUs and petabytes of data  Widely used by google: in production since 2006 SELECT A, COUNT(B) FROM T GROUP BY A T = {/gfs/1, /gfs/2, …, /gfs/100000} 4
  • 5. Contribution  This paper presented two major technique in Dremel:  Nested columnar storage ◦ store / split into columns ◦ Assembly to record  Query ◦ Language ◦ Execution 5
  • 6. Outline  Introduction  Nested columnar storage  Query processing  Experiments  Conclusion 6
  • 7. Why nested?  Flexible ◦ Data in web and scientific is often non- relational  Reduce cost ◦ Normalizing and recombining nested data is often prohibited in TB, PB scale of data. 7
  • 8. Why column? DocId: 10 Links column-oriented Forward: 20 Name A * * ... Language Code: 'en-us' Country: 'us' B E Url: 'http://A' * C D r1 Name Url: 'http://B' r1 r1 r1 r2 r2 r2 Read less, r2 cheaper Record-oriented decompression Challenge: preserve structure, reconstruct from a subset of fields 8
  • 9. DocId: 10 Nested data model Links Forward: 20 message Document { Forward: 40 Forward: 60 required int64 DocId; [1,1] Name optional group Links { Language Code: 'en-us' repeated int64 Backward; [0,*] Country: 'us' repeated int64 Forward; Language } Code: 'en' Url: 'http://A' repeated group Name { Name repeated group Language { Url: 'http://B' Name required string Code; Language optional string Country; [0,1] Code: 'en-gb' Country: 'gb' } optional string Url; DocId: 20 } Links } Backward: 10 Backward: 30 Forward: 80 Name http://code.google.com/apis/protocolbuffers Url: 'http://C' 9
  • 10. Column-striped representation DocId Name.Url Links.Forward Links.Backward value r d value r d value r d value r d 10 0 0 http://A 0 2 20 0 2 NULL 0 1 20 0 0 http://B 1 2 40 1 2 10 0 2 DocId: 10 NULL 1 1 60 1 2 30 1 2 Links http://C 0 2 80 0 2 Forward: 20 Forward: 40 Forward: 60 Name Language Name.Language.Code Name.Language.Country Code: 'en-us' Country: 'us' value r d value r d Language en-us 0 2 us 0 3 Code: 'en' Url: 'http://A' en 2 2 NULL 2 2 Name Url: 'http://B' NULL 1 1 NULL 1 1 Name Language en-gb 1 2 gb 1 3 Code: 'en-gb' NULL 0 1 NULL 0 1 Country: 'gb' 10
  • 11. Column-oriented problem  When query sub-tree, there is no way for you to know the whole structure(even the position of yourself) A * * B ... E  To solve this problem, * this paper presents C D repetition & definition r1 r1 r1 r2 r2 Where am I? r2 11
  • 12. Repetition and definition levels r DocId: 10 Links 1 Forward: 20 Forward: 40 Name.Language.Code Forward: 60 value r d Name First time, repeat=0 Language en-us 0 2 Code: 'en-us' Language repeat, level = 2 Country: 'us' en 2 2 Language Code: 'en' NULL 1 1 Url: 'http://A' en-gb 1 2 Name Url: 'http://B' NULL 0 1 Name Language Code: 'en-gb' None repeat, level = 0 Country: 'gb' DocId: 20 2 r Links Backward: 10 Backward: 30 r: At what repeated field in the field's path Forward: 80 the value has repeated Name Url: 'http://C'12
  • 13. Repetition and definition levels r DocId: 10 Links 1 Forward: 20 Forward: 40 Name.Language.Country Forward: 60 value r d Name Language us 0 3 Code: 'en-us' Country: 'us' NULL 2 2 Language Code: 'en' NULL 1 1 Url: 'http://A' gb 1 3 Name Url: 'http://B' NULL 0 1 Name Language Code: 'en-gb' Country: 'gb' DocId: 20 2 r Links Backward: 10 Backward: 30 d: How many fields in paths that could be Forward: 80 Name undefined (opt. or rep.) are actually present Url: 'http://C'13
  • 14. Column-oriented is all right, but what if I still need record- oriented query? (e.g., MapReduce) 14
  • 15. Record assembly FSM Transitions labeled with repetition levels DocId 0 0 1 1 Links.Backward Links.Forward 0 0,1,2 Name.Language.Code Name.Language.Country 2 Name.Ur 0,1 1 l 0 For record-oriented data processing (e.g., MapReduce) 15
  • 16. Reading two fields DocId: 10 s 1 Name Language DocId Country: 'us' Language 0 Name Name 1,2 Name.Language.Country Language 0 Country: 'gb' DocId: 20 s2 Name Structure of parent fields is preserved. Useful for queries like /Name[3]/Language[1]/Country 16
  • 17. Outline  Introduction  Nested columnar storage  Query processing  Experiments  Conclusion 17
  • 18. Query language  Based on SQL  Performs ◦ Projection ◦ Selection ◦ Nested subqueries ◦ inner and intra-record aggregation ◦ Top-k ◦ Joins ◦ User-defined functions 18
  • 19. DocId: 10 Links Forward: 20 Forward: 40 Example usage Forward: 60 Name Language Code: 'en-us' Country: 'us' SELECT DocId AS Id, Language Code: 'en' COUNT(Name.Language.Code) WITHIN Name AS Cnt, Url: 'http://A' Name Name.Url + ',' + Name.Language.Code AS Str Url: 'http://B' Name FROM t Language WHERE REGEXP(Name.Url, '^http') AND DocId < 20; Code: 'en-gb' Country: 'gb' Output table Output schema Id: 10 t1 message QueryResult { Name required int64 Id; Cnt: 2 repeated group Name { Language optional uint64 Cnt; Str: 'http://A,en-us' repeated group Language { Str: 'http://A,en' optional string Str; Name } Cnt: 0 } } 19
  • 20. Query execution client • Parallelizes scheduling root server and aggregation • Fault tolerance intermediate ... • query dispatcher can servers dispatch servers leaf servers ... (with local ... storage) storage layer (e.g., GFS) 20
  • 21. Example: count() SELECT A, COUNT(B) FROM T GROUP SELECT A, SUM(c) 0 BY A FROM (R11 UNION ALL R110) T = {/gfs/1, /gfs/2, …, /gfs/100000} GROUP BY A R11 R12 SELECT A, COUNT(B) AS c SELECT A, COUNT(B) AS c 1 FROM T11 GROUP BY A FROM T12 GROUP BY A T11 = {/gfs/1, …, /gfs/10000} T12 = {/gfs/10001, …, /gfs/20000} ... SELECT A, COUNT(B) AS c 3 FROM T31 GROUP BY A ... T31 = {/gfs/1} Data access ops 21
  • 22. Outline  Introduction  Nested columnar storage  Query processing  Experiments  Conclusion 22
  • 23. Experiment Data • 1 PB of real data (uncompressed, non-replicated) • 100K-800K tablets per table • Experiments run during business hours Table Number of Size (unrepl., Number Data Repl. name records compressed) of fields center factor T1 85 billion 87 TB 270 A 3× T2 24 billion 13 TB 530 A 3× T3 4 billion 70 TB 1200 A 3× T4 1+ trillion 105 TB 50 B 3× T5 1+ trillion 20 TB 30 B 2× 23
  • 24. Column v.s. Record"cold" time on local disk, time (sec) averaged over 30 runs (e) parse as from records C++ objects 10x speedup using columnar objects storage (d) read + decompress records (c) parse as from columns columns C++ objects (b) assemble 2-4x overhead of records using records (a) read + decompress number of fields Table partition: 375 MB (compressed), 300K rows, 125 columns 24
  • 25. MR and Dremel execution Avg # of terms in txtField in 85 billion record table T1 execution time (sec) on 3000 nodes Sawzall program ran on MR: num_recs: table sum of int; num_words: table sum of int; emit num_recs <- 1; emit num_words <- count_words(input.txtField); 87 TB 0.5 TB 0.5 TB Q1: SELECT SUM(count_words(txtField)) / COUNT(*) FROM T1 MR overheads: launch jobs, schedule 0.5M tasks, assemble record 25
  • 26. Impact of serving tree depth execution time (sec) (returns 100s of records) (returns 1M records) Q2: SELECT country, SUM(item.amount) FROM T2 GROUP BY country Q3: SELECT domain, SUM(item.amount) FROM T2 WHERE domain CONTAINS ’.net’ GROUP BY domain 40 billion nested items26
  • 27. Scalability execution time (sec) number of leaf servers Q5 on a trillion-row table T4: SELECT TOP(aid, 20), COUNT(*) FROM T4 27
  • 28. Outline  Introduction  Nested columnar storage  Query processing  Experiments  Conclusion 28
  • 29. Observation Monthly query workload of one 3000-node percentage of queries Dremel instance execution time (sec) Most queries complete under 10 sec 29
  • 30. Conclusion  Dremel is an query system For analysis of read-only nested data  Main feature is fast(interactive response time), column-oriented, SQL-like Query language  Introduced two major method: ◦ Nested columnar storage – in order to solve partial query problem. ◦ Query processing – parallel processing & distributing, decompressing queries 30
  • 31. Comments  Google is awesome!  Pros ◦ Nested storage gives us more flexibility. ◦ repetition & definition is an novel idea and it can solve the locality problem easily. ◦ Distributed serving tree is awesome, faster than MR. 31
  • 32. Comments  Cons ◦ Read-only may not fit every requirement ◦ Dremel is not a database, so you’ll need to convert your real data into dremel when analyzing  Converting may cost lost of time and space  Google doesn’t care this problem, they have GFS and many servers. 32
  • 33. Thank you for listening. 33