SlideShare una empresa de Scribd logo
1 de 76
Flexible Indexing in Hadoop
         Dmitriy Ryaboy @squarecog
        Analytics Infrastructure @ Twitter
    Hadoop Summit, San Jose, CA June 2012
@JoinTheFlock | Hadoop Summit, June 14 2012   2
@JoinTheFlock | Hadoop Summit, June 14 2012   3
Hadoop is great at plowing
through data


                                                              @JoinTheFlock | Hadoop Summit, June 14 2012   4
       Image source: http://en.wikipedia.org/wiki/File:Snowplow_in_the_morning.jpg
And we do plow
   10s of Thousands of Jobs per day

100 TB (uncompressed) ingested daily

Many users and diverse use cases




                                       @JoinTheFlock | Hadoop Summit, June 14 2012   5
Looking for needles in
haystacks.




                                                         @JoinTheFlock | Hadoop Summit, June 14 2012   6

        Image Source: http://en.wikipedia.org/wiki/File:July_1903_-_on_the_Gaisberg,_nr_Salzburg.JPG
Looking for needles in
haystacks.




With snowplows.
                                                         @JoinTheFlock | Hadoop Summit, June 14 2012   6

        Image Source: http://en.wikipedia.org/wiki/File:July_1903_-_on_the_Gaisberg,_nr_Salzburg.JPG
A Pig Script
 event_logs = load '/logs/lots_of_data'
                     using ThriftPigLoader('thrift.gen.LogEvent');
 filtered_logs = filter event_logs by event == 'something_rare';


 -- Then do stuff.




90% of the mappers in this job output no data.
We can do better...


                                                   @JoinTheFlock | Hadoop Summit, June 14 2012   7
Find smaller haystacks.




                                                                     @JoinTheFlock | Hadoop Summit, June 14 2012   8
     Image Source: http://en.wikipedia.org/wiki/File:July_1903_-_on_the_Gaisberg,_nr_Salzburg.JPG
Use subpartitions!




                     @JoinTheFlock | Hadoop Summit, June 14 2012   9
Use subpartitions!
• tablename/year/month/day/hour/bucket




                                         @JoinTheFlock | Hadoop Summit, June 14 2012   9
Use subpartitions!
• tablename/year/month/day/hour/bucket
• Only so many things you can partition by




                                             @JoinTheFlock | Hadoop Summit, June 14 2012   9
Use subpartitions!
• tablename/year/month/day/hour/bucket
• Only so many things you can partition by
• Up-front planning required




                                             @JoinTheFlock | Hadoop Summit, June 14 2012   9
Use subpartitions!
• tablename/year/month/day/hour/bucket
• Only so many things you can partition by
• Up-front planning required
• Rewrite or duplicate for different query patterns




                                              @JoinTheFlock | Hadoop Summit, June 14 2012   9
Keep the data sorted!




                        @JoinTheFlock | Hadoop Summit, June 14 2012   10
Keep the data sorted!
• Painful to maintain




                        @JoinTheFlock | Hadoop Summit, June 14 2012   10
Keep the data sorted!
• Painful to maintain
• Only one sort order at a time




                                  @JoinTheFlock | Hadoop Summit, June 14 2012   10
Keep the data sorted!
• Painful to maintain
• Only one sort order at a time
• Rewrite or duplicate for different query patterns




                                              @JoinTheFlock | Hadoop Summit, June 14 2012   10
Trojan Layouts*




                  * http://infosys.uni-saarland.de/publications/JQD11.pdf
                                     @JoinTheFlock | Hadoop Summit, June 14 2012   11
Trojan Layouts*
• Identify interesting column groupings




                             * http://infosys.uni-saarland.de/publications/JQD11.pdf
                                                @JoinTheFlock | Hadoop Summit, June 14 2012   11
Trojan Layouts*
• Identify interesting column groupings
• Use different column groupings per HDFS block replica




                             * http://infosys.uni-saarland.de/publications/JQD11.pdf
                                                @JoinTheFlock | Hadoop Summit, June 14 2012   11
Trojan Layouts*
• Identify interesting column groupings
• Use different column groupings per HDFS block replica
• Requires changes to NN




                             * http://infosys.uni-saarland.de/publications/JQD11.pdf
                                                @JoinTheFlock | Hadoop Summit, June 14 2012   11
Trojan Layouts*
• Identify interesting column groupings
• Use different column groupings per HDFS block replica
• Requires changes to NN
• ... and increases load on NN




                             * http://infosys.uni-saarland.de/publications/JQD11.pdf
                                                @JoinTheFlock | Hadoop Summit, June 14 2012   11
HBase!




         @JoinTheFlock | Hadoop Summit, June 14 2012   12
HBase!
• Good solution in many cases!




                                 @JoinTheFlock | Hadoop Summit, June 14 2012   12
HBase!
• Good solution in many cases!
• Maintenance overhead




                                 @JoinTheFlock | Hadoop Summit, June 14 2012   12
HBase!
• Good solution in many cases!
• Maintenance overhead
• All data must live in HBase




                                 @JoinTheFlock | Hadoop Summit, June 14 2012   12
HBase!
• Good solution in many cases!
• Maintenance overhead
• All data must live in HBase
• Full table scans slower than MR




                                    @JoinTheFlock | Hadoop Summit, June 14 2012   12
HBase!
• Good solution in many cases!
• Maintenance overhead
• All data must live in HBase
• Full table scans slower than MR
• Again with the up-front design




                                    @JoinTheFlock | Hadoop Summit, June 14 2012   12
HBase!
• Good solution in many cases!
• Maintenance overhead
• All data must live in HBase
• Full table scans slower than MR
• Again with the up-front design
  • Secondary Indexes can help




                                    @JoinTheFlock | Hadoop Summit, June 14 2012   12
Hive!




        @JoinTheFlock | Hadoop Summit, June 14 2012   13
Hive!
• That kind of works, actually.




                                  @JoinTheFlock | Hadoop Summit, June 14 2012   13
Hive
Generic Interface for defining indexing behavior.


Reference implementation: “compact” index
 value -> list of HDFS blocks; drop unneeded blocks.


Other indexes available (bitmap in 0.8)


It’ll even update indexes as you add partitions.




                                              @JoinTheFlock | Hadoop Summit, June 14 2012   14
WIN!
Done, Right?




               @JoinTheFlock | Hadoop Summit, June 14 2012   15
Hive
Good news if your data is in Hive!


Bad news if your world is a little bigger.


Indexing is tightly coupled to Hive.


No interoperability with the rest of the Hadoop stack.




                                             @JoinTheFlock | Hadoop Summit, June 14 2012   16
Democracy of Tools




                                                                                 @JoinTheFlock | Hadoop Summit, June 14 2012   17
   Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
Democracy of Tools
• Pig




                                                                                      @JoinTheFlock | Hadoop Summit, June 14 2012   17
        Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
Democracy of Tools
• Pig
• Raw Map-Reduce




                                                                                   @JoinTheFlock | Hadoop Summit, June 14 2012   17
     Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
Democracy of Tools
• Pig
• Raw Map-Reduce
• Cascading DSLs (Scalding, Cascalog, Py-Cascading)




                                                                                    @JoinTheFlock | Hadoop Summit, June 14 2012   17
      Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
Democracy of Tools
• Pig
• Raw Map-Reduce
• Cascading DSLs (Scalding, Cascalog, Py-Cascading)
• Mahout




                                                                                    @JoinTheFlock | Hadoop Summit, June 14 2012   17
      Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
Democracy of Tools
• Pig
• Raw Map-Reduce
• Cascading DSLs (Scalding, Cascalog, Py-Cascading)
• Mahout
• Maybe even Hive



                                                                                    @JoinTheFlock | Hadoop Summit, June 14 2012   17
      Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
Design Goals




               @JoinTheFlock | Hadoop Summit, June 14 2012   18
Design Goals




               @JoinTheFlock | Hadoop Summit, June 14 2012   18
Design Goals

• Minimal Job/Script modification required




                                 @JoinTheFlock | Hadoop Summit, June 14 2012   18
Design Goals

• Minimal Job/Script modification required
• As low in the stack as possible




                                 @JoinTheFlock | Hadoop Summit, June 14 2012   18
Design Goals

• Minimal Job/Script modification required
• As low in the stack as possible
 • In fact, pretty sure we could get Hive to use this...




                                        @JoinTheFlock | Hadoop Summit, June 14 2012   18
Design Goals

• Minimal Job/Script modification required
• As low in the stack as possible
 • In fact, pretty sure we could get Hive to use this...
• No unnecessary copies of data




                                        @JoinTheFlock | Hadoop Summit, June 14 2012   18
Design Goals

• Minimal Job/Script modification required
• As low in the stack as possible
 • In fact, pretty sure we could get Hive to use this...
• No unnecessary copies of data
• Allow post-factum indexing




                                        @JoinTheFlock | Hadoop Summit, June 14 2012   18
Design Goals

• Minimal Job/Script modification required
• As low in the stack as possible
 • In fact, pretty sure we could get Hive to use this...
• No unnecessary copies of data
• Allow post-factum indexing
• Graceful degradation




                                        @JoinTheFlock | Hadoop Summit, June 14 2012   18
Design Goals

• Minimal Job/Script modification required
• As low in the stack as possible
 • In fact, pretty sure we could get Hive to use this...
• No unnecessary copies of data
• Allow post-factum indexing
• Graceful degradation
• Flexible on-disk representation


                                        @JoinTheFlock | Hadoop Summit, June 14 2012   18
Elephant-Twin
Twitter’s library for creating indexes in Hadoop
https://github.com/twitter/elephant-twin
https://github.com/twitter/elephant-twin-lzo




                                               @JoinTheFlock | Hadoop Summit, June 14 2012   19
Block-Level Indexes
For each value, record the block it occurs in


“Block” can be HDFS block (100s of MBs)
Or LZO block (100s of KBs)
Or SequenceFile block
Or RCFile block ...


Ignore irrelevant blocks
Scan relevant blocks using original InputFormat




                                                @JoinTheFlock | Hadoop Summit, June 14 2012   20
Record-Level Indexes
For each value, record some representation of the record


Can be value + offset, as in bitmap indexes
Can be transformed projection of records, as in Lucene indexes


Some queries can be answered directly from index.




                                              @JoinTheFlock | Hadoop Summit, June 14 2012   21
Indexing:
                 MR
                               Index
                 job
   InputFormat


                 Data



                        @JoinTheFlock | Hadoop Summit, June 14 2012   22
Creating an Index
     public abstract class AbstractBlockIndexingJob {
    protected abstract List<String> getInput();
    protected abstract String getIndex();
    protected abstract String getInputFormat();
    protected abstract String getValueClass();
    protected abstract String getColumnName();
    protected abstract Job setMapper(Job job);
}

public abstract class AbstractLuceneIndexingJob {
  // Similar.
}




                                            @JoinTheFlock | Hadoop Summit, June 14 2012   23
Creating an Index
Mapper transforms the records: emit <DocId, Value>
                     Key                           Value
                 Block Offset                 Column Value
                   Tweet Id                       Text


Block helper:
public abstract class BlockIndexingMapper<KIN, VIN> extends
Mapper<KIN, VIN, TextLongPairWritable, LongPairWritable> {}


Lucene helper:
public abstract class AbstractIndexingMapper<KIN, VIN, KOUT, VOUT>
extends Mapper<KIN, VIN, KOUT, VOUT>
  abstract protected boolean filter(KIN k, VIN v);
  abstract protected KOUT buildOutputKey(KIN k, VIN v);

                                          @JoinTheFlock | Hadoop Summit, June 14 2012   24
Creating an Index
Reducer writes appropriately processed indexes and metadata.


MapFile block index:
public class MapFileIndexingReducer
    extends Reducer<TextLongPairWritable, LongPairWritable,
                    Text, ListLongPair>

Lucene index:
public abstract class AbstractLuceneIndexingReducer<KIN, VIN>
    extends Reducer<KIN, VIN, NullWritable, NullWritable> {
  protected abstract Document buildDocument(KIN k, VIN v);
}




                                          @JoinTheFlock | Hadoop Summit, June 14 2012   25
Creating an Index: Metadata
struct FileIndexDescriptor {
    1: DocType docType
    2: IndexType indexType
    3: i32 indexVersion
    4: string sourcePath
    5: FileChecksum checksum
    6: list<IndexedField> indexedFields
}
struct ETwinIndexDescriptor {
    1: list<FileIndexDescriptor> fileIndexDescriptors
    2: i32 indexPart
    3: optional map<string, string> options
}
                                              @JoinTheFlock | Hadoop Summit, June 14 2012   26
MR
       job     searchKey



                    IndexedInputFormat

Retrieval:
                                Index




             Data



                           @JoinTheFlock | Hadoop Summit, June 14 2012   27
InputFormat
  public class BlockIndexedFileInputFormat<K, V> extends
FileInputFormat<K, V> {

    // Indexing jobs call this function to set up indexing job
related parameters.
    public static void setIndexOptions(Job job,
      String inputformatClass, String valueClass,
      String indexDir, String columnName)

    // Searching jobs call this function to set up searching job
related parameters.
    public static void setSearchOptions(Job job,
      String inputformatClass, String valueClass,
      String indexDir, BinaryExpression filter)
}




                                         @JoinTheFlock | Hadoop Summit, June 14 2012   28
BinaryExpression
  public BinaryExpression(
  Expression lhs, Expression rhs, OpType opType)

public static enum OpType {
    OP_PLUS (" + "),
    OP_MINUS(" - "),
    ...
    OP_EQ(" == "),
    OP_NE(" != "),
    ...
    OP_AND(" and "),
    OP_OR(" or "),
    ...
    TERM_COL(" Column "),
    TERM_CONST(" Constant ");
}



                                         @JoinTheFlock | Hadoop Summit, June 14 2012   29
Pig Integration
    event_logs = load '/logs/lots_of_data'
    using ThriftPigLoader(
	       'thrift.gen.LogEvent');
	
    filtered_logs = filter event_logs by event == 'something_rare';
    -- Then do stuff.




                                               @JoinTheFlock | Hadoop Summit, June 14 2012   30
Pig Integration
    register elephant-twin-1.0.jar
    event_logs = load '/logs/lots_of_data'
    using IndexedLZOPigLoader(
	      'ThriftPigLoader',
	      'thrift.gen.LogEvent',
	      '/user/dmitriy/etwin');
	
    -- Pig will automatically push this down into the Loader and InputFormat
    filtered_logs = filter event_logs by event == 'something_rare';




                                                      @JoinTheFlock | Hadoop Summit, June 14 2012   31
Optimization: merge neighbors
     HDFS Block 1        HDFS Block 2




                     @JoinTheFlock | Hadoop Summit, June 14 2012   32
Optimization: merge neighbors
           HDFS Block 1                       HDFS Block 2




Merge neighbors, share the scan.
(Limit expansion to size of HDFS block)


                                          @JoinTheFlock | Hadoop Summit, June 14 2012   33
Optimization: merge neighbors
            HDFS Block 1                           HDFS Block 2




Scans are faster than random reads.. allow gaps?
Turns out, not that much faster. Better to jump.


                                              @JoinTheFlock | Hadoop Summit, June 14 2012   34
Optimization: combine small splits
              HDFS Block 1                            HDFS Block 2




      match                                             match                          match




                                Generated Split


Combine small relevant spans into single splits.
Try to take locality into account.



                                                  @JoinTheFlock | Hadoop Summit, June 14 2012   35
Applicability
Most keys occur in very few blocks!
Most frequent key only occurs in half the blocks.




                                             @JoinTheFlock | Hadoop Summit, June 14 2012   36
Results
Applicable Jobs take 5-10x fewer resources


Ad-hoc jobs particularly likely to benefit


“Real” indexes still faster..
 -- but can be represented using the same abstraction




                                             @JoinTheFlock | Hadoop Summit, June 14 2012   37
Future Work




                                                                                @JoinTheFlock | Hadoop Summit, June 14 2012   38
   Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
Future Work


  • Regex matching on keys




                                                                                 @JoinTheFlock | Hadoop Summit, June 14 2012   38
    Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
Future Work


  • Regex matching on keys
  • Better Pig pushdown support




                                                                                 @JoinTheFlock | Hadoop Summit, June 14 2012   38
    Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
Future Work


  • Regex matching on keys
  • Better Pig pushdown support
  • MultiIndexInputFormat




                                                                                 @JoinTheFlock | Hadoop Summit, June 14 2012   38
    Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
Future Work


  • Regex matching on keys
  • Better Pig pushdown support
  • MultiIndexInputFormat
  • Traditional indexes under ETwin




                                                                                 @JoinTheFlock | Hadoop Summit, June 14 2012   38
    Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
Future Work


  • Regex matching on keys
  • Better Pig pushdown support
  • MultiIndexInputFormat
  • Traditional indexes under ETwin
  • Index maintenance (via HCatalog?)




                                                                                 @JoinTheFlock | Hadoop Summit, June 14 2012   38
    Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
Questions?
@squarecog


Sounds like fun? We are hiring.



                                  @JoinTheFlock | Hadoop Summit, June 14 2012   39

Más contenido relacionado

La actualidad más candente

Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy
Praveen Sripati
 
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsWeb Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Kognitio
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Cloudera, Inc.
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
OReillyStrata
 

La actualidad más candente (20)

Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsWeb Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
SQL in Hadoop
SQL in HadoopSQL in Hadoop
SQL in Hadoop
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -I
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 

Destacado

Mc5.marketing multicanal
Mc5.marketing multicanalMc5.marketing multicanal
Mc5.marketing multicanal
lenaignf
 
Introduction aux algorithmes map reduce
Introduction aux algorithmes map reduceIntroduction aux algorithmes map reduce
Introduction aux algorithmes map reduce
Mathieu Dumoulin
 
Les community managers en France 2012
Les community managers en France 2012 Les community managers en France 2012
Les community managers en France 2012
HelloWork
 

Destacado (14)

Les grands enjeux de la banque de demain
Les grands enjeux de la banque de demainLes grands enjeux de la banque de demain
Les grands enjeux de la banque de demain
 
Référentiel Client Unique
Référentiel Client Unique Référentiel Client Unique
Référentiel Client Unique
 
Etude sur le Big Data
Etude sur le Big DataEtude sur le Big Data
Etude sur le Big Data
 
7 astuces pour attirer l'attention d'un influenceur sur Linkedin et sur Twitter
7 astuces pour attirer l'attention d'un influenceur sur Linkedin et sur Twitter7 astuces pour attirer l'attention d'un influenceur sur Linkedin et sur Twitter
7 astuces pour attirer l'attention d'un influenceur sur Linkedin et sur Twitter
 
Mc5.marketing multicanal
Mc5.marketing multicanalMc5.marketing multicanal
Mc5.marketing multicanal
 
Hadoop Hbase - Introduction
Hadoop Hbase - IntroductionHadoop Hbase - Introduction
Hadoop Hbase - Introduction
 
MapReduce: Traitement de données distribué à grande échelle simplifié
MapReduce: Traitement de données distribué à grande échelle simplifiéMapReduce: Traitement de données distribué à grande échelle simplifié
MapReduce: Traitement de données distribué à grande échelle simplifié
 
Junior Connect : la conquête de l'engagement
Junior Connect : la conquête de l'engagementJunior Connect : la conquête de l'engagement
Junior Connect : la conquête de l'engagement
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Introduction aux algorithmes map reduce
Introduction aux algorithmes map reduceIntroduction aux algorithmes map reduce
Introduction aux algorithmes map reduce
 
Les community managers en France 2012
Les community managers en France 2012 Les community managers en France 2012
Les community managers en France 2012
 
Carnet de témoignages #2 : les community managers dans les entreprises franca...
Carnet de témoignages #2 : les community managers dans les entreprises franca...Carnet de témoignages #2 : les community managers dans les entreprises franca...
Carnet de témoignages #2 : les community managers dans les entreprises franca...
 
infographie : les Français et Facebook
infographie : les Français et Facebookinfographie : les Français et Facebook
infographie : les Français et Facebook
 
Digital in 2017 Global Overview
Digital in 2017 Global OverviewDigital in 2017 Global Overview
Digital in 2017 Global Overview
 

Similar a Flexible In-Situ Indexing for Hadoop via Elephant Twin

Dallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: HadoopDallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: Hadoop
lamont_lockwood
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
Donald Miner
 
INTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPINTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOP
Krishna Sujeer
 

Similar a Flexible In-Situ Indexing for Hadoop via Elephant Twin (20)

Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop
 
big data
big databig data
big data
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Dallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: HadoopDallas TDWI Meeting Dec. 2012: Hadoop
Dallas TDWI Meeting Dec. 2012: Hadoop
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
NASA HDF/HDF-EOS Data Access Challenges
NASA HDF/HDF-EOS Data Access ChallengesNASA HDF/HDF-EOS Data Access Challenges
NASA HDF/HDF-EOS Data Access Challenges
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java Professionals
 
Big Data Training in Mohali
Big Data Training in MohaliBig Data Training in Mohali
Big Data Training in Mohali
 
Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in Ludhiana
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Big Data Training in Amritsar
Big Data Training in AmritsarBig Data Training in Amritsar
Big Data Training in Amritsar
 
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
 
50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs
 
Hadoop, Evolution of Hadoop, Features of Hadoop
Hadoop, Evolution of Hadoop, Features of HadoopHadoop, Evolution of Hadoop, Features of Hadoop
Hadoop, Evolution of Hadoop, Features of Hadoop
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopFirst NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
 
INTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPINTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOP
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Flexible In-Situ Indexing for Hadoop via Elephant Twin

  • 1. Flexible Indexing in Hadoop Dmitriy Ryaboy @squarecog Analytics Infrastructure @ Twitter Hadoop Summit, San Jose, CA June 2012
  • 2. @JoinTheFlock | Hadoop Summit, June 14 2012 2
  • 3. @JoinTheFlock | Hadoop Summit, June 14 2012 3
  • 4. Hadoop is great at plowing through data @JoinTheFlock | Hadoop Summit, June 14 2012 4 Image source: http://en.wikipedia.org/wiki/File:Snowplow_in_the_morning.jpg
  • 5. And we do plow 10s of Thousands of Jobs per day 100 TB (uncompressed) ingested daily Many users and diverse use cases @JoinTheFlock | Hadoop Summit, June 14 2012 5
  • 6. Looking for needles in haystacks. @JoinTheFlock | Hadoop Summit, June 14 2012 6 Image Source: http://en.wikipedia.org/wiki/File:July_1903_-_on_the_Gaisberg,_nr_Salzburg.JPG
  • 7. Looking for needles in haystacks. With snowplows. @JoinTheFlock | Hadoop Summit, June 14 2012 6 Image Source: http://en.wikipedia.org/wiki/File:July_1903_-_on_the_Gaisberg,_nr_Salzburg.JPG
  • 8. A Pig Script event_logs = load '/logs/lots_of_data' using ThriftPigLoader('thrift.gen.LogEvent'); filtered_logs = filter event_logs by event == 'something_rare'; -- Then do stuff. 90% of the mappers in this job output no data. We can do better... @JoinTheFlock | Hadoop Summit, June 14 2012 7
  • 9. Find smaller haystacks. @JoinTheFlock | Hadoop Summit, June 14 2012 8 Image Source: http://en.wikipedia.org/wiki/File:July_1903_-_on_the_Gaisberg,_nr_Salzburg.JPG
  • 10. Use subpartitions! @JoinTheFlock | Hadoop Summit, June 14 2012 9
  • 11. Use subpartitions! • tablename/year/month/day/hour/bucket @JoinTheFlock | Hadoop Summit, June 14 2012 9
  • 12. Use subpartitions! • tablename/year/month/day/hour/bucket • Only so many things you can partition by @JoinTheFlock | Hadoop Summit, June 14 2012 9
  • 13. Use subpartitions! • tablename/year/month/day/hour/bucket • Only so many things you can partition by • Up-front planning required @JoinTheFlock | Hadoop Summit, June 14 2012 9
  • 14. Use subpartitions! • tablename/year/month/day/hour/bucket • Only so many things you can partition by • Up-front planning required • Rewrite or duplicate for different query patterns @JoinTheFlock | Hadoop Summit, June 14 2012 9
  • 15. Keep the data sorted! @JoinTheFlock | Hadoop Summit, June 14 2012 10
  • 16. Keep the data sorted! • Painful to maintain @JoinTheFlock | Hadoop Summit, June 14 2012 10
  • 17. Keep the data sorted! • Painful to maintain • Only one sort order at a time @JoinTheFlock | Hadoop Summit, June 14 2012 10
  • 18. Keep the data sorted! • Painful to maintain • Only one sort order at a time • Rewrite or duplicate for different query patterns @JoinTheFlock | Hadoop Summit, June 14 2012 10
  • 19. Trojan Layouts* * http://infosys.uni-saarland.de/publications/JQD11.pdf @JoinTheFlock | Hadoop Summit, June 14 2012 11
  • 20. Trojan Layouts* • Identify interesting column groupings * http://infosys.uni-saarland.de/publications/JQD11.pdf @JoinTheFlock | Hadoop Summit, June 14 2012 11
  • 21. Trojan Layouts* • Identify interesting column groupings • Use different column groupings per HDFS block replica * http://infosys.uni-saarland.de/publications/JQD11.pdf @JoinTheFlock | Hadoop Summit, June 14 2012 11
  • 22. Trojan Layouts* • Identify interesting column groupings • Use different column groupings per HDFS block replica • Requires changes to NN * http://infosys.uni-saarland.de/publications/JQD11.pdf @JoinTheFlock | Hadoop Summit, June 14 2012 11
  • 23. Trojan Layouts* • Identify interesting column groupings • Use different column groupings per HDFS block replica • Requires changes to NN • ... and increases load on NN * http://infosys.uni-saarland.de/publications/JQD11.pdf @JoinTheFlock | Hadoop Summit, June 14 2012 11
  • 24. HBase! @JoinTheFlock | Hadoop Summit, June 14 2012 12
  • 25. HBase! • Good solution in many cases! @JoinTheFlock | Hadoop Summit, June 14 2012 12
  • 26. HBase! • Good solution in many cases! • Maintenance overhead @JoinTheFlock | Hadoop Summit, June 14 2012 12
  • 27. HBase! • Good solution in many cases! • Maintenance overhead • All data must live in HBase @JoinTheFlock | Hadoop Summit, June 14 2012 12
  • 28. HBase! • Good solution in many cases! • Maintenance overhead • All data must live in HBase • Full table scans slower than MR @JoinTheFlock | Hadoop Summit, June 14 2012 12
  • 29. HBase! • Good solution in many cases! • Maintenance overhead • All data must live in HBase • Full table scans slower than MR • Again with the up-front design @JoinTheFlock | Hadoop Summit, June 14 2012 12
  • 30. HBase! • Good solution in many cases! • Maintenance overhead • All data must live in HBase • Full table scans slower than MR • Again with the up-front design • Secondary Indexes can help @JoinTheFlock | Hadoop Summit, June 14 2012 12
  • 31. Hive! @JoinTheFlock | Hadoop Summit, June 14 2012 13
  • 32. Hive! • That kind of works, actually. @JoinTheFlock | Hadoop Summit, June 14 2012 13
  • 33. Hive Generic Interface for defining indexing behavior. Reference implementation: “compact” index value -> list of HDFS blocks; drop unneeded blocks. Other indexes available (bitmap in 0.8) It’ll even update indexes as you add partitions. @JoinTheFlock | Hadoop Summit, June 14 2012 14
  • 34. WIN! Done, Right? @JoinTheFlock | Hadoop Summit, June 14 2012 15
  • 35. Hive Good news if your data is in Hive! Bad news if your world is a little bigger. Indexing is tightly coupled to Hive. No interoperability with the rest of the Hadoop stack. @JoinTheFlock | Hadoop Summit, June 14 2012 16
  • 36. Democracy of Tools @JoinTheFlock | Hadoop Summit, June 14 2012 17 Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
  • 37. Democracy of Tools • Pig @JoinTheFlock | Hadoop Summit, June 14 2012 17 Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
  • 38. Democracy of Tools • Pig • Raw Map-Reduce @JoinTheFlock | Hadoop Summit, June 14 2012 17 Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
  • 39. Democracy of Tools • Pig • Raw Map-Reduce • Cascading DSLs (Scalding, Cascalog, Py-Cascading) @JoinTheFlock | Hadoop Summit, June 14 2012 17 Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
  • 40. Democracy of Tools • Pig • Raw Map-Reduce • Cascading DSLs (Scalding, Cascalog, Py-Cascading) • Mahout @JoinTheFlock | Hadoop Summit, June 14 2012 17 Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
  • 41. Democracy of Tools • Pig • Raw Map-Reduce • Cascading DSLs (Scalding, Cascalog, Py-Cascading) • Mahout • Maybe even Hive @JoinTheFlock | Hadoop Summit, June 14 2012 17 Image Source: http://en.wikipedia.org/wiki/File:20070124_sejm_sala_plenarna.jpg
  • 42. Design Goals @JoinTheFlock | Hadoop Summit, June 14 2012 18
  • 43. Design Goals @JoinTheFlock | Hadoop Summit, June 14 2012 18
  • 44. Design Goals • Minimal Job/Script modification required @JoinTheFlock | Hadoop Summit, June 14 2012 18
  • 45. Design Goals • Minimal Job/Script modification required • As low in the stack as possible @JoinTheFlock | Hadoop Summit, June 14 2012 18
  • 46. Design Goals • Minimal Job/Script modification required • As low in the stack as possible • In fact, pretty sure we could get Hive to use this... @JoinTheFlock | Hadoop Summit, June 14 2012 18
  • 47. Design Goals • Minimal Job/Script modification required • As low in the stack as possible • In fact, pretty sure we could get Hive to use this... • No unnecessary copies of data @JoinTheFlock | Hadoop Summit, June 14 2012 18
  • 48. Design Goals • Minimal Job/Script modification required • As low in the stack as possible • In fact, pretty sure we could get Hive to use this... • No unnecessary copies of data • Allow post-factum indexing @JoinTheFlock | Hadoop Summit, June 14 2012 18
  • 49. Design Goals • Minimal Job/Script modification required • As low in the stack as possible • In fact, pretty sure we could get Hive to use this... • No unnecessary copies of data • Allow post-factum indexing • Graceful degradation @JoinTheFlock | Hadoop Summit, June 14 2012 18
  • 50. Design Goals • Minimal Job/Script modification required • As low in the stack as possible • In fact, pretty sure we could get Hive to use this... • No unnecessary copies of data • Allow post-factum indexing • Graceful degradation • Flexible on-disk representation @JoinTheFlock | Hadoop Summit, June 14 2012 18
  • 51. Elephant-Twin Twitter’s library for creating indexes in Hadoop https://github.com/twitter/elephant-twin https://github.com/twitter/elephant-twin-lzo @JoinTheFlock | Hadoop Summit, June 14 2012 19
  • 52. Block-Level Indexes For each value, record the block it occurs in “Block” can be HDFS block (100s of MBs) Or LZO block (100s of KBs) Or SequenceFile block Or RCFile block ... Ignore irrelevant blocks Scan relevant blocks using original InputFormat @JoinTheFlock | Hadoop Summit, June 14 2012 20
  • 53. Record-Level Indexes For each value, record some representation of the record Can be value + offset, as in bitmap indexes Can be transformed projection of records, as in Lucene indexes Some queries can be answered directly from index. @JoinTheFlock | Hadoop Summit, June 14 2012 21
  • 54. Indexing: MR Index job InputFormat Data @JoinTheFlock | Hadoop Summit, June 14 2012 22
  • 55. Creating an Index public abstract class AbstractBlockIndexingJob { protected abstract List<String> getInput(); protected abstract String getIndex(); protected abstract String getInputFormat(); protected abstract String getValueClass(); protected abstract String getColumnName(); protected abstract Job setMapper(Job job); } public abstract class AbstractLuceneIndexingJob { // Similar. } @JoinTheFlock | Hadoop Summit, June 14 2012 23
  • 56. Creating an Index Mapper transforms the records: emit <DocId, Value> Key Value Block Offset Column Value Tweet Id Text Block helper: public abstract class BlockIndexingMapper<KIN, VIN> extends Mapper<KIN, VIN, TextLongPairWritable, LongPairWritable> {} Lucene helper: public abstract class AbstractIndexingMapper<KIN, VIN, KOUT, VOUT> extends Mapper<KIN, VIN, KOUT, VOUT> abstract protected boolean filter(KIN k, VIN v); abstract protected KOUT buildOutputKey(KIN k, VIN v); @JoinTheFlock | Hadoop Summit, June 14 2012 24
  • 57. Creating an Index Reducer writes appropriately processed indexes and metadata. MapFile block index: public class MapFileIndexingReducer extends Reducer<TextLongPairWritable, LongPairWritable, Text, ListLongPair> Lucene index: public abstract class AbstractLuceneIndexingReducer<KIN, VIN> extends Reducer<KIN, VIN, NullWritable, NullWritable> { protected abstract Document buildDocument(KIN k, VIN v); } @JoinTheFlock | Hadoop Summit, June 14 2012 25
  • 58. Creating an Index: Metadata struct FileIndexDescriptor { 1: DocType docType 2: IndexType indexType 3: i32 indexVersion 4: string sourcePath 5: FileChecksum checksum 6: list<IndexedField> indexedFields } struct ETwinIndexDescriptor { 1: list<FileIndexDescriptor> fileIndexDescriptors 2: i32 indexPart 3: optional map<string, string> options } @JoinTheFlock | Hadoop Summit, June 14 2012 26
  • 59. MR job searchKey IndexedInputFormat Retrieval: Index Data @JoinTheFlock | Hadoop Summit, June 14 2012 27
  • 60. InputFormat public class BlockIndexedFileInputFormat<K, V> extends FileInputFormat<K, V> { // Indexing jobs call this function to set up indexing job related parameters. public static void setIndexOptions(Job job, String inputformatClass, String valueClass, String indexDir, String columnName) // Searching jobs call this function to set up searching job related parameters. public static void setSearchOptions(Job job, String inputformatClass, String valueClass, String indexDir, BinaryExpression filter) } @JoinTheFlock | Hadoop Summit, June 14 2012 28
  • 61. BinaryExpression public BinaryExpression( Expression lhs, Expression rhs, OpType opType) public static enum OpType { OP_PLUS (" + "), OP_MINUS(" - "), ... OP_EQ(" == "), OP_NE(" != "), ... OP_AND(" and "), OP_OR(" or "), ... TERM_COL(" Column "), TERM_CONST(" Constant "); } @JoinTheFlock | Hadoop Summit, June 14 2012 29
  • 62. Pig Integration event_logs = load '/logs/lots_of_data' using ThriftPigLoader( 'thrift.gen.LogEvent'); filtered_logs = filter event_logs by event == 'something_rare'; -- Then do stuff. @JoinTheFlock | Hadoop Summit, June 14 2012 30
  • 63. Pig Integration register elephant-twin-1.0.jar event_logs = load '/logs/lots_of_data' using IndexedLZOPigLoader( 'ThriftPigLoader', 'thrift.gen.LogEvent', '/user/dmitriy/etwin'); -- Pig will automatically push this down into the Loader and InputFormat filtered_logs = filter event_logs by event == 'something_rare'; @JoinTheFlock | Hadoop Summit, June 14 2012 31
  • 64. Optimization: merge neighbors HDFS Block 1 HDFS Block 2 @JoinTheFlock | Hadoop Summit, June 14 2012 32
  • 65. Optimization: merge neighbors HDFS Block 1 HDFS Block 2 Merge neighbors, share the scan. (Limit expansion to size of HDFS block) @JoinTheFlock | Hadoop Summit, June 14 2012 33
  • 66. Optimization: merge neighbors HDFS Block 1 HDFS Block 2 Scans are faster than random reads.. allow gaps? Turns out, not that much faster. Better to jump. @JoinTheFlock | Hadoop Summit, June 14 2012 34
  • 67. Optimization: combine small splits HDFS Block 1 HDFS Block 2 match match match Generated Split Combine small relevant spans into single splits. Try to take locality into account. @JoinTheFlock | Hadoop Summit, June 14 2012 35
  • 68. Applicability Most keys occur in very few blocks! Most frequent key only occurs in half the blocks. @JoinTheFlock | Hadoop Summit, June 14 2012 36
  • 69. Results Applicable Jobs take 5-10x fewer resources Ad-hoc jobs particularly likely to benefit “Real” indexes still faster.. -- but can be represented using the same abstraction @JoinTheFlock | Hadoop Summit, June 14 2012 37
  • 70. Future Work @JoinTheFlock | Hadoop Summit, June 14 2012 38 Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
  • 71. Future Work • Regex matching on keys @JoinTheFlock | Hadoop Summit, June 14 2012 38 Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
  • 72. Future Work • Regex matching on keys • Better Pig pushdown support @JoinTheFlock | Hadoop Summit, June 14 2012 38 Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
  • 73. Future Work • Regex matching on keys • Better Pig pushdown support • MultiIndexInputFormat @JoinTheFlock | Hadoop Summit, June 14 2012 38 Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
  • 74. Future Work • Regex matching on keys • Better Pig pushdown support • MultiIndexInputFormat • Traditional indexes under ETwin @JoinTheFlock | Hadoop Summit, June 14 2012 38 Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
  • 75. Future Work • Regex matching on keys • Better Pig pushdown support • MultiIndexInputFormat • Traditional indexes under ETwin • Index maintenance (via HCatalog?) @JoinTheFlock | Hadoop Summit, June 14 2012 38 Image Source:http://en.wikipedia.org/wiki/File:Shasta_dam_under_construction_new_edit.jpg
  • 76. Questions? @squarecog Sounds like fun? We are hiring. @JoinTheFlock | Hadoop Summit, June 14 2012 39

Notas del editor

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n