SlideShare una empresa de Scribd logo
1 de 74
Descargar para leer sin conexión
Understanding Indexing
Without Needing to Understand Data Structures

 Boston MySQL Meetup - March 14, 2011

          Zardosht Kasheff
What’s a Table?
    A dictionary is a set of (key, value) pairs.
     • We’ll assume you can update the dictionary (insertions,
       deletions, updates) and query the dictionary (point
       queries, range queries)
     • B-Trees and Fractal Trees are examples of dictionaries
     • Hashes are not (range queries are not supported)




2
What’s a Table?

    A table is a set of
                              a      b     c
    dictionaries.
                             100     5    45
    Example:                 101    92     2
    create table foo (a      156    56    45
    int, b int, c int,       165     6     2
    primary key(a));         198   202    56
                             206    23   252
    Then we insert a bunch   256    56     2
    of data and get...       412    43    45



3
What’s a Table?
    The sort order of a dictionary is defined by the key
     • For the data structures/storage engines we’ll think about,
       range queries on the sort order are FAST
     • Range queries on any other order require a table scan =
       SLOW
     • Point queries -- retrieving the value for one particular key --
       is SLOW
       ‣ A single point query is fast, but reading a bunch of rows this way is going
        to be 2 orders of magnitude slower than reading the same number of
        rows in range query order




4
What’s an Index?
    A Index I on table T is itself a dictionary
     • We need to define the (key, value) pairs.
     • The key in index I is a subset of fields in the primary
       dictionary T.
     • The value in index I is the primary key in T.
       ‣ There are other ways to define the value, but we’re sticking with this.

    Example:
    alter table foo add key(b);
    Then we get...



5
What’s an Index?

          Primary         key(b)
     a        b       c     b    a
    100       5      45     5   100
    101      92       2     6   165
    156      56      45    23   206
    165       6       2    43   412
    198     202      56    56   156
    206      23     252    56   256
    256      56       2    92   101
    412      43      45   202   198


6
Q: count(*) where a<120;


     a      b     c     b    a
    100     5    45     5   100
    101    92     2     6   165
    156    56    45    23   206
    165     6     2    43   412
    198   202    56    56   156
    206    23   252    56   256
    256    56     2    92   101
    412    43    45   202   198


7
Q: count(*) where a<120;


     a      b     c     b    a
    100     5    45     5   100   100    5   45
    101    92     2     6   165   101   92    2
    156    56    45    23   206
    165     6     2    43   412
    198   202    56    56   156
    206    23   252    56   256         2
    256    56     2    92   101
    412    43    45   202   198


7
Q: count(*) where b>50;


     a      b     c     b    a
    100     5    45     5   100
    101    92     2     6   165
    156    56    45    23   206
    165     6     2    43   412
    198   202    56    56   156
    206    23   252    56   256
    256    56     2    92   101
    412    43    45   202   198


8
Q: count(*) where b>50;


     a      b     c     b    a
                                   56       156
    100     5    45     5   100
                                   56       256
    101    92     2     6   165
                                   92       101
    156    56    45    23   206
                                  202       198
    165     6     2    43   412
    198   202    56    56   156
    206    23   252    56   256
                                        4
    256    56     2    92   101
    412    43    45   202   198


8
Q: sum(c) where b>50;


     a      b     c     b    a
    100     5    45     5   100
    101    92     2     6   165
    156    56    45    23   206
    165     6     2    43   412
    198   202    56    56   156
    206    23   252    56   256
    256    56     2    92   101
    412    43    45   202   198


9
Q: sum(c) where b>50;
                                     56   156
                                     56   256




                                                     Fast
     a      b     c     b    a       92   101
    100     5    45     5   100     202   198
    101    92     2     6   165
    156    56    45    23   206   156    56     45
    165     6     2    43   412   256    56      2




                                                     Slow
    198   202    56    56   156   101    92      2
    206    23   252    56   256   198   202     56
    256    56     2    92   101
    412    43    45   202   198
                                        105

9
What are indexes good for?
     Indexes make queries go fast
      • Each index will speed up some subset of queries

     Design indexes with queries in mind
      • Pick most important queries and set up indexes for those
      • Consider the cost of maintaining the indexes




10
What are indexes good for?
     Indexes make queries go fast
      • Each index will speed up some subset of queries

     Design indexes with queries in mind
      • Pick most important queries and set up indexes for those
      • Consider the cost of maintaining the indexes




10
Goals of Talk
     3 simple rules for designing good indexes


     Avoid details of any data structure
      • B-trees and Fractal Trees are interesting and fun for the
        algorithmically minded computer scientist, but the 3 rules
        will apply equally well to either data structure.
      • All we need to care about is that range queries are fast
        (per row) and that point queries are much slower (per
        row).




11
The Rule to Rule Them All
     There is no absolute rule
      • Indexing is like a math problem
      • Rules help, but each scenario is its own problem, which
        requires problem-solving analysis


     That said, rules help a lot




12
Three Basic Rules
     1. Retrieve less data
      • Less bandwidth, less processing, ...

     2. Avoid point queries
      • Not all data access cost is the same
      • Sequential access is MUCH faster than random access

     3. Avoid Sorting
      • GROUP BY and ORDER BY queries do post-retrieval
        work
      • Indexing can help get rid of this work



13
Rule 1
Retrieve less data
Rule 1: Example of slow query
Example TABLE (1B rows, no indexes):
 • create table foo (a int, b int, c int);


Query (1000 rows match):
 • select sum(c) from foo where b=10 and a<150;


Query Plan:
 • Rows where b=10 and a<150 can be anywhere in the table
 • Without an index, entire table is scanned
Slow execution:
 • Scan 1B rows just to count 1000 rows
Rule 1: How to add an index
     What should we do?
      • Reduce the data retrieved
      • Analyze much less than 1B rows
     How (for a simple select)?
      • Design index by focusing on the WHERE clause
        ‣ This defines what rows the query is interested in
        ‣ Other rows are not important for this query




16
Rule 1: How to add an index
     What should we do?
      • Reduce the data retrieved
      • Analyze much less than 1B rows
     How (for a simple select)?
      • Design index by focusing on the WHERE clause
        ‣ This defines what rows the query is interested in
        ‣ Other rows are not important for this query


     select sum(c) from foo where b=10 and a<150;




16
Rule 1: Which index?
     Option 1: key(a)
     Option 2: key(b)
     Which is better? Depends on selectivity:
      • If there are fewer rows where a<150, then key(a) is better
      • If there are fewer rows where b=10, then key(b) is better



     Option 3: key(a) AND key(b), then MERGE
      • We’ll come to this later




17
Rule 1: Picking the best key
     Neither key(a) nor key(b) is optimal
     Suppose:
      • 200,000 rows exist where a<150
      • 100,000 rows exist where b=10
      • 1000 rows exist where b=10 and a<150
     Then either index retrieves too much data


     For better performance, indexes should try to
     optimize over as many pieces of the where clause
     as possible
      • We need a composite index


18
Composite indexes reduce data retrieved
     Where clause: b=5 and a<150
      • Option 1: key(a,b)
      • Option 2: key(b,a)
     Which one is better?
      • key(b,a)!
     KEY RULE:
      • When making a composite index, place equality checking
        columns first. Condition on b is equality, but not on a.




19
Q: where b=5 and a>150;


      a    b     c    b,a     a
     100   5    45   5,100   100
     101   6     2   5,156   156
     156   5    45   5,206   206
     165   6     2   5,256   256
     198   6    56   6,101   101
     206   5   252   6,165   165
     256   5     2   6,198   198
     412   6    45   6,412   412


20
Q: where b=5 and a>150;


      a    b     c    b,a     a
     100   5    45   5,100   100
     101   6     2   5,156   156
     156   5    45   5,206   206
     165   6     2   5,256   256
     198   6    56   6,101   101
     206   5   252   6,165   165
     256   5     2   6,198   198
     412   6    45   6,412   412


20
Composite Indexes: No equality clause
     What if where clause is:
      • where a>100 and       a<200 and b>100;
     Which is better?
      • key(a), key(b), key(a,b), key(b,a)?


     KEY RULE:
      • As soon as a column on a composite index is NOT used for
        equality, the rest of the composite index no longer reduces
        data retrieved.
        ‣ key(a,b) is no better* than key(a)
        ‣ key(b,a) is no better* than key(b)



21
Composite Indexes: No equality clause
     What if where clause is:
      • where a>100 and                 a<200 and b>100;
     Which is better?
      • key(a), key(b), key(a,b), key(b,a)?


     KEY RULE:
      • As soon as a column on a composite index is NOT used for
        equality, the rest of the composite index no longer reduces
        data retrieved.
        ‣ key(a,b) is no better* than key(a)
        ‣ key(b,a) is no better* than key(b)
        * Are there corner cases where it helps? Yes, but rare.




21
Q: where b>=5 and a>150;


      a    b     c    b,a     a
     100   5    45   5,100   100
     101   6     2   5,156   156
     156   5    45   5,206   206
     165   6     2   5,256   256
     198   6    56   6,101   101
     206   5   252   6,165   165
     256   5     2   6,198   198
     412   6    45   6,412   412


22
Q: where b>=5 and a>150;


      a    b     c    b,a     a
     100   5    45   5,100   100
     101   6     2   5,156   156
     156   5    45   5,206   206
     165   6     2   5,256   256
     198   6    56   6,101   101
     206   5   252   6,165   165
     256   5     2   6,198   198
     412   6    45   6,412   412


22
Q: where b>=5 and a>150;


      a    b     c    b,a     a
     100   5    45   5,100   100
     101   6     2   5,156   156   5,156   156
     156   5    45   5,206   206   5,206   206
     165   6     2   5,256   256   5,256   256
     198   6    56   6,101   101   6,101   101
     206   5   252   6,165   165   6,165   165
     256   5     2   6,198   198   6,198   198
     412   6    45   6,412   412   6,412   412


22
Composite Indexes: Another example
       WHERE clause: b=5 and c=100
           • key(b,a,c) is as good as key(b), because a is not used
             in clause, so having c in index doesn’t help. key(b,c,a)
             would be much better.

      a      b      c         b,a,c    a
     100     5     100     5,100,100 100
     101     6     200     5,156,200 156
     156     5     200     5,206,200 206
     165     6     100     5,256,100 256
     198     6     100     6,101,200 101
     206     5     200     6,165,100 165
     256     5     100     6,198,100 6,198
     412     6     100     6,412,100 412
23
Composite Indexes: Another example
       WHERE clause: b=5 and c=100
           • key(b,a,c) is as good as key(b), because a is not used
             in clause, so having c in index doesn’t help. key(b,c,a)
             would be much better.

      a      b      c         b,a,c    a
     100     5     100     5,100,100 100           5,100,100   100
     101     6     200     5,156,200 156           5,156,200   156
     156     5     200     5,206,200 206           5,206,200   206
     165     6     100     5,256,100 256           5,256,100   256
     198     6     100     6,101,200 101
     206     5     200     6,165,100 165
     256     5     100     6,198,100 6,198
     412     6     100     6,412,100 412
23
Rule 1: Recap
     Goal is to reduce rows retrieved
      • Create composite indexes based on where clause
      • Place equality-comparison columns at the beginning
      • Make first non-equality column in index as selective as
        possible
      • Once first column in a composite index is not used for
        equality, or not used in where clause, the rest of the
        composite index does not help reduce rows retrieved
        ‣ Does that mean they aren’t helpful?
        ‣ They might be very helpful... on to Rule 2.




24
Rule 2
Avoid Point Queries
Rule 2: Avoid point queries
     Table:
      • create table foo (a int, b int, c int,
        primary key(a), key(b));
     Query:
      • select sum(c) from foo where b>50;
     Query plan: use key(b)
      • retrieval cost of each row is high because random point
        queries are done




26
Q: sum(c) where b>50;


      a      b     c     b    a
     100     5    45     5   100
     101    92     2     6   165
     156    56    45    23   206
     165     6     2    43   412
     198   202    56    56   156
     206    23   252    56   256
     256    56     2    92   101
     412    43    45   202   198


27
Q: sum(c) where b>50;
                                      56   156
                                      56   256
      a      b     c     b    a       92   101
     100     5    45     5   100     202   198
     101    92     2     6   165
     156    56    45    23   206   156    56     45
     165     6     2    43   412   256    56      2
     198   202    56    56   156   101    92      2
     206    23   252    56   256   198   202     56
     256    56     2    92   101
     412    43    45   202   198
                                         105

27
Rule 2: Avoid Point Queries
     Table:
      • create table foo (a int, b int, c int,
        primary key(a), key(b));
     Query:
      • select sum(c) from foo where b>50;
     Query plan: scan primary table
      • retrieval cost of each row is CHEAP!
      • But you retrieve too many rows




28
Q: sum(c) where b>50;


      a      b     c     b    a
     100     5    45     5   100
     101    92     2     6   165
     156    56    45    23   206
     165     6     2    43   412
     198   202    56    56   156
     206    23   252    56   256
     256    56     2    92   101
     412    43    45   202   198


29
Q: sum(c) where b>50;


                                   100     5    45
      a      b     c     b    a
                                   101    92     2
     100     5    45     5   100
                                   156    56    45
     101    92     2     6   165
                                   165     6     2
     156    56    45    23   206
                                   198   202    56
     165     6     2    43   412
                                   206    23   252
     198   202    56    56   156
                                   256    56     2
     206    23   252    56   256
                                   412    43    45
     256    56     2    92   101
     412    43    45   202   198
                                         105

29
Rule 2: Avoid Point Queries
     Table:
      • create table foo (a int, b int, c int,
        primary key(a), key(b));
     Query:
      • select sum(c) from foo where b>50;
     What if we add another index?
      • What about key(b,c)?
      • Since we index on b, we retrieve only the rows we need.
      • Since the index has information about c, we don’t need to
        go to the main table. No point queries!




30
Q: sum(c) where b>50;


      a      b     c     b,c     a
     100     5    45    5,45    100
     101    92     2     6,2    165
     156    56    45   23,252   206
     165     6     2   43,45    412
     198   202    56    56,2    256
     206    23   252   56,45    156
     256    56     2    92,2    101
     412    43    45   202,56   198


31
Q: sum(c) where b>50;


      a      b     c     b,c     a
     100     5    45    5,45    100
     101    92     2     6,2    165    56,2     256
     156    56    45   23,252   206   56,45     156
     165     6     2   43,45    412    92,2     101
     198   202    56    56,2    256   202,56    198
     206    23   252   56,45    156
     256    56     2    92,2    101
     412    43    45   202,56   198       105


31
Covering Index
     An index covers a query if the index has enough
     information to answer the query.
     Examples:
     Q: select sum(c) from foo where b<100;
     Q: select sum(d) from foo where b<100;
     Indexes:
      • key(b,c) -- covering index for first query
      • key(b,d) -- covering index for second query
      • key(b,c,d) -- covering index for both

32
How to build a covering index
     Add every field from the select
      • Not just where clause

     Q: select c,d from foo where a=10 and
     b=100;
     Mistake: add index(a,b);
      • This doesn’t cover the query. You still need point queries
        to retrieve c and d values.

     Correct: add index(a,b,c,d);
      • Includes all referenced fields
      • Place a and b at beginning by Rule 1

33
What if Primary Key matches where?
     Q: select sum(c) from foo where b>100
     and b<200;
     Schema: create table foo (a int, b
     int, c int, ai int auto_increment,
     primary key (b,ai));
      • Query does a range query on primary dictionary
      • Only one dictionary is accessed, in sequential order
      • This is fast

     Primary key covers all queries
      • If sort order matches where clause, problems solved


34
What’s a Clustering Index
     What if primary key doesn’t match the where
     clause?
      • Ideally, you should be able to declare secondary indexes
        that carry all fields
      • AFAIK, storage engines don’t let you do this
      • With one exception... TokuDB
      • TokuDB allows you to declare any index to be
        CLUSTERING
      • A CLUSTERING index covers all queries, just like the
        primary key



35
Clustering Indexes in Action
     Q: select sum(c) from foo where b<100;
     Q: select sum(d) from foo where b>200;
     Q: select c,e from foo where b=1000;
     Indexes:
      • key(b,c) covers first query
      • key(b,d) covers second query
      • key(b,c,e) covers first and third queries
      • key(b,c,d,e) covers all three queries

     Indexes require a lot of analysis of queries

36
Clustering Indexes in Action
     Q: select sum(c) from foo where b<100;
     Q: select sum(d) from foo where b>200;
     Q: select c,e from foo where b=1000;
     What covers all queries?
      • clustering key(b)


     Clustering keys let you focus on the where
     clause
      • They eliminate point queries and make queries fast

37
More on clustering: Index Merge
     We had example:
      • create table foo(a int, b int, c int);
      • select sum(c) from foo where b=10 and
        a<150;

     Suppose
      • 200,000 rows have a<150
      • 100,000 rows have b=10
      • 1000 rows have b=10 and a<150

     What if we use key(a) and key(b) and merge
     results?

38
Query Plans
     Merge plan:
      • Scan 200,000 rows in key(a) where a<150
      • Scan 100,000 rows in key(b) where b=10
      • Merge the results and find 1000 row identifiers that
        match query
      • Do point queries with 1000 row identifiers to retrieve c

     Better than no indexes
      • Reduces number of rows scanned compared to no index
      • Reduces number of point queries compared to not
        merging


39
Does Clustering Help Merging?
     Suppose key(a) is clustering
     Query plan:
      • Scan key(a) for 200,000 rows where a<150
      • Scan resulting rows for ones where b=10
      • Retrieve c values from 1000 remaining rows

     Once again, no point queries
     What’s even better?
      • clustering key(b,a)!



40
Rule 2 Recap
     Avoid Point Queries
     Make sure index covers query
      • By mentioning all fields in the query, not just those in the
        where clause

     Use clustering indexes
      • Clustering indexes cover all queries
      • Allows user to focus on where clause
      • Speeds up more (and unforeseen) queries -- simplifies
        database design



41
Rule 3
Avoid Sorting
Rule 3: Avoid Sorting
     Simple selects require no post-processing
      • select * from foo where b=100;

     Just get the data and return to user
     More complex queries do post-processing
      • GROUP BY and ORDER BY sort the data

     Index selection can avoid this sorting step




43
Avoid Sorting
     Q1: select count(c) from foo;
     Q2: select count(c) from foo group by
     b, order by b;
     Q1 plan:
      • While doing a table scan, count rows with c

     Q2 plan
      • Scan table and write data to a temporary file
      • Sort tmp file data by b
      • Rescan sorted data, counting rows with c, for each b


44
Avoid Sorting
     Q2: select count(c) from foo group by
     b, order by b;
     Q2: what if we use key(b,c)?
      • By adding all needed fields, we cover query. FAST!
      • By sorting first by b, we avoid sort. FAST!

     Take home:
      • Sort index on group by or order by fields to avoid
        sorting




45
Summing Up
     Use indexes to pre-sort for order by and
     group by queries




46
Putting it all together
     Sample Queries
Example




48
Example
     select count(*) from foo where c=5,
     group by b;




48
Example
     select count(*) from foo where c=5,
     group by b;
     key(c,b):
      • First have c to limit rows retrieved (R1)
      • Then have remaining rows sorted by b to avoid sort (R3)
      • Remaining rows will be sorted by b because of equality
        test on c




48
Example




49
Example
     select sum(d) from foo where c=100,
     group by b;




49
Example
     select sum(d) from foo where c=100,
     group by b;
     key(c,b,d):
      • First have c to limit rows retrieved (R1)
      • Then have remaining rows sorted by b to avoid sort (R3)
      • Make sure index covers query, to avoid point queries (R2)




49
Example
     Sometimes, there is no clear answer
      • Best index is data dependent

     Q: select count(*) from foo where
     c<100, group by b;
     Indexes:
      • key(c,b)
      • key(b,c)




50
Example
     Q: select count(*) from foo where
     c<100, group by b;
     Query plan for key(c,b):
      • Will filter rows where c<100
      • Still need to sort by b
       ‣ Rows retrieved will not be sorted by b
       ‣ where clause does not do an equality check on c, so b values are scattered in clumps for
         different c values




51
Example
     Q: select count(*) from foo where
     c<100, group by b;
     Query plan for key(b,c):
      • Sorted by b, so R3 is covered
      • Rows where c>=100 are also processed, so not taking
        advantage of R1




52
Example
     Which is better?
      • Answer depends on the data
      • If there are many rows where c>=100, saving time by not
        retrieving these useless rows helps. Use key(c,b).
      • If there aren’t so many rows where c>=100, the time to
        execute the query is dominated by sorting. Use
        key(b,c).

     The point is, in general, rules of thumb help, but
     often they help us think about queries and
     indexes, rather than giving a recipe.


53
Why not just load up
   with indexes?
Need to keep up with insertion load
 More indexes = smaller max load
Indexing cost
     Space
      • Issue
        ‣ Each index adds storage requirements
      • Options
        ‣ Use compression (i.e., aggressive compression of 5-15x always on for
         TokuDB)

     Performance
      • Issue
        ‣ B-trees while fast for certain indexing tasks (in memory, sequential keys),
         are over 20x slower for other types of indexing
      • Options
        ‣ Fractal Tree indexes (TokuDB’s data structure) is fast at all kinds of
         indexing (i.e., random keys, large tables, wide keys, etc...)
         ‣ No need to worry about what type of index you are creating.
         ‣ Fractal trees enable customers to index early, index often

55
Last Caveat
     Range Query Performance
      • Issue
        ‣ Rule #2 (range query performance over point query performance)
          depends on range queries being fast
        ‣ However B-trees can get fragmented
         ‣ from deletions, from random insertions, ...
         ‣ Fragmented B-trees get slow for range queries
      • Options
        ‣ For B-trees, optimize tables, dump and reload, (ie, time consuming and
          offline maintenance) ...
        ‣ For Fractal Tree indexing (TokuDB), not an issue
         ‣ Fractal trees don’t fragment




56
Thanks!
     For more information...
      • Please contact me at zardosht@tokutek.com for any
        thoughts or feedback
      • I will be presenting on indexing in April at both
        Collaborate 11 in FL (10:30 am on 4/11) and the O’Reilly
        Conference in CA (11:55 am on 4/12)
      • Please visit Tokutek.com for a copy of this presentation
        (goo.gl/S2LBe) to learn more about the power of
        indexing, read about Fractal Tree indexes, or to download
        a free eval copy of TokuDB




57

Más contenido relacionado

Último

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Último (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

Destacado

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destacado (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

MySQL understand Indexes

  • 1. Understanding Indexing Without Needing to Understand Data Structures Boston MySQL Meetup - March 14, 2011 Zardosht Kasheff
  • 2. What’s a Table? A dictionary is a set of (key, value) pairs. • We’ll assume you can update the dictionary (insertions, deletions, updates) and query the dictionary (point queries, range queries) • B-Trees and Fractal Trees are examples of dictionaries • Hashes are not (range queries are not supported) 2
  • 3. What’s a Table? A table is a set of a b c dictionaries. 100 5 45 Example: 101 92 2 create table foo (a 156 56 45 int, b int, c int, 165 6 2 primary key(a)); 198 202 56 206 23 252 Then we insert a bunch 256 56 2 of data and get... 412 43 45 3
  • 4. What’s a Table? The sort order of a dictionary is defined by the key • For the data structures/storage engines we’ll think about, range queries on the sort order are FAST • Range queries on any other order require a table scan = SLOW • Point queries -- retrieving the value for one particular key -- is SLOW ‣ A single point query is fast, but reading a bunch of rows this way is going to be 2 orders of magnitude slower than reading the same number of rows in range query order 4
  • 5. What’s an Index? A Index I on table T is itself a dictionary • We need to define the (key, value) pairs. • The key in index I is a subset of fields in the primary dictionary T. • The value in index I is the primary key in T. ‣ There are other ways to define the value, but we’re sticking with this. Example: alter table foo add key(b); Then we get... 5
  • 6. What’s an Index? Primary key(b) a b c b a 100 5 45 5 100 101 92 2 6 165 156 56 45 23 206 165 6 2 43 412 198 202 56 56 156 206 23 252 56 256 256 56 2 92 101 412 43 45 202 198 6
  • 7. Q: count(*) where a<120; a b c b a 100 5 45 5 100 101 92 2 6 165 156 56 45 23 206 165 6 2 43 412 198 202 56 56 156 206 23 252 56 256 256 56 2 92 101 412 43 45 202 198 7
  • 8. Q: count(*) where a<120; a b c b a 100 5 45 5 100 100 5 45 101 92 2 6 165 101 92 2 156 56 45 23 206 165 6 2 43 412 198 202 56 56 156 206 23 252 56 256 2 256 56 2 92 101 412 43 45 202 198 7
  • 9. Q: count(*) where b>50; a b c b a 100 5 45 5 100 101 92 2 6 165 156 56 45 23 206 165 6 2 43 412 198 202 56 56 156 206 23 252 56 256 256 56 2 92 101 412 43 45 202 198 8
  • 10. Q: count(*) where b>50; a b c b a 56 156 100 5 45 5 100 56 256 101 92 2 6 165 92 101 156 56 45 23 206 202 198 165 6 2 43 412 198 202 56 56 156 206 23 252 56 256 4 256 56 2 92 101 412 43 45 202 198 8
  • 11. Q: sum(c) where b>50; a b c b a 100 5 45 5 100 101 92 2 6 165 156 56 45 23 206 165 6 2 43 412 198 202 56 56 156 206 23 252 56 256 256 56 2 92 101 412 43 45 202 198 9
  • 12. Q: sum(c) where b>50; 56 156 56 256 Fast a b c b a 92 101 100 5 45 5 100 202 198 101 92 2 6 165 156 56 45 23 206 156 56 45 165 6 2 43 412 256 56 2 Slow 198 202 56 56 156 101 92 2 206 23 252 56 256 198 202 56 256 56 2 92 101 412 43 45 202 198 105 9
  • 13. What are indexes good for? Indexes make queries go fast • Each index will speed up some subset of queries Design indexes with queries in mind • Pick most important queries and set up indexes for those • Consider the cost of maintaining the indexes 10
  • 14. What are indexes good for? Indexes make queries go fast • Each index will speed up some subset of queries Design indexes with queries in mind • Pick most important queries and set up indexes for those • Consider the cost of maintaining the indexes 10
  • 15. Goals of Talk 3 simple rules for designing good indexes Avoid details of any data structure • B-trees and Fractal Trees are interesting and fun for the algorithmically minded computer scientist, but the 3 rules will apply equally well to either data structure. • All we need to care about is that range queries are fast (per row) and that point queries are much slower (per row). 11
  • 16. The Rule to Rule Them All There is no absolute rule • Indexing is like a math problem • Rules help, but each scenario is its own problem, which requires problem-solving analysis That said, rules help a lot 12
  • 17. Three Basic Rules 1. Retrieve less data • Less bandwidth, less processing, ... 2. Avoid point queries • Not all data access cost is the same • Sequential access is MUCH faster than random access 3. Avoid Sorting • GROUP BY and ORDER BY queries do post-retrieval work • Indexing can help get rid of this work 13
  • 19. Rule 1: Example of slow query Example TABLE (1B rows, no indexes): • create table foo (a int, b int, c int); Query (1000 rows match): • select sum(c) from foo where b=10 and a<150; Query Plan: • Rows where b=10 and a<150 can be anywhere in the table • Without an index, entire table is scanned Slow execution: • Scan 1B rows just to count 1000 rows
  • 20. Rule 1: How to add an index What should we do? • Reduce the data retrieved • Analyze much less than 1B rows How (for a simple select)? • Design index by focusing on the WHERE clause ‣ This defines what rows the query is interested in ‣ Other rows are not important for this query 16
  • 21. Rule 1: How to add an index What should we do? • Reduce the data retrieved • Analyze much less than 1B rows How (for a simple select)? • Design index by focusing on the WHERE clause ‣ This defines what rows the query is interested in ‣ Other rows are not important for this query select sum(c) from foo where b=10 and a<150; 16
  • 22. Rule 1: Which index? Option 1: key(a) Option 2: key(b) Which is better? Depends on selectivity: • If there are fewer rows where a<150, then key(a) is better • If there are fewer rows where b=10, then key(b) is better Option 3: key(a) AND key(b), then MERGE • We’ll come to this later 17
  • 23. Rule 1: Picking the best key Neither key(a) nor key(b) is optimal Suppose: • 200,000 rows exist where a<150 • 100,000 rows exist where b=10 • 1000 rows exist where b=10 and a<150 Then either index retrieves too much data For better performance, indexes should try to optimize over as many pieces of the where clause as possible • We need a composite index 18
  • 24. Composite indexes reduce data retrieved Where clause: b=5 and a<150 • Option 1: key(a,b) • Option 2: key(b,a) Which one is better? • key(b,a)! KEY RULE: • When making a composite index, place equality checking columns first. Condition on b is equality, but not on a. 19
  • 25. Q: where b=5 and a>150; a b c b,a a 100 5 45 5,100 100 101 6 2 5,156 156 156 5 45 5,206 206 165 6 2 5,256 256 198 6 56 6,101 101 206 5 252 6,165 165 256 5 2 6,198 198 412 6 45 6,412 412 20
  • 26. Q: where b=5 and a>150; a b c b,a a 100 5 45 5,100 100 101 6 2 5,156 156 156 5 45 5,206 206 165 6 2 5,256 256 198 6 56 6,101 101 206 5 252 6,165 165 256 5 2 6,198 198 412 6 45 6,412 412 20
  • 27. Composite Indexes: No equality clause What if where clause is: • where a>100 and a<200 and b>100; Which is better? • key(a), key(b), key(a,b), key(b,a)? KEY RULE: • As soon as a column on a composite index is NOT used for equality, the rest of the composite index no longer reduces data retrieved. ‣ key(a,b) is no better* than key(a) ‣ key(b,a) is no better* than key(b) 21
  • 28. Composite Indexes: No equality clause What if where clause is: • where a>100 and a<200 and b>100; Which is better? • key(a), key(b), key(a,b), key(b,a)? KEY RULE: • As soon as a column on a composite index is NOT used for equality, the rest of the composite index no longer reduces data retrieved. ‣ key(a,b) is no better* than key(a) ‣ key(b,a) is no better* than key(b) * Are there corner cases where it helps? Yes, but rare. 21
  • 29. Q: where b>=5 and a>150; a b c b,a a 100 5 45 5,100 100 101 6 2 5,156 156 156 5 45 5,206 206 165 6 2 5,256 256 198 6 56 6,101 101 206 5 252 6,165 165 256 5 2 6,198 198 412 6 45 6,412 412 22
  • 30. Q: where b>=5 and a>150; a b c b,a a 100 5 45 5,100 100 101 6 2 5,156 156 156 5 45 5,206 206 165 6 2 5,256 256 198 6 56 6,101 101 206 5 252 6,165 165 256 5 2 6,198 198 412 6 45 6,412 412 22
  • 31. Q: where b>=5 and a>150; a b c b,a a 100 5 45 5,100 100 101 6 2 5,156 156 5,156 156 156 5 45 5,206 206 5,206 206 165 6 2 5,256 256 5,256 256 198 6 56 6,101 101 6,101 101 206 5 252 6,165 165 6,165 165 256 5 2 6,198 198 6,198 198 412 6 45 6,412 412 6,412 412 22
  • 32. Composite Indexes: Another example WHERE clause: b=5 and c=100 • key(b,a,c) is as good as key(b), because a is not used in clause, so having c in index doesn’t help. key(b,c,a) would be much better. a b c b,a,c a 100 5 100 5,100,100 100 101 6 200 5,156,200 156 156 5 200 5,206,200 206 165 6 100 5,256,100 256 198 6 100 6,101,200 101 206 5 200 6,165,100 165 256 5 100 6,198,100 6,198 412 6 100 6,412,100 412 23
  • 33. Composite Indexes: Another example WHERE clause: b=5 and c=100 • key(b,a,c) is as good as key(b), because a is not used in clause, so having c in index doesn’t help. key(b,c,a) would be much better. a b c b,a,c a 100 5 100 5,100,100 100 5,100,100 100 101 6 200 5,156,200 156 5,156,200 156 156 5 200 5,206,200 206 5,206,200 206 165 6 100 5,256,100 256 5,256,100 256 198 6 100 6,101,200 101 206 5 200 6,165,100 165 256 5 100 6,198,100 6,198 412 6 100 6,412,100 412 23
  • 34. Rule 1: Recap Goal is to reduce rows retrieved • Create composite indexes based on where clause • Place equality-comparison columns at the beginning • Make first non-equality column in index as selective as possible • Once first column in a composite index is not used for equality, or not used in where clause, the rest of the composite index does not help reduce rows retrieved ‣ Does that mean they aren’t helpful? ‣ They might be very helpful... on to Rule 2. 24
  • 36. Rule 2: Avoid point queries Table: • create table foo (a int, b int, c int, primary key(a), key(b)); Query: • select sum(c) from foo where b>50; Query plan: use key(b) • retrieval cost of each row is high because random point queries are done 26
  • 37. Q: sum(c) where b>50; a b c b a 100 5 45 5 100 101 92 2 6 165 156 56 45 23 206 165 6 2 43 412 198 202 56 56 156 206 23 252 56 256 256 56 2 92 101 412 43 45 202 198 27
  • 38. Q: sum(c) where b>50; 56 156 56 256 a b c b a 92 101 100 5 45 5 100 202 198 101 92 2 6 165 156 56 45 23 206 156 56 45 165 6 2 43 412 256 56 2 198 202 56 56 156 101 92 2 206 23 252 56 256 198 202 56 256 56 2 92 101 412 43 45 202 198 105 27
  • 39. Rule 2: Avoid Point Queries Table: • create table foo (a int, b int, c int, primary key(a), key(b)); Query: • select sum(c) from foo where b>50; Query plan: scan primary table • retrieval cost of each row is CHEAP! • But you retrieve too many rows 28
  • 40. Q: sum(c) where b>50; a b c b a 100 5 45 5 100 101 92 2 6 165 156 56 45 23 206 165 6 2 43 412 198 202 56 56 156 206 23 252 56 256 256 56 2 92 101 412 43 45 202 198 29
  • 41. Q: sum(c) where b>50; 100 5 45 a b c b a 101 92 2 100 5 45 5 100 156 56 45 101 92 2 6 165 165 6 2 156 56 45 23 206 198 202 56 165 6 2 43 412 206 23 252 198 202 56 56 156 256 56 2 206 23 252 56 256 412 43 45 256 56 2 92 101 412 43 45 202 198 105 29
  • 42. Rule 2: Avoid Point Queries Table: • create table foo (a int, b int, c int, primary key(a), key(b)); Query: • select sum(c) from foo where b>50; What if we add another index? • What about key(b,c)? • Since we index on b, we retrieve only the rows we need. • Since the index has information about c, we don’t need to go to the main table. No point queries! 30
  • 43. Q: sum(c) where b>50; a b c b,c a 100 5 45 5,45 100 101 92 2 6,2 165 156 56 45 23,252 206 165 6 2 43,45 412 198 202 56 56,2 256 206 23 252 56,45 156 256 56 2 92,2 101 412 43 45 202,56 198 31
  • 44. Q: sum(c) where b>50; a b c b,c a 100 5 45 5,45 100 101 92 2 6,2 165 56,2 256 156 56 45 23,252 206 56,45 156 165 6 2 43,45 412 92,2 101 198 202 56 56,2 256 202,56 198 206 23 252 56,45 156 256 56 2 92,2 101 412 43 45 202,56 198 105 31
  • 45. Covering Index An index covers a query if the index has enough information to answer the query. Examples: Q: select sum(c) from foo where b<100; Q: select sum(d) from foo where b<100; Indexes: • key(b,c) -- covering index for first query • key(b,d) -- covering index for second query • key(b,c,d) -- covering index for both 32
  • 46. How to build a covering index Add every field from the select • Not just where clause Q: select c,d from foo where a=10 and b=100; Mistake: add index(a,b); • This doesn’t cover the query. You still need point queries to retrieve c and d values. Correct: add index(a,b,c,d); • Includes all referenced fields • Place a and b at beginning by Rule 1 33
  • 47. What if Primary Key matches where? Q: select sum(c) from foo where b>100 and b<200; Schema: create table foo (a int, b int, c int, ai int auto_increment, primary key (b,ai)); • Query does a range query on primary dictionary • Only one dictionary is accessed, in sequential order • This is fast Primary key covers all queries • If sort order matches where clause, problems solved 34
  • 48. What’s a Clustering Index What if primary key doesn’t match the where clause? • Ideally, you should be able to declare secondary indexes that carry all fields • AFAIK, storage engines don’t let you do this • With one exception... TokuDB • TokuDB allows you to declare any index to be CLUSTERING • A CLUSTERING index covers all queries, just like the primary key 35
  • 49. Clustering Indexes in Action Q: select sum(c) from foo where b<100; Q: select sum(d) from foo where b>200; Q: select c,e from foo where b=1000; Indexes: • key(b,c) covers first query • key(b,d) covers second query • key(b,c,e) covers first and third queries • key(b,c,d,e) covers all three queries Indexes require a lot of analysis of queries 36
  • 50. Clustering Indexes in Action Q: select sum(c) from foo where b<100; Q: select sum(d) from foo where b>200; Q: select c,e from foo where b=1000; What covers all queries? • clustering key(b) Clustering keys let you focus on the where clause • They eliminate point queries and make queries fast 37
  • 51. More on clustering: Index Merge We had example: • create table foo(a int, b int, c int); • select sum(c) from foo where b=10 and a<150; Suppose • 200,000 rows have a<150 • 100,000 rows have b=10 • 1000 rows have b=10 and a<150 What if we use key(a) and key(b) and merge results? 38
  • 52. Query Plans Merge plan: • Scan 200,000 rows in key(a) where a<150 • Scan 100,000 rows in key(b) where b=10 • Merge the results and find 1000 row identifiers that match query • Do point queries with 1000 row identifiers to retrieve c Better than no indexes • Reduces number of rows scanned compared to no index • Reduces number of point queries compared to not merging 39
  • 53. Does Clustering Help Merging? Suppose key(a) is clustering Query plan: • Scan key(a) for 200,000 rows where a<150 • Scan resulting rows for ones where b=10 • Retrieve c values from 1000 remaining rows Once again, no point queries What’s even better? • clustering key(b,a)! 40
  • 54. Rule 2 Recap Avoid Point Queries Make sure index covers query • By mentioning all fields in the query, not just those in the where clause Use clustering indexes • Clustering indexes cover all queries • Allows user to focus on where clause • Speeds up more (and unforeseen) queries -- simplifies database design 41
  • 56. Rule 3: Avoid Sorting Simple selects require no post-processing • select * from foo where b=100; Just get the data and return to user More complex queries do post-processing • GROUP BY and ORDER BY sort the data Index selection can avoid this sorting step 43
  • 57. Avoid Sorting Q1: select count(c) from foo; Q2: select count(c) from foo group by b, order by b; Q1 plan: • While doing a table scan, count rows with c Q2 plan • Scan table and write data to a temporary file • Sort tmp file data by b • Rescan sorted data, counting rows with c, for each b 44
  • 58. Avoid Sorting Q2: select count(c) from foo group by b, order by b; Q2: what if we use key(b,c)? • By adding all needed fields, we cover query. FAST! • By sorting first by b, we avoid sort. FAST! Take home: • Sort index on group by or order by fields to avoid sorting 45
  • 59. Summing Up Use indexes to pre-sort for order by and group by queries 46
  • 60. Putting it all together Sample Queries
  • 62. Example select count(*) from foo where c=5, group by b; 48
  • 63. Example select count(*) from foo where c=5, group by b; key(c,b): • First have c to limit rows retrieved (R1) • Then have remaining rows sorted by b to avoid sort (R3) • Remaining rows will be sorted by b because of equality test on c 48
  • 65. Example select sum(d) from foo where c=100, group by b; 49
  • 66. Example select sum(d) from foo where c=100, group by b; key(c,b,d): • First have c to limit rows retrieved (R1) • Then have remaining rows sorted by b to avoid sort (R3) • Make sure index covers query, to avoid point queries (R2) 49
  • 67. Example Sometimes, there is no clear answer • Best index is data dependent Q: select count(*) from foo where c<100, group by b; Indexes: • key(c,b) • key(b,c) 50
  • 68. Example Q: select count(*) from foo where c<100, group by b; Query plan for key(c,b): • Will filter rows where c<100 • Still need to sort by b ‣ Rows retrieved will not be sorted by b ‣ where clause does not do an equality check on c, so b values are scattered in clumps for different c values 51
  • 69. Example Q: select count(*) from foo where c<100, group by b; Query plan for key(b,c): • Sorted by b, so R3 is covered • Rows where c>=100 are also processed, so not taking advantage of R1 52
  • 70. Example Which is better? • Answer depends on the data • If there are many rows where c>=100, saving time by not retrieving these useless rows helps. Use key(c,b). • If there aren’t so many rows where c>=100, the time to execute the query is dominated by sorting. Use key(b,c). The point is, in general, rules of thumb help, but often they help us think about queries and indexes, rather than giving a recipe. 53
  • 71. Why not just load up with indexes? Need to keep up with insertion load More indexes = smaller max load
  • 72. Indexing cost Space • Issue ‣ Each index adds storage requirements • Options ‣ Use compression (i.e., aggressive compression of 5-15x always on for TokuDB) Performance • Issue ‣ B-trees while fast for certain indexing tasks (in memory, sequential keys), are over 20x slower for other types of indexing • Options ‣ Fractal Tree indexes (TokuDB’s data structure) is fast at all kinds of indexing (i.e., random keys, large tables, wide keys, etc...) ‣ No need to worry about what type of index you are creating. ‣ Fractal trees enable customers to index early, index often 55
  • 73. Last Caveat Range Query Performance • Issue ‣ Rule #2 (range query performance over point query performance) depends on range queries being fast ‣ However B-trees can get fragmented ‣ from deletions, from random insertions, ... ‣ Fragmented B-trees get slow for range queries • Options ‣ For B-trees, optimize tables, dump and reload, (ie, time consuming and offline maintenance) ... ‣ For Fractal Tree indexing (TokuDB), not an issue ‣ Fractal trees don’t fragment 56
  • 74. Thanks! For more information... • Please contact me at zardosht@tokutek.com for any thoughts or feedback • I will be presenting on indexing in April at both Collaborate 11 in FL (10:30 am on 4/11) and the O’Reilly Conference in CA (11:55 am on 4/12) • Please visit Tokutek.com for a copy of this presentation (goo.gl/S2LBe) to learn more about the power of indexing, read about Fractal Tree indexes, or to download a free eval copy of TokuDB 57