SlideShare una empresa de Scribd logo
1 de 31
MySQL Performance
   Optimization
                Part II
“Indexing Data Structures and Algorithms”

            Abhijit Mondal

  Software Engineer at HolidayIQ
Contents
Hash Indexes


B-Trees and B+ Trees Indexes


Indexing Strategies for High Performance


Full Text Searching
Hash Indexes
●   A hash index is built on a hash table and is useful only for exact lookups that use
    every column in the index. For each row, the storage engine computes a hash code
    of the indexed columns, which is a small value that will probably differ from the
    hash codes computed for other rows with different key values. It stores the hash
    codes in the index and stores a pointer to each row in a hash table.
●   CREATE TABLE user_info (user_id int not null primary key auto_increment,
    username varchar(50), password char(32), KEY USING HASH(username,
    password)) ENGINE=MEMORY;
●   Suppose the has function is f() i.e. f : (username, password) -> Integer, then our data
    will have has values as such for eg. f('john','abc123') = 2789. The index's data
    structure will have a pointer from slot 2789 to the row which has username 'john'
    and password 'abc123'.
●   If the function f() is very selective i.e. For each combination of username and
    password it gives a different integer as output, then lookups will be O(1) in constant
    time (very very fast). For queries such as SELECT * from user_info where
    username='john' and password='abc123', it will not scan the table but compute
    f('john','abc123')=2789 and directly pick up the row from slot 2789.
Hash Indexes
●   ORDER BY queries on Memory engine will not take advantage of hash indexes as
    rows are not stored in sorted order.
●   Queries such as SELECT * from user_info where username='john'; will not use
    hash index because to compute the function f() it needs both username and
    password.
●   Range queries doesn't use hash indexes because to compute f() it needs exact values
    for the parameters.
●   If the function f() is not selective, i.e. For more than one combination of username,
    password pair it returns the same integer output e.g. f('john','abc123')=2789 and
    f('mary','25qwer')=2789 and so on for 5 other pairs then the slot 2789 points to a
    linked list of row pointers where each row pointer in the linked list has username,
    password pair that gives the same output when f() id applied on it. This case is
    termed chaining.
●   In case of hash collisions the worst case perormance for a query like SELECT *
    from user_info where username='mary' and password='25quer'; can amount to
    equivalent of a full table scan if all username, password pairs in the table have the
    same hash value.
Hash Indexes
●   Analysis of hashing with chaining :
        1. How long does it take to return the output of the query SELECT * from
    user_info where username='johnny' and password='derp123' ?
         2. Assuming simple uniform hashing, if there are 'm' slots in the index and a
    total of 'n' rows then the expected number of rows each slot points to is a=n/m (the
    average length of linked list for each slot is n/m ).
        3. For query such as SELECT * from user_info where username='johnny' and
    password='derp123' the average number of lookups is Θ(1+a).
         Proof : Suppose the username-password combination we are searching is non-
    existent then Mysql would compute f('johnny','derp123') = x, then it will search in
    the linked list of pointers in slot 'x'. Since it is not there it has to search till the end of
    linked list i.e. Average length of linked list = a = Θ(1+a).
         If the particular username-password combination is present then the number of
    lookups is equal to 1+ #(row pointers before ('johnny','derp123') in the linked list).
         For large values of n (number of rows in the table) we can assume that the
    expected number of row pointers before ('johnny','derp123') in its linked list is a/2.
    Thus average number of lookups = 1+a/2 = Θ(1+a).
Hash Indexes
●   Hash Indexes for InnoDB engine : The InnoDB storage engine has a special feature
    called adaptive hash indexes. When InnoDB notices that some index values are
    being accessed very frequently, it builds a hash index for them in memory on top of
    B-Tree indexes.
●   A 'Good' Hash function f() : Each row is equally likely to hash to any of the 'm' slots
    independently of where any other row has hashed to.i.e. f('john','abc123') should be
    independent of f('johnny','derp123').
●   In InnoDB there is no inbuilt hash function that we can take advantage of for
    “explicit” indexing. So we can maintain one column in the table for our hash values.
    ALTER TABLE user_info add column hash char(32) key. Then index 'hash'.
●   Collision analysis using 16 byte (32 hexadecimal digits) MD5() hash function :
        1. MD5() hash lookups are time consuming as the algorithm takes time to
    compute the value and then since the value is 32 digit hexadecimal string
    comparison also takes time.
        2. SELECT * from user_info where username='johnny' and password='derp123'
    and hash='690cdca9655043e9d087a1d50cd74e02'; we need the check on username
    and password field also so that single row is returned in case of collisions.
Hash Indexes
●   Method 2 : Using CRC32() as another builtin hash function is a better choice than
    MD5() since it results in a 10 digit integer value which can speed up comparisons
    effectively.
    SELECT * from user_info where username='johnny' and password='derp123' and
    hash=3682452828;
●   Method 3 : Using column prefixes as hash index. We can use fixed length prefixes
    from our username and password values. For e.g. For username 'johnny' and
    password 'derp123' we can choose our hash to be (4+3) character long 'johnder'.
        1. SELECT * from user_info where username='johnny' and password='derp123'
    and hash='johnder';
        2. Less comparison overhead compared to indexing the whole username and
    password values.
    3. Less selectivity. Defining selectivity s1= (# of distinct username-password pairs)/
    (# of rows in user_info) and s2=(# of distinct hash values)/(# of rows in user_info).
    Choose a length L for our hash values for which s2 ≈ s1, then number of collisions
    will be minimized.
Hash Indexes
●   Method 4 : Using universal class of hash functions. Convert our username and
    password strings to integer by summing up their ASCII character values and
    assuming the following for them :
         1. The ASCII character values for username and passwords lie between 0 and
    255.
         2. Maximum length of username is 10 and password is 10. Thus the maximum
    integer value for username is 255*10 and password is 255*10 adding them gives the
    maximum integer value for our key = 5100.
         3. Assuming there are 1000 distinct username passwords in our database,
    choose a prime p > 5100, p=5101, choose 2 integers 1<= a <= p-1 and 0<= b <= p-
    1, let a=19 and b=21.
●   Let the sum of the ASCII values of username and password be k. Then our universal
    hash function becomes f(k) = ((ak+b) mod p) mod m, where p=5101, m= number of
    distinct username-password pairs (1000 in our case), a=19 and b=21.
    So f(k)= ((19k+21) mod 5101) mod 1000.
●   For username 'johnny' and password 'derp123', k = 106 + 111 + 104 + 110 + 110 +
    121 + 100 + 101 + 114 + 112 + 49 + 50 + 51 = 1239. Thus f(1239) = 158. Thus our
    hash value for ('johnny','derp123') is 158.
Hash Indexes
●
    Using universal class of hash functions the probability that Pr(f(k)=f(l), k≠l) <=
    1/m. Hence in our case probability that f(k)=f(l) is less than 1/1000 = 0.001.
●   Proof:
    Let r = (ak+b) mod p and s = (al+b) mod p, then r-s = a(k-l) mod p.
    But 1<= a < p and (k-l) < p and p is prime hence r≠s (mod p). Since there are p(p-1)
    pairs for (a,b) and since r≠s (mod p) thus there are p(p-1) pairs for (r,s), there is one-
    to-one correspondence between (a,b) and (r,s).
    Thus if collision occurs it is due to for some r = s (mod m).
    For a given value of 0<= s < p, and r≠s , the number of values for which r = s (mod
    m) is at most (p-1)/m. Thus the probability that for a particular value of s , r = s
    (mod m) is at most ((p-1)/m)/(p-1) = 1/m.
●   Thus programmatically computing f(k) for lookups and the using query :
    SELECT * from user_info where username='johnny' and password='derp123' and
    hash=158; has great performance benefits.
B-Tree Indexes
●   B-trees are balanced search trees: height = O log(n) for the worst case.
●   They were designed to work well on Direct Access secondary storage devices
    (magnetic disks).
●




●   B-trees (and variants like B+ and B* trees ) are widely used in database systems.
B-Tree Indexes
●   A B-tree T is a rooted tree (with root root[T]) with properties:
       Every node x has four fields:
       1. The number of keys currently stored in node x, n[x].
        2. The n[x] keys themselves, stored in nondecreasing order:
            key1[x] ≤ key2[x] ≤ · · · ≤ keyn[x][x] .

        3. leaf[x] = “True” if x is a leaf else “False”
        4. n[x] + 1 pointers, c1[x], c2[x], . . . , cn[x]+1[x] to its children.




●   The keys keyi[x] separate the ranges of keys stored in each subtree: if k i is any key
    stored in the subtree with root ci[x], then:
         k1 ≤ key1[x] ≤ k2 ≤ key2[x] ≤ . . . ≤ keyn[x] ≤ kn[x]+1 .
B-Tree Indexes
●   All leaves have the same height, which is the tree’s height h.
●   There are upper on lower bounds on the number of keys on a node. To specify these
    bounds we use a fixed integer t ≥ 2, the minimum degree of the B-tree:
         lower bound: every node other than root must have at least t − 1 keys i.e. At
    least t children.
         upper bound: every node can contain at most 2t − 1 keys i.e. every internal node
    has at most 2t children.
●
B-Tree Indexes
●   SELECT * from user_info where firstname='johnny' and lastname='derp' and
    dob='1981-08-14'; (InnoDB engine; index on (firstname,lastname,dob));
●   Search Algorithm :       (x : node pointer to some node in a subtree)
        BTree-MySQL-Search (x=null, firstname='', lastname='', dob='')
           i=1;
           while ( i < n[x] and (firstname,lastname,dob) > keyi[x] )
                 i = i+1;
           if ( i ≤ n[x] and (firstname,lastname,dob) > key i[x] ) then
                 return keyi[x] -> rows;
           else if ( leaf[x] ) then
                 return null;
           else
                 Disk-Read(ci[x]);
                 return BTree-MySQL-Search(ci[x], firstname, lastname, dob );
●   Number of disk pages accessed by BTree-MySQL-Search Θ(h) = Θ(log t n) where n
    is the number of rows in the index.
B-Tree Indexes
●   INSERT, DELETE and UPDATE queries are much more involved. Let's discuss in
    brief about only INSERT.
●   INSERT into user_info (firstname,lastname,dob) values ('johnny', 'derp', '1981-08-
    14');
●   Insert algorithm :
         1. Let's assume k= (firstname,lastname,dob). If we find the leaf node x where k
    will be inserted.
            a. If x is not full, then insert k into x at an appropriate position (in
    ascending order of keys ).
             b. If x is full then compute the median value of all the keys in x . Then split
    the node into 2 nodes about the median. Then k is inserted into one of the splitted
    nodes at an appropriate position. The median value is then considered inserting into
    the parent node of x and this process is followed recursively. Moving up the tree if
    we find that the current root node needs to be split then the root node is split into 2
    and our new root node is a single key node with the median value from last split.
B-Tree Indexes
●   B-Tree insertion demonstration :




●   The key is always inserted in a leaf node
●   Requires O(h) = O(logt n) disk accesses.
B-Tree Indexes
●   B+ Trees are B-Trees with the modification that all internal nodes store the keys that
    are used in the indexing while the leaf nodes contains both the keys and the rows
    corresponding to the key.
●   Types of queries that can use a B-Tree index :
        1. Match the full value – SELECT * from user_info where firstname='johnny'
    and lastname='derp' and dob='1981-08-14';
         2. Match a leftmost prefix – SELECT * from user_info where
    firstname='johnny';
         3. Match a column prefix - SELECT * from user_info where firstname like
    'john%';
         4. Match a range of values - SELECT * from user_info where firstname
    between 'john' and 'johnny';
        5. Match one part exactly and match a range on another part – SELECT * from
    user_info where firstname='johnny and lastname like 'de%';
       6. InnoDB uses B+Tree indexes, so to take advantage of index-only-queries
    where rows are returned directly from index, select columns which are indexed -
    SELECT firstname, lastname from user_info where firstname like 'john%';
B-Tree Indexes
●   Types of queries that can't use a B-Tree index :
        1. They are not useful if the lookup does not start from the leftmost side of the
    indexed columns – SELECT * from user_info where lastname='derp';
    SELECT * from user_info where firstname like '%p';
         2. You can’t skip columns in the index – SELECT * from user_info where
    firstname='johnny' and dob='1981-08-14';
         3. The storage engine can’t optimize accesses with any columns to the right of
    the first range condition – SELECT * from user_info where firstname="johnny" and
    lastname like 'de%' and dob='1981-08-14';
Indexing Strategies for High
                   Performance
●   Isolating the Column : “Isolating” the column means it should not be part of an
    expression or be inside a function in the query.
    SELECT * from user_info where user_id + 1 = 5; or
    SELECT * from user_info where TO_DAYS(CURRENT_DATE) -
    TO_DAYS(dob) <= 365; don't use indexes with MySQL.
●   Prefix Indexes and Index Selectivity : For BLOB and TEXT columns instead of
    indexing a very long string , alternative is to index a prefix of the string . But index
    selectivity is also be taken care of . Index selectivity is the ratio of the distinct
    number of rows (grouped by our indexed field) to the total number of rows. The
    prefix length depends on index selectivity.
         For e.g. If there are 1000 rows in our user_info table and based on city there are
    435 distinct rows grouped by city, then our selectivity is 435/1000 = 0.435, now
    assuming that we choose a prefix length of 3, then the number of distinct rows
    grouped by city becomes 879 since there are many cities that have same prefix.
    Increasing the prefix length will always improve selectivity but choosing an optimal
    value (selectivity closest to 0.435 but length not too high) is important. In our case a
    prefix length of 7 gives number of distinct rows grouped by city 450. Thus we
    choose 7 as prefix length.
         ALTER TABLE user_info ADD KEY (city(7));
Indexing Strategies for High
                  Performance
●   Choosing a good column order (For multicolumn indexes) :
          1. If ORDER BY or GROUP BY is not required then index the columns from
    left to right in order of selectivity. i.e. The most selective column should be the
    leftmost so that probability of filtering maximizes for the leftmost column. For e.g
    the indexing order for the columns country and city should be (city, country)
    because more users belong to the same country compared to the same city. i.e.
    Selectivity of city is more than country thus filtering on “where city='kolkata' and
    country ='india' ” is efficient than “where country='india' and city ='kolkata' ” .
         2. In case of ORDER BY or GROUP BY the ORDER BY columns should be
    the rightmost in the index after the GROUP BY columns after the normal where
    clauses. For e.g “where firstname='johnny' GROUP BY city,country ORDER BY
    country” the index order should be (firstname,city,country).
●   Clustered Indexes : InnoDB’s clustered indexes actually store a B-Tree index and
    the rows together in the same structure.When a table has a clustered index, its rows
    are actually stored in the index’s leaf pages. The term “clustered” refers to the fact
    that rows with adjacent key values are stored close to each other.
Indexing Strategies for High
                  Performance
●   Clustered Indexes : (contd.) InnoDB clusters the data by the primary key. If you
    don’t define a primary key, InnoDB will try to use a unique non-nullable index
    instead. If there’s no such index, InnoDB will define a hidden primary key for you
    and then cluster on that. InnoDB clusters records together only within a page. Pages
    with adjacent key values might be distant from each other.
●
Indexing Strategies for High
                  Performance
●   Clustered Indexes : (contd.) Example : SELECT * from user_info ORDER BY
    username. If our primary key is username then this query's output is very fast
    because it returns all the columns from the leaf node of the B-tree index only
    without referring the table and also since it is clustered on username hence rows are
    stored in a page in alphabetical order of the usernames hence ORDER BY does not
    require to do any sort in a single page.
●   If clustering on primary key is not desired i.e. If we do not need order by on primary
    key and then return almost all the columns, it is better not to define a primary key
    derived from any of the column values. For e.g. If we do not require queries as
    above then instead of defining primary key on username define primary key to be
    some user_id auto_increment because with username primary key there will be lots
    of random I/O in case of insertions (since insertions are not in any order of
    username) which is inefficient but with auto increment insertions follow sequential
    order thus saving random I/O.
●   MyISAM engine does not use clustering.
Indexing Strategies for High
                  Performance
●   Covering Indexes : An index that contains all the data needed to satisfy a query is
    called a covering index. Consider the query :
    SELECT firstname, lastname from user_info where firstname='johnny' and lastname
    like 'de%'; The query is index covered since all the rows that are returned are part of
    the index (firstname, lastname, dob ).
●   Index covered queries are very fast since no row lookups (random I/O on disk)
    required, instead all rows returned from index.
●   Hash, spatial, and full-text indexes don’t use covering indexes, so MySQL can use
    only B-Tree indexes to cover queries.
●   When you issue a query that is covered by an index (an index-covered query), you’ll
    see “Using index” in the Extra column in EXPLAIN.
●   Due to the secondary index structure of InnoDB where secondary indexes store
    primary keys in their leaf nodes, queries that fetch columns that includes the primary
    key column and the secondary indexed columns is also a index covered query. For
    e.g SELECT user_id, firstname, lastname from user_info where firstname='johnny'
    and lastname like 'de%'; here user_id is not part of the index (firstname, lastname,
    dob ) but its a primary key so its index covered also.
Full-Text Searching
●   Most of the queries you’ll write will probably have WHERE clauses that compare
    values for equality, filter out ranges of rows, and so on. However, you might also
    need to perform keyword searches, which are based on relevance instead of
    comparing values to each other. Full-text search systems are designed for this
    purpose.
●   Full-Text search is based on finding words (terms) in documents instead of patterns.
●   For example we want to find all matching rows in the reviews table for which the
    reviews contains some or all words of the phrase “good excellent exciting”.
    ALTER TABLE reviews add FULLTEXT KEY(review);
    SELECT review, MATCH(review) AGAINST('good excellent exciting') as
    relevancy from reviews where MATCH(review) AGAINST('good excellent
    exciting');
●   Full-Text Searching can be accomplished without indexing also.
●   There are two different modes for Full-Text Searching : Natural Language Mode
    and Boolean Mode.
●   Only MyISAM engine supports Full-Text searching and indexing.
Full-Text Searching
●   Natural Language Mode of Full-Text Searching : The relevancy of a query with a
    particular row in the table is calculated as follows -
         1. Compute the weight of each word/term in the fulltext indexed columns in
    each row. The weight for each word in a row increases if the number of times it
    occurs in one row increases and decreases if the number of rows it occurs in
    increases. i.e. to say that if a word in the query exists in few rows then that word
    determines how relevant that word is for ordering the search results. For example in
    the query “we had an exciting adventure” words such as “we”, “had” and “an” are
    pretty common terms in holiday reviews so they exists in more than 75% of rows in
    our database but words such as “exciting” and “adventure” are less common and
    occur in less than 10% of our database so “naturally we are looking for” rows in the
    table which contains words like “exciting” and “adventure” and thus they should be
    ranked higher. Infact words such as “we” , “an” , “the” etc. are called stopwords and
    they are not even considered while calculating weights.
        2. Mathematically the formula for weight of a term ti in given row is given as :

             w[ti]= (log(dtf[ti])+1)/sumdtf * U/(1+0.0115*U) * log((N-nf[ti])/nf[ti])
Full-Text Searching
●   Natural Language Mode of Full-Text Searching : contd.
        2. w[ti]= (log(dtf[ti])+1)/sumdtf * U/(1+0.0115*U) * log((N-nf[ti])/nf[ti])
        where dtf[ti] : number of times term ti appears in the row.
                  sumdtf : sum of (log(dtf)+1)'s for all terms in the same row.
                  U : number of unique terms in the row.
                  N : Total number of rows.
                  nf[ti] : number of rows that contain the term ti .
        The middle term signifies that if the length of the indexed columns in the row is
    shorter than the average length (= number of unique words) then the weight for that
    row increases (i.e. “short and sweet” row so as to say).
         3. Then the Rank of that row is computed as R = ∑i w[ti]*qnf[ti], where qnf[ti]
    is the number of times the term ti occurs in the query. This value is given by
    SELECT MATCH() AGAINST() query.
●   Structure of the index : The index is a B-Tree structure with 2 levels. In the first
    level the nodes store the terms as keys and in the second level for each first level
    term, a pointer to the rows that contains the term. This is similar to inverted index.
Full-Text Searching
●   MATCH AGAINST clause can't be used to regard words from a particular column
    as more important than words from other columns. For example, you might want
    search results to appear first when the keywords appear in an review's title.
●   Alternative solution to give twice the importance to the title of the review than the
    review itself :
        ALTER TABLE ADD FULLTEXT KEY(title, review);
        ALTER TABLE ADD FULLTEXT KEY(title);
        SELECT title, review, ROUND(MATCH(title, review) AGAINST('good
    excellent exciting'), 3) AS full_rel, ROUND(MATCH(title) AGAINST('good
    excellent exciting'), 3) AS title_rel FROM reviews WHERE MATCH(title, review)
    AGAINST('good excellent exciting') ORDER BY (2 * MATCH(title)
    AGAINST('good excellent exciting')) + MATCH(title, review) AGAINST('good
    excellent exciting') DESC;
Full-Text Searching
●   Boolean Mode of Full-Text Searching : In Boolean searches, the query itself
    specifies the relative relevance of each word in a match. When constructing a
    Boolean search query, you can use prefixes to modify the relative ranking of each
    keyword in the search string.
●   Examples :
        1. SELECT * from reviews where MATCH(title,review) AGAINST ('good
    ~bad +adventure' in BOOLEAN MODE); i.e. Rows must contain the word
    'adventure' and rows with the word 'good' should be ranked higher and rows with
    the word 'bad' should be ranked lower.
        2. SELECT * from reviews where MATCH(title,review) AGAINST ('good
    -bad +adventure' in BOOLEAN MODE); i.e. Rows must contain the word
    'adventure' and rows with the word 'good' should be ranked higher and the rows
    should not contain the word 'bad'.
        3. SELECT * from reviews where MATCH(title,review) AGAINST ('“good
    adventure”' in BOOLEAN MODE); i.e. Rows should contain the phrase “good
    adventure”.
Full-Text Searching
●   Phrase searches tend to be quite slow. The full-text index alone can’t answer such
    queries, because it doesn’t record where words are located relative to each other in
    the original full-text collection. Consequently, the server actually has to look inside
    the rows to do a phrase search.
    To execute such a search, the server will find all documents that contain both
    “good” and “adventure” It will then fetch the rows from which the documents were
    built, and check for the exact phrase in the collection.
●   Disadvantages of Full-Text Indexing and Searching :
        1. The index doesn’t record the indexed word’s position in the string, so
    proximity doesn’t contribute to relevance.
         2. MySQL’s full-text indexing performs well when the index fits in memory,
    but if the index is not in memory it can be very slow, especially when the fields are
    large.
        3. Modifying a piece of text with 100 words requires not 1 but up to 100 index
    operations.
Full-Text Searching
●   Disadvantages of Full-Text Indexing and Searching :        (contd.)
         4. The field length doesn’t usually affect other index types much, but with full-
    text indexing, text with 3 words and text with 10,000 words will have performance
    profiles that differ by orders of magnitude.
         5. If there’s a full-text index and the query has a MATCH AGAINST clause
    that can use it, MySQL will use the full-text index to process the query. It will not
    compare the full-text index to the other indexes that might be used for the query.
         6. The full-text search index can perform only full-text matches. Any other
    criteria in the query, such as WHERE clauses, must be applied after MySQL reads
    the row from the table.
        7. Full-text indexes don’t store the actual text they index. Thus, you can never
    use a full-text index as a covering index.
         8. Full-text indexes cannot be used for any type of sorting, other than sorting by
    relevance in natural-language mode. If you need to sort by something other than
    relevance, MySQL will use a filesort.
Full-Text Searching
●   Disadvantages of Full-Text Indexing and Searching :       (contd.)
        SELECT * from reviews where MATCH(title, review) AGAINST ('good
    exciting adventure') and review_id= 879;
        The query will not use the index on review_id since the preference for fulltext
    index is higher. So it will look into the fulltext index and filter out all matching
    rows then will use the review_id value to filter from WHERE clause.
        Solution:
         Index the review_id column also in the fulltext index by converting the values
    into some string format 'review_id_is_879'.
        ALTER TABLE reviews add FULLTEXT KEY(review_id, title, review);
         SELECT * from reviews where MATCH(review_id, title, review) AGAINST
    ('+review_id_is_879 good exciting adventure' in BOOLEAN MODE);
References
●   Introduction to Algorithms, CLRS, 3rd Edition.
●   High Performance MySQL by Baron Schwartz, Peter Zaitsev and Vadim
    Tkachenko.



                              Thank You

Más contenido relacionado

La actualidad más candente (20)

Python data type
Python data typePython data type
Python data type
 
Pointer
PointerPointer
Pointer
 
Python Lecture 6
Python Lecture 6Python Lecture 6
Python Lecture 6
 
Python : Data Types
Python : Data TypesPython : Data Types
Python : Data Types
 
Variables In Php 1
Variables In Php 1Variables In Php 1
Variables In Php 1
 
Hash crypto
Hash cryptoHash crypto
Hash crypto
 
Strings in Python
Strings in PythonStrings in Python
Strings in Python
 
Basics of Python programming (part 2)
Basics of Python programming (part 2)Basics of Python programming (part 2)
Basics of Python programming (part 2)
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Array&amp;string
Array&amp;stringArray&amp;string
Array&amp;string
 
Array
ArrayArray
Array
 
1 D Arrays in C++
1 D Arrays in C++1 D Arrays in C++
1 D Arrays in C++
 
R Basics
R BasicsR Basics
R Basics
 
Array
ArrayArray
Array
 
Arrays in java
Arrays in javaArrays in java
Arrays in java
 
String in python lecture (3)
String in python lecture (3)String in python lecture (3)
String in python lecture (3)
 
Strings in python
Strings in pythonStrings in python
Strings in python
 
Dev Concepts: Data Structures and Algorithms
Dev Concepts: Data Structures and AlgorithmsDev Concepts: Data Structures and Algorithms
Dev Concepts: Data Structures and Algorithms
 
String Manipulation in Python
String Manipulation in PythonString Manipulation in Python
String Manipulation in Python
 
Python strings
Python stringsPython strings
Python strings
 

Destacado

Scaling Uber's Real-time Market Platform
Scaling Uber's Real-time Market PlatformScaling Uber's Real-time Market Platform
Scaling Uber's Real-time Market PlatformTrieu Nguyen
 
MySQL: Indexing for Better Performance
MySQL: Indexing for Better PerformanceMySQL: Indexing for Better Performance
MySQL: Indexing for Better Performancejkeriaki
 
Performance of fractal tree databases
Performance of fractal tree databasesPerformance of fractal tree databases
Performance of fractal tree databasesLixun Peng
 
Mysql Fulltext Search 1
Mysql Fulltext Search 1Mysql Fulltext Search 1
Mysql Fulltext Search 1johnymas
 
Through The Looking Glass 2012 2015(Afcom)
Through The Looking Glass 2012 2015(Afcom)Through The Looking Glass 2012 2015(Afcom)
Through The Looking Glass 2012 2015(Afcom)nab
 
Power Notes Measurements and Dealing with Data
Power Notes   Measurements and Dealing with DataPower Notes   Measurements and Dealing with Data
Power Notes Measurements and Dealing with Datajmori1
 
Correos electronicos
Correos electronicosCorreos electronicos
Correos electronicosSaida Lopez
 
Инвесторы и налоговая политика Казахстана
Инвесторы и налоговая политика Казахстана Инвесторы и налоговая политика Казахстана
Инвесторы и налоговая политика Казахстана АО "Самрук-Казына"
 
What is Social Media and how can it work for a service professional
What is Social Media and how can it work for a service professionalWhat is Social Media and how can it work for a service professional
What is Social Media and how can it work for a service professionalRather Inventive
 
Tools of the Trade
Tools of the TradeTools of the Trade
Tools of the Tradejmori1
 
6 Development Tools we Love for Mac
6 Development Tools we Love for Mac6 Development Tools we Love for Mac
6 Development Tools we Love for MacCopperEgg
 
CSI Day 3
CSI Day 3CSI Day 3
CSI Day 3jmori1
 
Implementing transparency and open government projects in Greece
Implementing transparency and open government projects in GreeceImplementing transparency and open government projects in Greece
Implementing transparency and open government projects in GreeceMichael Psallidas
 

Destacado (20)

Scaling Uber's Real-time Market Platform
Scaling Uber's Real-time Market PlatformScaling Uber's Real-time Market Platform
Scaling Uber's Real-time Market Platform
 
Nikhat b+ trees ppt
Nikhat b+ trees pptNikhat b+ trees ppt
Nikhat b+ trees ppt
 
MySQL: Indexing for Better Performance
MySQL: Indexing for Better PerformanceMySQL: Indexing for Better Performance
MySQL: Indexing for Better Performance
 
b+ tree
b+ treeb+ tree
b+ tree
 
Performance of fractal tree databases
Performance of fractal tree databasesPerformance of fractal tree databases
Performance of fractal tree databases
 
Mysql Fulltext Search 1
Mysql Fulltext Search 1Mysql Fulltext Search 1
Mysql Fulltext Search 1
 
Through The Looking Glass 2012 2015(Afcom)
Through The Looking Glass 2012 2015(Afcom)Through The Looking Glass 2012 2015(Afcom)
Through The Looking Glass 2012 2015(Afcom)
 
Power Notes Measurements and Dealing with Data
Power Notes   Measurements and Dealing with DataPower Notes   Measurements and Dealing with Data
Power Notes Measurements and Dealing with Data
 
Correos electronicos
Correos electronicosCorreos electronicos
Correos electronicos
 
Инвесторы и налоговая политика Казахстана
Инвесторы и налоговая политика Казахстана Инвесторы и налоговая политика Казахстана
Инвесторы и налоговая политика Казахстана
 
What is Social Media and how can it work for a service professional
What is Social Media and how can it work for a service professionalWhat is Social Media and how can it work for a service professional
What is Social Media and how can it work for a service professional
 
Tools of the Trade
Tools of the TradeTools of the Trade
Tools of the Trade
 
Silver
SilverSilver
Silver
 
Oop interfaces
Oop interfacesOop interfaces
Oop interfaces
 
6 Development Tools we Love for Mac
6 Development Tools we Love for Mac6 Development Tools we Love for Mac
6 Development Tools we Love for Mac
 
CSI Day 3
CSI Day 3CSI Day 3
CSI Day 3
 
Materi Internet
Materi InternetMateri Internet
Materi Internet
 
Pers
PersPers
Pers
 
House plan
House planHouse plan
House plan
 
Implementing transparency and open government projects in Greece
Implementing transparency and open government projects in GreeceImplementing transparency and open government projects in Greece
Implementing transparency and open government projects in Greece
 

Similar a Mysql Performance Optimization Indexing Algorithms and Data Structures

14-Intermediate code generation - Variants of Syntax trees - Three Address Co...
14-Intermediate code generation - Variants of Syntax trees - Three Address Co...14-Intermediate code generation - Variants of Syntax trees - Three Address Co...
14-Intermediate code generation - Variants of Syntax trees - Three Address Co...venkatapranaykumarGa
 
Python programming workshop
Python programming workshopPython programming workshop
Python programming workshopBAINIDA
 
Improve Your Edge on Machine Learning - Day 1.pptx
Improve Your Edge on Machine Learning - Day 1.pptxImprove Your Edge on Machine Learning - Day 1.pptx
Improve Your Edge on Machine Learning - Day 1.pptxCatherineVania1
 
Unitii classnotes
Unitii classnotesUnitii classnotes
Unitii classnotesSowri Rajan
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Guy Lebanon
 
Array assignment
Array assignmentArray assignment
Array assignmentAhmad Kamal
 
Homework Assignment – Array Technical DocumentWrite a technical .pdf
Homework Assignment – Array Technical DocumentWrite a technical .pdfHomework Assignment – Array Technical DocumentWrite a technical .pdf
Homework Assignment – Array Technical DocumentWrite a technical .pdfaroraopticals15
 
Shad_Cryptography_PracticalFile_IT_4th_Year (1).docx
Shad_Cryptography_PracticalFile_IT_4th_Year (1).docxShad_Cryptography_PracticalFile_IT_4th_Year (1).docx
Shad_Cryptography_PracticalFile_IT_4th_Year (1).docxSonu62614
 
presentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptxpresentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptxjainaaru59
 
Lecture 12 intermediate code generation
Lecture 12 intermediate code generationLecture 12 intermediate code generation
Lecture 12 intermediate code generationIffat Anjum
 
1-Object and Data Structures.pptx
1-Object and Data Structures.pptx1-Object and Data Structures.pptx
1-Object and Data Structures.pptxRobNieves1
 
Simple Queriebhjjnhhbbbbnnnnjjs In SQL.pdf
Simple Queriebhjjnhhbbbbnnnnjjs In SQL.pdfSimple Queriebhjjnhhbbbbnnnnjjs In SQL.pdf
Simple Queriebhjjnhhbbbbnnnnjjs In SQL.pdfManojVishwakarma91
 

Similar a Mysql Performance Optimization Indexing Algorithms and Data Structures (20)

Dictionary
DictionaryDictionary
Dictionary
 
14-Intermediate code generation - Variants of Syntax trees - Three Address Co...
14-Intermediate code generation - Variants of Syntax trees - Three Address Co...14-Intermediate code generation - Variants of Syntax trees - Three Address Co...
14-Intermediate code generation - Variants of Syntax trees - Three Address Co...
 
Python lecture 05
Python lecture 05Python lecture 05
Python lecture 05
 
Python programming workshop
Python programming workshopPython programming workshop
Python programming workshop
 
Improve Your Edge on Machine Learning - Day 1.pptx
Improve Your Edge on Machine Learning - Day 1.pptxImprove Your Edge on Machine Learning - Day 1.pptx
Improve Your Edge on Machine Learning - Day 1.pptx
 
Unitii classnotes
Unitii classnotesUnitii classnotes
Unitii classnotes
 
Compiler notes--unit-iii
Compiler notes--unit-iiiCompiler notes--unit-iii
Compiler notes--unit-iii
 
Python
PythonPython
Python
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)
 
1.2 matlab numerical data
1.2  matlab numerical data1.2  matlab numerical data
1.2 matlab numerical data
 
Session 4
Session 4Session 4
Session 4
 
Array assignment
Array assignmentArray assignment
Array assignment
 
Homework Assignment – Array Technical DocumentWrite a technical .pdf
Homework Assignment – Array Technical DocumentWrite a technical .pdfHomework Assignment – Array Technical DocumentWrite a technical .pdf
Homework Assignment – Array Technical DocumentWrite a technical .pdf
 
PHP Web Programming
PHP Web ProgrammingPHP Web Programming
PHP Web Programming
 
Hashing 1
Hashing 1Hashing 1
Hashing 1
 
Shad_Cryptography_PracticalFile_IT_4th_Year (1).docx
Shad_Cryptography_PracticalFile_IT_4th_Year (1).docxShad_Cryptography_PracticalFile_IT_4th_Year (1).docx
Shad_Cryptography_PracticalFile_IT_4th_Year (1).docx
 
presentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptxpresentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptx
 
Lecture 12 intermediate code generation
Lecture 12 intermediate code generationLecture 12 intermediate code generation
Lecture 12 intermediate code generation
 
1-Object and Data Structures.pptx
1-Object and Data Structures.pptx1-Object and Data Structures.pptx
1-Object and Data Structures.pptx
 
Simple Queriebhjjnhhbbbbnnnnjjs In SQL.pdf
Simple Queriebhjjnhhbbbbnnnnjjs In SQL.pdfSimple Queriebhjjnhhbbbbnnnnjjs In SQL.pdf
Simple Queriebhjjnhhbbbbnnnnjjs In SQL.pdf
 

Más de Abhijit Mondal

Más de Abhijit Mondal (8)

Pagerank
PagerankPagerank
Pagerank
 
Poster Presentation
Poster PresentationPoster Presentation
Poster Presentation
 
MySQL Performance Optimization
MySQL Performance OptimizationMySQL Performance Optimization
MySQL Performance Optimization
 
My MSc. Project
My MSc. ProjectMy MSc. Project
My MSc. Project
 
Security protocols
Security protocolsSecurity protocols
Security protocols
 
Public Key Cryptography
Public Key CryptographyPublic Key Cryptography
Public Key Cryptography
 
Number Theory for Security
Number Theory for SecurityNumber Theory for Security
Number Theory for Security
 
Quantum games
Quantum gamesQuantum games
Quantum games
 

Último

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Último (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Mysql Performance Optimization Indexing Algorithms and Data Structures

  • 1. MySQL Performance Optimization Part II “Indexing Data Structures and Algorithms” Abhijit Mondal Software Engineer at HolidayIQ
  • 2. Contents Hash Indexes B-Trees and B+ Trees Indexes Indexing Strategies for High Performance Full Text Searching
  • 3. Hash Indexes ● A hash index is built on a hash table and is useful only for exact lookups that use every column in the index. For each row, the storage engine computes a hash code of the indexed columns, which is a small value that will probably differ from the hash codes computed for other rows with different key values. It stores the hash codes in the index and stores a pointer to each row in a hash table. ● CREATE TABLE user_info (user_id int not null primary key auto_increment, username varchar(50), password char(32), KEY USING HASH(username, password)) ENGINE=MEMORY; ● Suppose the has function is f() i.e. f : (username, password) -> Integer, then our data will have has values as such for eg. f('john','abc123') = 2789. The index's data structure will have a pointer from slot 2789 to the row which has username 'john' and password 'abc123'. ● If the function f() is very selective i.e. For each combination of username and password it gives a different integer as output, then lookups will be O(1) in constant time (very very fast). For queries such as SELECT * from user_info where username='john' and password='abc123', it will not scan the table but compute f('john','abc123')=2789 and directly pick up the row from slot 2789.
  • 4. Hash Indexes ● ORDER BY queries on Memory engine will not take advantage of hash indexes as rows are not stored in sorted order. ● Queries such as SELECT * from user_info where username='john'; will not use hash index because to compute the function f() it needs both username and password. ● Range queries doesn't use hash indexes because to compute f() it needs exact values for the parameters. ● If the function f() is not selective, i.e. For more than one combination of username, password pair it returns the same integer output e.g. f('john','abc123')=2789 and f('mary','25qwer')=2789 and so on for 5 other pairs then the slot 2789 points to a linked list of row pointers where each row pointer in the linked list has username, password pair that gives the same output when f() id applied on it. This case is termed chaining. ● In case of hash collisions the worst case perormance for a query like SELECT * from user_info where username='mary' and password='25quer'; can amount to equivalent of a full table scan if all username, password pairs in the table have the same hash value.
  • 5. Hash Indexes ● Analysis of hashing with chaining : 1. How long does it take to return the output of the query SELECT * from user_info where username='johnny' and password='derp123' ? 2. Assuming simple uniform hashing, if there are 'm' slots in the index and a total of 'n' rows then the expected number of rows each slot points to is a=n/m (the average length of linked list for each slot is n/m ). 3. For query such as SELECT * from user_info where username='johnny' and password='derp123' the average number of lookups is Θ(1+a). Proof : Suppose the username-password combination we are searching is non- existent then Mysql would compute f('johnny','derp123') = x, then it will search in the linked list of pointers in slot 'x'. Since it is not there it has to search till the end of linked list i.e. Average length of linked list = a = Θ(1+a). If the particular username-password combination is present then the number of lookups is equal to 1+ #(row pointers before ('johnny','derp123') in the linked list). For large values of n (number of rows in the table) we can assume that the expected number of row pointers before ('johnny','derp123') in its linked list is a/2. Thus average number of lookups = 1+a/2 = Θ(1+a).
  • 6. Hash Indexes ● Hash Indexes for InnoDB engine : The InnoDB storage engine has a special feature called adaptive hash indexes. When InnoDB notices that some index values are being accessed very frequently, it builds a hash index for them in memory on top of B-Tree indexes. ● A 'Good' Hash function f() : Each row is equally likely to hash to any of the 'm' slots independently of where any other row has hashed to.i.e. f('john','abc123') should be independent of f('johnny','derp123'). ● In InnoDB there is no inbuilt hash function that we can take advantage of for “explicit” indexing. So we can maintain one column in the table for our hash values. ALTER TABLE user_info add column hash char(32) key. Then index 'hash'. ● Collision analysis using 16 byte (32 hexadecimal digits) MD5() hash function : 1. MD5() hash lookups are time consuming as the algorithm takes time to compute the value and then since the value is 32 digit hexadecimal string comparison also takes time. 2. SELECT * from user_info where username='johnny' and password='derp123' and hash='690cdca9655043e9d087a1d50cd74e02'; we need the check on username and password field also so that single row is returned in case of collisions.
  • 7. Hash Indexes ● Method 2 : Using CRC32() as another builtin hash function is a better choice than MD5() since it results in a 10 digit integer value which can speed up comparisons effectively. SELECT * from user_info where username='johnny' and password='derp123' and hash=3682452828; ● Method 3 : Using column prefixes as hash index. We can use fixed length prefixes from our username and password values. For e.g. For username 'johnny' and password 'derp123' we can choose our hash to be (4+3) character long 'johnder'. 1. SELECT * from user_info where username='johnny' and password='derp123' and hash='johnder'; 2. Less comparison overhead compared to indexing the whole username and password values. 3. Less selectivity. Defining selectivity s1= (# of distinct username-password pairs)/ (# of rows in user_info) and s2=(# of distinct hash values)/(# of rows in user_info). Choose a length L for our hash values for which s2 ≈ s1, then number of collisions will be minimized.
  • 8. Hash Indexes ● Method 4 : Using universal class of hash functions. Convert our username and password strings to integer by summing up their ASCII character values and assuming the following for them : 1. The ASCII character values for username and passwords lie between 0 and 255. 2. Maximum length of username is 10 and password is 10. Thus the maximum integer value for username is 255*10 and password is 255*10 adding them gives the maximum integer value for our key = 5100. 3. Assuming there are 1000 distinct username passwords in our database, choose a prime p > 5100, p=5101, choose 2 integers 1<= a <= p-1 and 0<= b <= p- 1, let a=19 and b=21. ● Let the sum of the ASCII values of username and password be k. Then our universal hash function becomes f(k) = ((ak+b) mod p) mod m, where p=5101, m= number of distinct username-password pairs (1000 in our case), a=19 and b=21. So f(k)= ((19k+21) mod 5101) mod 1000. ● For username 'johnny' and password 'derp123', k = 106 + 111 + 104 + 110 + 110 + 121 + 100 + 101 + 114 + 112 + 49 + 50 + 51 = 1239. Thus f(1239) = 158. Thus our hash value for ('johnny','derp123') is 158.
  • 9. Hash Indexes ● Using universal class of hash functions the probability that Pr(f(k)=f(l), k≠l) <= 1/m. Hence in our case probability that f(k)=f(l) is less than 1/1000 = 0.001. ● Proof: Let r = (ak+b) mod p and s = (al+b) mod p, then r-s = a(k-l) mod p. But 1<= a < p and (k-l) < p and p is prime hence r≠s (mod p). Since there are p(p-1) pairs for (a,b) and since r≠s (mod p) thus there are p(p-1) pairs for (r,s), there is one- to-one correspondence between (a,b) and (r,s). Thus if collision occurs it is due to for some r = s (mod m). For a given value of 0<= s < p, and r≠s , the number of values for which r = s (mod m) is at most (p-1)/m. Thus the probability that for a particular value of s , r = s (mod m) is at most ((p-1)/m)/(p-1) = 1/m. ● Thus programmatically computing f(k) for lookups and the using query : SELECT * from user_info where username='johnny' and password='derp123' and hash=158; has great performance benefits.
  • 10. B-Tree Indexes ● B-trees are balanced search trees: height = O log(n) for the worst case. ● They were designed to work well on Direct Access secondary storage devices (magnetic disks). ● ● B-trees (and variants like B+ and B* trees ) are widely used in database systems.
  • 11. B-Tree Indexes ● A B-tree T is a rooted tree (with root root[T]) with properties: Every node x has four fields: 1. The number of keys currently stored in node x, n[x]. 2. The n[x] keys themselves, stored in nondecreasing order: key1[x] ≤ key2[x] ≤ · · · ≤ keyn[x][x] . 3. leaf[x] = “True” if x is a leaf else “False” 4. n[x] + 1 pointers, c1[x], c2[x], . . . , cn[x]+1[x] to its children. ● The keys keyi[x] separate the ranges of keys stored in each subtree: if k i is any key stored in the subtree with root ci[x], then: k1 ≤ key1[x] ≤ k2 ≤ key2[x] ≤ . . . ≤ keyn[x] ≤ kn[x]+1 .
  • 12. B-Tree Indexes ● All leaves have the same height, which is the tree’s height h. ● There are upper on lower bounds on the number of keys on a node. To specify these bounds we use a fixed integer t ≥ 2, the minimum degree of the B-tree: lower bound: every node other than root must have at least t − 1 keys i.e. At least t children. upper bound: every node can contain at most 2t − 1 keys i.e. every internal node has at most 2t children. ●
  • 13. B-Tree Indexes ● SELECT * from user_info where firstname='johnny' and lastname='derp' and dob='1981-08-14'; (InnoDB engine; index on (firstname,lastname,dob)); ● Search Algorithm : (x : node pointer to some node in a subtree) BTree-MySQL-Search (x=null, firstname='', lastname='', dob='') i=1; while ( i < n[x] and (firstname,lastname,dob) > keyi[x] ) i = i+1; if ( i ≤ n[x] and (firstname,lastname,dob) > key i[x] ) then return keyi[x] -> rows; else if ( leaf[x] ) then return null; else Disk-Read(ci[x]); return BTree-MySQL-Search(ci[x], firstname, lastname, dob ); ● Number of disk pages accessed by BTree-MySQL-Search Θ(h) = Θ(log t n) where n is the number of rows in the index.
  • 14. B-Tree Indexes ● INSERT, DELETE and UPDATE queries are much more involved. Let's discuss in brief about only INSERT. ● INSERT into user_info (firstname,lastname,dob) values ('johnny', 'derp', '1981-08- 14'); ● Insert algorithm : 1. Let's assume k= (firstname,lastname,dob). If we find the leaf node x where k will be inserted. a. If x is not full, then insert k into x at an appropriate position (in ascending order of keys ). b. If x is full then compute the median value of all the keys in x . Then split the node into 2 nodes about the median. Then k is inserted into one of the splitted nodes at an appropriate position. The median value is then considered inserting into the parent node of x and this process is followed recursively. Moving up the tree if we find that the current root node needs to be split then the root node is split into 2 and our new root node is a single key node with the median value from last split.
  • 15. B-Tree Indexes ● B-Tree insertion demonstration : ● The key is always inserted in a leaf node ● Requires O(h) = O(logt n) disk accesses.
  • 16. B-Tree Indexes ● B+ Trees are B-Trees with the modification that all internal nodes store the keys that are used in the indexing while the leaf nodes contains both the keys and the rows corresponding to the key. ● Types of queries that can use a B-Tree index : 1. Match the full value – SELECT * from user_info where firstname='johnny' and lastname='derp' and dob='1981-08-14'; 2. Match a leftmost prefix – SELECT * from user_info where firstname='johnny'; 3. Match a column prefix - SELECT * from user_info where firstname like 'john%'; 4. Match a range of values - SELECT * from user_info where firstname between 'john' and 'johnny'; 5. Match one part exactly and match a range on another part – SELECT * from user_info where firstname='johnny and lastname like 'de%'; 6. InnoDB uses B+Tree indexes, so to take advantage of index-only-queries where rows are returned directly from index, select columns which are indexed - SELECT firstname, lastname from user_info where firstname like 'john%';
  • 17. B-Tree Indexes ● Types of queries that can't use a B-Tree index : 1. They are not useful if the lookup does not start from the leftmost side of the indexed columns – SELECT * from user_info where lastname='derp'; SELECT * from user_info where firstname like '%p'; 2. You can’t skip columns in the index – SELECT * from user_info where firstname='johnny' and dob='1981-08-14'; 3. The storage engine can’t optimize accesses with any columns to the right of the first range condition – SELECT * from user_info where firstname="johnny" and lastname like 'de%' and dob='1981-08-14';
  • 18. Indexing Strategies for High Performance ● Isolating the Column : “Isolating” the column means it should not be part of an expression or be inside a function in the query. SELECT * from user_info where user_id + 1 = 5; or SELECT * from user_info where TO_DAYS(CURRENT_DATE) - TO_DAYS(dob) <= 365; don't use indexes with MySQL. ● Prefix Indexes and Index Selectivity : For BLOB and TEXT columns instead of indexing a very long string , alternative is to index a prefix of the string . But index selectivity is also be taken care of . Index selectivity is the ratio of the distinct number of rows (grouped by our indexed field) to the total number of rows. The prefix length depends on index selectivity. For e.g. If there are 1000 rows in our user_info table and based on city there are 435 distinct rows grouped by city, then our selectivity is 435/1000 = 0.435, now assuming that we choose a prefix length of 3, then the number of distinct rows grouped by city becomes 879 since there are many cities that have same prefix. Increasing the prefix length will always improve selectivity but choosing an optimal value (selectivity closest to 0.435 but length not too high) is important. In our case a prefix length of 7 gives number of distinct rows grouped by city 450. Thus we choose 7 as prefix length. ALTER TABLE user_info ADD KEY (city(7));
  • 19. Indexing Strategies for High Performance ● Choosing a good column order (For multicolumn indexes) : 1. If ORDER BY or GROUP BY is not required then index the columns from left to right in order of selectivity. i.e. The most selective column should be the leftmost so that probability of filtering maximizes for the leftmost column. For e.g the indexing order for the columns country and city should be (city, country) because more users belong to the same country compared to the same city. i.e. Selectivity of city is more than country thus filtering on “where city='kolkata' and country ='india' ” is efficient than “where country='india' and city ='kolkata' ” . 2. In case of ORDER BY or GROUP BY the ORDER BY columns should be the rightmost in the index after the GROUP BY columns after the normal where clauses. For e.g “where firstname='johnny' GROUP BY city,country ORDER BY country” the index order should be (firstname,city,country). ● Clustered Indexes : InnoDB’s clustered indexes actually store a B-Tree index and the rows together in the same structure.When a table has a clustered index, its rows are actually stored in the index’s leaf pages. The term “clustered” refers to the fact that rows with adjacent key values are stored close to each other.
  • 20. Indexing Strategies for High Performance ● Clustered Indexes : (contd.) InnoDB clusters the data by the primary key. If you don’t define a primary key, InnoDB will try to use a unique non-nullable index instead. If there’s no such index, InnoDB will define a hidden primary key for you and then cluster on that. InnoDB clusters records together only within a page. Pages with adjacent key values might be distant from each other. ●
  • 21. Indexing Strategies for High Performance ● Clustered Indexes : (contd.) Example : SELECT * from user_info ORDER BY username. If our primary key is username then this query's output is very fast because it returns all the columns from the leaf node of the B-tree index only without referring the table and also since it is clustered on username hence rows are stored in a page in alphabetical order of the usernames hence ORDER BY does not require to do any sort in a single page. ● If clustering on primary key is not desired i.e. If we do not need order by on primary key and then return almost all the columns, it is better not to define a primary key derived from any of the column values. For e.g. If we do not require queries as above then instead of defining primary key on username define primary key to be some user_id auto_increment because with username primary key there will be lots of random I/O in case of insertions (since insertions are not in any order of username) which is inefficient but with auto increment insertions follow sequential order thus saving random I/O. ● MyISAM engine does not use clustering.
  • 22. Indexing Strategies for High Performance ● Covering Indexes : An index that contains all the data needed to satisfy a query is called a covering index. Consider the query : SELECT firstname, lastname from user_info where firstname='johnny' and lastname like 'de%'; The query is index covered since all the rows that are returned are part of the index (firstname, lastname, dob ). ● Index covered queries are very fast since no row lookups (random I/O on disk) required, instead all rows returned from index. ● Hash, spatial, and full-text indexes don’t use covering indexes, so MySQL can use only B-Tree indexes to cover queries. ● When you issue a query that is covered by an index (an index-covered query), you’ll see “Using index” in the Extra column in EXPLAIN. ● Due to the secondary index structure of InnoDB where secondary indexes store primary keys in their leaf nodes, queries that fetch columns that includes the primary key column and the secondary indexed columns is also a index covered query. For e.g SELECT user_id, firstname, lastname from user_info where firstname='johnny' and lastname like 'de%'; here user_id is not part of the index (firstname, lastname, dob ) but its a primary key so its index covered also.
  • 23. Full-Text Searching ● Most of the queries you’ll write will probably have WHERE clauses that compare values for equality, filter out ranges of rows, and so on. However, you might also need to perform keyword searches, which are based on relevance instead of comparing values to each other. Full-text search systems are designed for this purpose. ● Full-Text search is based on finding words (terms) in documents instead of patterns. ● For example we want to find all matching rows in the reviews table for which the reviews contains some or all words of the phrase “good excellent exciting”. ALTER TABLE reviews add FULLTEXT KEY(review); SELECT review, MATCH(review) AGAINST('good excellent exciting') as relevancy from reviews where MATCH(review) AGAINST('good excellent exciting'); ● Full-Text Searching can be accomplished without indexing also. ● There are two different modes for Full-Text Searching : Natural Language Mode and Boolean Mode. ● Only MyISAM engine supports Full-Text searching and indexing.
  • 24. Full-Text Searching ● Natural Language Mode of Full-Text Searching : The relevancy of a query with a particular row in the table is calculated as follows - 1. Compute the weight of each word/term in the fulltext indexed columns in each row. The weight for each word in a row increases if the number of times it occurs in one row increases and decreases if the number of rows it occurs in increases. i.e. to say that if a word in the query exists in few rows then that word determines how relevant that word is for ordering the search results. For example in the query “we had an exciting adventure” words such as “we”, “had” and “an” are pretty common terms in holiday reviews so they exists in more than 75% of rows in our database but words such as “exciting” and “adventure” are less common and occur in less than 10% of our database so “naturally we are looking for” rows in the table which contains words like “exciting” and “adventure” and thus they should be ranked higher. Infact words such as “we” , “an” , “the” etc. are called stopwords and they are not even considered while calculating weights. 2. Mathematically the formula for weight of a term ti in given row is given as : w[ti]= (log(dtf[ti])+1)/sumdtf * U/(1+0.0115*U) * log((N-nf[ti])/nf[ti])
  • 25. Full-Text Searching ● Natural Language Mode of Full-Text Searching : contd. 2. w[ti]= (log(dtf[ti])+1)/sumdtf * U/(1+0.0115*U) * log((N-nf[ti])/nf[ti]) where dtf[ti] : number of times term ti appears in the row. sumdtf : sum of (log(dtf)+1)'s for all terms in the same row. U : number of unique terms in the row. N : Total number of rows. nf[ti] : number of rows that contain the term ti . The middle term signifies that if the length of the indexed columns in the row is shorter than the average length (= number of unique words) then the weight for that row increases (i.e. “short and sweet” row so as to say). 3. Then the Rank of that row is computed as R = ∑i w[ti]*qnf[ti], where qnf[ti] is the number of times the term ti occurs in the query. This value is given by SELECT MATCH() AGAINST() query. ● Structure of the index : The index is a B-Tree structure with 2 levels. In the first level the nodes store the terms as keys and in the second level for each first level term, a pointer to the rows that contains the term. This is similar to inverted index.
  • 26. Full-Text Searching ● MATCH AGAINST clause can't be used to regard words from a particular column as more important than words from other columns. For example, you might want search results to appear first when the keywords appear in an review's title. ● Alternative solution to give twice the importance to the title of the review than the review itself : ALTER TABLE ADD FULLTEXT KEY(title, review); ALTER TABLE ADD FULLTEXT KEY(title); SELECT title, review, ROUND(MATCH(title, review) AGAINST('good excellent exciting'), 3) AS full_rel, ROUND(MATCH(title) AGAINST('good excellent exciting'), 3) AS title_rel FROM reviews WHERE MATCH(title, review) AGAINST('good excellent exciting') ORDER BY (2 * MATCH(title) AGAINST('good excellent exciting')) + MATCH(title, review) AGAINST('good excellent exciting') DESC;
  • 27. Full-Text Searching ● Boolean Mode of Full-Text Searching : In Boolean searches, the query itself specifies the relative relevance of each word in a match. When constructing a Boolean search query, you can use prefixes to modify the relative ranking of each keyword in the search string. ● Examples : 1. SELECT * from reviews where MATCH(title,review) AGAINST ('good ~bad +adventure' in BOOLEAN MODE); i.e. Rows must contain the word 'adventure' and rows with the word 'good' should be ranked higher and rows with the word 'bad' should be ranked lower. 2. SELECT * from reviews where MATCH(title,review) AGAINST ('good -bad +adventure' in BOOLEAN MODE); i.e. Rows must contain the word 'adventure' and rows with the word 'good' should be ranked higher and the rows should not contain the word 'bad'. 3. SELECT * from reviews where MATCH(title,review) AGAINST ('“good adventure”' in BOOLEAN MODE); i.e. Rows should contain the phrase “good adventure”.
  • 28. Full-Text Searching ● Phrase searches tend to be quite slow. The full-text index alone can’t answer such queries, because it doesn’t record where words are located relative to each other in the original full-text collection. Consequently, the server actually has to look inside the rows to do a phrase search. To execute such a search, the server will find all documents that contain both “good” and “adventure” It will then fetch the rows from which the documents were built, and check for the exact phrase in the collection. ● Disadvantages of Full-Text Indexing and Searching : 1. The index doesn’t record the indexed word’s position in the string, so proximity doesn’t contribute to relevance. 2. MySQL’s full-text indexing performs well when the index fits in memory, but if the index is not in memory it can be very slow, especially when the fields are large. 3. Modifying a piece of text with 100 words requires not 1 but up to 100 index operations.
  • 29. Full-Text Searching ● Disadvantages of Full-Text Indexing and Searching : (contd.) 4. The field length doesn’t usually affect other index types much, but with full- text indexing, text with 3 words and text with 10,000 words will have performance profiles that differ by orders of magnitude. 5. If there’s a full-text index and the query has a MATCH AGAINST clause that can use it, MySQL will use the full-text index to process the query. It will not compare the full-text index to the other indexes that might be used for the query. 6. The full-text search index can perform only full-text matches. Any other criteria in the query, such as WHERE clauses, must be applied after MySQL reads the row from the table. 7. Full-text indexes don’t store the actual text they index. Thus, you can never use a full-text index as a covering index. 8. Full-text indexes cannot be used for any type of sorting, other than sorting by relevance in natural-language mode. If you need to sort by something other than relevance, MySQL will use a filesort.
  • 30. Full-Text Searching ● Disadvantages of Full-Text Indexing and Searching : (contd.) SELECT * from reviews where MATCH(title, review) AGAINST ('good exciting adventure') and review_id= 879; The query will not use the index on review_id since the preference for fulltext index is higher. So it will look into the fulltext index and filter out all matching rows then will use the review_id value to filter from WHERE clause. Solution: Index the review_id column also in the fulltext index by converting the values into some string format 'review_id_is_879'. ALTER TABLE reviews add FULLTEXT KEY(review_id, title, review); SELECT * from reviews where MATCH(review_id, title, review) AGAINST ('+review_id_is_879 good exciting adventure' in BOOLEAN MODE);
  • 31. References ● Introduction to Algorithms, CLRS, 3rd Edition. ● High Performance MySQL by Baron Schwartz, Peter Zaitsev and Vadim Tkachenko. Thank You