Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Xadoop - new approaches to data analytics

3.472 visualizaciones

Publicado el

Overview of our data analytics work given to Microsoft SQL Server guys during their visit to Systems Group, ETH Zurich

Publicado en: Tecnología
  • Sé el primero en comentar

Xadoop - new approaches to data analytics

  1. 1. Systems Group Dept. Computer Science ETH Zurich - Switzerland Xadoop – new approaches to data analytics Lukas Blunschi, Maxim Grinev , Maria Grineva, Donald Kossmann, Georg Polzer, Kurt Stockinger (Credit Suisse)
  2. 2. Credit Suisse Project <ul><li>Task: Analyze Oracle query logs for audit purposes </li></ul><ul><ul><li>Log size: 6 TB new data every 6 month s </li></ul></ul><ul><ul><li>Typical query: who queried column A in table B in the second quarter of 2009 </li></ul></ul><ul><ul><li>A few queries like this twice a year </li></ul></ul><ul><li>Issues: </li></ul><ul><ul><li>Storing logs in Oracle tables is slow => Storing in XML files instead </li></ul></ul><ul><ul><li>Scan-intensive queries because of complex log processing (SQL parsing) </li></ul></ul>
  3. 3. Possible Solutions <ul><li>Build a warehouse </li></ul><ul><ul><li>Not cost effective for a few queries twice a year </li></ul></ul><ul><li>Use Hadoop </li></ul><ul><ul><li>Open source but proven software </li></ul></ul><ul><ul><li>Logs are already in files </li></ul></ul><ul><ul><li>Easy to implement the queries and to deploy </li></ul></ul>
  4. 4. Hadoop Solution 1: Using Pig <ul><li>Pig – High-level data processing language compiled to MapReduce </li></ul><ul><li>Advantages: </li></ul><ul><ul><li>It is easy to develop in Pig </li></ul></ul><ul><ul><li>Extendable via User Defined Functions in Java </li></ul></ul><ul><ul><li>Widely used by Web companies (Twitter, etc) </li></ul></ul><ul><li>Disadvantages: </li></ul><ul><ul><li>Have to write a format-specific data loader to parse XML </li></ul></ul><ul><ul><li>Restricted support for nested queries </li></ul></ul>
  5. 5. Hadoop Solution 1: Pig Example <ul><li>Get users who queried table “LOGON_INFO” after the date and sorted by number of requests : </li></ul><ul><li>register ./pigxml.jar </li></ul><ul><li>define DATECOMP ch.ethz.xadoop.udf.DATECOMP(); </li></ul><ul><li>define XMLLoader ch.ethz.xadoop.loader.XMLLoader() ; </li></ul><ul><li>A = load 'audit.xml' using XMLLoader() as (action, audit_type, comment_text, db_user, entry_id, instance_number, object_name, object_schema, os_process, os_user, return_code, scn, session_id, sql_bind, sql_text, terminal, user_host, extended_timestamp); </li></ul><ul><li>B = filter A by sql_text matches '.*LOGON_INFO.*' and DATECOMP((chararray)extended_timestamp, '2010-03-04T10:00:43.775225') > 0; </li></ul><ul><li>B1 = group B by db_user; </li></ul><ul><li>B2 = foreach B1 generate group, COUNT(B.sql_text) as num_of_queries; </li></ul><ul><li>B3 = order B2 by num_of_queries desc; </li></ul><ul><li>dump B3; </li></ul>
  6. 6. Hadoop Solution 1: Experiments 38m 10s 26m 20s 11m 05s 5 workers 59m 20s 40m 30s 19m 00s 3 workers 90 Gb 60 Gb 30 Gb
  7. 7. Hadoop Solution 2: Using XQuery <ul><li>Xadoop is an integration of XQuery (Zorba) and Hadoop: </li></ul><ul><ul><li>Map and Reduce are implemented in XQuery </li></ul></ul><ul><li>Advantages: </li></ul><ul><ul><li>Don’t need to write a loader for XML input </li></ul></ul><ul><ul><li>XQuery is a powerful data processing and transformation language with support for UDFs </li></ul></ul><ul><li>Disadvantages: </li></ul><ul><ul><li>You have to think in terms of two programming models: MapReduce and XQuery – that is quite natural and useful in practice though </li></ul></ul>
  8. 8. Hadoop Solution 2: Using XQuery <ul><li>declare function xadoop:map($ record ) { </li></ul><ul><li>for $r in $record </li></ul><ul><li>where fn:contains($r/sql_text, “ LOGON_INFO ”) and xs:date( $r/ extended_timestamp) > xs:date(&quot;2000-0 3 -0 4 &quot;) </li></ul><ul><li>return (<key> {$r/db_user} </key>,<value>1</value>) </li></ul><ul><li>}; </li></ul><ul><li>declare function xadoop:reduce($key, $ num ) { </li></ul><ul><li>($key,<value>{fn:count($ num /value)}</value>) </li></ul><ul><li>}; </li></ul>
  9. 9. Future Work: Vision <ul><li>You cannot merge traditional OLAP and OLTP systems: </li></ul><ul><ul><li>OLAP – pre-aggregated data with redundancy </li></ul></ul><ul><ul><li>OLTP – tend to be normalized </li></ul></ul><ul><li>There are two trends on the Web </li></ul><ul><ul><li>Hadoop is often used for analytic processing instead of warehouses </li></ul></ul><ul><ul><li>Key-value store is used for OLTP </li></ul></ul><ul><li>MapReduce and key-value store are good match </li></ul><ul><ul><li>MapReduce takes raw operational data and does aggregation on-the-fly </li></ul></ul><ul><ul><li>Key-value store is a natural input for MapReduce </li></ul></ul>
  10. 10. Future work: Issues <ul><li>Running Hadoop MapReduce over Cassandra key-value store: </li></ul><ul><ul><li>“ SQL/XQuery” over Cassandra/BigTable data model compiled to M/R </li></ul></ul><ul><ul><li>How to share resources (CPU, I/O) to support both transactional and analytical workloads over the same store </li></ul></ul><ul><li>Real-time analytics: </li></ul><ul><ul><li>From pull (batch) to push (online) processing models </li></ul></ul><ul><ul><li>Hadoop is slow but can be optimized (e.g. checkpointing into main memory of another cloud machine instead of local disk) </li></ul></ul>