Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Data Skipping Technology

60 visualizaciones

Publicado el

FinTech and InsuranceTech case studies digitally transforming Europe's future with BigData and AI
The new data-driven industrial revolution highlights the need for big data technologies to unlock the potential in various application domains. The insurance and finance services industry is rapidly transformed by data-intensive operations and applications. FinTech and InsuranceTech combine very large datasets from legacy banking systems with other data sources such as financial markets data, regulatory datasets, real-time retail transactions, and more, improving financial services and activities for customers.

Publicado en: Tecnología
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Data Skipping Technology

  1. 1. Data Skipping technology Yosef Moatti IBM Haifa Research Labs moatti@il.ibm.com May the 14th – 20
  2. 2. BigDataStack & INFINITECH Webinar 2 1. Big (semi) structured data 2. Object Storage 3. Apache Spark: one of the most popular Analytics engines for big data 4. More specifically Apache Spark SQL Data Skipping: technology context 14.05.2020
  3. 3. Modern BigData architectures decouple storage from compute Hence: Big data analytics implies a (big) data transfer Therefore: Data Skipping is important for BigDataStack Proof: Data Skipping is already incorporated into IBM products (see status slide) BigDataStack & INFINITECH Webinar 3 Contribution to BigDataStack requirements 14.05.2020
  4. 4. BigDataStack & INFINITECH Webinar 4 1. Big (semi) structured data 2. Object Storage 3. Apache Spark: one of the most popular Analytics engines for big data 4. More specifically Apache Spark SQL Data Skipping goal is to minimize the data transfer Data transfer Data Skipping: goal 14.05.2020
  5. 5. Example: Look for data in violent storm conditions SELECT vessel_code, datetime, longitude, latitude, wind_speed FROM cos://us-south/…/danaos stored as parquet WHERE wind_speed > 30 Determine which objects are NOT relevant for a SQL query using a data skipping index Stores summary index metadata for each object No change to Apache Spark base code “just” take advantage of a Spark external API which permits to refine the set of objects relevant to a given query. Skip over irrelevant objects Skipped objects are not touched at all IBM Data skipping is state of the art. It comprises: UDFs support Data Skipping for LIKE and general regular expressions Multiple indexes including Bloom Filter, etc… Plug-in user indexes Etc… è Saves time and $ Index of All Objects QueryData Skipping Index Candidate Objects WHERE Clause Save Time and $ Data Set Objects Min/max index on wind_speed column Possibly relevant object are ingested Data Skipping: how does it work? BigDataStack & INFINITECH Webinar14.05.2020
  6. 6. Data Skipping within the maritime use case 6 Danaos vessel dataset Typical SQL queries: SELECT vessel_code, datetime, longitude, latitude, wind_speed FROM cos://us-south/…/danaos stored as parquet WHERE wind_speed > 30 The Object Storage: either remote IBM Cloud Object Storage (COS) (as shown) or local Depending on query and data partitioning we get up to 99% reduction in data transfer BigDataStack & INFINITECH Webinar14.05.2020
  7. 7. Data Skipping for the BigDataStack maritime dataset demonstrated at THINK ’19 (joint work with Danaos: the BigDatastack maritime use-case partner) Data Skipping: status BigDataStack & INFINITECH Webinar14.05.2020 IBM Products Database Catalog Support added to IBM Cloud SQL Query Blog published here Released as open beta Data Skipping integrated within IBM Cloud SQL Query Blog published here Released as closed beta (as of now) IBM Analytics Engine (IAE) Data Skipping feature available as open beta That’s all … for now Scientific paper soon to be submitted
  8. 8. Data partitioning Push for additional adoption Decouple Data Skipping from Apache Spark … BigDataStack & INFINITECH Webinar 8 Data Skipping: next steps 14.05.2020
  9. 9. The team Team: Oshrit Feder Guy Khazma Gal Lushi Yosef Moatti Paula Ta-Shma Technical contact point: PAULA@il.ibm.com 07.05.2020 BigDataStack & INFINITECH Webinar 9

×