SlideShare una empresa de Scribd logo
1 de 9
Descargar para leer sin conexión
Data Skipping technology
Yosef Moatti
IBM Haifa Research Labs
moatti@il.ibm.com
May the 14th – 20
BigDataStack & INFINITECH Webinar 2
1. Big (semi) structured data
2. Object Storage
3. Apache Spark: one of the most popular Analytics
engines for big data
4. More specifically Apache Spark SQL
Data Skipping: technology context
14.05.2020
Modern BigData architectures decouple storage from
compute
Hence:
Big data analytics implies a (big) data transfer
Therefore:
Data Skipping is important for BigDataStack
Proof: Data Skipping is already incorporated into IBM
products (see status slide)
BigDataStack & INFINITECH Webinar 3
Contribution to BigDataStack requirements
14.05.2020
BigDataStack & INFINITECH Webinar 4
1. Big (semi) structured data
2. Object Storage
3. Apache Spark: one of the most popular Analytics
engines for big data
4. More specifically Apache Spark SQL
Data Skipping goal is to
minimize the data transfer
Data transfer
Data Skipping: goal
14.05.2020
Example: Look for data in violent storm conditions
SELECT vessel_code, datetime, longitude, latitude, wind_speed
FROM cos://us-south/…/danaos stored as parquet
WHERE wind_speed > 30
Determine which objects are NOT relevant for a
SQL query using a data skipping index
Stores summary index metadata for each object
No change to Apache Spark base code
“just” take advantage of a Spark external API which
permits to refine the set of objects relevant to a given
query.
Skip over irrelevant objects
Skipped objects are not touched at all
IBM Data skipping is state of the art.
It comprises:
UDFs support
Data Skipping for LIKE and general regular expressions
Multiple indexes including Bloom Filter, etc…
Plug-in user indexes
Etc…
è Saves time and $
Index of All
Objects
QueryData Skipping
Index
Candidate
Objects
WHERE
Clause
Save Time
and $
Data Set Objects
Min/max index on
wind_speed column
Possibly relevant
object are ingested
Data Skipping: how does it work?
BigDataStack & INFINITECH Webinar14.05.2020
Data Skipping within the maritime use case
6
Danaos vessel dataset
Typical SQL queries: SELECT vessel_code, datetime, longitude, latitude, wind_speed
FROM cos://us-south/…/danaos stored as parquet WHERE wind_speed > 30
The Object Storage: either remote IBM Cloud Object Storage (COS) (as shown) or local
Depending on query and data partitioning we get up to 99% reduction in data transfer
BigDataStack & INFINITECH Webinar14.05.2020
Data Skipping for the BigDataStack maritime dataset demonstrated at THINK ’19
(joint work with Danaos: the BigDatastack maritime use-case partner)
Data Skipping: status
BigDataStack & INFINITECH Webinar14.05.2020
IBM Products
Database Catalog Support added to IBM Cloud SQL Query
Blog published here
Released as open beta
Data Skipping integrated within IBM Cloud SQL Query
Blog published here
Released as closed beta (as of now)
IBM Analytics Engine (IAE)
Data Skipping feature available as open beta
That’s all … for now
Scientific paper soon to be submitted
Data partitioning
Push for additional adoption
Decouple Data Skipping from Apache Spark
…
BigDataStack & INFINITECH Webinar 8
Data Skipping: next steps
14.05.2020
The team
Team:
Oshrit Feder
Guy Khazma
Gal Lushi
Yosef Moatti
Paula Ta-Shma
Technical contact point: PAULA@il.ibm.com
07.05.2020 BigDataStack & INFINITECH Webinar 9

Más contenido relacionado

La actualidad más candente

T6.6 Sensitive Data Activities
T6.6 Sensitive Data ActivitiesT6.6 Sensitive Data Activities
T6.6 Sensitive Data ActivitiesOpenAIRE
 
Data analytics and downscaling for climate research in a big data world
Data analytics and downscaling for climate research in a big data worldData analytics and downscaling for climate research in a big data world
Data analytics and downscaling for climate research in a big data worldBigData_Europe
 
SC1 Workshop 2 Pilot instantiations
SC1 Workshop 2 Pilot instantiationsSC1 Workshop 2 Pilot instantiations
SC1 Workshop 2 Pilot instantiationsBigData_Europe
 
BDE-BDVA Webinar: BigDataEurope Overview & Synergies with BDVA
BDE-BDVA Webinar: BigDataEurope Overview & Synergies with BDVABDE-BDVA Webinar: BigDataEurope Overview & Synergies with BDVA
BDE-BDVA Webinar: BigDataEurope Overview & Synergies with BDVABigData_Europe
 
PhD Projects in Green Cloud Computing Research Guidance
PhD Projects in Green Cloud Computing Research GuidancePhD Projects in Green Cloud Computing Research Guidance
PhD Projects in Green Cloud Computing Research GuidancePhD Services
 
Tor Hovland: Taking a swim in the big data lake
Tor Hovland: Taking a swim in the big data lakeTor Hovland: Taking a swim in the big data lake
Tor Hovland: Taking a swim in the big data lakeAnalyticsConf
 
Neo4j GraphTour New York_Thomson Reuters SS
Neo4j GraphTour New York_Thomson Reuters SSNeo4j GraphTour New York_Thomson Reuters SS
Neo4j GraphTour New York_Thomson Reuters SSNeo4j
 
Mundi Presentation - A Space of New Opportunities
Mundi Presentation - A Space of New OpportunitiesMundi Presentation - A Space of New Opportunities
Mundi Presentation - A Space of New Opportunitiesplan4all
 
Extraction of significant patterns from heart disease warehouses for heart at...
Extraction of significant patterns from heart disease warehouses for heart at...Extraction of significant patterns from heart disease warehouses for heart at...
Extraction of significant patterns from heart disease warehouses for heart at...Alyaa Muhi
 
Journées ABES 2014 - Projet CIB - Uwe Rich
Journées ABES 2014 - Projet CIB - Uwe Rich Journées ABES 2014 - Projet CIB - Uwe Rich
Journées ABES 2014 - Projet CIB - Uwe Rich ABES
 
Hack reduce introduction
Hack reduce introductionHack reduce introduction
Hack reduce introductionmontrealouvert
 
Let's downscale the semantic web !
Let's downscale the semantic web !Let's downscale the semantic web !
Let's downscale the semantic web !Christophe Guéret
 
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...BigData_Europe
 
Open source for customer analytics
Open source for customer analyticsOpen source for customer analytics
Open source for customer analyticsMatthias Funke
 
ESCAPE Kick-off meeting - HL-LHC ESFRI Landmark (Feb 2019)
ESCAPE Kick-off meeting - HL-LHC ESFRI Landmark (Feb 2019)ESCAPE Kick-off meeting - HL-LHC ESFRI Landmark (Feb 2019)
ESCAPE Kick-off meeting - HL-LHC ESFRI Landmark (Feb 2019)ESCAPE EU
 
Exposing the data from NARCIS with VIVO
Exposing the data from NARCIS with VIVOExposing the data from NARCIS with VIVO
Exposing the data from NARCIS with VIVOChristophe Guéret
 
BDE SC3.3 Workshop - Agenda
 BDE SC3.3 Workshop - Agenda BDE SC3.3 Workshop - Agenda
BDE SC3.3 Workshop - AgendaBigData_Europe
 

La actualidad más candente (20)

T6.6 Sensitive Data Activities
T6.6 Sensitive Data ActivitiesT6.6 Sensitive Data Activities
T6.6 Sensitive Data Activities
 
Data analytics and downscaling for climate research in a big data world
Data analytics and downscaling for climate research in a big data worldData analytics and downscaling for climate research in a big data world
Data analytics and downscaling for climate research in a big data world
 
SC1 Workshop 2 Pilot instantiations
SC1 Workshop 2 Pilot instantiationsSC1 Workshop 2 Pilot instantiations
SC1 Workshop 2 Pilot instantiations
 
BDE-BDVA Webinar: BigDataEurope Overview & Synergies with BDVA
BDE-BDVA Webinar: BigDataEurope Overview & Synergies with BDVABDE-BDVA Webinar: BigDataEurope Overview & Synergies with BDVA
BDE-BDVA Webinar: BigDataEurope Overview & Synergies with BDVA
 
Big data hadoop
Big data hadoopBig data hadoop
Big data hadoop
 
Helix Nebula Initiative
Helix Nebula InitiativeHelix Nebula Initiative
Helix Nebula Initiative
 
PhD Projects in Green Cloud Computing Research Guidance
PhD Projects in Green Cloud Computing Research GuidancePhD Projects in Green Cloud Computing Research Guidance
PhD Projects in Green Cloud Computing Research Guidance
 
Tor Hovland: Taking a swim in the big data lake
Tor Hovland: Taking a swim in the big data lakeTor Hovland: Taking a swim in the big data lake
Tor Hovland: Taking a swim in the big data lake
 
Neo4j GraphTour New York_Thomson Reuters SS
Neo4j GraphTour New York_Thomson Reuters SSNeo4j GraphTour New York_Thomson Reuters SS
Neo4j GraphTour New York_Thomson Reuters SS
 
Mundi Presentation - A Space of New Opportunities
Mundi Presentation - A Space of New OpportunitiesMundi Presentation - A Space of New Opportunities
Mundi Presentation - A Space of New Opportunities
 
Extraction of significant patterns from heart disease warehouses for heart at...
Extraction of significant patterns from heart disease warehouses for heart at...Extraction of significant patterns from heart disease warehouses for heart at...
Extraction of significant patterns from heart disease warehouses for heart at...
 
Journées ABES 2014 - Projet CIB - Uwe Rich
Journées ABES 2014 - Projet CIB - Uwe Rich Journées ABES 2014 - Projet CIB - Uwe Rich
Journées ABES 2014 - Projet CIB - Uwe Rich
 
Hack reduce introduction
Hack reduce introductionHack reduce introduction
Hack reduce introduction
 
Let's downscale the semantic web !
Let's downscale the semantic web !Let's downscale the semantic web !
Let's downscale the semantic web !
 
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
 
20110922 owf
20110922 owf20110922 owf
20110922 owf
 
Open source for customer analytics
Open source for customer analyticsOpen source for customer analytics
Open source for customer analytics
 
ESCAPE Kick-off meeting - HL-LHC ESFRI Landmark (Feb 2019)
ESCAPE Kick-off meeting - HL-LHC ESFRI Landmark (Feb 2019)ESCAPE Kick-off meeting - HL-LHC ESFRI Landmark (Feb 2019)
ESCAPE Kick-off meeting - HL-LHC ESFRI Landmark (Feb 2019)
 
Exposing the data from NARCIS with VIVO
Exposing the data from NARCIS with VIVOExposing the data from NARCIS with VIVO
Exposing the data from NARCIS with VIVO
 
BDE SC3.3 Workshop - Agenda
 BDE SC3.3 Workshop - Agenda BDE SC3.3 Workshop - Agenda
BDE SC3.3 Workshop - Agenda
 

Similar a Data Skipping Technology

IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionTorsten Steinbach
 
Research data management 1.5
Research data management 1.5Research data management 1.5
Research data management 1.5John Martin
 
Big data presentation, explanations and use cases in industrial sector
Big data presentation, explanations and use cases in industrial sectorBig data presentation, explanations and use cases in industrial sector
Big data presentation, explanations and use cases in industrial sectorNicolas Sarramagna
 
Logical Data Fabric: Architectural Components
Logical Data Fabric: Architectural ComponentsLogical Data Fabric: Architectural Components
Logical Data Fabric: Architectural ComponentsDenodo
 
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Nati Shalom
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3Robert Grossman
 
Apache Storm
Apache StormApache Storm
Apache StormEdureka!
 
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)EUDAT
 
ESA and the Cloud
ESA and the CloudESA and the Cloud
ESA and the CloudNetcetera
 
Snowflake’s Cloud Data Platform and Modern Analytics
Snowflake’s Cloud Data Platform and Modern AnalyticsSnowflake’s Cloud Data Platform and Modern Analytics
Snowflake’s Cloud Data Platform and Modern AnalyticsSenturus
 
IoT to Cloud: Middle Layer (e.g Gateway, Hubs, Fog, Edge Computing)
IoT to Cloud: Middle Layer (e.g Gateway, Hubs, Fog, Edge Computing)IoT to Cloud: Middle Layer (e.g Gateway, Hubs, Fog, Edge Computing)
IoT to Cloud: Middle Layer (e.g Gateway, Hubs, Fog, Edge Computing)Bob Marcus
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...Alex Liu
 
Rank | Analyse | Lead | Search
Rank | Analyse | Lead | SearchRank | Analyse | Lead | Search
Rank | Analyse | Lead | Searchsopekmir
 
CSIT 534 Presentation Cherri_edmond
CSIT 534 Presentation Cherri_edmondCSIT 534 Presentation Cherri_edmond
CSIT 534 Presentation Cherri_edmondlowedmond
 

Similar a Data Skipping Technology (20)

IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query Introduction
 
Research data management 1.5
Research data management 1.5Research data management 1.5
Research data management 1.5
 
Big data presentation, explanations and use cases in industrial sector
Big data presentation, explanations and use cases in industrial sectorBig data presentation, explanations and use cases in industrial sector
Big data presentation, explanations and use cases in industrial sector
 
Logical Data Fabric: Architectural Components
Logical Data Fabric: Architectural ComponentsLogical Data Fabric: Architectural Components
Logical Data Fabric: Architectural Components
 
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
 
Final Presentation
Final PresentationFinal Presentation
Final Presentation
 
Solving Big Data Problems
Solving Big Data ProblemsSolving Big Data Problems
Solving Big Data Problems
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
 
Cloud Storage for all
Cloud Storage for allCloud Storage for all
Cloud Storage for all
 
ESA and the Cloud
ESA and the CloudESA and the Cloud
ESA and the Cloud
 
IoT - module 4
IoT - module 4IoT - module 4
IoT - module 4
 
Snowflake’s Cloud Data Platform and Modern Analytics
Snowflake’s Cloud Data Platform and Modern AnalyticsSnowflake’s Cloud Data Platform and Modern Analytics
Snowflake’s Cloud Data Platform and Modern Analytics
 
IoT to Cloud: Middle Layer (e.g Gateway, Hubs, Fog, Edge Computing)
IoT to Cloud: Middle Layer (e.g Gateway, Hubs, Fog, Edge Computing)IoT to Cloud: Middle Layer (e.g Gateway, Hubs, Fog, Edge Computing)
IoT to Cloud: Middle Layer (e.g Gateway, Hubs, Fog, Edge Computing)
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
 
Rank | Analyse | Lead | Search
Rank | Analyse | Lead | SearchRank | Analyse | Lead | Search
Rank | Analyse | Lead | Search
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
CSIT 534 Presentation Cherri_edmond
CSIT 534 Presentation Cherri_edmondCSIT 534 Presentation Cherri_edmond
CSIT 534 Presentation Cherri_edmond
 
Big Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use CaseBig Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use Case
 

Más de Big Data Value Association

Data Privacy, Security in personal data sharing
Data Privacy, Security in personal data sharingData Privacy, Security in personal data sharing
Data Privacy, Security in personal data sharingBig Data Value Association
 
Key Modules for a trsuted and privacy preserving personal data marketplace
Key Modules for a trsuted and privacy preserving personal data marketplaceKey Modules for a trsuted and privacy preserving personal data marketplace
Key Modules for a trsuted and privacy preserving personal data marketplaceBig Data Value Association
 
GDPR and Data Ethics considerations in personal data sharing
GDPR and Data Ethics considerations in personal data sharingGDPR and Data Ethics considerations in personal data sharing
GDPR and Data Ethics considerations in personal data sharingBig Data Value Association
 
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...Big Data Value Association
 
Three pillars for building a Smart Data Ecosystem: Trust, Security and Privacy
Three pillars for building a Smart Data Ecosystem: Trust, Security and PrivacyThree pillars for building a Smart Data Ecosystem: Trust, Security and Privacy
Three pillars for building a Smart Data Ecosystem: Trust, Security and PrivacyBig Data Value Association
 
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...Market into context - Three pillars for building a Smart Data Ecosystem: Trus...
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...Big Data Value Association
 
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...Big Data Value Association
 
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna BDV Skills Accreditation - Big Data skilling in Emilia-Romagna
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna Big Data Value Association
 
BDV Skills Accreditation - EIT labels for professionals
BDV Skills Accreditation - EIT labels for professionalsBDV Skills Accreditation - EIT labels for professionals
BDV Skills Accreditation - EIT labels for professionalsBig Data Value Association
 
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...Big Data Value Association
 
BDV Skills Accreditation - Objectives of the workshop
BDV Skills Accreditation - Objectives of the workshopBDV Skills Accreditation - Objectives of the workshop
BDV Skills Accreditation - Objectives of the workshopBig Data Value Association
 
BDV Skills Accreditation - Welcome introduction to the workshop
BDV Skills Accreditation - Welcome introduction to the workshopBDV Skills Accreditation - Welcome introduction to the workshop
BDV Skills Accreditation - Welcome introduction to the workshopBig Data Value Association
 
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...BDV Skills Accreditation - Definition and ensuring of digital roles and compe...
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...Big Data Value Association
 
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector Webinar
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector WebinarBigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector Webinar
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector WebinarBig Data Value Association
 
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector WebinarBigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector WebinarBig Data Value Association
 
Virtual BenchLearning - DeepHealth - Needs & Requirements for Benchmarking
Virtual BenchLearning - DeepHealth - Needs & Requirements for BenchmarkingVirtual BenchLearning - DeepHealth - Needs & Requirements for Benchmarking
Virtual BenchLearning - DeepHealth - Needs & Requirements for BenchmarkingBig Data Value Association
 
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Big Data Value Association
 
Policy Cloud Data Driven Policies against Radicalisation - Technical Overview
Policy Cloud Data Driven Policies against Radicalisation - Technical OverviewPolicy Cloud Data Driven Policies against Radicalisation - Technical Overview
Policy Cloud Data Driven Policies against Radicalisation - Technical OverviewBig Data Value Association
 
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...Big Data Value Association
 

Más de Big Data Value Association (20)

Data Privacy, Security in personal data sharing
Data Privacy, Security in personal data sharingData Privacy, Security in personal data sharing
Data Privacy, Security in personal data sharing
 
Key Modules for a trsuted and privacy preserving personal data marketplace
Key Modules for a trsuted and privacy preserving personal data marketplaceKey Modules for a trsuted and privacy preserving personal data marketplace
Key Modules for a trsuted and privacy preserving personal data marketplace
 
GDPR and Data Ethics considerations in personal data sharing
GDPR and Data Ethics considerations in personal data sharingGDPR and Data Ethics considerations in personal data sharing
GDPR and Data Ethics considerations in personal data sharing
 
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...
 
Three pillars for building a Smart Data Ecosystem: Trust, Security and Privacy
Three pillars for building a Smart Data Ecosystem: Trust, Security and PrivacyThree pillars for building a Smart Data Ecosystem: Trust, Security and Privacy
Three pillars for building a Smart Data Ecosystem: Trust, Security and Privacy
 
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...Market into context - Three pillars for building a Smart Data Ecosystem: Trus...
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...
 
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...
 
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna BDV Skills Accreditation - Big Data skilling in Emilia-Romagna
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna
 
BDV Skills Accreditation - EIT labels for professionals
BDV Skills Accreditation - EIT labels for professionalsBDV Skills Accreditation - EIT labels for professionals
BDV Skills Accreditation - EIT labels for professionals
 
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...
 
BDV Skills Accreditation - Objectives of the workshop
BDV Skills Accreditation - Objectives of the workshopBDV Skills Accreditation - Objectives of the workshop
BDV Skills Accreditation - Objectives of the workshop
 
BDV Skills Accreditation - Welcome introduction to the workshop
BDV Skills Accreditation - Welcome introduction to the workshopBDV Skills Accreditation - Welcome introduction to the workshop
BDV Skills Accreditation - Welcome introduction to the workshop
 
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...BDV Skills Accreditation - Definition and ensuring of digital roles and compe...
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...
 
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector Webinar
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector WebinarBigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector Webinar
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector Webinar
 
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector WebinarBigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
 
Virtual BenchLearning - Data Bench Framework
Virtual BenchLearning - Data Bench FrameworkVirtual BenchLearning - Data Bench Framework
Virtual BenchLearning - Data Bench Framework
 
Virtual BenchLearning - DeepHealth - Needs & Requirements for Benchmarking
Virtual BenchLearning - DeepHealth - Needs & Requirements for BenchmarkingVirtual BenchLearning - DeepHealth - Needs & Requirements for Benchmarking
Virtual BenchLearning - DeepHealth - Needs & Requirements for Benchmarking
 
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
 
Policy Cloud Data Driven Policies against Radicalisation - Technical Overview
Policy Cloud Data Driven Policies against Radicalisation - Technical OverviewPolicy Cloud Data Driven Policies against Radicalisation - Technical Overview
Policy Cloud Data Driven Policies against Radicalisation - Technical Overview
 
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...
 

Último

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Último (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Data Skipping Technology

  • 1. Data Skipping technology Yosef Moatti IBM Haifa Research Labs moatti@il.ibm.com May the 14th – 20
  • 2. BigDataStack & INFINITECH Webinar 2 1. Big (semi) structured data 2. Object Storage 3. Apache Spark: one of the most popular Analytics engines for big data 4. More specifically Apache Spark SQL Data Skipping: technology context 14.05.2020
  • 3. Modern BigData architectures decouple storage from compute Hence: Big data analytics implies a (big) data transfer Therefore: Data Skipping is important for BigDataStack Proof: Data Skipping is already incorporated into IBM products (see status slide) BigDataStack & INFINITECH Webinar 3 Contribution to BigDataStack requirements 14.05.2020
  • 4. BigDataStack & INFINITECH Webinar 4 1. Big (semi) structured data 2. Object Storage 3. Apache Spark: one of the most popular Analytics engines for big data 4. More specifically Apache Spark SQL Data Skipping goal is to minimize the data transfer Data transfer Data Skipping: goal 14.05.2020
  • 5. Example: Look for data in violent storm conditions SELECT vessel_code, datetime, longitude, latitude, wind_speed FROM cos://us-south/…/danaos stored as parquet WHERE wind_speed > 30 Determine which objects are NOT relevant for a SQL query using a data skipping index Stores summary index metadata for each object No change to Apache Spark base code “just” take advantage of a Spark external API which permits to refine the set of objects relevant to a given query. Skip over irrelevant objects Skipped objects are not touched at all IBM Data skipping is state of the art. It comprises: UDFs support Data Skipping for LIKE and general regular expressions Multiple indexes including Bloom Filter, etc… Plug-in user indexes Etc… è Saves time and $ Index of All Objects QueryData Skipping Index Candidate Objects WHERE Clause Save Time and $ Data Set Objects Min/max index on wind_speed column Possibly relevant object are ingested Data Skipping: how does it work? BigDataStack & INFINITECH Webinar14.05.2020
  • 6. Data Skipping within the maritime use case 6 Danaos vessel dataset Typical SQL queries: SELECT vessel_code, datetime, longitude, latitude, wind_speed FROM cos://us-south/…/danaos stored as parquet WHERE wind_speed > 30 The Object Storage: either remote IBM Cloud Object Storage (COS) (as shown) or local Depending on query and data partitioning we get up to 99% reduction in data transfer BigDataStack & INFINITECH Webinar14.05.2020
  • 7. Data Skipping for the BigDataStack maritime dataset demonstrated at THINK ’19 (joint work with Danaos: the BigDatastack maritime use-case partner) Data Skipping: status BigDataStack & INFINITECH Webinar14.05.2020 IBM Products Database Catalog Support added to IBM Cloud SQL Query Blog published here Released as open beta Data Skipping integrated within IBM Cloud SQL Query Blog published here Released as closed beta (as of now) IBM Analytics Engine (IAE) Data Skipping feature available as open beta That’s all … for now Scientific paper soon to be submitted
  • 8. Data partitioning Push for additional adoption Decouple Data Skipping from Apache Spark … BigDataStack & INFINITECH Webinar 8 Data Skipping: next steps 14.05.2020
  • 9. The team Team: Oshrit Feder Guy Khazma Gal Lushi Yosef Moatti Paula Ta-Shma Technical contact point: PAULA@il.ibm.com 07.05.2020 BigDataStack & INFINITECH Webinar 9