SlideShare una empresa de Scribd logo
1 de 34
Descargar para leer sin conexión
Integrate Apache Flink with Apache Hive
Xuefu Zhang,
-- Senior Staff Engineer, Alibaba
-- Hive PMC, Apache Member
Bowen Li
-- Senior Engineer, Alibaba
● Background
● Goals
● Technical Overview
● Current Progress
● Demo
● Q&A
Agenda
Background
● Flink has achieved an impressive success in stream processing
● Its scalability and potential has been proven and pushed further by Blink, now
part of Flink
● at Alibaba, Flink is used to process extremely large amount of data at an
unprecedented scale
1.7B Events/secEB Total PB Everyday 1T Event/Day
Streaming SQL
● Majority of stream analytics can be expressed in SQL
● Instead of programming, streaming SQL gives a user a non-programming way of
writing and deploying streaming jobs
● For SQL, there is need for metadata: sources, sinks, UDFs, views, etc
● The metadata needs a store
Streaming SQL (cont’d)
● Currently, Flink stores metadata in a memory
● The metadata is ill-organized, scattered around in different components
● Poor usability, interoperability, productivity, and manageability
● Problem #1: Flink lacking a well-organized, persistent store for its metadata
Batch and SQL
● Stream analytics users usually have also offline, batch analytics
● ETL is still an important use case for big data
● AI/ML is a major driving force behind both real-time and batch analytics
○ Gathering data to train and test a model, deploying it in stream processing
● SQL is the main tool processing big data for batch
● Unfortunately, users have to have a different engine for non-stream processing
Batch and SQL (cont’d)
● Flink has showed prevailing advantages over other solutions for
heavy-volume stream processing
● In Blink, we systematically explored Flink’s capabilities in batch processing,
and it shows great potential
Flink is the fastest due to its pipelined execution
Tez and Spark do not overlap 1st and 2nd stages
MapReduce is slow despite overlapping stages
A Comparative Performance Evaluation of Flink, Dongwon Kim, POSTECH, Flink Forward 2015
Batch and SQL (cont’d)
● Batch requires more on SQL capability
● Demands an even stronger metadata management
● Hive is the de facto standard for big data/batch processing on Hadoop
● The center of big data ecosystem is Hive metadata store
● Problem #2: Flink lacking a seamless access to Hive’s metadata and data
Heterogeneous Sources/Sinks
● Whether batch or streaming, Flink usually needs to access many data systems
○ Hive
○ MySQL
○ Key-Value stores
○ Kafka stream
● Different data catalogs
● Problem #3, Flink needs a unified interface to interact with different data catalogs
Beyond Flink
● Batch has a large use case then streaming
● Many Hive users are not Flink users
● We like Hive users can benefit from Flink’s batch capabilities
● Problem #4: Flink needing a story for Hive users
Four Goals
● Define Unified catalog API
● Implement In-Memory catalog and persistent catalog for Flink metadata
● Implement Hive catalog, enabling deep integration with Hive
● Provide Flink as Hive’s new execution engine (long-term)
Technical Overview
● Define unified catalog APIs (FLIP-30)
● Three implementations
○ Generic in-memory catalog
○ Generic persistent catalog (based on Hive metastore)
○ Hive catalog
● Hive data access
● Hive on Flink is not yet planned
Architecture
Flink Deployment
Flink Runtime
Query processing & optimization
Table API and SQL
SQL Client/Zeppelin
Catalog APIs
Catalog APIs and Implementations
GenericInMemoryCatalog
GenericHiveMetastoreCatalog
ReadableCatalog
ReadableWritableCatalog
HiveCatalog
Shim Layer:
HiveMetastoreClient
CatalogManager
TableEnvironment
inheritance reference
SQL Client HiveCatalogBase
Hive Metastore
Catalog APIs
Hive Data Connector
BatchTableFactory
HiveTableFactory
BatchTableSource
HiveTableSource
InputFormat
HiveTableInputFormat
BatchTableSink
HiveTableSink
OutputFormat
HiveTableOutputFormat
Read
Write
Hive Data
HiveTableSink HiveTableOutputFormat
Current Progress, Development Plan, and Demo
Bowen Li
Integrating Flink with Hive
This is a major change, work needs to be broken into parts
Part 1. Unified Catalog APIs (FLIP-30, FLINK-11275)
Part 2. Integrate Flink with Hive (FLINK-10556)
● for metadata thru Hive Metastore (FLINK-10744)
● for data (FLINK-10729)
Part 3. Support a complete set of SQL DDL/DML in Flink (FLINK-10232)
1 - Unified Catalog APIs
Flink current status:
○ Barely any catalog support
○ Has separate function catalog
Our highlighted improvements:
○ Introduced new catalog APIs and framework and connected to Calcite
● ReadableCatalog and ReadableWritableCatalog
● Meta-Objects: Database, Table, View, Partition, Functions, Stats, etc
● Operations: Create/Alter/Rename/Drop/Get/List/Exist/
○ Unified function catalog with new catalog APIs and supported persisting functions
1 - Unified Catalog APIs
Flink current status:
○ No well-structured hierarchy yet to manage metadata
○ Needs better SQL user experience when referencing metadata
Our highlighted improvements:
● Introduced two-level management structure: <catalog>.<db>.<meta-object>
● Added CatalogManager to resolve object name
select * from defaultCatalog.defaultDb.Tbl => select * from Tbl
● Made Flink case-insensitive to object names, similar to Hive, MySQL, Oracle
1 - Unified Catalog APIs
Flink current status:
No production-ready catalogs
Our highlighted improvements:
Developed three production-ready catalogs
■ GenericInMemoryCatalog - in-memory non-persistent, per session
■ HiveCatalog - compatible with Hive, read/write Hive meta-objects
■ GenericHiveMetastoreCatalog - persist Flink streaming and batch meta-objects
1 - Unified Catalog APIs
Catalogs are pluggable and opens opportunities to build catalogs for
○ Streams and MQ
● Kafka (Confluent Schema Registry), Kinesis, RabbitMQ, Pulsar, etc
○ Structured Data
● RDMS like MySQL, etc
○ Semi-Structured Data
● ElasticSearch, HBase, Cassandra, etc
○ Your other favorite data management systems
● …...
2 - Flink-Hive Integration - Metadata - HiveCatalog
Our highlighted improvements:
Developed HiveCatalog, via which Flink can
● read Hive meta-objects, like tables, views, functions, stats
● create and write Hive meta-objects to Hive Metastore such that Hive can consume
Flink can read and write Hive metadata thru HiveCatalogFlink can read and write Hive metadata thru HiveCatalog
2 - Flink-Hive Integration - Metadata - GenericHiveMetastoreCatalog
Our highlighted improvements:
● Persisted Flink’s metadata (both streaming and batch) by using Hive Metastore purely
as storage
HiveCatalog v.s. GenericHiveMetastoreCatalog
● for Hive batch metadata
● Hive can understand
● for any streaming and batch metadata
● Hive may not understand
Both are backed by Hive Metastore
2. Flink-Hive Integration - Data
Our highlighted improvements:
Connector:
○ Developed source and sink to read/write partition/non-partition tables and views
○ Supported partition-pruning
Data Types:
○ Supported for all Hive simple and complex (array, map, struct) data types
2. Flink-Hive Integration -
User defined functions and Version Compatibility
● Hive user defined functions
■ Supported Hive UDF
■ Working on supporting Hive GenericUDF, UDTF, UDAF
● Hive versions
■ Currently supports Hive 2.3.4 and 1.2.2 via shimming
■ Relies on Hive’s backward compatibility for 2.x and 1.x
● Working on direct support for more Hive versions, e.g. 2.1.1, 1.2.1
Timeline
First Targeted Flink release - 1.9.0, June 2019
Demo with Flink SQL CLI
• Query Hive Metadata
• Create Hive Source/Sink with HiveCatalog to read/write data
• Create CSV Source/Sink with GenericHiveMetastoreCatalog to read/write data
This tremendous amount of work cannot happen without help and support
Shout out to everyone in the community and our team
who have been helping us with designs, codes, feedbacks, etc!
● Flink is good at stream processing, but batch processing is equally important
● Flink has shown its potential in batch processing
● Flink/Hive integration benefits both communities
● This is a big effort
● We are taking a phased approach
● Your contribution is greatly welcome and appreciated!
Conclusions
Flink Forward China, Beijing, Dec 2019!
All major Chinese tech companies will attend.
Expected Attendees: 3,000+
Reach out to flink-forward-china@list.alibaba-inc.com for details!
Call for sponsors
Thanks!

Más contenido relacionado

La actualidad más candente

Exploring Oracle Multitenant in Oracle Database 12c
Exploring Oracle Multitenant in Oracle Database 12cExploring Oracle Multitenant in Oracle Database 12c
Exploring Oracle Multitenant in Oracle Database 12cZohar Elkayam
 
Amit Kumar_Resume
Amit Kumar_ResumeAmit Kumar_Resume
Amit Kumar_ResumeAmit Kumar
 
Bquery Reporting & Analytics Architecture
Bquery Reporting & Analytics ArchitectureBquery Reporting & Analytics Architecture
Bquery Reporting & Analytics ArchitectureCarst Vaartjes
 
Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop EcosystemThings Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop EcosystemZohar Elkayam
 
Adding real time reporting to your database oracle db in memory
Adding real time reporting to your database oracle db in memoryAdding real time reporting to your database oracle db in memory
Adding real time reporting to your database oracle db in memoryZohar Elkayam
 
Metadata Synchronization in MySQL NDB Cluster 8.0
Metadata Synchronization in MySQL NDB Cluster 8.0Metadata Synchronization in MySQL NDB Cluster 8.0
Metadata Synchronization in MySQL NDB Cluster 8.0Arnab Ray
 
What's New in DITA 1.3 (Tekom, Nov 2014)
What's New in DITA 1.3 (Tekom, Nov 2014)What's New in DITA 1.3 (Tekom, Nov 2014)
What's New in DITA 1.3 (Tekom, Nov 2014)Contrext Solutions
 
Big data for cio 2015
Big data for cio 2015Big data for cio 2015
Big data for cio 2015Zohar Elkayam
 
Directory Structure Changes in Laravel 5.3
Directory Structure Changes in Laravel 5.3Directory Structure Changes in Laravel 5.3
Directory Structure Changes in Laravel 5.3DHRUV NATH
 
Where the &amp;$%! did this come from e resources in alma%2-f_primo a teachi...
Where the &amp;$%! did this come from  e resources in alma%2-f_primo a teachi...Where the &amp;$%! did this come from  e resources in alma%2-f_primo a teachi...
Where the &amp;$%! did this come from e resources in alma%2-f_primo a teachi...Martin Patrick
 
Informatica Online Training
Informatica Online TrainingInformatica Online Training
Informatica Online TrainingRao Rao
 
Evolutionary database design
Evolutionary database designEvolutionary database design
Evolutionary database designSalehein Syed
 
Free Libre Open Source Software at FFZG library
Free Libre Open Source Software at FFZG libraryFree Libre Open Source Software at FFZG library
Free Libre Open Source Software at FFZG libraryDobrica Pavlinušić
 

La actualidad más candente (14)

Exploring Oracle Multitenant in Oracle Database 12c
Exploring Oracle Multitenant in Oracle Database 12cExploring Oracle Multitenant in Oracle Database 12c
Exploring Oracle Multitenant in Oracle Database 12c
 
Amit Kumar_Resume
Amit Kumar_ResumeAmit Kumar_Resume
Amit Kumar_Resume
 
Bquery Reporting & Analytics Architecture
Bquery Reporting & Analytics ArchitectureBquery Reporting & Analytics Architecture
Bquery Reporting & Analytics Architecture
 
Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop EcosystemThings Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem
 
Adding real time reporting to your database oracle db in memory
Adding real time reporting to your database oracle db in memoryAdding real time reporting to your database oracle db in memory
Adding real time reporting to your database oracle db in memory
 
Metadata Synchronization in MySQL NDB Cluster 8.0
Metadata Synchronization in MySQL NDB Cluster 8.0Metadata Synchronization in MySQL NDB Cluster 8.0
Metadata Synchronization in MySQL NDB Cluster 8.0
 
What's New in DITA 1.3 (Tekom, Nov 2014)
What's New in DITA 1.3 (Tekom, Nov 2014)What's New in DITA 1.3 (Tekom, Nov 2014)
What's New in DITA 1.3 (Tekom, Nov 2014)
 
Big data for cio 2015
Big data for cio 2015Big data for cio 2015
Big data for cio 2015
 
Directory Structure Changes in Laravel 5.3
Directory Structure Changes in Laravel 5.3Directory Structure Changes in Laravel 5.3
Directory Structure Changes in Laravel 5.3
 
Where the &amp;$%! did this come from e resources in alma%2-f_primo a teachi...
Where the &amp;$%! did this come from  e resources in alma%2-f_primo a teachi...Where the &amp;$%! did this come from  e resources in alma%2-f_primo a teachi...
Where the &amp;$%! did this come from e resources in alma%2-f_primo a teachi...
 
Informatica Online Training
Informatica Online TrainingInformatica Online Training
Informatica Online Training
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
 
Evolutionary database design
Evolutionary database designEvolutionary database design
Evolutionary database design
 
Free Libre Open Source Software at FFZG library
Free Libre Open Source Software at FFZG libraryFree Libre Open Source Software at FFZG library
Free Libre Open Source Software at FFZG library
 

Similar a Integrating Flink with Hive - Flink Forward SF 2019

Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019Bowen Li
 
Flink and Hive integration - unifying enterprise data processing systems
Flink and Hive integration - unifying enterprise data processing systemsFlink and Hive integration - unifying enterprise data processing systems
Flink and Hive integration - unifying enterprise data processing systemsBowen Li
 
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...Flink Forward
 
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022HostedbyConfluent
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
OpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonOpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonHostedbyConfluent
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpJosé Román Martín Gil
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixStitch Fix Algorithms
 
G.Bs Presentation Of Guru Nanak Univ. National Conf.2009
G.Bs Presentation Of Guru Nanak Univ. National Conf.2009G.Bs Presentation Of Guru Nanak Univ. National Conf.2009
G.Bs Presentation Of Guru Nanak Univ. National Conf.2009Goutam Biswas
 
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System OverviewApache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System OverviewApache Flink Taiwan User Group
 
Kettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolKettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolAlex Rayón Jerez
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerDataWorks Summit
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.
 

Similar a Integrating Flink with Hive - Flink Forward SF 2019 (20)

Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019
 
Flink and Hive integration - unifying enterprise data processing systems
Flink and Hive integration - unifying enterprise data processing systemsFlink and Hive integration - unifying enterprise data processing systems
Flink and Hive integration - unifying enterprise data processing systems
 
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
 
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
OpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonOpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit London
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
G.Bs Presentation Of Guru Nanak Univ. National Conf.2009
G.Bs Presentation Of Guru Nanak Univ. National Conf.2009G.Bs Presentation Of Guru Nanak Univ. National Conf.2009
G.Bs Presentation Of Guru Nanak Univ. National Conf.2009
 
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System OverviewApache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
 
Kettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolKettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration tool
 
Apache flink
Apache flinkApache flink
Apache flink
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
 
Update on HDF5 1.8
Update on HDF5 1.8Update on HDF5 1.8
Update on HDF5 1.8
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 

Más de Bowen Li

Apache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyondApache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyondBowen Li
 
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen LiTowards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen LiBowen Li
 
How to contribute to Apache Flink @ Seattle Flink meetup
How to contribute to Apache Flink @ Seattle Flink meetupHow to contribute to Apache Flink @ Seattle Flink meetup
How to contribute to Apache Flink @ Seattle Flink meetupBowen Li
 
Community update on flink 1.9 and How to Contribute to Flink
Community update on flink 1.9 and How to Contribute to FlinkCommunity update on flink 1.9 and How to Contribute to Flink
Community update on flink 1.9 and How to Contribute to FlinkBowen Li
 
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...Bowen Li
 
AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...
AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...
AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...Bowen Li
 
Community and Meetup Update, Seattle Flink Meetup, Feb 2019
Community and Meetup Update, Seattle Flink Meetup, Feb 2019Community and Meetup Update, Seattle Flink Meetup, Feb 2019
Community and Meetup Update, Seattle Flink Meetup, Feb 2019Bowen Li
 
Status Update of Seattle Flink Meetup, Jun 2018
Status Update of Seattle Flink Meetup, Jun 2018Status Update of Seattle Flink Meetup, Jun 2018
Status Update of Seattle Flink Meetup, Jun 2018Bowen Li
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Bowen Li
 
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...Bowen Li
 
Stream processing with Apache Flink @ OfferUp
Stream processing with Apache Flink @ OfferUpStream processing with Apache Flink @ OfferUp
Stream processing with Apache Flink @ OfferUpBowen Li
 
Apache Flink @ Alibaba - Seattle Apache Flink Meetup
Apache Flink @ Alibaba - Seattle Apache Flink MeetupApache Flink @ Alibaba - Seattle Apache Flink Meetup
Apache Flink @ Alibaba - Seattle Apache Flink MeetupBowen Li
 
Opening - Seattle Apache Flink Meetup
Opening - Seattle Apache Flink MeetupOpening - Seattle Apache Flink Meetup
Opening - Seattle Apache Flink MeetupBowen Li
 

Más de Bowen Li (13)

Apache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyondApache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyond
 
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen LiTowards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
 
How to contribute to Apache Flink @ Seattle Flink meetup
How to contribute to Apache Flink @ Seattle Flink meetupHow to contribute to Apache Flink @ Seattle Flink meetup
How to contribute to Apache Flink @ Seattle Flink meetup
 
Community update on flink 1.9 and How to Contribute to Flink
Community update on flink 1.9 and How to Contribute to FlinkCommunity update on flink 1.9 and How to Contribute to Flink
Community update on flink 1.9 and How to Contribute to Flink
 
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
 
AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...
AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...
AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...
 
Community and Meetup Update, Seattle Flink Meetup, Feb 2019
Community and Meetup Update, Seattle Flink Meetup, Feb 2019Community and Meetup Update, Seattle Flink Meetup, Feb 2019
Community and Meetup Update, Seattle Flink Meetup, Feb 2019
 
Status Update of Seattle Flink Meetup, Jun 2018
Status Update of Seattle Flink Meetup, Jun 2018Status Update of Seattle Flink Meetup, Jun 2018
Status Update of Seattle Flink Meetup, Jun 2018
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
 
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
 
Stream processing with Apache Flink @ OfferUp
Stream processing with Apache Flink @ OfferUpStream processing with Apache Flink @ OfferUp
Stream processing with Apache Flink @ OfferUp
 
Apache Flink @ Alibaba - Seattle Apache Flink Meetup
Apache Flink @ Alibaba - Seattle Apache Flink MeetupApache Flink @ Alibaba - Seattle Apache Flink Meetup
Apache Flink @ Alibaba - Seattle Apache Flink Meetup
 
Opening - Seattle Apache Flink Meetup
Opening - Seattle Apache Flink MeetupOpening - Seattle Apache Flink Meetup
Opening - Seattle Apache Flink Meetup
 

Último

Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stageAbc194748
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxnuruddin69
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...soginsider
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projectssmsksolar
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfsmsksolar
 

Último (20)

Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stage
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdf
 

Integrating Flink with Hive - Flink Forward SF 2019

  • 1. Integrate Apache Flink with Apache Hive Xuefu Zhang, -- Senior Staff Engineer, Alibaba -- Hive PMC, Apache Member Bowen Li -- Senior Engineer, Alibaba
  • 2. ● Background ● Goals ● Technical Overview ● Current Progress ● Demo ● Q&A Agenda
  • 3. Background ● Flink has achieved an impressive success in stream processing ● Its scalability and potential has been proven and pushed further by Blink, now part of Flink ● at Alibaba, Flink is used to process extremely large amount of data at an unprecedented scale
  • 4. 1.7B Events/secEB Total PB Everyday 1T Event/Day
  • 5. Streaming SQL ● Majority of stream analytics can be expressed in SQL ● Instead of programming, streaming SQL gives a user a non-programming way of writing and deploying streaming jobs ● For SQL, there is need for metadata: sources, sinks, UDFs, views, etc ● The metadata needs a store
  • 6. Streaming SQL (cont’d) ● Currently, Flink stores metadata in a memory ● The metadata is ill-organized, scattered around in different components ● Poor usability, interoperability, productivity, and manageability ● Problem #1: Flink lacking a well-organized, persistent store for its metadata
  • 7. Batch and SQL ● Stream analytics users usually have also offline, batch analytics ● ETL is still an important use case for big data ● AI/ML is a major driving force behind both real-time and batch analytics ○ Gathering data to train and test a model, deploying it in stream processing ● SQL is the main tool processing big data for batch ● Unfortunately, users have to have a different engine for non-stream processing
  • 8. Batch and SQL (cont’d) ● Flink has showed prevailing advantages over other solutions for heavy-volume stream processing ● In Blink, we systematically explored Flink’s capabilities in batch processing, and it shows great potential
  • 9. Flink is the fastest due to its pipelined execution Tez and Spark do not overlap 1st and 2nd stages MapReduce is slow despite overlapping stages A Comparative Performance Evaluation of Flink, Dongwon Kim, POSTECH, Flink Forward 2015
  • 10. Batch and SQL (cont’d) ● Batch requires more on SQL capability ● Demands an even stronger metadata management ● Hive is the de facto standard for big data/batch processing on Hadoop ● The center of big data ecosystem is Hive metadata store ● Problem #2: Flink lacking a seamless access to Hive’s metadata and data
  • 11. Heterogeneous Sources/Sinks ● Whether batch or streaming, Flink usually needs to access many data systems ○ Hive ○ MySQL ○ Key-Value stores ○ Kafka stream ● Different data catalogs ● Problem #3, Flink needs a unified interface to interact with different data catalogs
  • 12. Beyond Flink ● Batch has a large use case then streaming ● Many Hive users are not Flink users ● We like Hive users can benefit from Flink’s batch capabilities ● Problem #4: Flink needing a story for Hive users
  • 13. Four Goals ● Define Unified catalog API ● Implement In-Memory catalog and persistent catalog for Flink metadata ● Implement Hive catalog, enabling deep integration with Hive ● Provide Flink as Hive’s new execution engine (long-term)
  • 14. Technical Overview ● Define unified catalog APIs (FLIP-30) ● Three implementations ○ Generic in-memory catalog ○ Generic persistent catalog (based on Hive metastore) ○ Hive catalog ● Hive data access ● Hive on Flink is not yet planned
  • 15. Architecture Flink Deployment Flink Runtime Query processing & optimization Table API and SQL SQL Client/Zeppelin Catalog APIs
  • 16. Catalog APIs and Implementations GenericInMemoryCatalog GenericHiveMetastoreCatalog ReadableCatalog ReadableWritableCatalog HiveCatalog Shim Layer: HiveMetastoreClient CatalogManager TableEnvironment inheritance reference SQL Client HiveCatalogBase Hive Metastore Catalog APIs
  • 18. Current Progress, Development Plan, and Demo Bowen Li
  • 19. Integrating Flink with Hive This is a major change, work needs to be broken into parts Part 1. Unified Catalog APIs (FLIP-30, FLINK-11275) Part 2. Integrate Flink with Hive (FLINK-10556) ● for metadata thru Hive Metastore (FLINK-10744) ● for data (FLINK-10729) Part 3. Support a complete set of SQL DDL/DML in Flink (FLINK-10232)
  • 20. 1 - Unified Catalog APIs Flink current status: ○ Barely any catalog support ○ Has separate function catalog Our highlighted improvements: ○ Introduced new catalog APIs and framework and connected to Calcite ● ReadableCatalog and ReadableWritableCatalog ● Meta-Objects: Database, Table, View, Partition, Functions, Stats, etc ● Operations: Create/Alter/Rename/Drop/Get/List/Exist/ ○ Unified function catalog with new catalog APIs and supported persisting functions
  • 21. 1 - Unified Catalog APIs Flink current status: ○ No well-structured hierarchy yet to manage metadata ○ Needs better SQL user experience when referencing metadata Our highlighted improvements: ● Introduced two-level management structure: <catalog>.<db>.<meta-object> ● Added CatalogManager to resolve object name select * from defaultCatalog.defaultDb.Tbl => select * from Tbl ● Made Flink case-insensitive to object names, similar to Hive, MySQL, Oracle
  • 22. 1 - Unified Catalog APIs Flink current status: No production-ready catalogs Our highlighted improvements: Developed three production-ready catalogs ■ GenericInMemoryCatalog - in-memory non-persistent, per session ■ HiveCatalog - compatible with Hive, read/write Hive meta-objects ■ GenericHiveMetastoreCatalog - persist Flink streaming and batch meta-objects
  • 23. 1 - Unified Catalog APIs Catalogs are pluggable and opens opportunities to build catalogs for ○ Streams and MQ ● Kafka (Confluent Schema Registry), Kinesis, RabbitMQ, Pulsar, etc ○ Structured Data ● RDMS like MySQL, etc ○ Semi-Structured Data ● ElasticSearch, HBase, Cassandra, etc ○ Your other favorite data management systems ● …...
  • 24. 2 - Flink-Hive Integration - Metadata - HiveCatalog Our highlighted improvements: Developed HiveCatalog, via which Flink can ● read Hive meta-objects, like tables, views, functions, stats ● create and write Hive meta-objects to Hive Metastore such that Hive can consume Flink can read and write Hive metadata thru HiveCatalogFlink can read and write Hive metadata thru HiveCatalog
  • 25. 2 - Flink-Hive Integration - Metadata - GenericHiveMetastoreCatalog Our highlighted improvements: ● Persisted Flink’s metadata (both streaming and batch) by using Hive Metastore purely as storage
  • 26. HiveCatalog v.s. GenericHiveMetastoreCatalog ● for Hive batch metadata ● Hive can understand ● for any streaming and batch metadata ● Hive may not understand Both are backed by Hive Metastore
  • 27. 2. Flink-Hive Integration - Data Our highlighted improvements: Connector: ○ Developed source and sink to read/write partition/non-partition tables and views ○ Supported partition-pruning Data Types: ○ Supported for all Hive simple and complex (array, map, struct) data types
  • 28. 2. Flink-Hive Integration - User defined functions and Version Compatibility ● Hive user defined functions ■ Supported Hive UDF ■ Working on supporting Hive GenericUDF, UDTF, UDAF ● Hive versions ■ Currently supports Hive 2.3.4 and 1.2.2 via shimming ■ Relies on Hive’s backward compatibility for 2.x and 1.x ● Working on direct support for more Hive versions, e.g. 2.1.1, 1.2.1
  • 29. Timeline First Targeted Flink release - 1.9.0, June 2019
  • 30. Demo with Flink SQL CLI • Query Hive Metadata • Create Hive Source/Sink with HiveCatalog to read/write data • Create CSV Source/Sink with GenericHiveMetastoreCatalog to read/write data
  • 31. This tremendous amount of work cannot happen without help and support Shout out to everyone in the community and our team who have been helping us with designs, codes, feedbacks, etc!
  • 32. ● Flink is good at stream processing, but batch processing is equally important ● Flink has shown its potential in batch processing ● Flink/Hive integration benefits both communities ● This is a big effort ● We are taking a phased approach ● Your contribution is greatly welcome and appreciated! Conclusions
  • 33. Flink Forward China, Beijing, Dec 2019! All major Chinese tech companies will attend. Expected Attendees: 3,000+ Reach out to flink-forward-china@list.alibaba-inc.com for details! Call for sponsors