3. Background
● Flink has achieved an impressive success in stream processing
● Its scalability and potential has been proven and pushed further by Blink, now
part of Flink
● at Alibaba, Flink is used to process extremely large amount of data at an
unprecedented scale
5. Streaming SQL
● Majority of stream analytics can be expressed in SQL
● Instead of programming, streaming SQL gives a user a non-programming way of
writing and deploying streaming jobs
● For SQL, there is need for metadata: sources, sinks, UDFs, views, etc
● The metadata needs a store
6. Streaming SQL (cont’d)
● Currently, Flink stores metadata in a memory
● The metadata is ill-organized, scattered around in different components
● Poor usability, interoperability, productivity, and manageability
● Problem #1: Flink lacking a well-organized, persistent store for its metadata
7. Batch and SQL
● Stream analytics users usually have also offline, batch analytics
● ETL is still an important use case for big data
● AI/ML is a major driving force behind both real-time and batch analytics
○ Gathering data to train and test a model, deploying it in stream processing
● SQL is the main tool processing big data for batch
● Unfortunately, users have to have a different engine for non-stream processing
8. Batch and SQL (cont’d)
● Flink has showed prevailing advantages over other solutions for
heavy-volume stream processing
● In Blink, we systematically explored Flink’s capabilities in batch processing,
and it shows great potential
9. Flink is the fastest due to its pipelined execution
Tez and Spark do not overlap 1st and 2nd stages
MapReduce is slow despite overlapping stages
A Comparative Performance Evaluation of Flink, Dongwon Kim, POSTECH, Flink Forward 2015
10. Batch and SQL (cont’d)
● Batch requires more on SQL capability
● Demands an even stronger metadata management
● Hive is the de facto standard for big data/batch processing on Hadoop
● The center of big data ecosystem is Hive metadata store
● Problem #2: Flink lacking a seamless access to Hive’s metadata and data
11. Heterogeneous Sources/Sinks
● Whether batch or streaming, Flink usually needs to access many data systems
○ Hive
○ MySQL
○ Key-Value stores
○ Kafka stream
● Different data catalogs
● Problem #3, Flink needs a unified interface to interact with different data catalogs
12. Beyond Flink
● Batch has a large use case then streaming
● Many Hive users are not Flink users
● We like Hive users can benefit from Flink’s batch capabilities
● Problem #4: Flink needing a story for Hive users
13. Four Goals
● Define Unified catalog API
● Implement In-Memory catalog and persistent catalog for Flink metadata
● Implement Hive catalog, enabling deep integration with Hive
● Provide Flink as Hive’s new execution engine (long-term)
14. Technical Overview
● Define unified catalog APIs (FLIP-30)
● Three implementations
○ Generic in-memory catalog
○ Generic persistent catalog (based on Hive metastore)
○ Hive catalog
● Hive data access
● Hive on Flink is not yet planned
19. Integrating Flink with Hive
This is a major change, work needs to be broken into parts
Part 1. Unified Catalog APIs (FLIP-30, FLINK-11275)
Part 2. Integrate Flink with Hive (FLINK-10556)
● for metadata thru Hive Metastore (FLINK-10744)
● for data (FLINK-10729)
Part 3. Support a complete set of SQL DDL/DML in Flink (FLINK-10232)
20. 1 - Unified Catalog APIs
Flink current status:
○ Barely any catalog support
○ Has separate function catalog
Our highlighted improvements:
○ Introduced new catalog APIs and framework and connected to Calcite
● ReadableCatalog and ReadableWritableCatalog
● Meta-Objects: Database, Table, View, Partition, Functions, Stats, etc
● Operations: Create/Alter/Rename/Drop/Get/List/Exist/
○ Unified function catalog with new catalog APIs and supported persisting functions
21. 1 - Unified Catalog APIs
Flink current status:
○ No well-structured hierarchy yet to manage metadata
○ Needs better SQL user experience when referencing metadata
Our highlighted improvements:
● Introduced two-level management structure: <catalog>.<db>.<meta-object>
● Added CatalogManager to resolve object name
select * from defaultCatalog.defaultDb.Tbl => select * from Tbl
● Made Flink case-insensitive to object names, similar to Hive, MySQL, Oracle
22. 1 - Unified Catalog APIs
Flink current status:
No production-ready catalogs
Our highlighted improvements:
Developed three production-ready catalogs
■ GenericInMemoryCatalog - in-memory non-persistent, per session
■ HiveCatalog - compatible with Hive, read/write Hive meta-objects
■ GenericHiveMetastoreCatalog - persist Flink streaming and batch meta-objects
23. 1 - Unified Catalog APIs
Catalogs are pluggable and opens opportunities to build catalogs for
○ Streams and MQ
● Kafka (Confluent Schema Registry), Kinesis, RabbitMQ, Pulsar, etc
○ Structured Data
● RDMS like MySQL, etc
○ Semi-Structured Data
● ElasticSearch, HBase, Cassandra, etc
○ Your other favorite data management systems
● …...
24. 2 - Flink-Hive Integration - Metadata - HiveCatalog
Our highlighted improvements:
Developed HiveCatalog, via which Flink can
● read Hive meta-objects, like tables, views, functions, stats
● create and write Hive meta-objects to Hive Metastore such that Hive can consume
Flink can read and write Hive metadata thru HiveCatalogFlink can read and write Hive metadata thru HiveCatalog
25. 2 - Flink-Hive Integration - Metadata - GenericHiveMetastoreCatalog
Our highlighted improvements:
● Persisted Flink’s metadata (both streaming and batch) by using Hive Metastore purely
as storage
26. HiveCatalog v.s. GenericHiveMetastoreCatalog
● for Hive batch metadata
● Hive can understand
● for any streaming and batch metadata
● Hive may not understand
Both are backed by Hive Metastore
27. 2. Flink-Hive Integration - Data
Our highlighted improvements:
Connector:
○ Developed source and sink to read/write partition/non-partition tables and views
○ Supported partition-pruning
Data Types:
○ Supported for all Hive simple and complex (array, map, struct) data types
28. 2. Flink-Hive Integration -
User defined functions and Version Compatibility
● Hive user defined functions
■ Supported Hive UDF
■ Working on supporting Hive GenericUDF, UDTF, UDAF
● Hive versions
■ Currently supports Hive 2.3.4 and 1.2.2 via shimming
■ Relies on Hive’s backward compatibility for 2.x and 1.x
● Working on direct support for more Hive versions, e.g. 2.1.1, 1.2.1
30. Demo with Flink SQL CLI
• Query Hive Metadata
• Create Hive Source/Sink with HiveCatalog to read/write data
• Create CSV Source/Sink with GenericHiveMetastoreCatalog to read/write data
31. This tremendous amount of work cannot happen without help and support
Shout out to everyone in the community and our team
who have been helping us with designs, codes, feedbacks, etc!
32. ● Flink is good at stream processing, but batch processing is equally important
● Flink has shown its potential in batch processing
● Flink/Hive integration benefits both communities
● This is a big effort
● We are taking a phased approach
● Your contribution is greatly welcome and appreciated!
Conclusions
33. Flink Forward China, Beijing, Dec 2019!
All major Chinese tech companies will attend.
Expected Attendees: 3,000+
Reach out to flink-forward-china@list.alibaba-inc.com for details!
Call for sponsors