Avro Tutorial - Records with Schema for Kafka and Hadoop

•Descargar como PPTX, PDF•

4 recomendaciones•91,745 vistas

Covers how to use Avro to save records to disk. This can be used later to use Avro with Kafka Schema Registry. This provides background on Avro which gets used with Hadoop and Kafka.

Tecnología

™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Avro
Avro Apache Avro Data
Serialization

Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Apache Avro
❖ Data serialization system
❖ Data structures
❖ Binary data format
❖ Container file format to store persistent data
❖ RPC capabilities
❖ Does not require code generation to use

Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro Schemas
❖ Supports schemas for defining data structure
❖ Serializing and deserializing data, uses schema
❖ File schema
❖ Avro files store data with its schema
❖ RPC Schema
❖ RPC protocol exchanges schemas as part of the
handshake
❖ Schemas written in JSON

Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro compared to…
❖ Similar to Thrift, Protocol Buffers, JSON, etc.
❖ Does not require code generation
❖ Avro needs less encoding as part of the data since it
stores names and types in the schema
❖ It supports evolution of schemas.

Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro Schema
Avro schema stored in src/main/avro by default.

Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Code Generation

Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Employee Code Generation

Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Using Generated Avro class

Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Writing employees to an
Avro File

Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Reading employees From a
File

Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Using GenericRecord

Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Writing Generic Records

Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Reading using Generic
Records

Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro Schema Validation

Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro supported types
❖ Records
❖ Arrays
❖ Enums
❖ Unions
❖ Maps
❖ Strings, Int, Boolean, Decimal, Timestamp, Date

Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Fuller example Avro Schema

Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro
❖ Fast data serialization
❖ Supports data structures
❖ Supports Records, Maps, Array, and basic types
❖ You can use it direct or use Code Generation
❖ Read more
❖ Kafka Training
❖ Kafka Consulting

Más contenido relacionado

La actualidad más candente

From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent

Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK

Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward

Introducing the Apache Flink Kubernetes OperatorFlink Forward

ksqlDB: A Stream-Relational Database Systemconfluent

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Flink powered stream processing platform at PinterestFlink Forward

When NOT to use Apache Kafka?Kai Wähner

Evening out the uneven: dealing with skew in FlinkFlink Forward

From Zero to Hero with Kafka Connectconfluent

Deploying and Operating KSQLconfluent

Common issues with Apache Kafka® Producerconfluent

Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward

Fluent Bit: Log Forwarding at ScaleEduardo Silva Pereira

Dynamic Rule-based Real-time Market Data AlertsFlink Forward

How Apache Kafka® Worksconfluent

Kafka 101 and Developer Best Practicesconfluent

What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...HostedbyConfluent

Tuning Apache Kafka Connectors for Flink.pptxFlink Forward

Apache Kafka Best PracticesDataWorks Summit/Hadoop Summit

La actualidad más candente (20)

From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning

Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안

Building a fully managed stream processing platform on Flink at scale for Lin...

Introducing the Apache Flink Kubernetes Operator

ksqlDB: A Stream-Relational Database System

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Flink powered stream processing platform at Pinterest

When NOT to use Apache Kafka?

Evening out the uneven: dealing with skew in Flink

From Zero to Hero with Kafka Connect

Deploying and Operating KSQL

Common issues with Apache Kafka® Producer

Exactly-Once Financial Data Processing at Scale with Flink and Pinot

Fluent Bit: Log Forwarding at Scale

Dynamic Rule-based Real-time Market Data Alerts

How Apache Kafka® Works

Kafka 101 and Developer Best Practices

What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...

Tuning Apache Kafka Connectors for Flink.pptx

Apache Kafka Best Practices

Destacado

Kafka and Avro with Confluent Schema RegistryJean-Paul Azar

Kafka Tutorial - basics of the Kafka streaming platformJean-Paul Azar

Kafka website activity architectureOmid Vahdaty

Avro introductionNanda8904648951

Kafka Intro With Simple Java Producer ConsumersJean-Paul Azar

Processing IoT Data with Apache KafkaMatthew Howlett

Destacado (6)

Kafka and Avro with Confluent Schema Registry

Kafka Tutorial - basics of the Kafka streaming platform

Kafka website activity architecture

Avro introduction

Kafka Intro With Simple Java Producer Consumers

Processing IoT Data with Apache Kafka

Avro Tutorial - Records with Schema for Kafka and Hadoop

1. ™ Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting Avro Avro Apache Avro Data Serialization

2. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Apache Avro ❖ Data serialization system ❖ Data structures ❖ Binary data format ❖ Container file format to store persistent data ❖ RPC capabilities ❖ Does not require code generation to use

3. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Avro Schemas ❖ Supports schemas for defining data structure ❖ Serializing and deserializing data, uses schema ❖ File schema ❖ Avro files store data with its schema ❖ RPC Schema ❖ RPC protocol exchanges schemas as part of the handshake ❖ Schemas written in JSON

4. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Avro compared to… ❖ Similar to Thrift, Protocol Buffers, JSON, etc. ❖ Does not require code generation ❖ Avro needs less encoding as part of the data since it stores names and types in the schema ❖ It supports evolution of schemas.

5. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Avro Schema Avro schema stored in src/main/avro by default.

6. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Code Generation

7. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Employee Code Generation

8. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Using Generated Avro class

9. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Writing employees to an Avro File

10. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Reading employees From a File

11. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Using GenericRecord

12. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Writing Generic Records

13. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Reading using Generic Records

14. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Avro Schema Validation

15. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Avro supported types ❖ Records ❖ Arrays ❖ Enums ❖ Unions ❖ Maps ❖ Strings, Int, Boolean, Decimal, Timestamp, Date

16. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Fuller example Avro Schema

17. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Avro ❖ Fast data serialization ❖ Supports data structures ❖ Supports Records, Maps, Array, and basic types ❖ You can use it direct or use Code Generation ❖ Read more ❖ Kafka Training ❖ Kafka Consulting

Notas del editor

Apache Avro™ is a data serialization system. Avro provides data structures, binary data format, container file format to store persistent data and RPC capabilities. Avro does not require code generation to use. Integrates well with JavaScript, Python, Ruby and Java.
Avro data format is defined by Avro schemas. When deserializing data, the schema is used. Data is serialized based on the schema, and schema is sent with data. Avro data plus schema is fully self-describing. When Avro files store data with its schema. Avro RPC is also based on schema. Part of the RPC protocol exchanges schemas as part of the handshake. When Avro is used in RPC, the client and server exchange schemas in the connection handshake. Avro schemas are written in JSON.
Avro is similar to Thrift, Protocol Buffers, JSON, etc. Avro does not require code generation. Avro needs less encoding as part of the data since it stores names and types in the schema. It supports evolution of schemas.
Example Schema: {"namespace": "com.cloudurable.phonebook", "type": "record", "name": "Employee", "fields": [ {"name": "firstName", "type": “string"}, {"name": "lastName", "type": "string"}, {"name": "age", "type": "int"}, {"name": "phoneNumber", "type": "string"} ]} Avro schema is just JSON.
There are plugins for Maven and Gradle to generate code based on Avro schemas. This gradle-avro-plugin is a Gradle plugin that uses Avro tools to do Java code generation for Apache Avro. This plugin supports Avro schema files (avsc), and Avro RPC IDL (avdl). For Kafka you only need avsc. Notice that we did not generate setter methods. This makes the instances somewhat immutable.
The plugin generates the files and puts them under build/generated-main-avro-java.
The Employee class has a constructor and has a builder.
The above shoes serializing an Employee list to disk. In Kafka, we will not be writing to disk directly. We are just showing how so you have a way to test Avro serialization, which is helpful when debugging schema incompatibilities. Note we create a DatumWriter, which converts Java instance into an in-memory serialized format. SpecificDatumWriter is used with generated classes like Employee. DataFileWriter writes the serialized records to the employee.avro file.
The above deserializes employees from the employees.avro file. Deserializing is similar to serializing but in reverse. We create a SpecificDatumReader to converts in-memory serialized items into instances of our generated Employee class. The DatumReader reads records from the file by calling next. Another way to read is using forEach as follows: final DataFileReader<Employee> dataFileReader = new DataFileReader<>(file, empReader);dataFileReader.forEach(employeeList::add);
You can use a generic record instead of using generated code.
You can write to Avro files using Generic records as well.
You can read from Avro files using generic records as well.
Avro will validate the data types when it serializes and deserializes the data.
The document https://avro.apache.org/docs/current/spec.html#Protocol+Declaration describes all of the supported types.
The above has examples of default values, arrays, primitive types, Records within records, enums, and more.

Avro Tutorial - Records with Schema for Kafka and Hadoop

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (6)

Similar a Avro Tutorial - Records with Schema for Kafka and Hadoop

Similar a Avro Tutorial - Records with Schema for Kafka and Hadoop (20)

Último

Último (20)

Avro Tutorial - Records with Schema for Kafka and Hadoop

Notas del editor