Covers how to use Avro to save records to disk. This can be used later to use Avro with Kafka Schema Registry. This provides background on Avro which gets used with Hadoop and Kafka.
Avro Tutorial - Records with Schema for Kafka and Hadoop
1. ™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Avro
Avro Apache Avro Data
Serialization
2. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Apache Avro
❖ Data serialization system
❖ Data structures
❖ Binary data format
❖ Container file format to store persistent data
❖ RPC capabilities
❖ Does not require code generation to use
3. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro Schemas
❖ Supports schemas for defining data structure
❖ Serializing and deserializing data, uses schema
❖ File schema
❖ Avro files store data with its schema
❖ RPC Schema
❖ RPC protocol exchanges schemas as part of the
handshake
❖ Schemas written in JSON
4. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro compared to…
❖ Similar to Thrift, Protocol Buffers, JSON, etc.
❖ Does not require code generation
❖ Avro needs less encoding as part of the data since it
stores names and types in the schema
❖ It supports evolution of schemas.
5. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro Schema
Avro schema stored in src/main/avro by default.
6. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Code Generation
7. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Employee Code Generation
8. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Using Generated Avro class
9. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Writing employees to an
Avro File
10. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Reading employees From a
File
11. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Using GenericRecord
12. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Writing Generic Records
13. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Reading using Generic
Records
14. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro Schema Validation
15. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro supported types
❖ Records
❖ Arrays
❖ Enums
❖ Unions
❖ Maps
❖ Strings, Int, Boolean, Decimal, Timestamp, Date
16. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Fuller example Avro Schema
17. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Avro
❖ Fast data serialization
❖ Supports data structures
❖ Supports Records, Maps, Array, and basic types
❖ You can use it direct or use Code Generation
❖ Read more
❖ Kafka Training
❖ Kafka Consulting
Notas del editor
Apache Avro™ is a data serialization system.
Avro provides data structures, binary data format, container file format to store persistent data and RPC capabilities. Avro does not require code generation to use.
Integrates well with JavaScript, Python, Ruby and Java.
Avro data format is defined by Avro schemas. When deserializing data, the schema is used. Data is serialized based on the schema, and schema is sent with data. Avro data plus schema is fully self-describing.
When Avro files store data with its schema. Avro RPC is also based on schema. Part of the RPC protocol exchanges schemas as part of the handshake.
When Avro is used in RPC, the client and server exchange schemas in the connection handshake.
Avro schemas are written in JSON.
Avro is similar to Thrift, Protocol Buffers, JSON, etc. Avro does not require code generation. Avro needs less encoding as part of the data since it stores names and types in the schema. It supports evolution of schemas.
There are plugins for Maven and Gradle to generate code based on Avro schemas.
This gradle-avro-plugin is a Gradle plugin that uses Avro tools to do Java code generation for Apache Avro. This plugin supports Avro schema files (avsc), and Avro RPC IDL (avdl). For Kafka you only need avsc. Notice that we did not generate setter methods. This makes the instances somewhat immutable.
The plugin generates the files and puts them under build/generated-main-avro-java.
The Employee class has a constructor and has a builder.
The above shoes serializing an Employee list to disk. In Kafka, we will not be writing to disk directly.
We are just showing how so you have a way to test Avro serialization, which is helpful when debugging schema incompatibilities.
Note we create a DatumWriter, which converts Java instance into an in-memory serialized format. SpecificDatumWriter is used with generated classes like Employee.
DataFileWriter writes the serialized records to the employee.avro file.
The above deserializes employees from the employees.avro file.
Deserializing is similar to serializing but in reverse. We create a SpecificDatumReader to converts in-memory serialized items into instances of our generated Employee class.
The DatumReader reads records from the file by calling next.
Another way to read is using forEach as follows:
final DataFileReader<Employee> dataFileReader = new DataFileReader<>(file, empReader);dataFileReader.forEach(employeeList::add);
You can use a generic record instead of using generated code.
You can write to Avro files using Generic records as well.
You can read from Avro files using generic records as well.
Avro will validate the data types when it serializes and deserializes the data.
The document https://avro.apache.org/docs/current/spec.html#Protocol+Declaration describes all of the supported types.
The above has examples of default values, arrays, primitive types, Records within records, enums, and more.