Scaling API-first – The story of a global engineering organization
Apache AVRO (Boston HUG, Jan 19, 2010)
1. Apache AVRO
What's new?
Philip Zeyliger, Cloudera
(AVRO committer)
Boston HUG
January 19, 2009
2. What's AVRO?
A data serialization system
Includes:
A schema language
A compact serialized form
An RPC framework
A handful of APIs, in a handful of languages
Goals:
Cross-language
Support for dynamic access
Simple but expressive schema evolution
Same "space" as Apache Thrift, Google Protocol Buffers,
Binary JSON, and XDR. Subtle differences with all of them.
10. Hadoop Integration
Users
AvroInputFormat/AvroOutputFormat (MR-815)
Using AVRO in the shuffle (MR-1126)
Note that AVRO schemas let you specify sort order;
binary comparators are a thing of the past
Many Writables can be AVRO+Reflection instead
AVRO sort order leaves hand-writing RawComparators in
the past; for Streaming, you now get fast comparators for
free!
Framework
AVRO for Hadoop RPC (e.g., HDFS-982)
Goals
Open up protocols for cross-language use
11. avro-tools
Available tools:
compile Generates Java code for the given schema.
fragtojson Renders a binary-encoded Avro datum as JSON.
fromjson Reads JSON records and writes an Avro data file.
genavro Generates a JSON schema from a GenAvro file
getschema Prints out schema of an Avro data file.
induce Induce a schema/protocol from Java class/interface.
jsontofrag Renders a JSON-encoded Avro datum as binary.
rpcreceive Opens an HTTP RPC Server and listens for one message.
rpcsend Sends a single RPC message.
tojson Dumps an Avro data file as JSON, one record per line.
12. 1.3 to be released soon...
Good time to try it out!
What's evolving?
Trying not to evolve the serialized format.
APIs are evolving.
Transports are evolving.