2. Avro is...
● data serialization
● file format
● RPC format
3. Existing Serialization Systems:
Protocol Buffers & Thrift
● expressive
● efficient (small & fast)
● but not very dynamic
● cannot browse arbitrary data
● viewing a new datatype
– requires code generation & load
● writing a new datatype
– requires generating schema text
– plus code generation & load
4. Avro Serialization
● spec's a serialization format
● schema language is in JSON
● each lang already has JSON parser
● each lang implements data reader & writer
● in normal code
● code generation is optional
● sometimes useful in statically typed languages
● data is untagged
● schema required to read/write
5. Avro Schema Evolution
● writer's schema always provided to reader
● so reader can compare:
● the schema used to write with
● the schema expected by application
● fields that match (name & type) are read
● fields written that don't match are skipped
● expected fields not written can be identified
● same features as provided by numeric field ids
6. Avro JSON Schemas
// a simple three-element record
{"name": "Block", "type": "record":,
"fields": [
{"name": "id", "type": "string"},
{"name": "length", "type": "integer"},
{"name": "hosts", "type":
{"type": "array:, "items": "string"}}
]
}
// a linked list of strings or ints
{"name": "MyList", "type": "record":,
"fields": [
{"name": "value", "type": ["string", "int"]},
{"name": "next", "type": ["MyList", "null"]}
]
}
7. Avro IDL Schemas
// a simple three-element record
record Block {
string id;
int length;
array<string> hosts;
}
// a linked list of strings or ints
record MyList {
union {string, int} value;
MyList next;
}
8. Hadoop Data Formats
● Today, primarily
● text
– pro: interoperable
– con: not expressive, inefficient
● Java Writable
– pro: expressive, efficient
– con: platformspecific, fragile
9. Avro Data
● expressive
● small & fast
● dynamic
● schema stored with data
– but factored out of instances
● APIs permit reading & creating
– new datatypes without generating & loading code
10. Avro Data
● includes a file format
● replacement for SequenceFile
● includes a textual encoding
● handles versioning
● if schema changes
● can still process data
● hope Hadoop apps will
● upgrade from text; standardize on Avro for data
11. Avro MapReduce API
● Single-valued inputs and outputs
● key/value pairs only required for intermediate
● map(IN, Collector<OUT>)
● map-only jobs never need to create k/v pairs
● map(IN, Collector<Pair<K,V>>)
● reduce(K, Iterable<V>, Collector<OUT>)
● if IN and OUT are pairs, default is sort
● In Avro trunk today, built on Hadoop 0.20 APIs.
● in Avro1.4.0 release next month
13. Avro RPC
● leverage versioning support
● permit different versions of services to interoperate
● for Hadoop, will
● let apps talk to clusters running different versions
● provide crosslanguage access
15. Avro Status
● Current
● C, C++, Java, Python & Ruby APIs
● Interoperable RPC and data
● Mapreduce API for Java
● Upcoming
● MapReduce APIs for other languages
– efficient, rich data
● RPC used in Flume, Hbase, Cassandra, Hadoop, etc.
– interversion compatibility
– nonJava clients