Avro is presented as a common data format that is expressive, efficient, dynamic, and standalone. It aims to address issues with having many tools, data formats, and programming languages that often require adapters. Avro uses JSON schemas for interchange and IDL schemas for authoring static data types. It supports schema evolution and has MapReduce APIs in Java and other languages. Avro is currently used in projects like Pig, Hive, Flume, and has conversion tools from formats like SequenceFile and Thrift.
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
Avro Data Format Provides Expressive, Efficient and Dynamic Solution
1. Avro Data
Doug Cutting
Cloudera & Apache
Avro, Nutch, Hadoop, Pig, Hive, HBase, Zookeeper, Whirr, Cassandra and Mahout are trademarks of the Apache Software Foundation
9. Today
● face an exploding combination of
● tools
● data formats
● programming languages
● may require new adapter for each combination
● more tools and languages are good
● but more formats might not be
● Google claims benefits of common format
10. Data Format Properties
● expressive
● supports complex, nested data structures
● efficient
● fast and small
● dynamic
● programs can process & define new datatypes
● file format
● standalone
● splittable, compressed, sortable
11. Data Format Comparison
CSV XML/JSON SequenceFile Thrift & PB Avro
language yes yes no yes yes
independent
expressive no yes yes yes yes
efficient no no yes yes yes
dynamic yes yes no no yes
standalone ? yes no no yes
splittable ? ? yes ? yes
sortable yes ? yes no yes
12. Avro
● specification-based design
● permits independent implementations
● schema in JSON to simplify impls
● dynamic implementations the norm
● static, codegen-based implementations too
● file format specified
● standalone, splittable, compressed
● efficient binary encoding
● factors schema out of instances
● sortable
13. IDL Schemas
for authoring static datatypes
// a simple three-element record
record Block {
string id;
int length;
array<string> hosts;
}
// a linked list of ints
record IntList {
int value;
union { null, IntList} next;
}
14. JSON Schemas
for interchange
// a simple three-element record
{"name": "Block", "type": "record":,
"fields": [
{"name": "id", "type": "string"},
{"name": "length", "type": "int"},
{"name": "hosts", "type":
{"type": "array:, "items": "string"}}
]
}
// a linked list of ints
{"name": "IntList", "type": "record":,
"fields": [
{"name": "value", "type": "int"},
{"name": "next", "type": ["null", "IntList"]}
]
}
15. Dynamic Schemas
e.g., in Java
Schema block = Schema.createRecord("Block", "a block", null, false);
List<Field> fields = new ArrayList<Field>();
fields.add(new Field("id", Schema.create(Type.STRING), null, null));
fields.add(new Field("length", Schema.create(Type.INT), null, null));
fields.add(new Field("hosts",
Schema.createArray(Schema.create(Type.STRING)),
null, null));
block.setFields(fields);
Schema list = Schema.createRecord("MyList", "a list", null, false);
List<Field> fields = new ArrayList<Field>();
fields.add(new Field("value", Schema.create(Type.INT), null, null));
fields.add(new Field("next",
Schema.createUnion(Arrays.asList(new Schema[] {
Schema.create(Type.NULL),
list
}, null, null));
list.setFields(fields);
16. Avro Schema Evolution
● writer's schema always provided to reader
● so reader can compare:
● the schema used to write with
● the schema expected by application
● fields that match (name & type) are read
● fields written that don't match are skipped
● expected fields not written can be identified
● same features as provided by numeric field ids
17. Avro MapReduce API
● Single-valued inputs and outputs
● key/value pairs only required for intermediate
● map(IN, Collector<OUT>)
● map-only jobs never need to create k/v pairs
● map(IN, Collector<Pair<K,V>>)
● reduce(K, Iterable<V>, Collector<OUT>)
● if IN and OUT are pairs, default is sort
18. Avro Java MapReduce Example
public void map(String text, AvroCollector<Pair<String,Long>> c,
Reporter r) throws IOException {
StringTokenizer i = new StringTokenizer(text.toString());
while (i.hasMoreTokens())
c.collect(new Pair<String,Long>(i.nextToken(), 1L));
}
public void reduce(String word, Iterable<Long> counts,
AvroCollector<Pair<String,Long>> c,
Reporter r) throws IOException {
long sum = 0;
for (long count : counts)
sum += count;
c.collect(new Pair<String,Long>(word, sum));
}
19. Avro Status
● Current
● APIs: C, C++, C# Java, Python, PHP, Ruby
– interoperable data & RPC
● Integration: Pig, Hive, Flume, Crunch, etc.
● Conversion: SequenceFile, Thrift, Protobuf
● Java Mapreduce API
● Upcoming
● MapReduce APIs for more languages
– efficient, rich data
20. Summary
● Ecosystem needs a common data format
● that's expressive, efficient, dynamic, etc.
● Avro meets this need
● but switching data formats is a slow process