Avro Data Format Provides Expressive, Efficient and Dynamic Solution

Avro Data

Doug Cutting
Cloudera & Apache

Avro, Nutch, Hadoop, Pig, Hive, HBase, Zookeeper, Whirr, Cassandra and Mahout are trademarks of the Apache Software Foundation

2002-2003

Nutch

SequenceFile
Writable

2004-2005

Nutch
MapReduce
NDFS
SequenceFile
Writable

2006

Hadoop
MapReduce
HDFS
SequenceFile
Writable

2007

HBase Pig

Zookeeper Hadoop MapReduce
HDFS
SequenceFile
Writable

2008

HBase Pig Hive Mahout

Zookeeper Hadoop MapReduce Cassandra
HDFS
SequenceFile
Writable

2009-2010
Your Application Here
Whirr Oozie Hue ...

HBase Pig Hive Mahout
Flume

Zookeeper Hadoop MapReduce Cassandra
HDFS
SequenceFile
Writable

Today
● face an exploding combination of
● tools
● data formats
● programming languages
● may require new adapter for each combination
● more tools and languages are good
● but more formats might not be
● Google claims benefits of common format

Data Format Properties
● expressive
● supports complex, nested data structures
● efficient
● fast and small
● dynamic
● programs can process & define new datatypes
● file format
● standalone
● splittable, compressed, sortable

Data Format Comparison
CSV XML/JSON SequenceFile Thrift & PB Avro

language yes yes no yes yes
independent
expressive no yes yes yes yes

efficient no no yes yes yes

dynamic yes yes no no yes

standalone ? yes no no yes

splittable ? ? yes ? yes

sortable yes ? yes no yes

Avro
● specification-based design
● permits independent implementations
● schema in JSON to simplify impls
● dynamic implementations the norm
● static, codegen-based implementations too
● file format specified
● standalone, splittable, compressed
● efficient binary encoding
● factors schema out of instances
● sortable

IDL Schemas
for authoring static datatypes
// a simple three-element record
record Block {
string id;
int length;
array<string> hosts;
}

// a linked list of ints
record IntList {
int value;
union { null, IntList} next;
}

JSON Schemas
for interchange

// a simple three-element record
{"name": "Block", "type": "record":,
"fields": [
{"name": "id", "type": "string"},
{"name": "length", "type": "int"},
{"name": "hosts", "type":
{"type": "array:, "items": "string"}}
]
}

// a linked list of ints
{"name": "IntList", "type": "record":,
"fields": [
{"name": "value", "type": "int"},
{"name": "next", "type": ["null", "IntList"]}
]
}

Dynamic Schemas
e.g., in Java

Schema block = Schema.createRecord("Block", "a block", null, false);
List<Field> fields = new ArrayList<Field>();
fields.add(new Field("id", Schema.create(Type.STRING), null, null));
fields.add(new Field("length", Schema.create(Type.INT), null, null));
fields.add(new Field("hosts",
Schema.createArray(Schema.create(Type.STRING)),
null, null));
block.setFields(fields);

Schema list = Schema.createRecord("MyList", "a list", null, false);
List<Field> fields = new ArrayList<Field>();
fields.add(new Field("value", Schema.create(Type.INT), null, null));
fields.add(new Field("next",
Schema.createUnion(Arrays.asList(new Schema[] {
Schema.create(Type.NULL),
list
}, null, null));
list.setFields(fields);

Avro Schema Evolution
● writer's schema always provided to reader
● so reader can compare:
● the schema used to write with
● the schema expected by application
● fields that match (name & type) are read
● fields written that don't match are skipped
● expected fields not written can be identified
● same features as provided by numeric field ids

Avro MapReduce API
● Single-valued inputs and outputs
● key/value pairs only required for intermediate
● map(IN, Collector<OUT>)
● map-only jobs never need to create k/v pairs
● map(IN, Collector<Pair<K,V>>)
● reduce(K, Iterable<V>, Collector<OUT>)
● if IN and OUT are pairs, default is sort

Avro Java MapReduce Example
public void map(String text, AvroCollector<Pair<String,Long>> c,
                Reporter r) throws IOException {

  StringTokenizer i = new StringTokenizer(text.toString());

  while (i.hasMoreTokens())
    c.collect(new Pair<String,Long>(i.nextToken(), 1L));
}

public void reduce(String word, Iterable<Long> counts,
                   AvroCollector<Pair<String,Long>> c,
                   Reporter r) throws IOException {
  long sum = 0;
  for (long count : counts)
    sum += count;
  c.collect(new Pair<String,Long>(word, sum));
}

Avro Status
● Current
● APIs: C, C++, C# Java, Python, PHP, Ruby
– interoperable data & RPC
● Integration: Pig, Hive, Flume, Crunch, etc.
● Conversion: SequenceFile, Thrift, Protobuf
● Java Mapreduce API
● Upcoming
● MapReduce APIs for more languages
– efficient, rich data

Summary
● Ecosystem needs a common data format
● that's expressive, efficient, dynamic, etc.
● Avro meets this need
● but switching data formats is a slow process

Avro Data Format Provides Expressive, Efficient and Dynamic Solution

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (9)

Similar a Avro Data Format Provides Expressive, Efficient and Dynamic Solution

Similar a Avro Data Format Provides Expressive, Efficient and Dynamic Solution (20)

Más de Cloudera, Inc.

Más de Cloudera, Inc. (20)

Último

Último (20)

Avro Data Format Provides Expressive, Efficient and Dynamic Solution