SlideShare a Scribd company logo
1 of 28
Avro
 Etymology & History
 Sexy Tractors
 Project Drivers & Overview
 Serialization
 RPC
 Hadoop Support
Etymology
 British aircraft manufacturer
 1910-1963
History
 Doug Cutting – Cloudera, Hadoop project founder
 2002 – Nutch
 2004 – Google GFS, MapReduce whitepapers
 2005 – NDFS & MR, Writable & SequenceFile
 2006 – Hadoop split from Nutch, renamed NDFS to
  HDFS
 2007 – Yahoo gets involved, HBase, Pig, Zookeeper
 2008 – Terrasort contest winner, Hive, Mahout,
  Cassandra
 2009 – Oozie, Flume, Hue
History
 Underlying serialization system basically unchanged
 Additional language support and data formats
 Language, data format combinatorial explosion
    C++ JSON to Java BSON
    Python Smile to PHP CSV
 Apr 2009 – Avro proposal
 May 2010 – Top-level project
Sexy Tractors
 Data serialization tools, like tractors, aren’t sexy
 They should be!
 Dollar for dollar storage capacity has increased
  exponentially, doubling every 1.5-2 years
 Throughput of magnetic storage and network has not
  maintained this pace
 Distributed systems are the norm
 Efficient data serialization techniques and tools are
  vital
Project Drivers
 Common data format for serialization and RPC
 Dynamic
 Expressive
 Efficient
 File format
    Well defined
    Standalone
    Splittable & compressed
Biased Comparison
              CSV   XML/JSON   SequenceFile   Thrift & PB   Avro

Language      Yes   Yes        No             Yes           Yes
Independent
Expressive    No    Yes        Yes            Yes           Yes

Efficient     No    No         Yes            Yes           Yes

Dynamic       Yes   Yes        No             No            Yes

Standalone    ?     Yes        No             No            Yes

Splittable    ?     ?          Yes            ?             Yes
Project Overview
 Specification based design
 Dynamic implementations
 File format
 Schemas
    Must support JSON implementation
    IDL often supported
    Evolvable
 First class Hadoop support
Specification Based Design
 Schemas
 Encoding
 Sort order
 Object container files
 Codecs
 Protocol
 Protocol write format
 Schema resolution
Specification Based Design
 Schemas
    Primitive types
        Null, boolean, int, long, float, double, bytes, string
    Complex types
      Records, enums, arrays, maps, unions and fixed

    Named types
      Records, enums, fixed
      Name & namespace

    Aliases
    http://avro.apache.org/docs/current/spec.html#schema
     s
Schema Example
log-message.avpr

{
    "namespace": "com.emoney",
    "name": "LogMessage",
    "type": "record",
    "fields": [
       {"name": "level", "type": "string", "comment" : "this is ignored"},
       {"name": "message", "type": "string", "description" : "this is the message"},
       {"name": "dateTime", "type": "long"},
       {"name": "exceptionMessage", "type": ["null", "string"]}
    ]
}
Specification Based Design
 Encodings
    JSON – for debugging
    Binary
 Sort order
    Efficient sorting by system other than writer
    Sorting binary-encoded data without deserialization
Specification Based Design
 Object container files
    Schema
    Serialized data written to binary-encoded blocks
    Blocks may be compressed
    Synchronization markers
 Codecs
    Null
    Deflate
    Snappy (optional)
    LZO (future)
Specification Based Design
 Protocol
    Protocol name
    Namespace
    Types
        Named types used in messages
    Messages
        Uniquely named message
        Request
        Response
        Errors
 Wire format
   Transports
   Framing
   Handshake
Protocol
{
    "namespace": "com.acme",
    "protocol": "HelloWorld",
    "doc": "Protocol Greetings",

    "types": [
       {"name": "Greeting", "type": "record", "fields": [ {"name": "message", "type": "string"}]},
       {"name": "Curse", "type": "error", "fields": [ {"name": "message", "type": "string"}]} ],

    "messages": {
      "hello": {
        "doc": "Say hello.",
        "request": [{"name": "greeting", "type": "Greeting" }],
        "response": "Greeting",
        "errors": ["Curse"]
      }
    }
}
Schema Resolution & Evolution
   Writers schema always provided to reader
   Compare schema used by writer & schema expected by reader
   Fields that match name & type are read
   Fields written that don’t match are skipped
   Expected fields not written can be identified
      Error or provide default value
 Same features as provided by numeric field ids
    Keeps fields symbolic, no index IDs written in data
 Allows for projections
    Very efficient at skipping fields
 Aliases
    Allows projections from 2 different types using aliases
    User transaction
          Count, date
      Batch
        Count, date
Implementations
   Core – parse schemas, read & write binary data for a schema
   Data file – read & write Avro data files
   Codec – supported codecs
   RPC/HTTP – make and receive calls over HTTP
Implementation         Core         Data file         Codec          RPC/HTTP
C                Yes           Yes              Deflate         Yes
C++              Yes           Yes              ?               Yes
C#               Yes           No               N/A             No
Java             Yes           Yes              Deflate, Snappy Yes
Python           Yes           Yes              Deflate         Yes
Ruby             Yes           Yes              Deflate         Yes
PHP              Yes           Yes              ?               No
API
 Generic
    Generic attribute/value data structure
    Best suited for dynamic processing
 Specific
    Each record corresponds to a different kind of object in the
     programming language
    RPC systems typically use this
 Reflect
    Schemas generated via reflection
    Converting an existing codebase to use Avro
API
 Low-level
    Schema
    Encoders
    DatumWriter
    DatumReader
 High-level
    DataFileWriter
    DataFileReader
Java Example
Schema schema = Schema.parse(getClass().getResourceAsStream("schema.avpr"));

OutputStream outputStream = new FileOutputStream("data.avro");

DataFileWriter<Message> writer =
        new DataFileWriter<Message>(new GenericDatumWriter<Message>(schema));

writer.setCodec(CodecFactory.deflateCodec(1));
writer.create(schema, outputStream);

writer.append(new Message ());

writer.close();
Java Example
DataFileReader<Message> reader = new DataFileReader<Message>(
         new File("data.avro"),
         new GenericDatumReader<Message>());

for (Message next : reader) {
  System.out.println("next: " + next);
}
RPC
 Server
    SocketServer (non-standard)
    SaslSocketServer
    HttpServer
    NettyServer
    DatagramServer (non-standard)
 Responder
    Generic
    Reflect
    Specific
 Client
    Corresponding Transceiver
    LocalTransceiver
 Requestor
RPC
 Client
    Corresponding Transceiver for each server
    LocalTransceiver
 Requestor
RPC Server
Protocol protocol = Protocol.parse(new File("protocol.avpr"));

InetSocketAddress address = new InetSocketAddress("localhost", 33333);

GenericResponder responder = new GenericResponder(protocol) {
   @Override
   public Object respond(Protocol.Message message, Object request)
   throws Exception {
     ...
   }
};

new SocketServer(responder, address).join();
Hadoop Support
 File writers and readers
 Replacing RPC with Avro
    In Flume already
 Pig support is in
 Splittable
    Set block size when writing
 Tether jobs
    Connector framework for other languages
    Hadoop Pipes
Future
 RPC
    Hbase, Cassandra, Hadoop core
 Hive in progress
 Tether jobs
    Actual MapReduce implementations in other languages
Avro
 Dynamic
 Expressive
 Efficient
 Specification based design
 Language implementations are fairly solid
 Serialization or RPC or both
 First class Hadoop support
 Currently 1.5.1
 Sexy tractors

More Related Content

What's hot

Common MongoDB Use Cases
Common MongoDB Use Cases Common MongoDB Use Cases
Common MongoDB Use Cases
MongoDB
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
Eric Evans
 

What's hot (20)

Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka Connect
 
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 
Common MongoDB Use Cases
Common MongoDB Use Cases Common MongoDB Use Cases
Common MongoDB Use Cases
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
A Comparison of EDB Postgres to Self-Supported PostgreSQL
A Comparison of EDB Postgres to Self-Supported PostgreSQLA Comparison of EDB Postgres to Self-Supported PostgreSQL
A Comparison of EDB Postgres to Self-Supported PostgreSQL
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Avro introduction
Avro introductionAvro introduction
Avro introduction
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
Apache avro and overview hadoop tools
Apache avro and overview hadoop toolsApache avro and overview hadoop tools
Apache avro and overview hadoop tools
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-PatternApache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-Pattern
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
 
Indexing with MongoDB
Indexing with MongoDBIndexing with MongoDB
Indexing with MongoDB
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
Kafka basics
Kafka basicsKafka basics
Kafka basics
 

Viewers also liked

Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)
Cloudera, Inc.
 
G rpc lection1_theory_bkp2
G rpc lection1_theory_bkp2G rpc lection1_theory_bkp2
G rpc lection1_theory_bkp2
eleksdev
 
Serialization and performance by Sergey Morenets
Serialization and performance by Sergey MorenetsSerialization and performance by Sergey Morenets
Serialization and performance by Sergey Morenets
Alex Tumanoff
 

Viewers also liked (20)

Avro intro
Avro introAvro intro
Avro intro
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416
 
3 apache-avro
3 apache-avro3 apache-avro
3 apache-avro
 
Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)
 
排队排队--kafka
排队排队--kafka排队排队--kafka
排队排队--kafka
 
맛만 보자 Finagle이란
맛만 보자 Finagle이란 맛만 보자 Finagle이란
맛만 보자 Finagle이란
 
java thrift
java thriftjava thrift
java thrift
 
Microservices in the Enterprise
Microservices in the Enterprise Microservices in the Enterprise
Microservices in the Enterprise
 
RPC protocols
RPC protocolsRPC protocols
RPC protocols
 
Protobuf & Code Generation + Go-Kit
Protobuf & Code Generation + Go-KitProtobuf & Code Generation + Go-Kit
Protobuf & Code Generation + Go-Kit
 
OpenFest 2016 - Open Microservice Architecture
OpenFest 2016 - Open Microservice ArchitectureOpenFest 2016 - Open Microservice Architecture
OpenFest 2016 - Open Microservice Architecture
 
3 avro hug-2010-07-21
3 avro hug-2010-07-213 avro hug-2010-07-21
3 avro hug-2010-07-21
 
G rpc lection1
G rpc lection1G rpc lection1
G rpc lection1
 
G rpc lection1_theory_bkp2
G rpc lection1_theory_bkp2G rpc lection1_theory_bkp2
G rpc lection1_theory_bkp2
 
RPC: Remote procedure call
RPC: Remote procedure callRPC: Remote procedure call
RPC: Remote procedure call
 
HTTP2 and gRPC
HTTP2 and gRPCHTTP2 and gRPC
HTTP2 and gRPC
 
Apache Avro and You
Apache Avro and YouApache Avro and You
Apache Avro and You
 
아파치 쓰리프트 (Apache Thrift)
아파치 쓰리프트 (Apache Thrift) 아파치 쓰리프트 (Apache Thrift)
아파치 쓰리프트 (Apache Thrift)
 
Building High Performance APIs In Go Using gRPC And Protocol Buffers
Building High Performance APIs In Go Using gRPC And Protocol BuffersBuilding High Performance APIs In Go Using gRPC And Protocol Buffers
Building High Performance APIs In Go Using gRPC And Protocol Buffers
 
Serialization and performance by Sergey Morenets
Serialization and performance by Sergey MorenetsSerialization and performance by Sergey Morenets
Serialization and performance by Sergey Morenets
 

Similar to Avro

Web Development Environments: Choose the best or go with the rest
Web Development Environments:  Choose the best or go with the restWeb Development Environments:  Choose the best or go with the rest
Web Development Environments: Choose the best or go with the rest
george.james
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
Sri Ambati
 

Similar to Avro (20)

Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?
 
Web Development Environments: Choose the best or go with the rest
Web Development Environments:  Choose the best or go with the restWeb Development Environments:  Choose the best or go with the rest
Web Development Environments: Choose the best or go with the rest
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thrift
 
Not only SQL
Not only SQL Not only SQL
Not only SQL
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 
Ruby On Rails
Ruby On RailsRuby On Rails
Ruby On Rails
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
Document Databases & RavenDB
Document Databases & RavenDBDocument Databases & RavenDB
Document Databases & RavenDB
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
 
Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]
 
The Glory of Rest
The Glory of RestThe Glory of Rest
The Glory of Rest
 
Php
PhpPhp
Php
 
Php
PhpPhp
Php
 
Php
PhpPhp
Php
 
Introduction To Groovy 2005
Introduction To Groovy 2005Introduction To Groovy 2005
Introduction To Groovy 2005
 
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
 
Webtechnologies
Webtechnologies Webtechnologies
Webtechnologies
 
Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Avro

  • 1.
  • 2. Avro  Etymology & History  Sexy Tractors  Project Drivers & Overview  Serialization  RPC  Hadoop Support
  • 3. Etymology  British aircraft manufacturer  1910-1963
  • 4. History  Doug Cutting – Cloudera, Hadoop project founder  2002 – Nutch  2004 – Google GFS, MapReduce whitepapers  2005 – NDFS & MR, Writable & SequenceFile  2006 – Hadoop split from Nutch, renamed NDFS to HDFS  2007 – Yahoo gets involved, HBase, Pig, Zookeeper  2008 – Terrasort contest winner, Hive, Mahout, Cassandra  2009 – Oozie, Flume, Hue
  • 5. History  Underlying serialization system basically unchanged  Additional language support and data formats  Language, data format combinatorial explosion  C++ JSON to Java BSON  Python Smile to PHP CSV  Apr 2009 – Avro proposal  May 2010 – Top-level project
  • 6. Sexy Tractors  Data serialization tools, like tractors, aren’t sexy  They should be!  Dollar for dollar storage capacity has increased exponentially, doubling every 1.5-2 years  Throughput of magnetic storage and network has not maintained this pace  Distributed systems are the norm  Efficient data serialization techniques and tools are vital
  • 7. Project Drivers  Common data format for serialization and RPC  Dynamic  Expressive  Efficient  File format  Well defined  Standalone  Splittable & compressed
  • 8. Biased Comparison CSV XML/JSON SequenceFile Thrift & PB Avro Language Yes Yes No Yes Yes Independent Expressive No Yes Yes Yes Yes Efficient No No Yes Yes Yes Dynamic Yes Yes No No Yes Standalone ? Yes No No Yes Splittable ? ? Yes ? Yes
  • 9. Project Overview  Specification based design  Dynamic implementations  File format  Schemas  Must support JSON implementation  IDL often supported  Evolvable  First class Hadoop support
  • 10. Specification Based Design  Schemas  Encoding  Sort order  Object container files  Codecs  Protocol  Protocol write format  Schema resolution
  • 11. Specification Based Design  Schemas  Primitive types  Null, boolean, int, long, float, double, bytes, string  Complex types  Records, enums, arrays, maps, unions and fixed  Named types  Records, enums, fixed  Name & namespace  Aliases  http://avro.apache.org/docs/current/spec.html#schema s
  • 12. Schema Example log-message.avpr { "namespace": "com.emoney", "name": "LogMessage", "type": "record", "fields": [ {"name": "level", "type": "string", "comment" : "this is ignored"}, {"name": "message", "type": "string", "description" : "this is the message"}, {"name": "dateTime", "type": "long"}, {"name": "exceptionMessage", "type": ["null", "string"]} ] }
  • 13. Specification Based Design  Encodings  JSON – for debugging  Binary  Sort order  Efficient sorting by system other than writer  Sorting binary-encoded data without deserialization
  • 14. Specification Based Design  Object container files  Schema  Serialized data written to binary-encoded blocks  Blocks may be compressed  Synchronization markers  Codecs  Null  Deflate  Snappy (optional)  LZO (future)
  • 15. Specification Based Design  Protocol  Protocol name  Namespace  Types  Named types used in messages  Messages  Uniquely named message  Request  Response  Errors  Wire format  Transports  Framing  Handshake
  • 16. Protocol { "namespace": "com.acme", "protocol": "HelloWorld", "doc": "Protocol Greetings", "types": [ {"name": "Greeting", "type": "record", "fields": [ {"name": "message", "type": "string"}]}, {"name": "Curse", "type": "error", "fields": [ {"name": "message", "type": "string"}]} ], "messages": { "hello": { "doc": "Say hello.", "request": [{"name": "greeting", "type": "Greeting" }], "response": "Greeting", "errors": ["Curse"] } } }
  • 17. Schema Resolution & Evolution  Writers schema always provided to reader  Compare schema used by writer & schema expected by reader  Fields that match name & type are read  Fields written that don’t match are skipped  Expected fields not written can be identified  Error or provide default value  Same features as provided by numeric field ids  Keeps fields symbolic, no index IDs written in data  Allows for projections  Very efficient at skipping fields  Aliases  Allows projections from 2 different types using aliases  User transaction  Count, date  Batch  Count, date
  • 18. Implementations  Core – parse schemas, read & write binary data for a schema  Data file – read & write Avro data files  Codec – supported codecs  RPC/HTTP – make and receive calls over HTTP Implementation Core Data file Codec RPC/HTTP C Yes Yes Deflate Yes C++ Yes Yes ? Yes C# Yes No N/A No Java Yes Yes Deflate, Snappy Yes Python Yes Yes Deflate Yes Ruby Yes Yes Deflate Yes PHP Yes Yes ? No
  • 19. API  Generic  Generic attribute/value data structure  Best suited for dynamic processing  Specific  Each record corresponds to a different kind of object in the programming language  RPC systems typically use this  Reflect  Schemas generated via reflection  Converting an existing codebase to use Avro
  • 20. API  Low-level  Schema  Encoders  DatumWriter  DatumReader  High-level  DataFileWriter  DataFileReader
  • 21. Java Example Schema schema = Schema.parse(getClass().getResourceAsStream("schema.avpr")); OutputStream outputStream = new FileOutputStream("data.avro"); DataFileWriter<Message> writer = new DataFileWriter<Message>(new GenericDatumWriter<Message>(schema)); writer.setCodec(CodecFactory.deflateCodec(1)); writer.create(schema, outputStream); writer.append(new Message ()); writer.close();
  • 22. Java Example DataFileReader<Message> reader = new DataFileReader<Message>( new File("data.avro"), new GenericDatumReader<Message>()); for (Message next : reader) { System.out.println("next: " + next); }
  • 23. RPC  Server  SocketServer (non-standard)  SaslSocketServer  HttpServer  NettyServer  DatagramServer (non-standard)  Responder  Generic  Reflect  Specific  Client  Corresponding Transceiver  LocalTransceiver  Requestor
  • 24. RPC  Client  Corresponding Transceiver for each server  LocalTransceiver  Requestor
  • 25. RPC Server Protocol protocol = Protocol.parse(new File("protocol.avpr")); InetSocketAddress address = new InetSocketAddress("localhost", 33333); GenericResponder responder = new GenericResponder(protocol) { @Override public Object respond(Protocol.Message message, Object request) throws Exception { ... } }; new SocketServer(responder, address).join();
  • 26. Hadoop Support  File writers and readers  Replacing RPC with Avro  In Flume already  Pig support is in  Splittable  Set block size when writing  Tether jobs  Connector framework for other languages  Hadoop Pipes
  • 27. Future  RPC  Hbase, Cassandra, Hadoop core  Hive in progress  Tether jobs  Actual MapReduce implementations in other languages
  • 28. Avro  Dynamic  Expressive  Efficient  Specification based design  Language implementations are fairly solid  Serialization or RPC or both  First class Hadoop support  Currently 1.5.1  Sexy tractors