Posterous theme by Cory Watilo

Dabbling with Avro

Avro is a data serialization system that is impressive in terms of the data structures it provides. Avro relies on schemas. An Avro data includes schemas during writing and the same schema is always available when reading or de-serializing this data. This makes it a really cool feature. This means that the serialized data is completely described as it includes its schema. In addition to this, the serialization can be very fast. There are few good examples and discussions on using Avro with Ruby. Check it out.

X.commerce uses Avro for defining message contracts, which makes it possible to describe and validate messages easily. However, the message contracts are defined as an Avro protocol and directly as a schema. Avro supports RPC, where both client and server exchange schemas with a handshaking protocol. But, we don't want to do that. We want to parse the X.commerce contracts as Avro schemas.

I was dabbling with the Avro ruby gem (https://github.com/apache/avro/tree/trunk/lang/ruby) to understand how it operates and how I can directly use that gem to serialize/de-serialize messages. I started with a sample schema -

SCHEMA = <<-JSON
[{ "type": "record",
  "name": "Product",
  "fields" : [
    {"name": "id", "type": "string"},
    {"name": "product_url", "type": "string"},
    {"name": "product_purchased", "type": "boolean", "default": "false"}
  ]},
  { "type": "record",
    "name": "Review",
    "fields" : [
      {"name": "id", "type": "string"},
      {"name": "review_url", "type": "string"},
      {"name": "review_verified", "type": "boolean", "default": "false"}
    ]}
JSON

To serialize this schema, we can do the following -

file = File.open('data.avr', 'wb')
schema = Avro::Schema.parse(SCHEMA)
writer = Avro::IO::DatumWriter.new(schema)
encoder = Avro::IO::BinaryEncoder.new(writer)


dw = Avro::DataFile::Writer.new(file, writer, schema)
dw << {"id" => "product123", "product_url" => "http://ebay.com/some_product", "product_purchased" => true}
dw << {"id" => "review123", "review_url" => "http://yelp.com/some_review", "review_verified" => false}
dw.close

If you look at the schema, I have intentionally used "id" as the name for both the records (Product and Review). This is to illustrate what I think is a bad practice. While, the fields are relative to the particular record, it might be better to have a proper identifier as this helps us easily correlate the data during de-serialization, which we might do the following way.

file = File.open('data.avr', 'r')
reader = Avro::IO::DatumReader.new(nil, Avro::Schema.parse(SCHEMA))
dr = Avro::DataFile::Reader.new(file, reader)
dr.each { |record| p record }

The output is -

$output -> 
 {"id"=>"product123", "product_url"=>"http://ebay.com/some_product", "product_purchased"=>true}
{"id"=>"review123", "review_url"=>"http://yelp.com/some_review", "review_verified"=>false}

If you try to serialize a data using improper schema, it gets flagged immediately. This is the error dump. A better way to display is to catch it and throw an error. Again, this is more for ilustrative purposes -

Avro::IO::AvroTypeError (The datum {"id"=>"foo", "value"=>"1001010"} is not an example of schema'

With X.commerce contracts, read them as schemas and not as RPC protocols. I will post a rails working example of a message console, that sends/receives an Avro message using the above approach.