There are only three data serialization formats you should use: TSV, JSON, or Avro.
For data exploration, wherever reasonable use tab-separated values (TSV). Emit one record per line, ASCII-only, with no quoting, and html enttity escaping where necessary. The advantage of TSV:
-
you only have two control characters, NL and tab.
-
It’s restartable: a newline unambiguously signals the start of a record, so a corrupted record can’t cause major damage.
-
Ubiquitous: pig, Hadoop streaming, every database bulk import tool, and excel will all import it directly.
-
Ad-hoc: Most importantly, it works with the standard unix toolkit of cut, wc, etc. can use without decoding all fields
A stray quote mark can have the computer absorb the next MB of data I into one poor little field. Don’t do any quoting — just escaping
Disadvantages
-
Tabular data only with uniform schema
-
Second extraction step to get text field contents
I will cite as a disadvantage, but want to stress how small it is: data size. Compression takes care of,
Exercise: compare compression:
-
Format: TSV, JSO, Avro, packed binary strings
-
Compression: gz, bz2, LZO, snappy, native (block)
-
Measure: map input, map output, reduce sort, reduce output, replicate, final storage size.
best practice
-
Restartable - If you have a corrupt record, you only have to look for the next un-escaped newline
-
Keep regexes simple - Quotes and nested parens are hard,
Don’t use CSV — you’re sure to have some data with free form text, and then find yourself doing quoting acrobatics or using a different format.
Use TSV when
-
You’re
Don’t use it when
-
Structured data
-
Bulk documents
Individual JSON records stored one-per-line are your best bet for denormalized objects,
{"text":"","user":{"screen_name":"infochimps",...},...}
for documents,
{"title":"Gettysburg Address","body":"Four score and seven years ago...",...}
and for schema-free records:
{"_ts":"","":"","data":{"_type":"ev.hadoop_job",...}} {"_ts":"","":"","data":{"_type":"ev.web_form",...}}
TODO: receive should take an owner field
Receiver:
class Message def receive_user end end
class User def receive_ Edn
class Factory Def convert(obj) Correct type return it Look up factory Factory.receive obj End
def factory_for If factory then return it If symbol then get from identity map If string then Constantize it If hash then create model class, model_klass If array then union type end
Important that receeiver is not setter - floppy inassertive model with no sense of its own identity. Like the security line at the airport: horrible inexplicable things, more complicated then it seems like it should, but once past the security line you can g back to being a human again.
Do not make filedls that represent objects and keys — if there’s a user fields put it in user
, its id in user_id, and use virtual accessors to mask the difference.
that there is no essential difference among
File Format Schema API RPC (Remote Procedure Call) Definition JPG CREATE TABLE Twitter API HTML DTD db defn.
Avro says these are imperfect reflections for the same thing: a method to send data in space and time, to yourself or others. This is a very big idea [^1].
[1]
Not restartable.
Can’t represent a document with a null
XML is disastrous as a data transport format. It’s also widely used in enterprise systems and on the web, and so you will have to learn how to work with it. Wherever possible, implement decoupled code whose only job is to translate the XML into a sane format, and write all downstream code to work with that.
If you have to emit XML for downstream consumption, yet have any control over its structure, follow these best practices:
.Well-formed XML [This has been split across multiple lines, but in production eliminate the whitespace as well] -------------------- <post> <author><name>William Carlos William</name><id>88</id></author> <id>12345</id> <title>This is Just to Say</title> <dtstart>2012-04-26T12:34:56 CST</dtstart> <text>I have eaten the plums that were in the icebox and which you were probably saving for breakfast Forgive me they were delicious so sweet and so cold</text> <comments> <comment><commenter-name>Holly</commenter-name><id>98765</id><text>Your poem made up for it... <em>barely</em></text></comment> </comments> <replies></replies> </post> --------------------
The example to the side is pretty-printed for clarity; in production, you should eliminate the whitespace as well. Otherwise it does things correctly:
* Tags hold only values or other tags (not both).
* Values only appear in the contents of tags, and not the tag attributes.
* Text contents are fully encoded (<em>barely</em>
, not <em>barely</em>
), including whitespace (&10;
in the post text, not a literal newline). All the XML tags you see belong to the record.
* The nesting comments
tag makes clear that it is an array-of-length-one, in contrast to a singular property like author
. The replies
tag is present, representing an empty array, rather than being omitted.
* It has a predictable structure, making it easily `grep’able
If you have to write XML against a specific format, consider using a template language like erubis, moustache or the like. Before I learned this trick, I’d end up with a whole bunch of over-wrough soupy code just for the purpose of putting open and close tags in the right place. When 90% of the complexity is writing the XML and 10% is stuffing the values in there, you should put the code in the content and not the other way around.
.Well-formed XML [This has been split across multiple lines, but in production eliminate the whitespace as well] -------------------- <post> <author><name><%= record.author.name.to_xml %></name><id><%= record.author.id.to_xml %></id></author> <id><%= record.id.to_xml %></id> <title><%= record.title.to_xml %></title> <dtstart><%= record.created_at.to_xml %></dtstart> <text><%= record.text.to_xml %></text> <comments> <% record.comments.each do |comment| -%> <comment> <commenter-name><%= comment.commenter_name.to_xml %></commenter-name> <id><%= comment.id.to_xml %></id> <text><%= comment.text.to_xml %></text> </comment> <% end -%> </comments> <replies></replies> </post> --------------------
This template generates XML with a consistent structure. The <%= %>
blocks interpolate data — an equals-sign <%= %>
causes output, a plain <% %>
block is for control statements. When you’re inside a funny-braces block, you’re in ruby; everything else is literal content. Control blocks (like the <% record.comments.each do |comment| %>
) stamp out their contents as you’d expect.
XML is like the English Measurement System — just ubiquitous enough, and just barely useful enough, that it’s near-impossible to weed out. Neither, however, is anymore justifiable for use by the professional practitioner. XML is both too extensible and too limited to map smoothly to and from the data structures languages use in practive. In the case that you need to make the case against XML to a colleague, I arm you with the following List of Grievances:
-
Does not preserve simple types: the only primitive data type is a string; there’s no standard way to distinguish an integer, a floating-point number, or a date without external hints.
-
Does not preserve complex types: You will find data stored with
-
mixed attributes and data:
-
<post date="2012-01-02T12:34:56 CST" author="William Carlos Williams"> <body>I have eaten the plums that were in...</body> </post>
-
mixed data and text:
<post> <title>This is Just to Say</title> I have eaten the plums that ... you were probably saving for <span dtstart="2012-04-27T08:00:00 CST">breakfast</span>. Forgive me ... </post>
-
Inconsistent cardinality: In this example, there’s no way to distinguish a singular property like
title
from a list-of-length-one likecomment
; simple XML readers will return`{"title":"…", "comment":{"text":"…"}}` when there is one, and{"title":"…", "comment":[{"text":"…"},{"text":"…"}]}
when there are many.
<post> <title>This is just to say</title> <comment><text>Your poem made up for it</text></comment> </post>
-
Not restartable.: you can only properly understand an XML file by reading it from beginning to end. CDATA blocks are especially treacherous; they can in principle hold nearly anything, including out-of-band XML.
-
Not unique: even are multiple ways to represent the same final context. An apostrophe might be represented directly (
'
), hex-encoded ('
), decimal-encoded ('
), or as an SGML [2] entity ('
). (You may even find people using SGML entities in the absence of the DTD [3] that is technically required to interpret them.) -
Complex: the technical standard for XML is fiendishly complex, and even mature libraries in widespread use still report bugs parsing complex or ill-formed input.
Attributes, CDATA, model boundaries, document text
If you do it, consider emitting not with a serde but with a template engine. Pretty-print fields so can use cmdline tools
Like most Semantic-Web developed technology, N3 is antagonistic to thought and action.
If you must deal with this, pretty-print the fields and ensure delimiters are clean.
WALKTHROUGH: converting the weather fields.
Flat formats are surprisingly innocuous; it’s the contortions they force upon their tender that hurts.
Straightforward to build a regexp. Wukong gives you a flatpack stringifier. Specify a format string as follows:
"%4d%3.2f\"%r{([^\"]+)}\""
It returns a MatchData object (same as a regexp does).
9999 as null (or other out-of-band): Override the receive_xxx method to knock those out, call super.
To handle the elevation fields, override the receive method:
Note that we call super first here , because we want an int to divide; in the previous case, we want to catch 9999 before it goes in for conversion. Wukong has some helpers for unit conversion too.
WALKTHROUGH: apache web logs of course. - Regexp to tuple. Just capture substructure
All of the following examples could be ambiguously referred to as "encoding":
-
Compression: gz, LZO, Snappy, etc
-
Serialization: the actual record container
-
Stringifying: conversion to JSON, TSV, etc.
-
Glyphing: binary stream (for example UTF8-encoded Unicode) to characters (TODO: make this correct)
-
Structured to model[4]
Our best advice is
-
Never let anything into your system unless it is UTF8, UTF-16, or ASCII.
-
Either:
-
Only transmit 7-bit ASCII characters in the range 0x20 (space) to 0x126 (~), along with 0x0a (newline) and 0x09 (tab) but only when used as record (newline) or field (tab) separators. URL encoding, JSON encoding, and HTML entity encoding are all reasonable. HTML entity encoding has the charm of leaving simple international text largely readable: "café\;" or "Möaut\;torhëaut\;ad" are more easily scannable than "caf\XX". Be warned that unless you exercise care all three can be ambiguous: é\;, (that in decimal) and (that in hex) are all the same.to make life grep’able, force your converter to emit exactly one string for any given glyph — That is, it will not ship "0x32" for "a", and it will not ship "é" for "\XX"
-
Use unix-style newlines only.
-
Even With unique glyph coding, Unicode is still not unique: edge cases involving something something diacritic modifiers.
-
However complex you think Unicode is, it’s slightly more hairy than that.
-
URL encoding only makes sense when you’re shipping urls anyway.
-
TODO: check those character strings for correctness. Also, that I’m using "glyph" correctly
-