Skip to content

Latest commit

 

History

History
272 lines (185 loc) · 13.2 KB

XX42-organizing_data.asciidoc

File metadata and controls

272 lines (185 loc) · 13.2 KB

Organizing Data

There are only three data serialization formats you should use: TSV, JSON, or Avro.

Good Format 1: TSV (It’s simple)

For data exploration, wherever reasonable use tab-separated values (TSV). Emit one record per line, ASCII-only, with no quoting, and html enttity escaping where necessary. The advantage of TSV:

  • you only have two control characters, NL and tab.

  • It’s restartable: a newline unambiguously signals the start of a record, so a corrupted record can’t cause major damage.

  • Ubiquitous: pig, Hadoop streaming, every database bulk import tool, and excel will all import it directly.

  • Ad-hoc: Most importantly, it works with the standard unix toolkit of cut, wc, etc. can use without decoding all fields

A stray quote mark can have the computer absorb the next MB of data I into one poor little field. Don’t do any quoting — just escaping

Disadvantages

  • Tabular data only with uniform schema

  • Second extraction step to get text field contents

I will cite as a disadvantage, but want to stress how small it is: data size. Compression takes care of,

Exercise: compare compression:

  • Format: TSV, JSO, Avro, packed binary strings

  • Compression: gz, bz2, LZO, snappy, native (block)

  • Measure: map input, map output, reduce sort, reduce output, replicate, final storage size.

best practice

  • Restartable - If you have a corrupt record, you only have to look for the next un-escaped newline

  • Keep regexes simple - Quotes and nested parens are hard,

Don’t use CSV — you’re sure to have some data with free form text, and then find yourself doing quoting acrobatics or using a different format.

Use TSV when

  • You’re

Don’t use it when

  • Structured data

  • Bulk documents

Good Format 2: JSON (It’s Generic and Ubiquitous)

Individual JSON records stored one-per-line are your best bet for denormalized objects,

{"text":"","user":{"screen_name":"infochimps",...},...}

for documents,

{"title":"Gettysburg Address","body":"Four score and seven years ago...",...}

and for schema-free records:

{"_ts":"","":"","data":{"_type":"ev.hadoop_job",...}}
{"_ts":"","":"","data":{"_type":"ev.web_form",...}}

structured to model.

TODO: receive should take an owner field

Receiver:

class Message
  def receive_user
  end
end
class User
  def receive_
Edn
class Factory
  Def convert(obj)
    Correct type return it
    Look up factory
    Factory.receive obj
  End
def factory_for
  If factory then return it
  If symbol then get from identity map
  If string then Constantize it
  If hash then create model class, model_klass
  If array then union type
end

Important that receeiver is not setter - floppy inassertive model with no sense of its own identity. Like the security line at the airport: horrible inexplicable things, more complicated then it seems like it should, but once past the security line you can g back to being a human again.

Do not make filedls that represent objects and keys — if there’s a user fields put it in user, its id in user_id, and use virtual accessors to mask the difference.

Good Format #3: Avro (It does everything right)

that there is no essential difference among

File Format         Schema          API
RPC (Remote Procedure Call) Definition
JPG                 CREATE TABLE    Twitter API
HTML DTD            db defn.

Avro says these are imperfect reflections for the same thing: a method to send data in space and time, to yourself or others. This is a very big idea [^1].

[1]

Other reasonable choices: tagged net strings and null-delimited documents

Not restartable.

Can’t represent a document with a null

Crap format #1: XML

XML is disastrous as a data transport format. It’s also widely used in enterprise systems and on the web, and so you will have to learn how to work with it. Wherever possible, implement decoupled code whose only job is to translate the XML into a sane format, and write all downstream code to work with that.

Writing XML

If you have to emit XML for downstream consumption, yet have any control over its structure, follow these best practices:

.Well-formed XML
[This has been split across multiple lines, but in production eliminate the whitespace as well]
--------------------
<post>
  <author><name>William Carlos William</name><id>88</id></author>
  <id>12345</id>
  <title>This is Just to Say</title>
  <dtstart>2012-04-26T12:34:56 CST</dtstart>
  <text>I have eaten&#10;the plums&#10;that were in&#10;the icebox&#10;&#10;and which&#10;you were probably&#10;saving&#10;for breakfast&#10;&#10;Forgive me&#10;they were delicious&#10;so sweet&#10;and so cold</text>
  <comments>
    <comment><commenter-name>Holly</commenter-name><id>98765</id><text>Your poem made up for it...  &lt;em&gt;barely&lt;/em&gt;</text></comment>
  </comments>
  <replies></replies>
</post>
--------------------

The example to the side is pretty-printed for clarity; in production, you should eliminate the whitespace as well. Otherwise it does things correctly: * Tags hold only values or other tags (not both). * Values only appear in the contents of tags, and not the tag attributes. * Text contents are fully encoded (<em>barely</em>, not <em>barely</em>), including whitespace (&10; in the post text, not a literal newline). All the XML tags you see belong to the record. * The nesting comments tag makes clear that it is an array-of-length-one, in contrast to a singular property like author. The replies tag is present, representing an empty array, rather than being omitted. * It has a predictable structure, making it easily `grep’able

If you have to write XML against a specific format, consider using a template language like erubis, moustache or the like. Before I learned this trick, I’d end up with a whole bunch of over-wrough soupy code just for the purpose of putting open and close tags in the right place. When 90% of the complexity is writing the XML and 10% is stuffing the values in there, you should put the code in the content and not the other way around.

.Well-formed XML
[This has been split across multiple lines, but in production eliminate the whitespace as well]
--------------------
<post>
  <author><name><%= record.author.name.to_xml %></name><id><%= record.author.id.to_xml %></id></author>
  <id><%=      record.id.to_xml         %></id>
  <title><%=   record.title.to_xml      %></title>
  <dtstart><%= record.created_at.to_xml %></dtstart>
  <text><%=    record.text.to_xml       %></text>
  <comments>
    <% record.comments.each do |comment| -%>
    <comment>
      <commenter-name><%= comment.commenter_name.to_xml %></commenter-name>
      <id><%=   comment.id.to_xml %></id>
      <text><%= comment.text.to_xml %></text>
    </comment>
    <% end -%>
  </comments>
  <replies></replies>
</post>
--------------------

This template generates XML with a consistent structure. The <%= %> blocks interpolate data — an equals-sign <%= %> causes output, a plain <% %> block is for control statements. When you’re inside a funny-braces block, you’re in ruby; everything else is literal content. Control blocks (like the <% record.comments.each do |comment| %>) stamp out their contents as you’d expect.

Airing of Grievances

XML is like the English Measurement System — just ubiquitous enough, and just barely useful enough, that it’s near-impossible to weed out. Neither, however, is anymore justifiable for use by the professional practitioner. XML is both too extensible and too limited to map smoothly to and from the data structures languages use in practive. In the case that you need to make the case against XML to a colleague, I arm you with the following List of Grievances:

  • Does not preserve simple types: the only primitive data type is a string; there’s no standard way to distinguish an integer, a floating-point number, or a date without external hints.

  • Does not preserve complex types: You will find data stored with

    • mixed attributes and data:

<post date="2012-01-02T12:34:56 CST" author="William Carlos Williams">
<body>I have eaten&#10;the plums&#10;that were in...</body>
</post>
  • mixed data and text:

<post>
<title>This is Just to Say</title>
I have eaten the plums that ... you were probably saving for <span dtstart="2012-04-27T08:00:00 CST">breakfast</span>. Forgive me ...
</post>
  • Inconsistent cardinality: In this example, there’s no way to distinguish a singular property like title from a list-of-length-one like comment; simple XML readers will return`{"title":"…​", "comment":{"text":"…​"}}` when there is one, and {"title":"…​", "comment":[{"text":"…​"},{"text":"…​"}]} when there are many.

<post>
  <title>This is just to say</title>
  <comment><text>Your poem made up for it</text></comment>
</post>
  • Not restartable.: you can only properly understand an XML file by reading it from beginning to end. CDATA blocks are especially treacherous; they can in principle hold nearly anything, including out-of-band XML.

  • Not unique: even are multiple ways to represent the same final context. An apostrophe might be represented directly ('), hex-encoded (&#x0027;), decimal-encoded ('), or as an SGML [2] entity (&apos;). (You may even find people using SGML entities in the absence of the DTD [3] that is technically required to interpret them.)

  • Complex: the technical standard for XML is fiendishly complex, and even mature libraries in widespread use still report bugs parsing complex or ill-formed input.

Attributes, CDATA, model boundaries, document text

If you do it, consider emitting not with a serde but with a template engine. Pretty-print fields so can use cmdline tools

Crap Format #2: N3 triples

Like most Semantic-Web developed technology, N3 is antagonistic to thought and action.

If you must deal with this, pretty-print the fields and ensure delimiters are clean.

Crap Format #3: Flat format

WALKTHROUGH: converting the weather fields.

Flat formats are surprisingly innocuous; it’s the contortions they force upon their tender that hurts.

Straightforward to build a regexp. Wukong gives you a flatpack stringifier. Specify a format string as follows:

"%4d%3.2f\"%r{([^\"]+)}\""

It returns a MatchData object (same as a regexp does).

9999 as null (or other out-of-band): Override the receive_xxx method to knock those out, call super.

To handle the elevation fields, override the receive method:

Note that we call super first here , because we want an int to divide; in the previous case, we want to catch 9999 before it goes in for conversion. Wukong has some helpers for unit conversion too.

Web log and Regexpable

WALKTHROUGH: apache web logs of course. - Regexp to tuple. Just capture substructure

Glyphing (string encoding), Unicode,UTF-8

All of the following examples could be ambiguously referred to as "encoding":

  • Compression: gz, LZO, Snappy, etc

  • Serialization: the actual record container

  • Stringifying: conversion to JSON, TSV, etc.

  • Glyphing: binary stream (for example UTF8-encoded Unicode) to characters (TODO: make this correct)

  • Structured to model[4]

Our best advice is

  • Never let anything into your system unless it is UTF8, UTF-16, or ASCII.

  • Either:

    • Only transmit 7-bit ASCII characters in the range 0x20 (space) to 0x126 (~), along with 0x0a (newline) and 0x09 (tab) but only when used as record (newline) or field (tab) separators. URL encoding, JSON encoding, and HTML entity encoding are all reasonable. HTML entity encoding has the charm of leaving simple international text largely readable: "caf&eacute\;" or "M&oumlaut\;torh&eumlaut\;ad" are more easily scannable than "caf\XX". Be warned that unless you exercise care all three can be ambiguous: &eacute\;, (that in decimal) and (that in hex) are all the same.to make life grep’able, force your converter to emit exactly one string for any given glyph — That is, it will not ship "0x32" for "a", and it will not ship "é" for "\XX"

    • Use unix-style newlines only.

    • Even With unique glyph coding, Unicode is still not unique: edge cases involving something something diacritic modifiers.

    • However complex you think Unicode is, it’s slightly more hairy than that.

    • URL encoding only makes sense when you’re shipping urls anyway.

    • TODO: check those character strings for correctness. Also, that I’m using "glyph" correctly


1. To the people of the future: this might seem totally obvious. Trust that it is not. There are virtually no shared patterns or idioms across the systems listed here.
2. SGML= Standard Generalized Markup Language, the highly-complex document format that inspired HTML
3. DTD = Document Type Declaration; an over-enthusiastic DTD can make XML mutable to the point of incomprehensibility.
4. the worst example iis "node": a LAN node configured by a chef node running a Flume logical node handling graph nodes in a Node.js decorator.