Skip to content

Document Design

Chris Anderson edited this page Jul 30, 2013 · 2 revisions

NoSQL document design is a lot less cut-and-dried than relational schema design, both because it's a newer topic, and because the flexibility gives you many options. If you're used to SQL databases you may be perplexed. Here are some hints.

Denormalization

If you read much about SQL databases you're learn about the importance of normalization, splitting apart data to eliminate redundancy and make rows more independent. By contrast, in document-oriented databases it’s pretty common to denormalize data, storing the equivalent of multiple relational rows in a single document. There are several reasons for this:

  • In JSON document databases you don’t have transactions, so about the only way to update multiple items atomically is to put them in the same document.
  • The JSON structure of documents makes it more convenient to store collections of data in one place.
  • It’s faster to fetch smaller numbers of documents.
  • Map/reduce can pick lots of values out of a single document and index them separately, so you don’t need each one to be in its own row as you do in a relational db.

But the downsides of putting lots of stuff in one document include:

  • The possibility of conflicts if there are multiple parts that might be modified separately on different clients.
  • Access controls are more difficult to enforce, if a user should be able to edit one part of a document but not another.
  • Replication sends entire documents (not minimal deltas/patches) so large documents are less efficient to replicate if only small bits change.
  • There’s no straightforward way to “point to” a sub-item of a document, whereas if it’s in its own document it will have a document-ID.

As you’d expect it often comes down to informed intuition. For example, I would definitely use a single document for an address card, storing all of the person’s phone numbers, email addresses etc. in arrays or dictionaries. But I would store comments on a blog post as individual documents that point back to the post (i.e. include its doc-ID.)

References

A related issue is which direction to-many references should point. Relational databases use what I call “backward links”, where the children contain the ID of the parent and you run a query to find them, while in-memory data structures use “forward links” where the parent owns a collection that points to all of the children.

You can use both techniques in document databases:

  • A backward link is a property of a child object that contains the document ID of the parent. There will be a view whose map function does something like emit(doc.parent_id, null), and you query this to find the children of a document by setting the start and end of the range to the document's ID.
  • A forward link is simply a property of the parent whose value is a collection (usually an array) containing the document IDs of the children.

I’ve found that forward references are easier to code and perform better in lookup (since you don’t need to query, just get child documents by their IDs) but can be more expensive to update in a shared environment (the parent can act as a bottleneck and source of conflicts if multiple clients are trying to update its list of children.)

Forward references can also be more fragile: if a child document is deleted, the parent needs to be updated to remove its reference to the child. Otherwise the next time you enumerate its children you'll get a missing-document error trying to resolve that document ID. This isn't as fatal as a bad-pointer dereference, but it can be awkward. On the other hand, with backward references when you delete a child there are no dangly bits left over: the next query will return only the remaining children. (You can create a dangling reference by deleting the parent, but that tends to just result in unreachable child objects that don't cause problems other than taking up space.)

Further Reading