Skip to content

Commit

Permalink
Merge pull request #179 from analyst-collective/cherry-picked-about-d…
Browse files Browse the repository at this point in the history
…ocs-update

Cherry picked about docs update
  • Loading branch information
drewbanin authored Oct 19, 2016
2 parents 4fb0f91 + b99f037 commit 8a3d29d
Show file tree
Hide file tree
Showing 3 changed files with 71 additions and 32 deletions.
42 changes: 42 additions & 0 deletions docs/about/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,45 @@ We welcome PRs! We recommend that you log any feature requests as issues and dis
## Docs

We welcome PRs with updated documentation! All documentation for dbt is written in markdown using [mkdocs](http://www.mkdocs.org/). Please follow installation instructions there to set up mkdocs on your local environment.

## Design Constraints

All contributions to dbt must adhere to the following design constraints:

- All data models are a single `SELECT` statement. All decisions about how the results of that statement are materialized in the database must be user-controlled via configuration.
- The target schema must always be able to be regenerated from scratch—i.e. if a user performs a `DROP SCHEMA [target] CASCADE` and then runs `dbt run --target [target]`, all data will be re-built exactly as before.

## Design Principles

When contributing to dbt, please keep the core project goal in mind:

**dbt (data build tool) is a productivity tool that helps analysts get more done and produce higher quality results.**

This goal has been carefully selected, and flows directly from the [viewpoint](about/viewpoint/).

### Why do we call dbt a “productivity tool”? Doesn't this minimize its impact?

This is a deliberate choice of words that forces us to remember what dbt actually is and what its goals are: *dbt is a user experience wrapper around an analytic database.* All design decisions should be made with the goal of creating a better workflow / user experience for analysts.

### Why are we focused on speed and quality as opposed to capability?

Most analytics tools that exist today were designed to maximize user capability. If an analyst wanted to build a line chart, the tool needed to make sure he/she could build that line chart. Exactly how that line chart was produced was less important.

This perspective made sense in the past. It used to be hard to make a line chart. Today it is easy: using matplotlib, ggplot2, Tableau, or the countless other charting tools creates functionally the same result. Today, the hard part is not making the line chart, but making the line chart fast, with accurate data, in a collaborative environment. While analysts today can create stunning visualizations, they struggle to produce accurate and timely data in a collaborative environment.

Analysts don’t need more new capabilities, they need a workflow that allows them to use the ones they have faster, with higher quality, and in teams.

### Why are we focused on analysts instead of data engineers?

Two reasons:

1. Analysts are closer to the business and the business users, and therefore have the information they need to actually build data models.
1. There are far more analysts than data engineers in the world. To truly solve this problem, we need to make the solution accessible for analysts.

### Why is dbt such a technical tool if its target users are analysts?

Most analysts today don't spend their time in text files and on the command line, but dbt forces the user to do both. This choice was made intentionally, and on three beliefs:

1. Analysts are already becoming more technical, and this trend will accelerate in coming years.
1. Working in this way allows dbt to hook into a much larger ecosystem of developer productivity tools like git, vim/emacs, etc. This ecosystem has a large part to play in the overall productivity gains to be had from dbt.
1. Core analytics workflows should not be locked away into a particular UI.
54 changes: 28 additions & 26 deletions docs/about/overview.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,50 @@
# Overview #

## What is dbt?
dbt [data build tool] is a tool for creating analytical data models. dbt facilitates an analytical workflow that closely mirrors software development, including source control, testing, and deployment. dbt makes it possible to produce reliable, modular analytic code as an individual or in teams.

For more information on the thinking that led to dbt, see [this article]( https://medium.com/analyst-collective/building-a-mature-analytics-workflow-the-analyst-collective-viewpoint-7653473ef05b).
dbt (data build tool) is a productivity tool that helps analysts get more done and produce higher quality results.

## Who should use dbt?
dbt is built for data consumers who want to model data in SQL to support production analytics use cases. Familiarity with tools like text editors, git, and the command line is helpful—while you do not need to be an expert with any of these tools, some basic familiarity is important.
Analysts commonly spend 50-80% of their time modeling raw data—cleaning, reshaping, and applying fundamental business logic to it. dbt empowers analysts to do this work better and faster.

## Why do I need to model my data?
With the advent of MPP analytic databases like Amazon Redshift and Google BigQuery, it is now common for companies to load and analyze large amounts of raw data in SQL-based environments. Raw data is often not suited for direct analysis and needs to be restructured first. Some common use cases include:
dbt's primary interface is its CLI. Using dbt is a combination of editing code in a text editor and running that code using dbt from the command line using `dbt [command] [options]`.

- sessionizing raw web clickstream data
- amortizing multi-month financial transactions
## How does dbt work?

Modeling data transforms raw data into data that can be more easily consumed by business users and BI platforms. It also encodes business rules that can then be relied on by all subsequent analysis, establishing a "single source of truth".
dbt has two core workflows: building data models and testing data models. (We call any transformed view of raw data a data model.)

## What exactly is a "data model" in this context?
A dbt data model is a SQL `SELECT` statement with templating and dbt-specific extensions.
To create a data model, an analyst simply writes a SQL `SELECT` statement. dbt then takes that statement and builds it in the database, materializing it as either a view or a table. This model can then be queried by other models or by other analytics tools.

## How does dbt work?
To test a data model, an analyst asserts something to be true about the underlying data. For example, an analyst can assert that a certain field should never be null, should always hold unique values, or should always map to a field in another table. Analysts can also write assertions that express much more customized logic, such as “debits and credits should always be equal within a given journal entry”. dbt then tests all assertions against the database and returns success or failure responses.

## Does dbt really help me get more done?

dbt has a small number of core functions. It:
One dbt user has this to say: *“At this point when I have a new question, I can answer it 10-100x faster than I could before.”* Here’s how:

- takes a set of data models and compiles them into raw SQL,
- materializes them into your database as views and tables, and
- runs automated tests on top of them to ensure their integrity.
- dbt allows analysts avoid writing boilerplate DML and DDL: managing transactions, dropping tables, and managing schema changes. All business logic is expressed in SQL `SELECT` statements, and dbt takes care of materialization.
- dbt creates leverage. Instead of starting at the raw data with every analysis, analysts instead build up reusable data models that can be referenced in subsequent work.
- dbt includes optimizations for data model materialization, allowing analysts to dramatically reduce the time their queries take to run.

Once your data models have been materialized into your database, you can write analytic queries on top of them in any SQL-enabled tool.
There are many other optimizations in the dbt to help you work quickly: macros, hooks, and package management are all accelerators.

Conceptually, this is very simple. Practically, dbt solves some big headaches in exactly *how* it accomplishes these tasks:
## Does dbt really help me produce more reliable analysis?

- dbt interpolates schema and table names in your data models. This allows you to do things like deploy models to test and production environments seamlessly.
- dbt automatically infers a directed acyclic graph of the dependencies between your data models and uses this graph to manage the deployment to your schema. This graph is powerful, and allows for features like partial deployment and safe multi-threading.
- dbt's opinionated design lets you focus on writing your business logic instead of writing configuration and boilerplate code.
It does. Here’s how:

## Why model data in SQL?
- Writing SQL frequently involves a lot of copy-paste, which leads to errors when logic changes. With dbt, analysts don’t need to copy-paste. Instead, they build reusable data models that then get pulled into subsequent models and analysis. Change a model once and everything that’s build on it reflects that change.
- dbt allows subject matter experts to publish the canonical version of a particular data model, encapsulating all complex business logic. All analysis on top of this model will incorporate the same business logic without needing to understand it.
- dbt plays nicely with source control. Using dbt, analysts can use mature source control processes like branching, pull requests, and code reviews.
- dbt makes it easy and fast to write functional tests on the underlying data. Many analytic errors are caused by edge cases in the data: testing helps analysts find and handle those edge cases.

Historically, most analytical data modeling has been done prior to loading data into a SQL-based analytic database. Today, however, it's often preferable to model data within an analytic database using SQL. There are two primary reasons for this:
## Why SQL?

1. SQL is a very widely-known language for working with data. Providing SQL-based modeling tools gives the largest-possible group of users access.
1. Modern analytic databases are extremely performant and have sophisticated optimizers. Writing data transformations in SQL allows users to describe transformations on their data but leave the execution plan to the underlying technology. In practice, this provides excellent results with far less work on the part of the author.
While there are a large number of great languages for manipulating data, we’ve chosen SQL as the primary data transformation language at the heart of dbt. There are two reasons for this:

Of course, SQL will inevitably not be suitable for 100% of potential use cases. dbt may be extended in the future to take advantage of support for non-SQL languages in platforms like Redshift and BigQuery. We have found, though, that modern SQL has a higher degree of coverage than we had originally expected. To users of languages like Python, solving a challenging problem in SQL often requires a different type of thinking, but the advantages of staying "in-database" and allowing the optimizer to work for you are very significant.
1. SQL is a very widely-known language for working with data. Using SQL gives the largest-possible group of users access.
1. Modern analytic databases are extremely performant and have sophisticated optimizers. Writing data transformations in SQL allows users to describe transformations on their data but leave the execution plan to the underlying database technology. In practice, this provides excellent results with far less work on the part of the author.

## What databases does dbt currently support?
Currently, dbt supports PostgreSQL and Amazon Redshift. We anticipate building support for additional databases in the future.

## How do I get started?

dbt is open source and completely free to download and use. See our [setup instructions](guide/setup/) for more.
7 changes: 1 addition & 6 deletions docs/about/viewpoint.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,6 @@ Analytics doesn’t have to be this way. In fact, the playbook for solving these
The same techniques that software engineering teams use to collaborate on the rapid creation of quality applications can apply to analytics. We believe it’s time to build an open set of tools and processes to make that happen.

## Analytics is collaborative
Most of the problems with the current analytics workflow aren’t so bad if you’re working alone. You know about all of the data available to you, you know what it means, and you know how it was created. But you don’t scale. As soon as your analytics needs grow beyond a single analyst, these problems begin to manifest.

We believe a mature analytics team’s techniques and workflow should have the following collaboration features:

### Version Control
Expand All @@ -35,15 +33,12 @@ Bad data can lead to bad analyses, and bad analyses can lead to bad decisions. A

### Documentation
Your analysis is a software application, and, like every other software application, people are going to have questions about how to use it. Even though it might seem simple, in reality the “Revenue” line you’re showing could mean dozens of things. Your code should come packaged with a basic description of how it should be interpreted, and your team should be able to add to that documentation as additional questions arise.
Further, your analysis may need to be extended or modified by another member of your team, so document any portions of the code that may benefit from clarification.

### Modularity
If you build a series of analyses about your company’s revenue, and your colleague does as well, you should use the same input data. Copy-paste is not a good approach here — if the definition of the underlying set changes, it will need to be updated everywhere it was used. Instead, think of the schema of a data set as its public interface. Create tables, views, or other data sets that expose a consistent schema and can be modified if business logic changes.

## Analytic code is an asset
Data collection, processing, and analysis have all grown exponentially in capability in the past decades. As a result, analytics as a practice provides more value than ever to organizations. Today, the success of an organization can be directly linked to its ability to make effective, fast decisions based on data.

If analytics is core to the success of an organization, the code, processes, and tooling required to produce that analysis are core organizational investments. We believe a mature analytics organization’s workflow should have the following characteristics so as to protect and grow that investment:
The code, processes, and tooling required to produce that analysis are core organizational investments. We believe a mature analytics organization’s workflow should have the following characteristics so as to protect and grow that investment:

### Environments
Analytics requires multiple environments. Analysts need the freedom to work without impacting users, while users need service level guarantees so that they can trust the data they rely on to do their jobs.
Expand Down

0 comments on commit 8a3d29d

Please sign in to comment.