Skip to content

2 Working with table designs

Tom Vogels edited this page Aug 14, 2017 · 1 revision

Table designs

The Redshift ETL supports the notion of "table designs" which describe

  • how to create the table (including how to distribute it)
  • what column attributes and constraints exist,
  • how to retrieve column values from upstream source in case casts are needed,
  • what table attributes and constraints exist.

This is true for tables and CTAS but in case of views they simply list the columns.

The extension of the file must be '.yaml' and the format is YAML. (We encourage the use of a "commented" JSON as a subset of YAML, which makes the syntax of these files easier to debug.)

Upstream sources

Basic structure

The basic table design for a table from an upstream data source looks like this:

{
    "name": "harryswww.orders",
    "source_name": "harryswww.public.orders",
    "columns": [
        ...
    ]
}

Tables can (and probably should) have a description field.

Column definitions

Each column must be an object that contains at least the name, type and sql_type information. The "type" here is a generic information that can be readily used when data is in flight, e.g. in Avro data files or Spark Dataframes. The "SQL type" is the type of the column which will be used when the table is created.

Here's a full toy example:

{
    "name": "example.toys",
    "description": "An example table that stores information about toys",
    "source_name": "example.public.toys",
    "columns": [
        {
            "name": "id",
            "sql_type": "bigint",
            "type": "int"
        },
     ]
}

The exception here is that there is support for "skipped" columns -- these are columns which should not be retrieved from upstream sources but should be tracked so that nobody assumes that they were left off by accident. These columns show up like this:

    ... {
            "name": "encrypted_password",
            "skipped": true
        } ...

Additional fields for column definitions are:

  • the type that the column has in the upstream database: source_sql_type
  • the flag whether column is nullable: is_null
  • the SQL expression which will cast upstream data into the correct column type: expression
  • the compression: encoding

Table attributes and constraints

Attributes describe how the table is distributed over compute nodes: either on all nodes, evenly distributed or distributed by a column.

Constraints describe how a column or a set of columns needs to be unique or references another table.

CTAS -- Tables created using SQL queries

In most ways, CTAS are just like tables except they do not have an upstream source and are created by running a SQL query instead. Thus all of the information above applies except for "source_name" which now must be set to the literal "CTAS". And note that you also have to provide a file with the query. That file must have the same base name as the table design and have an extension of .sql.

{
    "name": "sales.fact_order",
    "source_name": "CTAS",
    "description": "Fact table for our order star schema, based on upstream 'www' data",
    "columns": [
        ...
    ]
}

Additionally, CTAS may have an "identity" column which is most useful for dimension tables.

Views

For views, there's no need (or opportunity) to specify anything other than the names of the columns. So the table design for a view looks like this:

{
    "name": "staging.users_without_pii",
    "source_name": "VIEW",
    "description": "View over users information but without their names and addresses",
    "columns": [
        { "name": "id" },
        { "name": "last_visit" },
        { "name": "is_subscriber" },
        { "name": "state_us" }
    ]
}

Note that "source_name" must be set to the literal "VIEW".