Skip to content

Latest commit

 

History

History
35 lines (25 loc) · 4.28 KB

README.md

File metadata and controls

35 lines (25 loc) · 4.28 KB

Manipulating Data in Style with SQL

An introduction to SQL, the interface language to most of the world’s structured data, and practices for readable and reusable SQL code

This repository contains materials for my talk at the Polyglot Programming DC meetup on October 14, 2014, which is loosely based on material from my talks at the Data Wranglers DC meetups on June 4, 2014 (materials at nihonjinrxs/dwdc-june2014) and August 6, 2014 (materials at nihonjinrxs/dwdc-august2014).

Contents

The talk consists of two major directions:

  • An introduction to SQL and relational databases, and techniques for writing readable and reusable code in SQL
  • Using SQL on Data Frame objects in R and Python

Folders are as follows:

  • A slide deck (./slides) in Apple Keynote, PDF and HTML formats
  • A set of SQL scripts (./sql) that demonstrate the queries discussed, as well as the creation of database objects such as views and functions
  • An RMarkdown document (./R), published on RPubs, that demonstrates using sqldf in R to perform SQL queries on data frames as if they are tables
  • An IPython notebook (./python), available at IPython nbviewer, that demonstrates using sqldf from the pandasql package to perform SQL queries on Pandas DataFrame objects as if they are tables

Where do I start?

I recommend that anyone wishing to understand what I've done should tackle these pieces in order, starting with the slide deck. In order to work with the code examples, you'll need PostgreSQL (I used version 9.3 on an Apple Mac to create these examples) and the Chinook Database (MS-PL licensed). You may also want to review the Chinook data model and schema.

Importing the Chinook data into PostgreSQL

Once you download the Chinook Database and unzip the file, you'll see a collection of scripts. You're looking for the Chinook_PostgreSql.sql file. If you just run the SQL file from the psql command prompt, PostgreSQL will happily dump the database objects into your public schema. I prefer to have a named schema, and the examples here assume one. In order to do that, you'll need to do a few more steps:

psql> CREATE SCHEMA chinook;
psql> SET search_path TO chinook;
psql> \i /path/to/script/Chinook_PostgreSql.sql

Note that many of the SQL examples using the Chinook database depend on a faulty import of Chinook data that doesn't take into account character set issues, and thus leaves out many records in tables. Although we would normally not want this to happen, doing it intentionally here provides some interesting real-world data integrity problems to inspect, so I've done this on purpose. A log of the import process with errors and warnings (chinook_import.log) is provided in the .sql directory, so that one can manually achieve an identical result if desired (although the examples should work with any non-zero number of import errors in the specified tables, so long as some data is actually imported).

Disclaimer

This work and the opinions expressed here are my own, and do not purport to represent the views of my current or former employers.

The MIT license on this repo covers all contents of the repo, but does not supercede the existing licenses for products used for this work, including PostgreSQL (covered by the PostgreSQL License) and the Chinook Database (covered by the Microsoft Public License).