Simon Willison’s Weblog

Subscribe

Datasette: instantly create and publish an API for your SQLite databases

13th November 2017

I just shipped the first public version of datasette, a new tool for creating and publishing JSON APIs for SQLite databases.

You can try out out right now at fivethirtyeight.datasettes.com, where you can explore SQLite databases I built from Creative Commons licensed CSV files published by FiveThirtyEight. Or you can check out parlgov.datasettes.com, derived from the parlgov.org database of world political parties which illustrates some advanced features such as SQLite views.

Common surnames from fivethirtyeight

Or you can try it out on your own machine. If you run OS X and use Google Chrome, try running the following:

pip3 install datasette
datasette ~/Library/Application\ Support/Google/Chrome/Default/History

This will start a web server on http://127.0.0.1:8001/ displaying an interface that will let you browse your Chrome browser history, which is conveniently stored in a SQLite database.

Got a SQLite database you want to share with the world? Provided you have Zeit Now set up on your machine, you can publish one or more databases with a single command:

datasette publish now my-database.db

The above command will whir away for about a minute and then spit out a URL to a hosted version of datasette with your database (or databases) ready to go. This is how I’m hosting the fivethirtyeight and parlgov example datasets, albeit on a custom domain behind a Cloudflare cache.

The datasette API

Everything datasette can do is driven by URLs. Queries can produce responsive HTML pages (I’m using a variant of this responsive tables pattern for smaller screens) or with the .json or .jsono extension can produce JSON. All JSON responses are served with an Access-Control-Allow-Origin: * HTTP header, meaning you can query them from any page.

You can try that right now in your browser’s developer console. Navigate to http://www.example.com/ and enter the following in the console:

fetch(
    "https://fivethirtyeight.datasettes.com/fivethirtyeight-2628db9/avengers%2Favengers.jsono"
).then(
    r => r.json()
).then(data => console.log(
    JSON.stringify(data.rows[0], null, '  ')
))

You’ll see the following:

{
  "rowid": 1,
  "URL": "http://marvel.wikia.com/Henry_Pym_(Earth-616)",
  "Name/Alias": "Henry Jonathan \"Hank\" Pym",
  "Appearances": 1269,
  "Gender": "MALE",
  "Full/Reserve Avengers Intro": "Sep-63",
  "Year": 1963,
  "Years since joining": 52,
  ...
}

Since the API sits behind Cloudflare with a year-long cache expiry header, responses to any query like this should be lightning-fast.

Datasette supports a limited form of filtering based on URL parameters, inspired by Django’s ORM. Here’s an example: by appending ?CLOUDS=1&MOUNTAINS=1&BUSHES=1 to the FiveThirtyEight dataset of episodes of Bob Ross’ The Joy of Painting we can see every episode in which Bob paints clouds, bushes AND mountains:

https://fivethirtyeight.datasettes.com/fivethirtyeight-2628db9/bob-ross%2Felements-by-episode?CLOUDS=1&MOUNTAINS=1&BUSHES=1

And here’s the same episode list as JSON.

Arbitrary SQL

The most exciting feature of datasette is that it allows users to execute arbitrary SQL queries against the database. Here’s a convoluted Bob Ross example, returning a count for each of the items that can appear in a painting.

Datasette has a number of limitations in place here: it cuts off any SQL queries that take longer than a threshold (defaulting to 1000ms) and it refuses to return more than 1,000 rows at a time—partly to avoid too much JSON serialization overhead.

Datasette also blocks queries containing the string PRAGMA, since these statements could be used to modify database settings at runtime. If you need to include PRAGMA in an argument to a query you can do so by constructing a prepared statement:

select * from [twitter-ratio/senators] where "text" like :q

You can then construct a URL that incorporates both the SQL and provides a value for that named argument, like this: https://fivethirtyeight.datasettes.com/fivethirtyeight-2628db9?sql=select+rowid%2C+*+from+[twitter-ratio%2Fsenators]+where+“text”+like+%3Aq&q=%25pragmatic%25—which returns tweets by US senators that include the word “pragmatic”.

Why an immutable API?

A key feature of datasette is that the API it provides is very deliberately read-only. This provides a number of interesting benefits:

Implementation notes

Datasette is built on top of the Sanic asynchronous Python web framework (see my previous notes), and makes extensive use of Python 3’s async/await statements. Since SQLite doesn’t yet have an async Python module all interactions with SQLite are handled inside a thread pool managed by a concurrent.futures.ThreadPoolExecutor.

The CLI is implemented using the Click framework. This is the first time I’ve used Click and it was an absolute joy to work with. I enjoyed it so much I turned one of my Jupyter notebooks into a Click script called csvs-to-sqlite and published it to PyPI.

This post is being discussed on a Hacker News.