DuckDB

Official Guides

Installation

It comes as a single binary, we just have to download and unpack:

wget https://github.com/duckdb/duckdb/releases/download/v0.9.2/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip

Running Queries

Open Shell

./duckdb
D select tx_id, item_price from read_csv('./data/compras_tiny.csv', AUTO_DETECT=TRUE) where tx_id = 'YMEVOKU194';
┌────────────┬────────────┐
│   tx_id    │ item_price │
│  varchar   │   double   │
├────────────┼────────────┤
│ YMEVOKU194 │      28.85 │
│ YMEVOKU194 │       4.12 │
└────────────┴────────────┘

Run Query and Exit

./duckdb -c "select * from read_csv('data/compras_tiny.csv', AUTO_DETECT=TRUE)"

Creating a database

Just append the filename when running duckdb. For instance, this will create a table named compras in a file at data/compras_tiny.duckdb:

duckdb data/compras_tiny.duckdb -c "CREATE TABLE compras AS SELECT * FROM read_csv('data/compras_tiny.csv', header=true, auto_detect=true)"

After that, you can open the file and query the compras table:

./duckdb data/compras_tiny.duckdb
v0.9.2 3c695d7ba9
Enter ".help" for usage hints.
D describe compras;
┌────────────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
│    column_name     │ column_type │  null   │   key   │ default │ extra │
│      varchar       │   varchar   │ varchar │ varchar │ varchar │ int32 │
├────────────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
│ tx_id              │ VARCHAR     │ YES     │         │         │       │
│ tx_time            │ TIMESTAMP   │ YES     │         │         │       │
│ buyer              │ VARCHAR     │ YES     │         │         │       │
│ currency_code      │ VARCHAR     │ YES     │         │         │       │
│ payment_type       │ VARCHAR     │ YES     │         │         │       │
│ credit_card_number │ VARCHAR     │ YES     │         │         │       │
│ country            │ VARCHAR     │ YES     │         │         │       │
│ department         │ VARCHAR     │ YES     │         │         │       │
│ product            │ VARCHAR     │ YES     │         │         │       │
│ item_price         │ DOUBLE      │ YES     │         │         │       │
│ coupon_code        │ VARCHAR     │ YES     │         │         │       │
│ was_returned       │ VARCHAR     │ YES     │         │         │       │
├────────────────────┴─────────────┴─────────┴─────────┴─────────┴───────┤
│ 12 rows                                                      6 columns │
└────────────────────────────────────────────────────────────────────────┘
D .mode line
D select count(*) as con_descuento from compras where coupon_code is not null;
con_descuento = 414

Change output format

Use the .mode csv, also -csv or -json command line options. You can check the available output formats in the documentation.

D .mode csv
D select tx_id, item_price from read_csv('./data/compras_tiny.csv', AUTO_DETECT=TRUE) where tx_id = 'YMEVOKU194';
tx_id,item_price
YMEVOKU194,28.85
YMEVOKU194,4.12

Remember you can always save the data to a different file:

./duckdb -jsonlines -c "select tx_id, item_price from read_csv('./data/compras_tiny.csv', AUTO_DETECT=TRUE) where tx_id = 'YMEVOKU194';" > ./data/compras_tiny.jsonl

Use >> instead of > to append to the file.

Converting between CSV and Parquet

For instance, using compras_tiny.csv:

./duckdb -c "copy (select * from read_csv('data/compras_tiny.csv', AUTO_DETECT=TRUE)) to 'data/compras_tiny.parquet' (format 'parquet')"

Running from Python

You need first to install the library:

pip install duckdb

Plain SQL Queries

Use the method .sql to run SQL queries as you'll normally do in the console:

df = duckdb.sql("select * from read_csv('./data/prices.csv', AUTO_DETECT=TRUE)")

Which by default will print something like this in the console or Jupyter Lab:

┌─────────┬────────────┬────────────────────┬───┬───────────────────┬───────────────────┬─────────────┐
│ Symbol  │    Date    │     Adj Close      │ … │        Low        │       Open        │   Volume    │
│ varchar │    date    │       double       │   │      double       │      double       │   double    │
├─────────┼────────────┼────────────────────┼───┼───────────────────┼───────────────────┼─────────────┤
│ aapl    │ 2010-01-04 │  6.535084247589111 │ … │ 7.585000038146973 │ 7.622499942779541 │ 493729600.0 │
│ msft    │ 2010-01-04 │ 23.800220489501953 │ … │ 30.59000015258789 │  30.6200008392334 │  38409100.0 │
│ goog    │ 2010-01-04 │ 312.20477294921875 │ … │ 310.9544677734375 │ 312.3044128417969 │   3927065.0 │
├─────────┴────────────┴────────────────────┴───┴───────────────────┴───────────────────┴─────────────┤
│ 3 rows                                                                          8 columns (6 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────┘

Import data into a Panda's DataFrame

You can use the method .df() to convert any result into a Pandas Dataframe:

df = duckdb.sql("select * from read_csv('./data/prices.csv', AUTO_DETECT=TRUE)").df()
df[df['Symbol'] == 'msft']
df.set_index(["Symbol", "Date"])

Read a DataFrame using DuckDB queries

All Dataframes you create will be directly accesible from DuckDB as well:

import pandas as pd
goog = pd.read_csv("./data/goog.csv")
duckdb.sql("select * from goog")