Secrets in DuckDB

It’s been a while since I’ve written a post and that bums me out. Some of my recent projects have been on the ambitious side which can make stopping to blog feel like a distraction

In my downtime I have received a couple emails from folks asking clarifying questions about some of my posts and that was really nice, so I’m going to try some snack-sized posts to get rolling again 😀

Background

One of the great features of DuckDB is the httpfs extension. This allows reading files from HTTP and reading / writing / globbing files from AWS S3

This is one of DuckDB’s secret (no pun intended, I swear!) weapons and enables us to run analytical workloads from data stored on S3 sort of like a mini (or maybe not so mini) data warehouse / data lake

Let’s run a query on S3 using DuckDB in Python 3.11

Tell me your secrets

AWS setup

If we’re going to pull data from S3 we’ll probably need some sort of authentication / authorization. In the case of AWS S3, we’ll use access keys generated by IAM

In the interest of minimizing distractions, I’ll leave as an exercise for the reader to prepare an S3 bucket, an IAM user, and generate access keys 😇

Handling of secrets is a whole topic in itself, so let’s just set them as environment variables so they aren’t in our code. We can set them in our shell with:

export ACCESS_KEY=<your access key>
export SECRET_KEY=<your secret key>

Python setup

First we’ll import:

logging to log
os to read enviroment variables
duckdb to query parquet files in S3

import logging
import os
import duckdb

First, let’s set the AWS region, bucket name, and parquet filename prefix:

s3_region = "<your aws region>"

# example bucket name and prefix:
# "s3://<your bucket name>/<your file prefix>.parquet"
s3_bucket = "<your bucket name>"
s3_prefix = "<your file prefix>"

Let’s add a logger with some boilerplate I’ve been using recently:

logger = logging.getLogger(__name__)
logging.basicConfig(
    format="%(asctime)s %(levelname)-8s %(message)s",
    level=logging.DEBUG,
    datefmt="%Y-%m-%d %H:%M:%S",
)

We can test it by adding a welcome message:

logger.info(f"Welcome to the DuckDB secrets demo")

Let’s pull in the ACCESS_KEY and SECRET_KEY environment variables to store as Python variables. We’ll check to see if either variable is None which tells us the variable was not set, in which case we exit the script with: exit(1)

logger.info(f"Setting Python variables from ACCESS_KEY and SECRET_KEY environment variables")

env_var = "ACCESS_KEY"
env_value = os.getenv(env_var)
if env_value is None:
    logger.error(f"Error: {env_var} is set to None")
    exit(1)
else:
    logger.info("S3 access key set")
    s3_access_key = env_value

env_var = "SECRET_KEY"
env_value = os.getenv(env_var)
if env_value is None:
    logger.error(f"Error: {env_var} is set to None")
    exit(1)
else:
    s3_secret_key = env_value
    logger.info("S3 secret key set")

The httpfs extension is described as autoloadable which means we don’t have to explicitly install and load it:

# httpfs is autoloadable so we don't need to explicitly install and load it
# duckdb.sql("INSTALL httpfs;")
# duckdb.sql("LOAD httpfs;")

Creating the secret

Now, we’ll give the secret a name and log that we’re going to create it:

secret_name = "s3_secret"

logger.info(f"Creating duckdb S3 secret: {secret_name}")

Here’s how to create a secret in DuckDB:

duckdb.sql(f"""
CREATE SECRET {secret_name} (
    TYPE S3,
    KEY_ID '{s3_access_key}',
    SECRET '{s3_secret_key}',
    REGION '{s3_region}'
);
""")

Note: the main motivation for this post is to share that DuckDB only allows creating a secret with a given name once

If we try to create a secret with the same name a second time we’ll get a runtime error. This can be annoying (especially in a Jupyter notebook) so it sounds like we might need to wrap it a try / except block:

try:
    duckdb.sql(f"""
    CREATE SECRET {secret_name} (
        TYPE S3,
        KEY_ID '{s3_access_key}',
        SECRET '{s3_secret_key}',
        REGION '{s3_region}'
    );
    """)
except Exception as e:
    if str(e) == f"Invalid Input Error: Temporary secret with name '{secret_name}' already exists!":
        logger.info(f"Secret: {secret_name} already exists, skipping")
    else:
        logger.error("Unknown error when creating secret in DuckDB")
        exit(1)
else:
    logger.info(f"Successfully created secret: {secret_name}")

First, we try and run the DuckDB CREATE SECRET command

If there’s no Exception, we log info that we successfully created the secret and continue the script
If there is an Exception, we assign it to e. We cast e to a string so we can check if it indicates that the secret already exists
1. If so, we are OK with this Exception and continue the script
2. If not, the Exception is something other than the secret existing and we log error and exit(1)

Testing the secret

Let’s do a basic query to test out our secret:

logger.info("Run a test query")

query_result = duckdb.sql(f"""
select
    *
from
    's3://{s3_bucket}/{s3_prefix}/*.parquet'
limit
    1
""")

logger.info(f"Query result:\n{query_result}")

That worked for me, hopefully it worked for you!

Here is a link to the full script

Hurray, now you can set a secret in DuckDB and run a basic query on a parquet file on S3. As mentioned, this is an incredibly powerful toolkit that I’ll go over in more depth another day

Thanks for reading,

Nick