Secrets in DuckDB
Table of Contents
It’s been a while since I’ve written a post and that bums me out. Some of my recent projects have been on the ambitious side which can make stopping to blog feel like a distraction
In my downtime I have received a couple emails from folks asking clarifying questions about some of my posts and that was really nice, so I’m going to try some snack-sized posts to get rolling again 😀
Background
One of the great features of DuckDB is the httpfs extension. This allows reading files from HTTP and reading / writing / globbing files from AWS S3
This is one of DuckDB’s secret (no pun intended, I swear!) weapons and enables us to run analytical workloads from data stored on S3 sort of like a mini (or maybe not so mini) data warehouse / data lake
Let’s run a query on S3 using DuckDB in Python 3.11
Tell me your secrets
AWS setup
If we’re going to pull data from S3 we’ll probably need some sort of authentication / authorization. In the case of AWS S3, we’ll use access keys generated by IAM
In the interest of minimizing distractions, I’ll leave as an exercise for the reader to prepare an S3 bucket, an IAM user, and generate access keys 😇
Handling of secrets is a whole topic in itself, so let’s just set them as environment variables so they aren’t in our code. We can set them in our shell with:
export ACCESS_KEY=<your access key>
export SECRET_KEY=<your secret key>
Python setup
First we’ll import:
logging
to logos
to read enviroment variablesduckdb
to query parquet files in S3
import logging
import os
import duckdb
First, let’s set the AWS region, bucket name, and parquet filename prefix:
s3_region = "<your aws region>"
# example bucket name and prefix:
# "s3://<your bucket name>/<your file prefix>.parquet"
s3_bucket = "<your bucket name>"
s3_prefix = "<your file prefix>"
Let’s add a logger with some boilerplate I’ve been using recently:
logger = logging.getLogger(__name__)
logging.basicConfig(
format="%(asctime)s %(levelname)-8s %(message)s",
level=logging.DEBUG,
datefmt="%Y-%m-%d %H:%M:%S",
)
We can test it by adding a welcome message:
logger.info(f"Welcome to the DuckDB secrets demo")
Let’s pull in the ACCESS_KEY
and SECRET_KEY
environment variables to store as Python variables. We’ll check to see if either variable is None
which tells us the variable was not set, in which case we exit the script with: exit(1)
logger.info(f"Setting Python variables from ACCESS_KEY and SECRET_KEY environment variables")
env_var = "ACCESS_KEY"
env_value = os.getenv(env_var)
if env_value is None:
logger.error(f"Error: {env_var} is set to None")
exit(1)
else:
logger.info("S3 access key set")
s3_access_key = env_value
env_var = "SECRET_KEY"
env_value = os.getenv(env_var)
if env_value is None:
logger.error(f"Error: {env_var} is set to None")
exit(1)
else:
s3_secret_key = env_value
logger.info("S3 secret key set")
The httpfs
extension is described as autoloadable which means we don’t have to explicitly install
and load
it:
# httpfs is autoloadable so we don't need to explicitly install and load it
# duckdb.sql("INSTALL httpfs;")
# duckdb.sql("LOAD httpfs;")
Creating the secret
Now, we’ll give the secret a name and log that we’re going to create it:
secret_name = "s3_secret"
logger.info(f"Creating duckdb S3 secret: {secret_name}")
Here’s how to create a secret in DuckDB:
duckdb.sql(f"""
CREATE SECRET {secret_name} (
TYPE S3,
KEY_ID '{s3_access_key}',
SECRET '{s3_secret_key}',
REGION '{s3_region}'
);
""")
Note: the main motivation for this post is to share that DuckDB only allows creating a secret with a given name once
If we try to create a secret with the same name a second time we’ll get a runtime error. This can be annoying (especially in a Jupyter notebook) so it sounds like we might need to wrap it a try / except
block:
try:
duckdb.sql(f"""
CREATE SECRET {secret_name} (
TYPE S3,
KEY_ID '{s3_access_key}',
SECRET '{s3_secret_key}',
REGION '{s3_region}'
);
""")
except Exception as e:
if str(e) == f"Invalid Input Error: Temporary secret with name '{secret_name}' already exists!":
logger.info(f"Secret: {secret_name} already exists, skipping")
else:
logger.error("Unknown error when creating secret in DuckDB")
exit(1)
else:
logger.info(f"Successfully created secret: {secret_name}")
First, we try and run the DuckDB CREATE SECRET
command
- If there’s no Exception, we log
info
that we successfully created the secret and continue the script - If there is an Exception, we assign it to
e
. We caste
to a string so we can check if it indicates that the secret already exists- If so, we are OK with this Exception and continue the script
- If not, the Exception is something other than the secret existing and we log
error
andexit(1)
Testing the secret
Let’s do a basic query to test out our secret:
logger.info("Run a test query")
query_result = duckdb.sql(f"""
select
*
from
's3://{s3_bucket}/{s3_prefix}/*.parquet'
limit
1
""")
logger.info(f"Query result:\n{query_result}")
That worked for me, hopefully it worked for you!
Here is a link to the full script
Hurray, now you can set a secret in DuckDB and run a basic query on a parquet file on S3. As mentioned, this is an incredibly powerful toolkit that I’ll go over in more depth another day
Thanks for reading,
Nick