DuckDB in PyCharm
Table of Contents
My first exposure to DuckDB was doing a query on a Parquet file. The performance of predict pushdown, where only the relevant columns are read from disk, qualified as a legitimate gamechanger for doing local data analysis
More recently, I find myself reaching for DuckDB to provide a SQL interface to the mentioned Parquet files, .csv files, and even pandas dataframes. The continuity of using the same language to query a remote database like Snowflake and locally to manipulate query results for visualization is a lovely productivity boost
However, an IDE like PyCharm can’t quite flex its autocomplete muscles when querying a .csv file – or can it?!
Note: you can find the Jupyter notebook used to create this tutorial on GitHub
The problem
A simple DuckDB query in Python looks like this:
duckdb.query("""
select
from 'Tesla-2024-03-10 13-34-28-2024-03-10 14-25-11.csv'
""").df()
At present, PyCharm is not able to pull the column names from the .csv file to aid in autocompletion or other intelligent features, so typing l
after select
and looking at the auto complete list does not reveal the column names Latitude
and Longitude
that I was expecting:
This is a little disappointing and makes the case to use DuckDB a little bit weaker
Let’s get introspective
PyCharm has a feature called introspection that is:
the process of loading the metadata of the database objects (for example, object names and types of columns) and source code
Introspection is the fuel for the autocomplete and other intelligent features. But how can we get PyCharm to run introspection on the .csv file that we are using in DuckDB?
This solution is a little bit of a hack, but it does the trick:
- Create a DuckDB database on disk
- We often default to using DuckDB in in-memory mode where nothing is persisted to disk
- Create a database
VIEW
that points to the .csv file- As a reminder, a view takes very little space to store
- Bonus: we can define new column names when defining the
VIEW
, allowing us to change to names that are a little SQL friendlier!
Step by step
Prerequisites
Before we get started:
- We’ll need to create a new Python project in PyCharm
- We’ll also need to install
duckdb
andpandas
:
pip install duckdb pandas
The Python side of things
We’ll need to do some configuration and create the DuckDB database before we can run introspection:
- Import duckdb
import duckdb
- Configure the csv filename, the filename of the database we’ll create, and the name of the
VIEW
that we’ll create
# raw .csv filename
csv_filename = "Tesla-2024-03-10 13-34-28-2024-03-10 14-25-11.csv"
# duckdb filename (to create)
db_filename = "file.db"
# name of view (to create)
data_view_name = "data_v"
- Write SQL query that creates a
VIEW
, optionally renaming all the columns- Note: the first and last lines of the query are the most important, containing f-string variables that we defined above
# query to create a view that aliases each column name in the .csv to
# something more SQL query friendly
create_data_view_query = f"""create or replace view data_v as
select "Timestamp (PDT)" as ts,
"State" as state,
"Charging State" as charging_state,
"Shift State" as shift_state,
"Battery Level (%)" as battery_level_pct,
"Usable Battery Level (%)" as usable_battery_level_pct,
"Battery Range (mi)" as battery_range_mi,
"Est. Battery Range (mi)" as estimated_battery_range_mi,
"Ideal Battery Range (mi)" as ideal_battery_range_mi,
"Latitude" as latitude,
"Longitude" as longitude,
"Elevation (m)" as elevation_m,
"Speed (mph)" as speed_mph,
"Power (kW)" as power_kw,
"Odometer (mi)" as odometer_mi,
"Charge Rate (mph)" as charge_rate_mph,
"Current (A)" as current_a,
"Charger Power (kW)" as charger_power_kw,
"Charger Voltage (V)" as charger_voltage_v,
"Charger Phases" as charger_phases,
"Energy Added (kWh)" as energy_added_kwh,
"Rated Range Added (mi)" as rated_range_added_mi,
"Ideal Range Added (mi)" as ideal_range_added_mi,
"Climate" as climate,
"Battery Heater" as battery_heater,
"Inside Temp (°F)" as inside_temperature_f,
"Outside Temp (°F)" as outside_temperature_f,
"Locked" as locked,
"Sentry Mode" as sentry_mode
from '{csv_filename}'"""
- Create a function to check if the
VIEW
already exists- In general, we want to avoid opening the database in read / write mode (more on that later), so we’ll only create the
VIEW
if it doesn’t exist already
- In general, we want to avoid opening the database in read / write mode (more on that later), so we’ll only create the
def check_view_exists(duck_conn: duckdb.DuckDBPyConnection, check_view_name: str) -> bool:
"""Return True if check_view_name exists in duck_conn or False if not"""
try:
# use DuckDB Python API, not SQL query
duck_conn.view(check_view_name)
except Exception as local_e:
if f"Table with name {check_view_name} does not exist" in str(local_e):
return False
return True
- Create a function to create the
VIEW
from the query we wrote above
def create_view(duck_conn: duckdb.DuckDBPyConnection, duck_filename: str, new_view_query: str) -> duckdb.DuckDBPyConnection:
"""Open database connection to duck_filename in read / write mode,
run the new_view_query SQL query,
re-open the database connection in read only mode,
and return the read only DuckDB connection"""
duck_conn.close()
duck_conn = duckdb.connect(duck_filename, read_only=False)
duck_conn.query(new_view_query)
duck_conn.close()
duck_conn = duckdb.connect(duck_filename, read_only=True)
return duck_conn
- Connect to the database if it exists, create it if not
# connect to db in read mode
# if db does not exist, create it in write mode and reconnect in read mode
# close connection if it exists
if "conn" in globals():
# noinspection PyUnresolvedReferences
conn.close()
try:
conn = duckdb.connect(db_filename, read_only=True)
except Exception as e:
if "database does not exist" in str(e):
conn = duckdb.connect(db_filename, read_only=False)
conn.close()
conn = duckdb.connect(db_filename, read_only=True)
else:
raise e
- Check if the
VIEW
exists, create it if not
# see if our view already exists in the db, if view does not exist, create it
view_exists = check_view_exists(conn, data_view_name)
if not view_exists:
conn = create_view(conn, db_filename, create_data_view_query)
- Confirm the
VIEW
exists, assert if not
# confirm that view exists, if not assert because something went wrong
view_exists = check_view_exists(conn, data_view_name)
assert view_exists, "Error: view not created"
- Ensure that the Jupyter notebook is no longer connected to the DuckDB database. The easiest way to do this is to restart the Jupyter kernel
Charming PyCharm
Now we need to configure this DuckDB database as a Data Source in PyCharm. Here’s the PyCharm documentation for the Database tool and here’s a step by step for doing configuring it to target DuckDB:
- Click the Database icon on the right
- Click the + icon in the Database menu
- Click Data Source
- Scroll to the bottom and select Other
- Select DuckDB
- Set File to the value of the
db_filename
variable in Python (file.db
in our case) - Set Authentication to No auth
- Click Test Connection
- You may be prompted to install database drivers for DuckDB
- Click on the Schemas tab
- Open the
file
expander - Check the box for
main
underfile
- Click the Advanced tab
- Add a setting with Name
duckdb.read_only
- Note: this and the following step are CRITICAL
- Set the Value to
true
- Click OK
- We can now see introspection of the .csv file in the Database window!
- The Deactivate button disconnects PyCharm from the database
- Note: Any time you connect to
db_filename
in read / write mode in Python, you will need to click Deactivate - Similarly, to disconnect from the database in Python, use the method:
conn.close()
- Note: Any time you connect to
- The Refresh button connects PyCharm to the database
DuckDB concurrency
Concurrency in DuckDB can be either:
- Multiple read only connections to a database
- A single read / write connection to a database
- A read only connection cannot be made when there is an active read / write connection
Results
We can now use PyCharm’s syntax highlighting and autocompletion features when we query DuckDB. Here’s a template to start with:
conn = duckdb.connect(db_filename, read_only=True)
conn.query(f"""
select
from data_v -- data_v cannot be replaced with a Python variable
""").df()
Here’s what it looks like in action where I’ve just typed l
and now the autocomplete engine shows us latitude
and longitude
like I was hoping for. Joy!
Note: as a reminder, you can find the Jupyter notebook used to create this tutorial on GitHub
Next steps
If I were to dig a little deeper on this subject, I may write a Python class as a DuckDB wrapper to make things a little more object-oriented