[DuckDB](https://duckdb.org/) is a database that can be run in-process with a variety of programming languages including [[Python]] and [[R]]. DuckDB provides the power of a database in-memory (i.e., no persistent files on [[disk]]). It can also be run as a [[command line application]].
DuckDB is the best way to efficiently run SQL on [[pandas]] DataFrames. In fact, DuckDB (due in part to the power of [[SQL]]) is faster than pandas for many joining, filtering, and aggregation functions.[^1]
## Installation
For Python, install DuckDB using [[conda]].
```
conda install python -duckdb -c conda-forge
```
## Basic Use
My primary use case for DuckDB is to load an [[Excel]] file, use queries to create views ("relations"), and process the data with [[pandas]]. This workflow allows me to store data efficiently in Excel or [[Google Sheets]] and analyze it easily.
```python
import pandas as pd import duckdb mydf = pd.DataFrame({'a' : [1, 2, 3]}) print(duckdb.query("SELECT SUM(a) FROM mydf").to_df())
```
ALternatives
- SQLite with Pandas
- Pandasql
- Dask (for large files)
- Apache Spark (for large files)
-
## Use with Excel
```python
# install and load the spatial extension
duckdb.install_extension('spatial')
duckdb.load_extension('spatial')
```
#rough
[^1]: [Efficient SQL on Pandas with DuckDB](https://duckdb.org/2021/05/14/sql-on-pandas.html)