DuckDB - Obsidian Publish

[DuckDB](https://duckdb.org/) is a database that can be run in-process with a variety of programming languages including [[Python]] and [[R]]. DuckDB provides the power of a database in-memory (i.e., no persistent files on [[disk]]). It can also be run as a [[command line application]]. DuckDB is the best way to efficiently run SQL on [[pandas]] DataFrames. In fact, DuckDB (due in part to the power of [[SQL]]) is faster than pandas for many joining, filtering, and aggregation functions.[^1] ## Installation For Python, install DuckDB using [[conda]]. ``` conda install python -duckdb -c conda-forge ``` ## Basic Use My primary use case for DuckDB is to load an [[Excel]] file, use queries to create views ("relations"), and process the data with [[pandas]]. This workflow allows me to store data efficiently in Excel or [[Google Sheets]] and analyze it easily. ```python import pandas as pd import duckdb mydf = pd.DataFrame({'a' : [1, 2, 3]}) print(duckdb.query("SELECT SUM(a) FROM mydf").to_df()) ``` ALternatives - SQLite with Pandas - Pandasql - Dask (for large files) - Apache Spark (for large files) - ## Use with Excel ```python # install and load the spatial extension duckdb.install_extension('spatial') duckdb.load_extension('spatial') ``` #rough [^1]: [Efficient SQL on Pandas with DuckDB](https://duckdb.org/2021/05/14/sql-on-pandas.html)