● LIVE   Breaking News & Analysis
Thchere
2026-05-03
Technology

Exploring DuckLake 1.0: A SQL-Centric Data Lake Format

DuckLake 1.0 is a novel data lake format from DuckDB Labs that stores table metadata in a SQL database instead of object store files, offering faster small updates, improved sorting/partitioning, and Iceberg compatibility.

DuckDB Labs has introduced DuckLake 1.0, a novel data lake format that shifts table metadata storage from distributed object store files to a centralized SQL database. This approach promises enhanced performance for small updates, better sorting and partitioning strategies, and seamless interoperation with Iceberg-style features. The debut extension for DuckDB brings these capabilities to life. Below, we answer key questions about DuckLake 1.0.

What is DuckLake 1.0?

DuckLake 1.0 is an open-source data lake format developed by DuckDB Labs. Unlike traditional formats like Apache Parquet or Iceberg, which store table metadata in multiple manifest files within an object store (e.g., S3), DuckLake uses a SQL database as its metadata repository. The initial release comes as a DuckDB extension, allowing users to manage table schemas, partitions, and small updates directly through SQL queries. This design simplifies metadata management, speeds up catalog operations, and reduces the overhead of scanning numerous small files. DuckLake 1.0 is compatible with Iceberg-style data features, meaning it can work with existing Iceberg tables and leverage Iceberg’s partitioning and sorting improvements.

Exploring DuckLake 1.0: A SQL-Centric Data Lake Format
Source: www.infoq.com

How does DuckLake differ from traditional data lake formats?

Traditional data lake formats—such as Apache Iceberg, Delta Lake, or Hudi—store metadata in object storage as a collection of JSON or Avro files. These files track table snapshots, partition information, and file-level statistics. DuckLake 1.0 flips this model: it keeps the data files in object storage (e.g., Parquet files) but moves the metadata into a SQL database. This SQL catalog can be any standard relational database (like PostgreSQL or DuckDB’s own embedded engine). The key advantages are faster metadata queries (SQL databases are optimized for lookups) and the ability to perform atomic, transactional metadata updates—something that is complex with file-based catalogs. Additionally, DuckLake avoids the need to list and read many small metadata files, which can be slow and expensive in object stores.

What are the key features of DuckLake 1.0?

DuckLake 1.0 introduces several notable features:

  • Catalog-stored small updates: Instead of rewriting entire manifest files for tiny changes, DuckLake directly updates rows in the SQL metadata tables. This makes small INSERT, UPDATE, or DELETE operations much cheaper.
  • Improved sorting and partitioning: Users can define sorting orders and partition schemes at the table level, and DuckLake automatically maintains these when writing data. This leads to better query performance, especially for columnar scans.
  • Iceberg compatibility: DuckLake can read and write data files that follow Iceberg’s underlying file layout (e.g., using Iceberg’s manifest list format). This allows users to share data between DuckLake and Iceberg engines without conversion.
  • DuckDB extension: The format is immediately usable via a DuckDB extension, requiring no external services or complex setup.

How is DuckLake implemented in DuckDB?

DuckLake 1.0 is delivered as a DuckDB extension that integrates with DuckDB’s existing storage and catalog APIs. When a user creates a DuckLake table, DuckDB stores the table’s metadata—such as column names, data types, partition boundaries, and file locations—in an SQLite-style database (or any DuckDB-compatible SQL backend). The data itself remains in Parquet files placed inside an object store (like S3 or a local filesystem). The extension handles all DDL and DML operations, translating normal SQL commands (CREATE TABLE, INSERT, SELECT) into metadata updates in the SQL catalog and data writes in object storage. This architecture allows DuckDB to leverage its built-in query optimizer and execution engine while offloading metadata management to a fast, transactional SQL layer.

What is the role of the SQL catalog in DuckLake?

The SQL catalog is the central brain of DuckLake. It stores all table metadata in normalized relational tables: one table tracks schemas, another maps partitions to data files, and a third records version snapshots. This catalog replaces the manifest files that Iceberg or Delta Lake traditionally use. Because the catalog is a live SQL database, DuckLake can perform atomic, concurrent metadata operations (e.g., two simultaneous INSERTs) without the need for locking files or complex optimistic concurrency control. The catalog also enables instant point queries—for example, finding all files belonging to a specific partition becomes a simple indexed lookup. Furthermore, the SQL catalog can be replicated or scaled independently from the data lake, offering greater flexibility in multi-cluster or cloud-native deployments.

Exploring DuckLake 1.0: A SQL-Centric Data Lake Format
Source: www.infoq.com

How does DuckLake compare to Iceberg?

Both DuckLake and Iceberg are open data lake formats with a focus on reliability and performance. The core difference lies in metadata storage. Iceberg uses file-based manifests stored in object storage; DuckLake uses a SQL database. This gives DuckLake advantages for small metadata updates and low-latency catalog operations. However, Iceberg has broader ecosystem adoption (supports Spark, Flink, Trino, etc.), while DuckLake is currently limited to DuckDB. DuckLake 1.0 is designed to be compatible with Iceberg-style data features, meaning it can read Iceberg’s file layout and partitioning schemes. This allows users to transition gradually: they can keep Iceberg tables but use DuckLake’s SQL catalog for metadata queries. In terms of performance, DuckLake may be faster for metadata-heavy workloads, but Iceberg offers more mature tooling and wider engine support.

What benefits does storing metadata in SQL bring?

Storing metadata in a SQL database instead of object store files provides several practical benefits:

  1. Faster metadata queries: SQL databases (like DuckDB or PostgreSQL) are optimized for indexed lookups, aggregations, and joins. Listing all files for a partition or finding the latest snapshot becomes a millisecond operation, compared to scanning multiple JSON files.
  2. Atomic, transactional updates: Small changes (e.g., adding a single file) can be done with a SQL transaction rather than rewriting a manifest file. This eliminates write amplification and reduces the risk of partial updates.
  3. Simpler concurrency: Multiple writers can update metadata concurrently using standard database locking mechanisms, without custom conflict resolution code.
  4. Easy integration: Existing SQL tools, monitoring, and backup solutions can directly access the metadata catalog, simplifying governance and auditing.
  5. Cost efficiency: Object store LIST operations are expensive; DuckLake avoids them for metadata access, potentially lowering cloud storage costs.

How does DuckLake improve sorting and partitioning?

DuckLake 1.0 introduces enhanced sorting and partitioning options that are managed through the SQL catalog. Users can define a table’s sort order (e.g., by date then by customer_id) at creation time. When new data is written, DuckLake automatically sorts the records before writing them into Parquet files, ensuring data within each file is physically ordered. This improves the efficiency of range scans and filter pushdown. For partitioning, DuckLake allows hierarchical partitioning schemes (e.g., year/month/day) and supports dynamic partition evolution—adding new partitions without rewriting existing files. The SQL catalog stores partition boundaries as indexed rows, making partition pruning extremely fast. Compared to Iceberg’s manifest-based partitioning, DuckLake’s catalog approach eliminates the need to read multiple manifest files to identify which partitions to scan. As a result, queries with high-selectivity filters can skip entire partitions instantly.