Project Nessie

Project Nessie is an open-source, transactional Data Catalog designed primarily to manage tables in an open Data Lakehouse (supporting formats like Apache Iceberg and Delta Lake).

Git-Like Data Management

Nessie introduces version control concepts to data engineering pipelines, allowing users to interact with their data lakehouse in a manner similar to Git:

  • Branches: Users can create branches of their catalog (e.g., a dev branch) to test new ingestions or transformations in isolation. Changes are metadata-only and do not duplicate underlying files.
  • Commits: Catalog operations are bundled as atomic commits, ensuring that concurrent readers always see consistent, un-corrupted states of the tables.
  • Merges: Once isolation tests succeed, changes can be merged back into the main branch (main) atomically, preventing half-written states from being exposed.
  • Tags & Time-Travel: Users can tag specific catalog states (e.g., q4_finance_close) and query exactly what the data looked like at that commit or tag.

By providing multi-table transaction guarantees across entire namespaces, Nessie enables robust, zero-copy data operations.


Part of the Data & AI Terms glossary.

This page is mirrored from the GitHub Wiki. View original on GitHub