Apache Iceberg: Building a Modern Data Lakehouse
Data lakes promised cheap, scalable storage for all your data. The reality was a “data swamp” — no schema enforcement, no ACID transactions, and queries that scanned entire directories. Apache Iceberg fixes this by adding a table format layer on top of data lake storage that brings database-like features: schema evolution, time travel, partition evolution, and ACID transactions — all on files stored in S3 or HDFS. Therefore, this guide covers how Iceberg works and how to build a production data lakehouse.
Why Data Lakes Need a Table Format
Traditional Hive-partitioned data lakes organize data by directory structure: s3://data/events/year=2026/month=03/day=09/. The query engine must list directories and files to figure out what to read. This breaks down at scale: listing millions of files in S3 is slow, schema changes require rewriting all data, and there’s no way to read a consistent snapshot while new data is being written.
Iceberg introduces a metadata layer that tracks exactly which files belong to each table, their schemas, partition information, and column-level statistics. Instead of listing directories, the query engine reads a compact metadata file that tells it exactly which data files to read. Moreover, each write creates a new snapshot — readers always see a consistent view of the table, even during concurrent writes. This is the same isolation guarantee you’d expect from a traditional database, but operating on files in object storage.
The Apache Iceberg table format is engine-agnostic: Spark, Trino, Flink, Dremio, Snowflake, and BigQuery can all read and write Iceberg tables. This engine independence means you’re not locked into a single query engine — use Spark for batch processing, Trino for interactive queries, and Flink for streaming, all operating on the same tables.
Schema Evolution Without Rewriting Data
One of Iceberg’s most valuable features is safe schema evolution. You can add columns, rename columns, reorder columns, widen types (int to long), and make required columns optional — all without rewriting existing data files. The metadata layer tracks schema versions, and the query engine maps old data files to the current schema automatically.
-- Create an Iceberg table (Spark SQL)
CREATE TABLE lakehouse.events (
event_id STRING,
user_id STRING,
event_type STRING,
timestamp TIMESTAMP,
properties MAP<STRING, STRING>
)
USING iceberg
PARTITIONED BY (days(timestamp), event_type)
TBLPROPERTIES (
'write.format.default' = 'parquet',
'write.parquet.compression-codec' = 'zstd'
);
-- Schema evolution — no data rewrite needed
ALTER TABLE lakehouse.events ADD COLUMNS (
session_id STRING,
device_type STRING
);
ALTER TABLE lakehouse.events RENAME COLUMN properties TO metadata;
-- Old data files still work — missing columns return NULL
SELECT event_id, session_id -- session_id is NULL for old rows
FROM lakehouse.events
WHERE timestamp > '2026-01-01';In contrast, Hive tables require you to either rewrite all existing data files with the new schema (expensive and slow for terabyte-scale tables) or maintain complex schema-on-read logic in every query. Additionally, Iceberg enforces schema validation on writes — you can’t accidentally write data with the wrong column types, a common source of data quality issues in traditional data lakes.
Time Travel and Snapshot Isolation
Every write to an Iceberg table creates a new snapshot. You can query any historical snapshot by timestamp or snapshot ID — essential for debugging data issues, reproducing ML training datasets, and auditing changes. Concurrent readers always see a consistent snapshot, even while writers are adding new data.
-- Query data as it existed at a specific time
SELECT * FROM lakehouse.events
TIMESTAMP AS OF '2026-03-01 00:00:00';
-- Query a specific snapshot
SELECT * FROM lakehouse.events
VERSION AS OF 5847293048;
-- View snapshot history
SELECT * FROM lakehouse.events.snapshots;
-- View file-level changes between snapshots
SELECT * FROM lakehouse.events.changes
WHERE snapshot_id BETWEEN 5847293048 AND 5847293055;
-- Rollback to a previous snapshot (undo bad writes)
CALL lakehouse.system.rollback_to_snapshot(
'events', 5847293048
);
-- Cherry-pick: apply changes from a specific snapshot
CALL lakehouse.system.cherrypick_snapshot(
'events', 5847293055
);Time travel is also powerful for regulatory compliance. If an auditor asks “what did the customer data look like on January 15th?”, you query that exact timestamp without maintaining separate audit tables. Furthermore, ML teams use time travel to create reproducible training datasets — freeze a snapshot ID in your training config, and you can always reproduce the exact same dataset months later.
Partition Evolution: Change Partitioning Without Rewriting
Partition evolution is uniquely powerful in Apache Iceberg. Traditional Hive tables encode partition values in directory paths — changing the partitioning scheme requires rewriting all data. Iceberg decouples partitioning from the physical file layout, allowing you to change partition strategies without touching existing data.
-- Start with daily partitioning
CREATE TABLE lakehouse.logs (
log_id STRING,
service STRING,
level STRING,
message STRING,
timestamp TIMESTAMP
)
USING iceberg
PARTITIONED BY (days(timestamp));
-- Data grows — switch to hourly partitioning for recent data
ALTER TABLE lakehouse.logs
ADD PARTITION FIELD hours(timestamp);
-- Drop the old daily partitioning
ALTER TABLE lakehouse.logs
DROP PARTITION FIELD days(timestamp);
-- New writes use hourly partitions
-- Old data (daily partitions) is still readable
-- Both partition schemes coexist in the metadataThis is transformative for growing datasets. You might start with monthly partitions when your table has 100GB, switch to daily at 10TB, and hourly at 100TB — all without downtime or data rewriting. The query engine uses metadata to apply the correct partition filter regardless of which scheme was active when the data was written.
Production Integration with Spark and Trino
A practical lakehouse uses multiple engines: Spark for heavy ETL and batch processing, Trino for fast interactive queries and dashboards, and Flink for real-time streaming ingestion. All three work with the same Iceberg tables through a shared catalog.
# PySpark — batch ETL with Iceberg
from pyspark.sql import SparkSession
spark = SparkSession.builder .config("spark.sql.catalog.lakehouse",
"org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.lakehouse.type", "rest") .config("spark.sql.catalog.lakehouse.uri",
"http://iceberg-rest-catalog:8181") .config("spark.sql.catalog.lakehouse.warehouse",
"s3://my-lakehouse/warehouse") .getOrCreate()
# Read Iceberg table
events = spark.table("lakehouse.events")
# Incremental read — only new snapshots since last run
new_events = spark.read.format("iceberg") .option("start-snapshot-id", last_processed_snapshot) .table("lakehouse.events")
# Merge (upsert) — ACID transaction
spark.sql("""
MERGE INTO lakehouse.customers target
USING updates source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
# Maintenance — compact small files
spark.sql("""
CALL lakehouse.system.rewrite_data_files(
table => 'events',
options => map('target-file-size-bytes', '134217728')
)
""")
# Expire old snapshots (free storage)
spark.sql("""
CALL lakehouse.system.expire_snapshots(
table => 'events',
older_than => TIMESTAMP '2026-02-01 00:00:00',
retain_last => 10
)
""")Table maintenance is essential for production Iceberg deployments. Small files accumulate from streaming writes and need periodic compaction. Old snapshots consume storage and should be expired after your retention period. Orphan files from failed writes need cleanup. Consequently, schedule these maintenance operations as regular batch jobs — weekly compaction and daily snapshot expiration work well for most workloads.
Related Reading:
Resources:
In conclusion, Apache Iceberg transforms data lakes into reliable data lakehouses with database-grade features. Schema evolution, time travel, and partition evolution eliminate the most painful aspects of data lake management. Start with a single table, use your existing Spark or Trino infrastructure, and expand as you see the operational benefits over traditional Hive-partitioned tables.