Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Now accepting applications for the Qlik Luminary and Partner Ambassador Programs: Apply by July 6!
ajayc
Employee
Employee

Think Iceberg is only for big data or streaming pipelines? Think again. It turns your data lake into a reliable, database-like platform — flexible, future-proof, and ready for any engine you choose.

Modern data world is buzzing with statements like:

  • “Iceberg is the future of the lakehouse.”
  • “Iceberg is basically a data warehouse on S3.”
  • “Iceberg is just Delta Lake but open.”
  • “If you’re doing CDC, you need Iceberg.”
  • ...

The reality is: Iceberg is an extremely powerful technology — but it’s also one of the most misunderstood.

And those misunderstandings matter. They lead teams to build the wrong architecture, expect warehouse-like performance overnight, or avoid Iceberg entirely because they assume it’s “only for big data companies.”

So let’s clear the air.

Here are the biggest myths about Apache Iceberg and the lakehouse — and what’s actually true.

Myth #1: Iceberg is just a file format (or “just Parquet with metadata”)

This is probably the most common misunderstanding.

Iceberg is not a storage format like Parquet or ORC. Parquet defines how data is stored inside files. Iceberg defines how those files are managed as a table.

Iceberg provides a full table abstraction including:

  • snapshot-based versioning
  • atomic commits
  • partition evolution
  • schema evolution
  • metadata-driven planning
  • support for deletes, updates, and merges

In other words, Iceberg isn’t “a better Parquet.” It’s the layer that makes object storage behave more like a database/data warehouse.

Iceberg is a database-like table abstraction for a data lake. That’s why it’s such a powerful building block for lakehouse architecture.

 

Myth #2: Iceberg replaces your data warehouse

This is probably one of the most debated topics in the lakehouse world. Some people hear “Iceberg lakehouse” and assume the conclusion is obvious: “So Iceberg replaces Snowflake / Redshift / BigQuery?”

Reality: Iceberg is increasingly replacing the warehouse as the storage layer, but not always replacing the warehouse as a query engine.

Iceberg enables organizations to store data in open object storage while still gaining database-like capabilities such as transactional updates, schema evolution, atomic commits and consistent reads.

On its own, Iceberg is “just” a table format. But the ecosystem around Iceberg has evolved rapidly. Managed lakehouse solutions like Qlik Open Lakehouse, and Snowflake-managed Iceberg tables (along with catalogs like Glue, Polaris, and others) now provide many of the features that teams historically depended on data warehouses for:

  • managed compute and scaling
  • automated optimization and compaction
  • governance via centralized catalogs
  • access control and auditing
  • growing BI ecosystem compatibility

This is why the modern architecture trend is shifting toward a new model:

  • Iceberg becomes the system of record (storage layer)
  • multiple compute engines (including warehouses) query Iceberg directly
  • workloads gradually shift away from warehouse-managed storage to the open lake storage
  • data duplication across systems becomes optional, not mandatory
  • warehouses increasingly act as one of many query engines

So Iceberg doesn’t always eliminate warehouses — but it does change their role dramatically.

Iceberg isn’t just competing with warehouses. It’s redefining them. "Iceberg is replacing the warehouse as the storage layer, while warehouses increasingly become just another compute engine.”

 

Myth #3: Iceberg automatically solves performance

Iceberg enables performance. It doesn’t magically deliver it. 

Yes, Iceberg introduces powerful capabilities such as:

  • metadata pruning
  • partition pruning without hard-coded directory structures
  • snapshot planning
  • table statistics
  • faster planning for large datasets

But Iceberg is not a “set it and forget it” system. Performance still depends heavily on operational practices:

  • compaction (small file management)
  • partitioning strategy and partition evolution
  • clustering / sorting
  • snapshot expiration and metadata cleanup
  • compute engine choice (Spark vs Trino vs Flink, etc.)

This is where many teams get surprised. They adopt Iceberg expecting instant warehouse-like performance, but forget that warehouses continuously optimize tables behind the scenes.

In the Iceberg world, that optimization work must be done either:

  • by your platform team, or
  • by a managed lakehouse solution (such as Qlik Open Lakehouse, or other managed Iceberg platforms)

Without this, even an Iceberg-based lakehouse can degrade over time into something that looks like the old data lake problem: too many files, slow queries, rising compute cost, and unpredictable performance.

Iceberg isn’t a magic speed button.
It’s a system that makes optimization possible and sustainable but only with operational discipline.

 

Myth #4: Iceberg is only for huge data volumes

This myth is surprisingly common, especially among teams who associate Iceberg with “big data” platforms.

But Iceberg is not only valuable at petabyte scale.

Even smaller organizations benefit because Iceberg solves painful operational problems that show up early:

  • schema changes breaking pipelines
  • inconsistent reads across jobs
  • unreliable partitioning logic
  • lack of rollback when data is corrupted
  • messy S3 folder conventions
  • hard-to-manage incremental loads

But the biggest reason Iceberg matters early isn’t scale — it’s interoperability. And, 

Interoperability = future-proofing

Iceberg lets you store your data once in object storage and query it from multiple engines: Spark, Trino, Flink, Athena, and even modern warehouses: Snowflake, Databricks, Redshift. 

That means you don’t have to copy and duplicate datasets across multiple warehouses, marts, or analytics platforms just to support different teams and tools. You avoid building an architecture where every new use case requires another data copy.

This becomes even more important as you grow. Most organizations start with one analytics tool — but over time they add BI workloads, ML pipelines, real-time ingestion, governance requirements, and multiple compute engines.

If your data foundation is closed or warehouse-specific, growth often forces a painful redesign. Iceberg helps you avoid that. It’s about building a data architecture that won’t force you to rebuild everything when you grow.

Iceberg isn’t about big data. It’s about reliable tables on object storage.

And yes — cost matters

Iceberg doesn’t just reduce storage cost. It reduces the bigger hidden cost in data platforms:

duplicating the same datasets across multiple systems.

Even at smaller scale, fewer copies means less ETL, less operational overhead, and fewer expensive warehouse storage footprints.

 

Myth #5: Iceberg is only for CDC (streaming ingestion)

Iceberg is often discussed alongside Debezium, Kafka, Flink, and incremental ingestion pipelines. That leads many people to believe:

“Iceberg is mainly for CDC pipelines.”

CDC is a great use case — but it’s not the whole story.

Iceberg works extremely well for traditional workloads too:

  • batch ETL pipelines
  • curated data marts
  • BI reporting datasets
  • ML feature tables
  • audit-friendly datasets requiring reproducibility
  • GDPR deletes and corrections
  • incremental batch pipelines (even without Kafka)

Streaming ingestion is one reason Iceberg is popular, but Iceberg’s real strength is broader:

it brings transactional table management and consistent reads to the data lake.

CDC is a use case. Iceberg is the table layer.

The Real Story: What Iceberg Actually Gives You

Iceberg is best understood as a modern table system designed for object storage. It provides three major outcomes:

  1. Reliability

Atomic commits, consistent snapshots, and rollback capability.

  1. Flexibility

Multi-engine access across Spark, Trino, Flink, Athena, and more.

  1. Long-term scalability

Metadata-driven planning, partition evolution, and manageable table growth.

This is why Iceberg has become so central to lakehouse architecture. It’s not just a format. It’s not just for streaming. It’s not only for “big data.”

It’s a way to make your data lake behave like a real platform — a buffet-style platform where you can pick the tools and engines that work best for each use case.

If you’re exploring Iceberg adoption, table maintenance strategies, or lakehouse architecture patterns, feel free to reach out or connect — I’d love to compare notes.

 

4 Comments