This blog post details what happens under the hood when interacting with Apache Iceberg Tables. It explains how the different components in Apache Iceberg Architecture work with a simple example in Apache Spark with Python.
What is Apache Iceberg?
Apache Iceberg is an Open Table Format, OTF created in 2017 at Netflix by Ryan Blue and Daniel Weeks. This project was open-sourced and donated to Apache Software Foundation in 2018.
OTFs are also known as Modern Data Lake Table Formats that approach the defining table as a canonical list of files, providing metadata for engine information on which files make up the table, not which directories, unlike the Hive table format. This granular approach to defining a table provides features like ACID Transactions, consistent reads, safe writes by multiple readers and writers at the same time, Time Travel, easy schema evolution without rewriting the entire table, and more.
All these features come with performance benefits as its architecture provides metadata which helps in avoiding excessive file listing that helps in quicker query planning.
Alrighty, now to the main discussion of this blog.
Apache Iceberg Architecture
Apache Iceberg table has three different layers - Catalog Layer, Metadata Layer, and Data Layer.
Let's take a peek inside these different layers.
Data Layer
This is the layer where the actual data for the table is stored and is primarily made of data files. Apache Iceberg is file-format agnostic and it currently supports Apache Parquet, Apache ORC, and Apache Avro. It stores the data by default in Apache Parquet file format. This file-format agnostic provides the ability for a user to choose the underlying file format based on the use case, for example, Parquet might be used for a large-scale OLAP analytics table, whereas Avro might be used for a low-latency streaming analytics table.
The data layer is backed by a distributed file system like HDFS or a cloud object storage like AWS S3. This enables building data lakehouse architectures that can benefit from these extremely scalable and low-cost storage systems
Metadata Layer
This layer contains all of the metadata files for an Iceberg table. It has a tree structure that tracks the data files and metadata about them along with the details of the operation that made them.
The files in this layer are immutable files so everytime an insert, merge, upsert or delete operation happens on the table, a new set of files are written.
This layer contains three file types:
Manifest Files
Manifest files keep track of files in the data layer along with the additional details and statistics about each file. It stores all this information in avro file format.
Manifest Lists
Manifest lists keep track of manifest files, including the location, the partitions it belong to, and the upper and lower bound for partition columns for the data it tracks. It stores all this information in avro file format.
A Manifest list file is a snapshot of an Iceberg Table as it contains the details of the snapshot along with snapshot_id that has added it.
Metadata Files
Metadata files keep track of Manifest Lists. These files include the information about the metadata of the Iceberg Table at a certain point in time i.e. table's schema, partition information, snapshots, and which snapshot is the current one. All this information is stored in a json format file.
Catalog Layer
Within the Catalog layer, there is a reference or pointer, that points to the current metadata file for that table.
As catalog is an interface and the only requirement for an Iceberg catalog is that it needs to store the current metadata pointer and provide atomic guarantees, there are different backends that can serve as the Iceberg catalog like Hadoop, AWS S3, Hive, AWS Glue Catalog and more. These different implementations store the current metadata pointer differently.
Alrighty, that's enough of the theoretical part, let's dive deeper into what all these files are, and when these are created under the hood with an easy-to-follow example with Apache Spark.
Diving Deeper with an example
In this example, we will be creating a simple Iceberg table and inserting some records into it to understand how different architectural components are created and what all these components store.
In addition to this, we will also be looking into the Iceberg table metadata tables that can be queried to get different metadata-related information instead of reading the metadata JSON or avro files.
Defining Catalog
Let's start with configuring an Iceberg catalog and understand how can we use different catalog implementations.
Understanding different Spark Configuration for Iceberg Catalog
The name of an Iceberg catalog can be defined using spark.sql.catalog.<catalog-name> with value org.apache.iceberg.spark.SparkCatalog or org.apache.iceberg.spark.SparkSessionCatalog
Once the Iceberg catalog is defined, the type for catalog needs to be defined, this type defines which catalog implementation catalog-impl and IO implementation io-impl for reading/writing a file will be used. Some of the types supported by Iceberg are hadoop and hive .
For using a custom implementation of the Iceberg catalog e.g. using AWS Glue Catalog as Iceberg catalog, catalog-impl class needs to be mentioned. It can be mentioned using spark.sql.catalog.<catalog-name>.catalog-impl as org.apache.iceberg.aws.glue.GlueCatalog.
It's mandatory to define either a type or catalog-impl for a defined catalog as this defines how the current metadata pointer will be stored.
warehouse is a required catalog property to determine the root path of the catalog in storage. By default, all the Iceberg tables location created within this defined catalog refer to this as the root path.
warehouse is configured using spark.sql.catalog.<catalog-name>.warehouse
Depending on which catalog implementation is being used, warehouse location can also be changed during runtime. More details on this can be seen here.
This is just scratching the surface of the Iceberg Catalog.
Keeping simplicity for the sake of explanation in mind, I will be creating a catalog named local that is of type hadoop
Creating an Iceberg Table
Let's create an Iceberg table called sales.
On successful creation of table, as we are using catalog of type hadoop it creates:
version-hint.txt file: This file is used by the engines to identify the latest metadata file version.
Metadata file called v1.metadata.json : As the table is created, this file stores the table schema details like fields in the schema, partition columns, current-schema-id, properties, and more.
As this table has no data written into it as of now, the current-snapshot-id is mentioned as -1 . Also as there are no data files so, there won't be any manifest lists and manifest files created for the table.
All the metadata-related details can be seen in metadata_log_entries metadata tables:
As there is no data present in the table, all the other details in this table is NULL
Loading data in sales Iceberg Table
Let's load some data into the sales table and see how the table metadata tree evolves on the addition of data files in the table.
While writing data into the table, the query engine (in this case, Spark):
Gets the location of the current metadata file by looking into the catalog version-hints.txt file (as the catalog type is hadoop ). In this case, the content of version-hints.txt is 1.
This gives the current metadata file version and the engine reads the warehouse/local/db/sales/metadata/v1.metadata.json file. This lets Spark understand the table schema and the partitioning strategy of the table.
Spark first writes the records as Parquet data files based on the partitioning in this case based on year_id and month_id
After data files are written, it writes the manifest file with data file details and statistics provided to it.
Next, the engine creates a manifest list to keep track of the manifest file. This file includes information such as the manifest file’s path, the number of data files/rows added or deleted, and statistics about partitions.
Finally, a new metadata file is created, v2.metadata.jsonwith a new snapshot and version-hints.txt is updated with 2.
Let's look at how the architecture for this table evolved using Iceberg metadata tables.
Metadata Tables
Apache Iceberg metadata tables can be used to better understand your Iceberg tables and how it's evolved over time.
metadata_log_entries Keeps track of the evolution of the table by logging the metadata files generated during table updates.
snapshots Maintains metadata about every snapshot for a given table, each representing a consistent view of the dataset at specific time. This mainly keeps details of manifest list files as these defines the snapshots for an Iceberg Table.
manifests The manifests table details each of the table’s current file manifests.
files shows the details of current data files in the table. It keeps detailed information about each data file in the table, from its location and format to its content and partitioning specifics.
That's it for this one folks!!
If you want to continue to dive deeper, you can checkout this next:
If it has added any value to you and want to read more content like this, subscribe to the newsletter, it's free of cost and I will make sure every post is worth your time.
Sign up for Akashdeep Gupta
Big Data and Serverless Tech implementation and tutorials on cloud
No spam. Unsubscribe anytime.
You might also like...
Dec
15
Selecting between Double and Decimal Data Type To Avoid Unexpected Results
How to choose between Double and Decimal data types for your tables/datasets, why does it matter, and when to choose which one?
4 min read
Nov
28
Shuffle-less Join, a.k.a Storage Partition Join in Apache Spark - Why, How and Where?
A Deep Dive into Shuffle-less joins (Storage Partitioned Joins) in Apache Spark to improve Join performance when using V2 Data Sources.
10 min read
Oct
12
Enhancing Spark Job Performance with Multithreading
It covers a Spark Job Optimization technique to enhance the performance of independent running queries using Multithreading in Pyspark.
7 min read
Feb
13
EMRFS S3 Optimized Committer and Committer Protocol for Improving Spark Write Performance - Why and How?
What are EMRFS S3 Optimized Committer and EMRFS S3 Optimized Committer Protocol and how to use and identify if these are working for your Spark Jobs to improve write performance?
30 min read
Jan
24
Copy-on-Write or Merge-on-Read? What, When, and How?
Copy-on-Write or Merge-on-Read? Optimizing Row-level updates in Apache Iceberg Table by understanding both the approaches and deciding when to use which approach and its impact on the Read and Write speed of the table.
How to identify these using Iceberg Metadata tables on AWS?
Member discussion