data engineering Shuffle-less Join, a.k.a Storage Partition Join in Apache Spark - Why, How and Where? A Deep Dive into Shuffle-less joins (Storage Partitioned Joins) in Apache Spark to improve Join performance when using V2 Data Sources.
optimization-techniques Enhancing Spark Job Performance with Multithreading It covers a Spark Job Optimization technique to enhance the performance of independent running queries using Multithreading in Pyspark.
AWS EMRFS S3 Optimized Committer and Committer Protocol for Improving Spark Write Performance - Why and How? What are EMRFS S3 Optimized Committer and EMRFS S3 Optimized Committer Protocol and how to use and identify if these are working for your Spark Jobs to improve write performance?
data engineering Copy-on-Write or Merge-on-Read? What, When, and How? Copy-on-Write or Merge-on-Read? Optimizing Row-level updates in Apache Iceberg Table by understanding both the approaches and deciding when to use which approach and its impact on the Read and Write speed of the table. How to identify these using Iceberg Metadata tables on AWS?
data engineering Apache Iceberg - Architecture Demystified A detailed explanation of Apache Iceberg Architecture and how it evolves when data is inserted into the table.
AWS How to Implement Write-Audit-Publish Pattern with Apache Iceberg on AWS using WAP id Detailed implementation of Write-Audit-Publish (WAP) Data Quality Pattern in AWS using Apache Iceberg WAP ID i.e. for Apache Iceberg < 1.2.0.
AWS How to Implement Write-Audit-Publish Pattern with Apache Iceberg on AWS using Branches Detailed implementation of Write-Audit-Publish (WAP) Data Quality Pattern in AWS using Apache Iceberg Braches i.e. for Apache Iceberg > 1.2.0. It also covers the gotchas while using this pattern and using Athena as a query Engine.
data engineering PyDeequ - Testing Data Quality at Scale How to use PyDeequ to test your data quality on AWS