S3 Buckets: Setting Up Queryable Iceberg Tables in AWS

aws

iceberg

Vengat

Posted 6 months, 1 week ago

Views 236

3 min read

0 reactions

If you’re storing large amounts of data in an Amazon S3 bucket, managing and finding specific files can become challenging. Amazon S3 offers a powerful feature to automatically capture and query metadata using Iceberg tables. In this article, we will see how you can set this up step by step for a bucket named comp-manufacturing-us.

Why Use Metadata in S3?

Metadata helps you store information about your files, like size, creation date, tags, and more. Querying metadata allows you to quickly find files or analyze their properties without manually searching through thousands of objects.

Step 1: Pre-requisite

Before starting, make sure:

You have an AWS account with necessary permissions.
AWS CLI (Command Line Interface) is installed and configured.
Created an S3 bucket to store metadata tables, in this example called the “comp-manufacturing-us”.

How to Set Up Queryable Metadata in AWS S3 Using Iceberg Tables

If you’re storing large amounts of data in an Amazon S3 bucket, managing and finding specific files can become challenging. Amazon S3 now offers a powerful feature to automatically capture and query metadata using Iceberg tables. In this guide, let’s see how you can set this up step by step for a bucket named comp-manufacturing-us.

Why Use Metadata in S3?

Step 1: Get Ready for the Setup

Before starting, make sure:

You have an AWS account with necessary permissions.
AWS CLI (Command Line Interface) is installed and configured.
You’ve created an S3 bucket to store metadata tables, called the “table bucket”.

Step 2: Create a Table Bucket for Metadata

The table bucket stores the metadata tables in Iceberg format.

In our terminal and create a new bucket for metadata storage:

aws s3 create-table-bucket --name comp-manufacturing-tables --region us-east-1

2. Note the ARN (Amazon Resource Name) for this bucket. Use the command below to check:

aws s3api get-bucket-location --bucket comp-manufacturing-tables

Step 3: Link Metadata Tables to Your Data Bucket

Now we will link the metadata table bucket (comp-manufacturing-tables) to your main data bucket (comp-manufacturing-us).

Create a configuration file called metadata-config.json:

{
  "S3TablesDestination": {
    "TableBucketArn": "arn:aws:s3:::comp-manufacturing-tables",
    "TableName": "comp_manufacturing_metadata"
  }
}

2. Run the following command to apply this configuration:

aws s3api create-bucket-metadata-table-configuration \
  --bucket comp-manufacturing-us \
  --metadata-table-configuration file://metadata-config.json \
  --region us-east-1

3. Confirm the setup by checking the configuration:

aws s3api get-bucket-metadata-table-configuration \
  --bucket comp-manufacturing-us \
  --region us-east-1

Step 4: Upload or Modify Data

Now, add or modify files in your comp-manufacturing-us bucket. The metadata for these objects will be automatically captured.

For example, upload a file:

aws s3 cp manufacturing_report.csv s3://comp-manufacturing-us/reports/manufacturing_report.csv

Step 5: Query Metadata with Amazon Athena

Amazon Athena allows you to query the metadata table to find information about the files in your bucket.

Set Up Athena for Querying Metadata: Go to the Athena console in the AWS Management Console.

2. Create a Metadata Table Schema in Athena:

Run this SQL query in Athena to define your metadata table:

CREATE EXTERNAL TABLE comp_manufacturing_metadata (
  key STRING,
  size BIGINT,
  last_modified_date TIMESTAMP,
  storage_class STRING,
  encryption_status STRING
)
STORED AS ICEBERG
LOCATION 's3://comp-manufacturing-tables/comp_manufacturing_metadata/';

3.Run Queries to Explore Your Metadata:

Find the 10 largest files in your bucket:

SELECT key, size
FROM comp_manufacturing_metadata
ORDER BY size DESC
LIMIT 10;

Conclusion

By setting up queryable metadata in S3 with Iceberg tables, you can make managing and analyzing your files much easier. Whether you’re dealing with manufacturing data or any other large datasets, this solution helps you find and understand your data quickly.

0 reactions

Discussion