Skip to main content

Bidon Self Hosted: S3 Export Guide

Overview

This solution sets up a full Kafka-based event pipeline using Docker Compose, with automatic export of Kafka topics (such as ad-events) to the AWS S3 storage using Kafka Connect and the Confluent S3 Sink Connector.

Key Components

  • Kafka & Zookeeper – Core messaging infrastructure.
  • Kafka Connect – Worker that runs the S3 Sink Connector.
  • S3 Sink Connector – Streams Kafka topic data into structured files in S3.
  • .env Support – Credentials and bucket details are injected via environment variables.
  • Time-based Partitioning – Data is organized in S3 by year, month, day, hour, and minute.

This makes it easy to stream, store, and later query Kafka events in cloud storage, with minimal configuration.

Prerequisites: Initial Setup

We assume you've already completed the initial system setup, including starting core services like Kafka, Zookeeper, PostgreSQL, Redis, and Bidon components.

👉 If not, follow the Server Setup Guide before continuing.


Required Configuration for S3 Export

The system includes a pre-configured Kafka Connect + S3 Sink Connector setup that automatically exports Kafka messages (e.g., from the ad-events topic) to your S3 bucket.

To make this work, you must define the following environment variables in a .env file in the project root:

# Required for S3 Sink Connector
AWS_REGION=your-region # e.g., eu-west-1
S3_BUCKET_NAME=your-bucket-name # Your S3 bucket name
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key

These values are injected into the connector-config.template.json at runtime to configure the connector.

✅ The connector is created automatically by the kafka-connect-setup container when you run docker-compose up.


Advanced Configuration & Scaling Tips

The default connector configuration is defined in:

📄 docker/kafka-connect/connector-config.template.json

Here’s a breakdown of important settings and how to customize them:

➤ Topic & Storage Location

"topics": "ad-events",
"topics.dir": "bidon",
  • topics: Kafka topics to export. Separate multiple with commas.
  • topics.dir: Folder prefix in S3 under which files are stored.

✅ To export multiple topics, adjust this:

"topics": "ad-events,notification-events"

➤ File Rotation & Partitioning

"flush.size": "200000",
"partition.duration.ms": "300000",
"rotate.interval.ms": "60000",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm",
"timestamp.extractor": "Wallclock"

These settings control when files are flushed and how data is partitioned in S3:

  • flush.size: Number of records before flushing a new file
  • rotate.interval.ms: Max interval (in ms) before rotating a file (e.g., 60s)
  • partition.duration.ms: Duration of time-based partitioning window
  • path.format: Folder structure inside the bucket
  • timestamp.extractor: Can be Wallclock, Record, or RecordField

💡 Tip: To scale down file size and latency (for faster testing), you might reduce values:

"flush.size": "1000",
"rotate.interval.ms": "10000",
"partition.duration.ms": "60000"

➤ Scaling the Connector

"tasks.max": "1"

Increase this if you want the connector to process data in parallel (per partition or topic):

"tasks.max": "3"

Make sure your Kafka topic has enough partitions and that your S3 bucket can handle concurrent writes.


Summary

AreaDefaultHow to Change
Exported Topicsad-eventsEdit "topics" in the config
Output FormatJSONOnly json is supported for now
Flush & Rotate SettingsEvery 200k records or 60sTune "flush.size", "rotate.interval.ms"
S3 Bucket Path FormatTime-based (UTC)Change "path.format" or use different partitioner
Number of Connector Tasks1Increase "tasks.max"

For questions or customization help, feel free to reach out or contribute updates via PRs.