Bidon Self Hosted: S3 Export Guide
Overview
This solution sets up a full Kafka-based event pipeline using Docker Compose, with automatic export of Kafka topics (such as ad-events
) to the AWS S3 storage using Kafka Connect and the Confluent S3 Sink Connector.
Key Components
- Kafka & Zookeeper – Core messaging infrastructure.
- Kafka Connect – Worker that runs the S3 Sink Connector.
- S3 Sink Connector – Streams Kafka topic data into structured files in S3.
- .env Support – Credentials and bucket details are injected via environment variables.
- Time-based Partitioning – Data is organized in S3 by year, month, day, hour, and minute.
This makes it easy to stream, store, and later query Kafka events in cloud storage, with minimal configuration.
Prerequisites: Initial Setup
We assume you've already completed the initial system setup, including starting core services like Kafka, Zookeeper, PostgreSQL, Redis, and Bidon components.
👉 If not, follow the Server Setup Guide before continuing.
Required Configuration for S3 Export
The system includes a pre-configured Kafka Connect + S3 Sink Connector setup that automatically exports Kafka messages (e.g., from the ad-events
topic) to your S3 bucket.
To make this work, you must define the following environment variables in a .env
file in the project root:
# Required for S3 Sink Connector
AWS_REGION=your-region # e.g., eu-west-1
S3_BUCKET_NAME=your-bucket-name # Your S3 bucket name
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key
These values are injected into the connector-config.template.json
at runtime to configure the connector.
✅ The connector is created automatically by the kafka-connect-setup
container when you run docker-compose up
.
Advanced Configuration & Scaling Tips
The default connector configuration is defined in:
📄 docker/kafka-connect/connector-config.template.json
Here’s a breakdown of important settings and how to customize them:
➤ Topic & Storage Location
"topics": "ad-events",
"topics.dir": "bidon",
topics
: Kafka topics to export. Separate multiple with commas.topics.dir
: Folder prefix in S3 under which files are stored.
✅ To export multiple topics, adjust this:
"topics": "ad-events,notification-events"
➤ File Rotation & Partitioning
"flush.size": "200000",
"partition.duration.ms": "300000",
"rotate.interval.ms": "60000",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm",
"timestamp.extractor": "Wallclock"
These settings control when files are flushed and how data is partitioned in S3:
flush.size
: Number of records before flushing a new filerotate.interval.ms
: Max interval (in ms) before rotating a file (e.g., 60s)partition.duration.ms
: Duration of time-based partitioning windowpath.format
: Folder structure inside the buckettimestamp.extractor
: Can beWallclock
,Record
, orRecordField
💡 Tip: To scale down file size and latency (for faster testing), you might reduce values:
"flush.size": "1000",
"rotate.interval.ms": "10000",
"partition.duration.ms": "60000"
➤ Scaling the Connector
"tasks.max": "1"
Increase this if you want the connector to process data in parallel (per partition or topic):
"tasks.max": "3"
Make sure your Kafka topic has enough partitions and that your S3 bucket can handle concurrent writes.
Summary
Area | Default | How to Change |
---|---|---|
Exported Topics | ad-events | Edit "topics" in the config |
Output Format | JSON | Only json is supported for now |
Flush & Rotate Settings | Every 200k records or 60s | Tune "flush.size" , "rotate.interval.ms" |
S3 Bucket Path Format | Time-based (UTC) | Change "path.format" or use different partitioner |
Number of Connector Tasks | 1 | Increase "tasks.max" |
For questions or customization help, feel free to reach out or contribute updates via PRs.