Table of Contents

Introduction

Modern enterprise data migrations are still surprisingly manual.

Even with cloud-native tooling, engineers often spend significant time writing DDL scripts, mapping schemas manually, coordinating ingestion pipelines, handling orchestration dependencies, and maintaining infrastructure synchronization across cloud providers.

As datasets grow larger and architectures become increasingly distributed, this operational complexity compounds quickly.

We wanted to explore a different approach:

What if AI could operate as an autonomous infrastructure engineer inside a real analytics pipeline?

To test this idea, we built an end-to-end cross-cloud analytics platform that migrates StackOverflow data from Google BigQuery into ClickHouse using AWS as the intermediary cloud layer.

The system combines:

Google BigQuery for source storage
AWS Glue for cross-cloud extraction
Amazon S3 for Parquet staging
Apache Airflow for orchestration
Anthropic Claude + MCP for autonomous infrastructure generation
ClickHouse Cloud for analytics storage
dbt for Medallion transformations

The most interesting part of the architecture is the infrastructure automation layer.

Instead of manually creating ClickHouse tables, queues, and materialized views, the pipeline dynamically asks Claude to inspect staged Parquet metadata and generate the required ClickHouse infrastructure automatically.

Claude then executes the generated DDL directly inside ClickHouse through an MCP-enabled workflow.

The final result is an autonomous analytics pipeline capable of:

migrating data across clouds,
creating ingestion infrastructure dynamically,
continuously ingesting Parquet files through S3Queue,
and exposing optimized analytical marts through dbt.

The pipeline processed over 23 million StackOverflow records using a fully orchestrated workflow.

Architecture Overview

The overall architecture spans multiple execution environments and cloud providers.

At a high level, the system performs the following steps:

Extract StackOverflow data from Google BigQuery
Stage compressed Parquet files inside Amazon S3
Trigger AI-driven infrastructure generation
Dynamically create ClickHouse ingestion components
Stream records continuously through S3Queue and Materialized Views
Transform the raw data into analytical marts using dbt

The complete workflow looks like this:

Google BigQuery (GCP)
↓
AWS Glue Extraction Job
↓
Amazon S3 (Parquet Staging)
↓
Apache Airflow Orchestration
↓
Claude + MCP Infrastructure Automation
↓
ClickHouse:
– MergeTree Table
– S3Queue Engine
– Materialized View
↓
dbt Medallion Models:
– Bronze
– Silver
– Gold
↓
Analytical Queries

Picture1blog

Why ClickHouse

The target analytics engine needed to support:

high-speed analytical aggregation,
scalable ingestion from object storage,
efficient columnar storage,
and real-time streaming behavior.

ClickHouse was selected because it combines high-performance OLAP execution with storage-engine optimizations specifically designed for analytical workloads at scale.

What differentiated ClickHouse for this project was not only raw query speed, but how its storage architecture minimizes scan overhead through sparse indexing, ORDER BY-driven data skipping, and columnar compression.

MergeTree Storage Engine and Sparse Primary Indexing

The project used the MergeTree engine as the primary analytical storage layer.

Unlike traditional OLTP databases, ClickHouse does not rely on row-level B-tree indexing for analytical workloads. Instead, MergeTree stores data in sorted parts and builds sparse primary indexes on the ORDER BY keys.

This allows ClickHouse to skip large portions of data during scans instead of reading every row.

As a result, table ordering strategies become extremely important because they directly influence data-skipping efficiency and analytical query performance.

This architecture was especially effective for workloads involving:

grouped aggregations,
time-series analytics,
rolling window calculations,
and large-scale analytical scans.

By physically organizing related data together on disk, ClickHouse minimized scan overhead and delivered sub-second analytical performance across millions of records.

Columnar Compression and Scan Efficiency

ClickHouse stores each column independently in compressed format.

For analytical workloads like StackOverflow analysis — where queries often access only a subset of columns — this significantly reduces disk I/O and memory overhead compared to row-oriented systems.

This architecture was especially effective for:

aggregation-heavy workloads,
wide analytical scans,
and time-series analytics.

S3Queue Streaming Ingestion

One of the most powerful features used in the project was the S3Queue engine.

Instead of building polling services or custom ingestion workers, ClickHouse continuously monitored Amazon S3 for new Parquet objects and streamed them automatically into MergeTree tables through Materialized Views.

This eliminated the need for additional stream-processing infrastructure while dramatically simplifying ingestion logic.

Materialized Views

Materialized Views connected the ingestion queue directly into the analytical landing tables.

As new Parquet files appeared inside S3, ClickHouse automatically pushed records into the destination MergeTree table.

This created a lightweight event-driven ingestion architecture without requiring Kafka, Spark Streaming, or external orchestration layers.

Cost-Performance Advantages

Another major reason for selecting ClickHouse was its strong cost-performance profile for analytical workloads.

To evaluate analytical execution efficiency, equivalent rolling-window analytical workloads were executed against the fct_tag_analysis mart in both ClickHouse and BigQuery using identical query logic and equivalent transformed datasets.

The benchmark workload included:

time-series aggregation
rolling-window calculations
analytical scans across approximately 19,209,554 transformed analytical rows

For this workload:

ClickHouse completed execution in approximately 156 ms with an estimated compute cost of approximately $0.000017
BigQuery completed execution in approximately 958 ms with an estimated compute cost of approximately $0.0009

The ClickHouse query scanned approximately 192 MB of analytical data, while BigQuery consumed approximately 36.48 slot-seconds during distributed execution.

The benchmark demonstrated ClickHouse’s strong performance characteristics for interactive OLAP workloads involving:

ordered analytical scans
window functions
time-series analytics
aggregation-heavy queries

Even at multi-million-row scale, ClickHouse consistently maintained highly responsive sub-second analytical query performance.

Cross-Cloud Extraction Pipeline

The source dataset originated inside Google BigQuery.

To move the data into AWS, the pipeline used AWS Glue as the extraction layer.

Airflow triggered a managed Glue job that:

Connected securely to BigQuery
Extracted the StackOverflow dataset
Wrote partitioned Parquet files into Amazon S3

The extracted data consisted of:

~23 million rows
76 Parquet files
each file averaging 230–260 MB

The final staging structure inside S3 looked like this:

Using Parquet as the staging format significantly improved downstream ingestion efficiency because ClickHouse can read Parquet natively.

Orchestrating the Workflow with Airflow

Apache Airflow coordinated the entire lifecycle of the platform.
The DAG contained six major stages:

Trigger AWS Glue extraction
Run AI-driven ClickHouse infrastructure automation
Wait for S3Queue ingestion to complete before downstream transformations
Execute dbt staging models
Execute dbt data quality tests
Execute dbt marts

The orchestration graph looked like this:

The pipeline was intentionally designed to be modular.

Each stage executes independently while still preserving deterministic execution order.

The DAG structure:

trigger_glue_migration >> automate_clickhouse_setup >> wait_for_ingest >> run_dbt_staging >> run_dbt_tests >> run_dbt_marts

This ensured that:

infrastructure only initializes after extraction completes,
dbt staging models only execute after ingestion is active,
data quality validation occurs before Gold marts are built,
and retries remain isolated to individual stages.

The DAG also included:

retry logic,
idempotent execution patterns,
modular task separation,
and transformation-layer validation gates.

Adding dbt tests directly into the orchestration layer helped establish a controlled reliability boundary between ingestion and downstream analytics.

AI-Driven Infrastructure Automation with Claude and MCP

The most experimental component of the platform was the infrastructure automation layer.

Instead of manually writing ClickHouse DDL statements, the pipeline dynamically delegated infrastructure generation to Claude.

Airflow executed a PythonOperator that:

Connected to Anthropic Claude
Passed a structured infrastructure prompt
Asked Claude to inspect Parquet metadata
Generated ClickHouse SQL dynamically
Executed the generated statements directly in ClickHouse

The orchestration logic looked like this:

response = client.messages.create(
model=”claude-opus-4-7″,
max_tokens=4000,
system=system_prompt,
messages=[{“role”: “user”, “content”: user_prompt}],
thinking={
“type”: “adaptive”
}
)

The prompt instructed Claude to:

inspect Parquet schema metadata,
infer ClickHouse data types,
create MergeTree tables,
initialize S3Queue engines,
create Materialized Views,
and verify ingestion flow.

The generated SQL was then executed programmatically:

statements = [s.strip() for s in sql_commands.split(‘;’) if s.strip()]

for statement in statements:
ch_client.command(statement)

This transformed Claude from a passive assistant into an active infrastructure automation layer.

Importantly, the system still maintained controlled execution boundaries.

Claude operated only within:

predefined prompts,
constrained execution targets,
and orchestrated Airflow tasks.

This approach avoided unrestricted autonomous execution while still removing repetitive infrastructure initialization work.

Improving AI-Generated Infrastructure with ClickHouse Agent Skills

One important observation during the project was that general-purpose LLMs can generate syntactically correct SQL while still missing workload-specific optimization patterns.

This became particularly visible during MergeTree schema generation and ORDER BY selection.

For example, the initial AI-generated schema proposed:

ORDER BY (id, creation_date)

While technically valid, this ordering strategy was not ideal for the analytical workload because downstream queries primarily filtered and aggregated by tags and time-series patterns rather than unique post IDs.

After reviewing the query access patterns, the schema was refined to:

ORDER BY (tag, creation_date)

which aligned significantly better with ClickHouse sparse primary indexing and data-skipping behavior.

This challenge closely aligns with the motivation behind ClickHouse Agent Skills, an open-source collection of ClickHouse-specific best practices designed for AI coding agents and LLM-driven infrastructure generation.

ClickHouse Agent Skills provide workload-aware guidance for:

MergeTree schema design
ORDER BY selection
partitioning strategies
ingestion patterns
materialized view design
query optimization
sparse indexing and data-skipping strategies

These skills help AI agents move beyond generic SQL generation and produce infrastructure decisions that better reflect real ClickHouse analytical workloads.

In future iterations, integrating ClickHouse Agent Skills directly into the MCP orchestration layer could further improve autonomous schema generation by making workload-aware optimization part of the infrastructure automation process itself.

Autonomous ClickHouse Infrastructure Creation

The generated infrastructure included:

MergeTree Target Table

Claude generated the final analytical landing table using MergeTree.

The schema was inferred dynamically from the Parquet metadata.

S3Queue Engine

The pipeline then created an S3Queue ingestion layer pointing directly to the staged Parquet files.

Materialized View

A Materialized View connected the S3Queue engine to the MergeTree destination table.

This established continuous ingestion automatically.

The resulting ClickHouse structure looked like this:

This included:

stackoverflow_queue
stackoverflow_mv
stackoverflow_final
stg_posts
fct_tag_analysis

The infrastructure was generated dynamically without manually writing static DDL files.

Medallion Modeling with dbt

Once ingestion completed, dbt transformed the raw records into structured analytical marts.

The project followed a Medallion-style architecture:

Bronze Layer

The Bronze layer stored raw ingested StackOverflow records exactly as they arrived from S3.

Silver Layer

The Silver layer normalized and cleaned the records.

One major transformation involved parsing raw StackOverflow tags.

Original records looked like:

The transformation layer converted these strings into analytical structures suitable for aggregation and filtering.

The main staging model was:

stg_posts

Gold Layer

The Gold layer created analytical marts optimized for query workloads.

The final mart:

fct_tag_analysis

aggregated:

technology adoption,
tag popularity,
post volume,
scoring trends,
and temporal analytics.

The dbt lineage graph looked like this:

This provided a clean transformation dependency structure:

Raw Source → stg_posts → fct_tag_analysis

Data Quality Validation

To improve transformation reliability, dbt tests were added directly into the orchestration workflow.

The stg_posts model validates that:

post_id is unique
post_id is non-null

This helps detect:

duplicate ingestion,
malformed transformations,
and incomplete records

before invalid data propagates into downstream analytical marts.

The validation layer establishes an important reliability boundary between ingestion and analytics.

The validation task was also integrated directly into the Airflow DAG:

run_dbt_tests = BashOperator(
task_id=’run_dbt_tests’,
bash_command=f’cd {DBT_PROJECT_DIR} && dbt test –select staging’,
)

This ensured that downstream marts only execute after transformation-level validation succeeds.

The transformation dependency structure remained:

Raw Source → stg_posts → fct_tag_analysis

Security and Credential Design

One of the design goals of the platform was reducing static credential exposure.

Instead of embedding cloud credentials directly into ingestion logic, the platform used AWS IAM Role ARNs for controlled access.

This enabled ClickHouse to access S3 objects securely without exposing long-lived access keys.

Benefits included:

reduced credential leakage risk,
simplified rotation management,
and alignment with zero-trust infrastructure patterns.

The S3 access layer was configured using inline IAM role authorization.

Analytical Query Workloads

After the Gold marts were generated, ClickHouse handled several analytical query patterns efficiently.

Top Technology Tags

One of the queries aggregated the most active StackOverflow technology tags.

Despite scanning over 1.5 million transformed rows, the query completed in:

0.169 seconds

Time-Series Aggregation

The pipeline also supported time-based analytical workloads.

Rolling Window Analysis

Window functions were used to compute rolling post activity trends.

The query used:

ROWS BETWEEN 6 PRECEDING AND CURRENT ROW

to calculate rolling 7-day post metrics.

Trend Detection Queries

The Gold layer also enabled trend-detection analysis using interval-based comparisons.

These workloads demonstrated ClickHouse’s efficiency for:

OLAP aggregation,
window functions,
time-series analytics,
and large-scale grouped calculations.

Lessons Learned

Building the platform exposed several interesting engineering challenges.

AI Infrastructure Automation Requires Strong Constraints

Allowing AI to generate infrastructure dynamically can be powerful, but prompt boundaries are critical.

The execution layer worked reliably because:

prompts were narrowly scoped,
outputs were constrained to SQL,
and execution targets were predefined.

AI-Generated Schemas Still Require Workload Awareness

Claude successfully generated the initial ClickHouse ingestion infrastructure dynamically from Parquet metadata.

However, optimizing analytical performance still required engineering validation of workload patterns.

The experience highlighted an important lesson:

AI can dramatically accelerate infrastructure initialization, but workload-aware optimization decisions still benefit from human engineering review.

S3Queue Simplified Streaming Significantly

The S3Queue engine eliminated a large amount of operational complexity.

Without it, the architecture would likely require:

polling jobs,
stream-processing infrastructure,
or custom ingestion services.

dbt Added Important Transformation Discipline

The lineage tracking and modular transformation structure made the platform significantly easier to maintain.

The separation between:

raw ingestion,
normalized transformations,
and analytical marts

kept the project organized even as query complexity increased.

Future Improvements

Several areas could further extend the platform.

Schema Evolution Handling

Future versions could automatically detect upstream schema drift and evolve ClickHouse tables dynamically.

Real-Time CDC

The current architecture processes staged Parquet files.

A future iteration could integrate real-time CDC streams.

Kubernetes Deployment

The orchestration stack could be containerized further for scalable distributed execution.

OpenTelemetry Integration

Adding tracing and telemetry would improve operational visibility across orchestration and ingestion layers.

Conclusion

This project explored what happens when AI is integrated directly into infrastructure orchestration instead of being limited to assistant-style interactions.

By combining:

Airflow,
AWS Glue,
Amazon S3,
Claude,
MCP,
ClickHouse,
and dbt,

it became possible to build a fully orchestrated analytics platform capable of:

cross-cloud migration,
autonomous infrastructure generation,
continuous ingestion,
and high-performance analytical querying.

The final platform successfully processed more than 23 million StackOverflow records while dynamically generating ingestion infrastructure during execution.

Most importantly, the project demonstrated that AI can operate effectively as a controlled infrastructure automation layer inside modern data engineering workflows.

Not by replacing engineers,

but by automating repetitive operational initialization tasks inside well-defined execution boundaries.

Repository

GitHub Repository:

https://github.com/AniketBShinde/stackoverflow-migration

Building an Autonomous Cross-Cloud Migration and Analytics Platform with Clickhouse