How It Works

Core thesis

Extraction is not a static task — it is a closed-loop control problem.

Most extraction tools treat parallelism as a config value. You set workers=8, run, tune manually, run again. ixtract solves this differently: it observes every run, learns from it, and converges toward optimal without intervention.

The four components

Profiler

Before planning, ixtract profiles the source table:

Estimated row count
Primary key type, range, and distribution (coefficient of variation)
Source latency (p50 round-trip)
Skew detection: if CV > 1.0, work-stealing is flagged

The profiler runs a lightweight query against the source — no full table scan.

Planner

Takes the source profile, run history, and optional RuntimeContext, and produces an ExecutionPlan:

Worker count (from controller, benchmarker, or profiler — in that priority order)
Chunk count and boundaries (range chunking by default; density-aware planned for Phase 5)
Chunking strategy
Verdict: SAFE TO RUN | SAFE WITH WARNINGS | NOT RECOMMENDED

The plan is deterministic: same inputs produce the same plan, every time.

Controller

The statistical window-based optimizer. After each run it updates a rolling window (default: 5 runs) and decides whether to adjust worker count.

Behavior:

Reduces workers when it detects sustained throughput degradation (≥15% drop over 3 consecutive runs, total ≥20%)
Does not discover that fewer workers would be better when the current count appears stable — that requires the benchmarker
Uses direction-aware deviation: only flags degradation, not noise in the upward direction
Escape mode: severe misconfiguration (3 consecutive drops ≥15%, total ≥20%) triggers a hard cut

Estimator

Blends two sources of throughput estimate:

EWMA — exponentially weighted moving average of historical runs for this table
Context similarity — weighted 5-dimension similarity search over past runs (source load, network quality, time of day, etc.)

The blend is continuous (not a hard switch). Time decay: weight = score × exp(-λ × age_days), λ=0.05, half-life ~14 days. Recent runs matter more.

Epistemic boundary

ixtract maintains a hard distinction between two types of runtime information:

Concept	What it is	Stored	Feeds learning
ExecutionContext	System-measured reality	Yes	Yes
RuntimeContext	User-declared beliefs	Yes	No

ExecutionContext is what ixtract measures: actual throughput, chunk durations, worker efficiency, source latency. This feeds the controller and estimator.

RuntimeContext is what you tell ixtract: “the source is under heavy load right now” or “I want to use at most 3 workers.” This constrains the plan but never pollutes the learned baseline.

A run with RuntimeContext constraints is excluded from controller learning. It won’t bias future unconstrained runs.

Execution engine

The engine dispatches chunks to workers using Longest Processing Time (LPT) scheduling — the chunk estimated to take longest goes to the first available worker. For skewed tables, this minimizes total elapsed time.

Work-stealing: when CV > 1.0 is detected by the profiler, the engine sorts the dispatch queue in descending estimated size order. This ensures slow chunks don’t become stragglers.

Adaptive backoff: if a SOURCE_LATENCY_SPIKE is detected mid-run (source round-trip increases >3× baseline), the engine backs off with exponential sleep to avoid overwhelming the source.

State store

SQLite, stored locally at ~/.ixtract/state.db. Tracks:

All runs (metadata, metrics, plan fingerprint)
Chunk-level results
Controller state per table
Plan persistence for replay

Plan fingerprint

Every plan is serialized to canonical JSON (sorted keys, no whitespace, floats rounded to 6 decimals), then SHA-256 hashed. This fingerprint is stored with each run and used to verify replay integrity.

Data flow

ExtractionIntent
    ↓
Profiler → SourceProfile
    ↓
Estimator → ThroughputEstimate
    ↓
RuntimeContext (optional constraints)
    ↓
Planner → ExecutionPlan (fingerprinted)
    ↓
Engine → chunks dispatched (LPT)
    ↓
Writer → Parquet / CSV / S3 / GCS
    ↓
Manifest → _manifest.json
    ↓
Controller ← ExecutionContext (metrics fed back)