Skip to content

How It Works

Extraction is not a static task — it is a closed-loop control problem.

Most extraction tools treat parallelism as a config value. You set workers=8, run, tune manually, run again. ixtract solves this differently: it observes every run, learns from it, and converges toward optimal without intervention.

Before planning, ixtract profiles the source table:

  • Estimated row count
  • Primary key type, range, and distribution (coefficient of variation)
  • Source latency (p50 round-trip)
  • Skew detection: if CV > 1.0, work-stealing is flagged

The profiler runs a lightweight query against the source — no full table scan.

Takes the source profile, run history, and optional RuntimeContext, and produces an ExecutionPlan:

  • Worker count (from controller, benchmarker, or profiler — in that priority order)
  • Chunk count and boundaries (range chunking by default; density-aware planned for Phase 5)
  • Chunking strategy
  • Verdict: SAFE TO RUN | SAFE WITH WARNINGS | NOT RECOMMENDED

The plan is deterministic: same inputs produce the same plan, every time.

The statistical window-based optimizer. After each run it updates a rolling window (default: 5 runs) and decides whether to adjust worker count.

Behavior:

  • Reduces workers when it detects sustained throughput degradation (≥15% drop over 3 consecutive runs, total ≥20%)
  • Does not discover that fewer workers would be better when the current count appears stable — that requires the benchmarker
  • Uses direction-aware deviation: only flags degradation, not noise in the upward direction
  • Escape mode: severe misconfiguration (3 consecutive drops ≥15%, total ≥20%) triggers a hard cut

Blends two sources of throughput estimate:

  1. EWMA — exponentially weighted moving average of historical runs for this table
  2. Context similarity — weighted 5-dimension similarity search over past runs (source load, network quality, time of day, etc.)

The blend is continuous (not a hard switch). Time decay: weight = score × exp(-λ × age_days), λ=0.05, half-life ~14 days. Recent runs matter more.

ixtract maintains a hard distinction between two types of runtime information:

ConceptWhat it isStoredFeeds learning
ExecutionContextSystem-measured realityYesYes
RuntimeContextUser-declared beliefsYesNo

ExecutionContext is what ixtract measures: actual throughput, chunk durations, worker efficiency, source latency. This feeds the controller and estimator.

RuntimeContext is what you tell ixtract: “the source is under heavy load right now” or “I want to use at most 3 workers.” This constrains the plan but never pollutes the learned baseline.

A run with RuntimeContext constraints is excluded from controller learning. It won’t bias future unconstrained runs.

The engine dispatches chunks to workers using Longest Processing Time (LPT) scheduling — the chunk estimated to take longest goes to the first available worker. For skewed tables, this minimizes total elapsed time.

Work-stealing: when CV > 1.0 is detected by the profiler, the engine sorts the dispatch queue in descending estimated size order. This ensures slow chunks don’t become stragglers.

Adaptive backoff: if a SOURCE_LATENCY_SPIKE is detected mid-run (source round-trip increases >3× baseline), the engine backs off with exponential sleep to avoid overwhelming the source.

SQLite, stored locally at ~/.ixtract/state.db. Tracks:

  • All runs (metadata, metrics, plan fingerprint)
  • Chunk-level results
  • Controller state per table
  • Plan persistence for replay

Every plan is serialized to canonical JSON (sorted keys, no whitespace, floats rounded to 6 decimals), then SHA-256 hashed. This fingerprint is stored with each run and used to verify replay integrity.

ExtractionIntent
Profiler → SourceProfile
Estimator → ThroughputEstimate
RuntimeContext (optional constraints)
Planner → ExecutionPlan (fingerprinted)
Engine → chunks dispatched (LPT)
Writer → Parquet / CSV / S3 / GCS
Manifest → _manifest.json
Controller ← ExecutionContext (metrics fed back)