How It Works
Core thesis
Section titled “Core thesis”Extraction is not a static task — it is a closed-loop control problem.
Most extraction tools treat parallelism as a config value. You set workers=8, run, tune manually, run again. ixtract solves this differently: it observes every run, learns from it, and converges toward optimal without intervention.
The four components
Section titled “The four components”Profiler
Section titled “Profiler”Before planning, ixtract profiles the source table:
- Estimated row count
- Primary key type, range, and distribution (coefficient of variation)
- Source latency (p50 round-trip)
- Skew detection: if CV > 1.0, work-stealing is flagged
The profiler runs a lightweight query against the source — no full table scan.
Planner
Section titled “Planner”Takes the source profile, run history, and optional RuntimeContext, and produces an ExecutionPlan:
- Worker count (from controller, benchmarker, or profiler — in that priority order)
- Chunk count and boundaries (range chunking by default; density-aware planned for Phase 5)
- Chunking strategy
- Verdict:
SAFE TO RUN|SAFE WITH WARNINGS|NOT RECOMMENDED
The plan is deterministic: same inputs produce the same plan, every time.
Controller
Section titled “Controller”The statistical window-based optimizer. After each run it updates a rolling window (default: 5 runs) and decides whether to adjust worker count.
Behavior:
- Reduces workers when it detects sustained throughput degradation (≥15% drop over 3 consecutive runs, total ≥20%)
- Does not discover that fewer workers would be better when the current count appears stable — that requires the benchmarker
- Uses direction-aware deviation: only flags degradation, not noise in the upward direction
- Escape mode: severe misconfiguration (3 consecutive drops ≥15%, total ≥20%) triggers a hard cut
Estimator
Section titled “Estimator”Blends two sources of throughput estimate:
- EWMA — exponentially weighted moving average of historical runs for this table
- Context similarity — weighted 5-dimension similarity search over past runs (source load, network quality, time of day, etc.)
The blend is continuous (not a hard switch). Time decay: weight = score × exp(-λ × age_days), λ=0.05, half-life ~14 days. Recent runs matter more.
Epistemic boundary
Section titled “Epistemic boundary”ixtract maintains a hard distinction between two types of runtime information:
| Concept | What it is | Stored | Feeds learning |
|---|---|---|---|
| ExecutionContext | System-measured reality | Yes | Yes |
| RuntimeContext | User-declared beliefs | Yes | No |
ExecutionContext is what ixtract measures: actual throughput, chunk durations, worker efficiency, source latency. This feeds the controller and estimator.
RuntimeContext is what you tell ixtract: “the source is under heavy load right now” or “I want to use at most 3 workers.” This constrains the plan but never pollutes the learned baseline.
A run with RuntimeContext constraints is excluded from controller learning. It won’t bias future unconstrained runs.
Execution engine
Section titled “Execution engine”The engine dispatches chunks to workers using Longest Processing Time (LPT) scheduling — the chunk estimated to take longest goes to the first available worker. For skewed tables, this minimizes total elapsed time.
Work-stealing: when CV > 1.0 is detected by the profiler, the engine sorts the dispatch queue in descending estimated size order. This ensures slow chunks don’t become stragglers.
Adaptive backoff: if a SOURCE_LATENCY_SPIKE is detected mid-run (source round-trip increases >3× baseline), the engine backs off with exponential sleep to avoid overwhelming the source.
State store
Section titled “State store”SQLite, stored locally at ~/.ixtract/state.db. Tracks:
- All runs (metadata, metrics, plan fingerprint)
- Chunk-level results
- Controller state per table
- Plan persistence for replay
Plan fingerprint
Section titled “Plan fingerprint”Every plan is serialized to canonical JSON (sorted keys, no whitespace, floats rounded to 6 decimals), then SHA-256 hashed. This fingerprint is stored with each run and used to verify replay integrity.
Data flow
Section titled “Data flow”ExtractionIntent ↓Profiler → SourceProfile ↓Estimator → ThroughputEstimate ↓RuntimeContext (optional constraints) ↓Planner → ExecutionPlan (fingerprinted) ↓Engine → chunks dispatched (LPT) ↓Writer → Parquet / CSV / S3 / GCS ↓Manifest → _manifest.json ↓Controller ← ExecutionContext (metrics fed back)