v0.9.2 · MIT · 518 tests

Why did this extraction take three times longer today?

ixtract answers that question. Every time. With evidence.

Deterministic adaptive extraction runtime for PostgreSQL, MySQL, and SQL Server.

$ pip install ixtract Read the docs →
Step 1 of 4 Profile
bash
$ ixtract profile orders --database mydb --user app

Profile — orders
  rows_estimated:     10,241,847
  pk_range:           1 → 10,241,847
  skew_coefficient:   0.12  (low — balanced)
  latency_p50_ms:     8.1
  recommended_chunks: 20
bash
$ ixtract plan orders

Execution Plan
  workers:    8
  chunks:     20
  strategy:   range_chunking
  basis:      controller (run 4 of window)
  estimated:  12.1s @ 846K rows/sec
  plan_hash:  f6b8048a
  verdict:    ✓ SAFE TO RUN
bash
$ ixtract execute orders --output ./data/

  [████████████████████] 20/20 chunks

Summary
  rows_extracted:  10,241,847
  duration:        11.7s
  throughput:      875,371 rows/sec
  anomalies:       none
  output:          orders_20260413.parquet
bash
$ ixtract history orders

Run History — orders
  run_017  8w  856K/s  stable
  run_018  8w  847K/s  stable
  run_019  8w  831K/s  stable
  run_020  8w  824K/s  stable
  run_021  8w  806K/s  converged ✓

  Controller: converged at 8 workers (±3.8% drift)

Questions you ask every week.
That no tool has ever answered.

Not because the data doesn't exist. Because no tool was built to surface it.

  • Why is this job slower today than yesterday?
  • Why did runtime double without any code change?
  • Why does this table always take longer?
  • Why is throughput fluctuating mid-run?
  • Why are some chunks fast and others take 10x longer?
  • How many workers should I actually use?
  • Why does adding more workers make it worse?
  • Am I overloading the source database?
  • Is it safe to run this during business hours?
  • Can I trust this to run unattended?

ixtract was built to answer all of these.

"Why did this get slower today?"

Most tools show you that throughput dropped. ixtract tells you why — with a reasoning chain, not a label.

Every deviation from expected performance is classified. Not with a single word, but with a structured reasoning chain: the root cause, the evidence that supports it, and a recommendation.

When source latency spikes, ixtract knows. When a table is skewed and one chunk is doing 95% of the work, ixtract names it. When throughput drops 50% between runs, you don't have to guess — the diagnosis is waiting for you.

Evidence-based diagnosis. Not guess-based labels.

bash
$ ixtract diagnose --object events

Diagnosis — events (run_021 vs baseline)
────────────────────────────────────────────────────
Deviation      THROUGHPUT_DROP_SEVERE
Confidence     HIGH
Root Cause     DATA_SKEW

Evidence
  chunk_001     1,502,847 rows  2.07s   ← 97% of work
  chunks 002–006  ~10,000 rows    0.03s each
  skew_ratio:     43.2x max/median
  cv:             2.05  (threshold: 1.0)

Work Stealing    ACTIVE — LPT dispatch engaged
Effective Workers  1.1 / 3 planned

Recommendation
  Data distribution is highly non-uniform.
  Range chunking distributes key space, not work.
  Current mitigation: work_stealing active.
────────────────────────────────────────────────────
bash
$ ixtract plan orders \
    --source-load high \
    --network-quality degraded \
    --priority low

RuntimeContext
  source_load:      high       (multiplier: 0.50)
  network_quality:  degraded   (multiplier: 0.75)
  priority:         low

Worker Resolution
  base (controller):       8
  after env multipliers:   3   (×0.38 combined)
  after priority (low):    2
  final:                   2

Cost Comparison
  workers  duration  cost
  2        11.4s    $0.13  ← planned
  3        11.9s    $0.13
  8        14.2s    $0.14  (over-parallelized)

Verdict:  ✅ SAFE TO RUN

Stop guessing worker counts.

The right number of workers is not 8. It depends on your source, your table, your network, and your history. ixtract calculates it — and explains why.

Adding workers doesn't always help. Sometimes it makes things worse. ixtract's direction-aware controller tracks whether the last adjustment helped or hurt — not just whether throughput went up.

When you're running against a heavily-loaded source, fewer workers can outperform more. The controller discovers this through feedback, not configuration.

Real finding from testing:
2 workers on a high-load source: 920,000 rows/sec
8 workers on the same source: 856,000 rows/sec

Over-parallelization confirmed. The controller learned this in 3 runs without a single config change.

Never accidentally overload your source again.

ixtract maintains a conservative bias. Under uncertainty, it uses fewer workers. It will never let a misconfigured extraction kill a production database.

Conservative by default

When ixtract doesn't have enough history to be confident, it starts conservatively. It scales up as evidence accumulates — never the other way around.

Bounded adaptation

No single adjustment exceeds configured step limits. The controller cannot oscillate. It cannot runaway. Every move is bounded.

Source load awareness

Declare --source-load high and ixtract automatically constrains parallelism. No manual cap calculation. No guessing what "safe" means for your source.

In testing against Azure SQL Server (30ms p50 latency, 100× slower than local), ixtract flagged the anomaly at 44.3 standard deviations below the local baseline — and correctly constrained its own behavior without being told.

Real runs. Real numbers.

Five test runs across local PostgreSQL and Azure SQL Server. These are the actual results.

Run Table / Config Result What it proves
1 — Baseline pgbench_accounts (10M rows), 8 workers, default 856K rows/sec, 11.7s Clean cold-start with profiler
2 — Source load Same table, --source-load high, --network-quality degraded 920K rows/sec at 2 workers Fewer workers outperformed 8 at high load
3 — Skewed table skewed_events (1.55M rows, CV=2.05), work stealing active 43× skew detected, LPT dispatch engaged Skew detection and mitigation working
4 — Cloud SQL Server cloud_extraction_test (1M rows, Azure, p50=30ms) 8.7K rows/sec, anomaly flagged at 44.3σ Cross-environment anomaly detection
5 — Replay pgbench_accounts (Run 1 replayed), --run-id run_001 Plan hash ✓ identical, +0.3% throughput delta Deterministic replay verified

Test environment: Ubuntu, local PostgreSQL 5432, Azure SQL Server (ixtract-db-server-46). 518 simulation tests passing. 12 integration tests passing. No cherry-picked runs — these are the full test sequence.

Every decision is recorded.
Every run can be replayed exactly.

ixtract is not probabilistic. It does not guess. Every plan is produced by the same deterministic rules from the same inputs — and can be reproduced six months later on different hardware.

  • Same inputs → same plan. Always.
  • Every decision has a structured justification you can inspect.
  • Every run stores a plan fingerprint (SHA-256).
  • Replay re-executes against the stored plan — not a reconstruction.
  • Deviation from expected behavior is explained, not hidden.
  • No probabilistic drift. No unsupervised learning. No black box.

"Replay guarantees identical decisions, not identical results."
Timing varies. Hardware varies. The plan does not.

bash
$ ixtract replay --run-id run_001

Replaying run_001 (pgbench_accounts, 2026-04-08)

Plan Integrity
  fingerprint:  f6b8048a4d2e...  ✔ verified
  version:      1.0              ✔ supported

Decision Check
──────────────────────────────────────────────────
              Original        Replay
──────────────────────────────────────────────────
Workers       8               8
Chunks        20              20
Strategy      range_chunking  range_chunking
Plan Hash     f6b8048a...     f6b8048a...  
──────────────────────────────────────────────────

Outcome Delta
  rows:       10,241,847 → 10,241,847  
  throughput: 856,341/s  → 858,284/s  (+0.3%)
  duration:   11.7s      → 11.6s     (-0.1s)

Determinism: ✔ Verified  (plan_fingerprint match)

Up and running in five minutes.

1
bash
$ pip install ixtract
2
extract.py
from ixtract import plan, execute, ExtractionIntent

intent = ExtractionIntent(
    source_type="postgresql",
    source_config={
        "host":     "localhost",
        "database": "mydb",
        "user":     "app",
    },
    object_name="orders",
)

result = plan(intent)
if result.is_safe:
    execution = execute(result)
    print(f"{execution.rows_extracted:,} rows in {execution.duration_seconds:.1f}s")
3
output
10,241,847 rows in 11.7s

Run stored. Diagnosis available. Controller learning.
Next run will be faster.

Built for the full extraction lifecycle.

Each tool does one thing. None of them do each other's job.

ixtract MIT Open Source

Extraction runtime. Self-tuning, deterministic, explainable. Converges to optimal parallelism. Explains every decision.

→ You are here

iPoxy MIT Open Source — Coming Soon

Pipeline reinforcement layer. Pre: gate extractions before they run. Watch: monitor pipelines in production. Gate: CI/CD checks for data pipelines.

ixora Commercial — Coming Soon

Fleet intelligence platform. SLA tracking, cost dashboards, multi-team visibility. Built on ixtract data, scaled to the enterprise.

Single engineer ixtract
Team reliability iPoxy
Platform scale ixora