Skip to content

Performance: streaming vs batch SQL

Session provides two ways to get query results:

API Returns When data is ready
Session.sql(query) pyarrow.Table (fully materialized) After the full result is collected
Session.sql_reader(query) pyarrow.RecordBatchReader (streaming) First batch arrives as soon as the engine produces it

For small to medium queries the difference is negligible. For large results — especially when you want to process rows as they arrive or when memory is a concern — sql_reader(...) is significantly better.


Benchmark results

Measured on a generated dataset of ~10.5 million rows (Linux, local SSD, single process). Each number is a median over 3 runs after 1 warmup.

Time to first batch

"First batch available" measures how long until your code can start working with data.

Query sql_reader(...) Session.sql(...) Improvement
SELECT * FROM prices 370.7 ms 2,312 ms 84% earlier
SELECT * FROM prices ORDER BY ts 2,489 ms 13,182 ms 81% earlier

With sql_reader(...), the engine hands you batches incrementally as they are produced. Session.sql(...) must collect the entire result before returning — for a sort query this means waiting for the full sort to complete.

Peak RSS (process memory)

"Process-as-you-go" pattern: processing each batch immediately as it arrives rather than holding the full table in memory.

Query sql_reader(...) Session.sql(...) + iterate Reduction
SELECT * FROM prices 2.30 GiB 3.60 GiB 36% lower
SELECT * FROM prices ORDER BY ts 3.66 GiB 4.84 GiB 24% lower

Memory savings come from the fact that sql_reader(...) never holds a materialized copy of the full result — batches can be processed and discarded one at a time.

sql_reader(...).read_all() vs Session.sql(...)

If you call reader.read_all() (materializing everything eagerly), performance is in the same range as Session.sql(...). The streaming API does not add overhead when you ultimately collect all rows.


When to use each

Use Session.sql(...) when:

  • Your result fits comfortably in memory.
  • You want a pyarrow.Table directly (e.g. to pass to Polars, pandas, etc.).
  • Simplicity matters more than latency-to-first-row.

Use Session.sql_reader(...) when:

  • Your result is large and you want to process batches as they arrive (e.g. writing to disk, aggregating incrementally, piping to another system).
  • You want lower peak memory usage.
  • You care about time-to-first-result (e.g. streaming to a UI or a downstream pipeline).

Usage

import timeseries_table_format as ttf

sess = ttf.Session()
sess.register_tstable("prices", "./prices_table")

# Batch (eager) — returns a fully materialized pyarrow.Table
table = sess.sql("SELECT * FROM prices WHERE ts > '2024-05-01'")

# Streaming — process each batch as it arrives
reader = sess.sql_reader("SELECT * FROM prices WHERE ts > '2024-05-01'")
try:
    for batch in reader:
        # process batch (pyarrow.RecordBatch) here
        print(batch.num_rows)
finally:
    reader.close()

# Or collect everything at once via the reader (same performance as Session.sql)
reader2 = sess.sql_reader("SELECT * FROM prices WHERE ts > '2024-05-01'")
try:
    table2 = reader2.read_all()
finally:
    reader2.close()

Reproducing the benchmark

cd python
uv pip install -p .venv/bin/python numpy
uv run -p .venv/bin/python maturin develop --features test-utils
.venv/bin/python bench/sql_conversion.py \
    --target-ipc-gb 2 \
    --warmups 1 \
    --runs 3 \
    --include-streaming \
    --summary

Increase --target-ipc-gb to 3 or 6 on machines with more memory for larger datasets. Results are printed to stderr; use --json path/to/out.json to save the raw data.