Thchere

mssql-python Delivers Direct Apache Arrow Support, Slashing Data Fetch Overhead

Published: 2026-05-07 03:56:00 | Category: Data Science

Breaking: mssql-python Now Fetches SQL Server Data as Apache Arrow

mssql-python, the official Python driver for SQL Server, now supports fetching query results directly as Apache Arrow structures. This replaces the previous row-by-row Python object construction with a zero-copy columnar path that eliminates per-row overhead and garbage-collector pressure.

mssql-python Delivers Direct Apache Arrow Support, Slashing Data Fetch Overhead
Source: devblogs.microsoft.com

“This is a game-changer for anyone working with SQL Server and modern DataFrame libraries,” said Felix Graßl, the community developer who contributed the feature. “Previously, fetching a million rows meant a million Python objects. Now, the entire fetch loop runs in C++ and writes directly into Arrow buffers—no per-row allocations, no GC pauses.”

The update means data engineers can load millions of rows into Polars, Pandas (via ArrowDtype), DuckDB, or any Arrow-native tool without intermediate Python object marshalling. Benchmarks show dramatic speedups, especially for temporal types like DATETIME and DATETIMEOFFSET, where per-value conversions are eliminated.

Background: The Apache Arrow Advantage

Apache Arrow is an open-standard columnar in-memory format that enables zero-copy data exchange between languages. Instead of representing a table as a list of row objects, Arrow stores each column’s values contiguously in typed buffers, with nulls tracked in a compact bitmap. The Arrow C Data Interface is a cross-language ABI that lets a C++ database driver and a Python DataFrame library share the exact same memory without serialization, copies, or re-parsing.

For mssql-python, this means the entire fetch loop can run in native code and write values directly into Arrow arrays. The Python side receives only a pointer to that memory and can begin processing immediately. Crucial for performance: subsequent operations—filters, joins, aggregations—also work in-place on those same buffers, so a Polars pipeline reading from mssql-python never needs to materialize intermediate Python objects.

What This Means for Data Engineers

The Arrow integration delivers four concrete benefits:

mssql-python Delivers Direct Apache Arrow Support, Slashing Data Fetch Overhead
Source: devblogs.microsoft.com
  • Speed: Columnar fetch avoids per-row Python object creation, making queries noticeably faster—especially for date/time and other conversion-heavy types.
  • Lower memory usage: A column of one million integers is a single contiguous C array, not a million individual Python objects plus overhead.
  • Seamless interoperability: Data flows directly into Polars, DuckDB, Hugging Face datasets, and any library supporting the Arrow C Data Interface—without copying or conversion.
  • Reduced GC pressure: Python’s garbage collector is off the critical path, leading to more predictable performance in long-running data pipelines.

“This isn’t just an incremental improvement,” said Sumit Sarabhai, who reviewed the feature. “It fundamentally changes how Python applications interact with SQL Server for analytics work. The zero-copy path is the right foundation for high-throughput data processing.”

Key Technical Concepts

API (Application Programming Interface): a source-code contract that defines how to call a function or library.
ABI (Application Binary Interface): a binary-level contract that specifies how compiled code is laid out in memory. Two programs built in different languages can share an ABI and exchange data directly—no serialization needed.
Arrow C Data Interface: Apache Arrow’s ABI specification, enabling zero-copy data exchange between languages.

For more on getting started with mssql-python and Arrow, see the official repository.