Thought on Python's DataFrame

Posted on Aug 19, 2023

Note

This article is a bit longer version of my tweet.

I have to say author of the tweet couldn’t see the whole world. It’s not a world only Python could process real world data. It’s wired that take Pandas’ API as “standard” just because it’s everywhere?

PHP is everywhere, Java is everywhere, why we need Python?

In my view, Pandas should not be considered the “standard,” as it is neither language-agnostic nor efficient(I mean if Pandas could be performant like NumPy for most use cases).

Currently, I contend that Apache Arrow is the only standard, facilitating a collaboration between Pandas & Polars.

For instance, we can establish an efficient data processing pipeline in Polars, then translate results into Pandas df for scenarios where Pandas excels. Here, Arrow serves as the connector.


2023-11-11 Update: I just found Pandas author Wes McKinney’s Apache Arrow and the “10 Things I Hate About pandas”

I strongly feel that Arrow is a key technology for the next generation of data science tools.


2023-12-02 Update: I learn a new project dataframe-api is trying to define a standard API for DataFrame via Using Polars in a Pandas world by Itamar Turner-Trauring from Python⇒Speed.