Thought on Python’s DataFrame
Note
This article is a bit longer version of my tweet.
One great thing about Python is API standardization: know scikit? Well the API is everywhere so you’re good to go. This is why I detest polars and heartily hope the project fails: is pandas perfect? No. But its API is the standard and we should keep it this way.
— Paul Rietschka (@paul_rietschka) August 17, 2023
I have to say author of the tweet couldn’t see the whole world. It’s not a world only Python could process real world data. It’s wired that take Pandas’ API as “standard” just because it’s everywhere?
PHP is everywhere, Java is everywhere, why we need Python?
In my view, Pandas should not be considered the “standard,” as it is neither language-agnostic nor efficient(I mean if Pandas could be performant like NumPy for most use cases).
Currently, I contend that Apache Arrow is the only standard, facilitating a collaboration between Pandas & Polars.
For instance, we can establish an efficient data processing pipeline in Polars, then translate results into Pandas df for scenarios where Pandas excels. Here, Arrow serves as the connector.
2023-11-11 Update: I just found Pandas author Wes McKinney’s Apache Arrow and the “10 Things I Hate About pandas”
I strongly feel that Arrow is a key technology for the next generation of data science tools.
2023-12-02 Update: I learn a new project
dataframe-api
is trying to
define a standard API for DataFrame via
Using Polars in a Pandas world
by Itamar Turner-Trauring from Python⇒Speed.