Polars: How I Learned to Stop Worrying about Pandas

Mycchaka Kleinbort
5 min readNov 17, 2022

--

Polars: The fastest bear-themed dataframe library

If you are tired of learning “pandas-killers” then I hear you. I’ve myself spend way too many earnings trying to push through the limits of pandas by reaching out for Dask/Modin/Vaex/cuDF, etc… but hear me out — I think this might be it!

Why should I learn Polars?

There are two main reasons for this:

  • Because you don’t need access to foot guns 💣
  • Because it’s FAST! Very fast! 🚀
  • Functional dataframe manipulations🏄‍♂️

Ok, tell me more about foot guns…

Polars uses the Arrow format under the hood. This makes it very hard to “hit” a type error.

Oh, sorry, you tried to add an int to a string… Did you try add a pd.Timedelta to a NaN?

Black Panther on arithmetic with mixed-types

Maybe this is a personal learning more than anything, but I abused the pandas’ type flexibility a bit too much, and that made integration with other tools a bit hard. (What do you mean I can’t have callables in my cells in snowflake?)

https://imgflip.com/i/6wdvic

It also means work you do in Polars is highly compatible with other tools — like databases, Parquet, or any other tool that loosely expects you to have type-homogeneous columns.

Ok, tell me more about speed…

50 to 150x speedup — depends on your code.

You know what sucks? Cache invalidation. And let’s be honest, caching is how data scientists solve 99% of our problems? Does that dataset take a long time to build? Cache it to disk. Does that SQL query take a long time to run? Cache it to disk. Etc…

But caching is hard! Not only are you still waiting for hours while your data gets “ready” the first time — you also have to re-run this every time your code/data changes, which makes experimentation a pain.

XKCD Wisdom

It also makes interactivity a pain! Unless you have many users, chances are each way of inspecting your data will be some form of cache miss — which sucks.

Tell you what doesn’t suck: to be able to rebuild your entire dataset in a few milli-seconds.

That’s what a 150x speedup looks like in practice: did that take 2h to build? Well, now it’s ready in 50 seconds.

Was that user interaction taking 2–5 seconds? Well, now it’s faster than most monitor’s refresh rate!!!

Speed is a feature (as the rust guys would say).

And at least for me, that feature means less up-front decisions, less maintenance, and more experimentation.

Ok, tell me more about dataframe manipulation…

Let me start with this: the Polars syntax is not the pandas syntax. Any similarity is merely due to reasonable people having reasonable overlap in how things should be done, and learning the Polars’ API takes some work.

I am a big user of the Pandas functional API. And I got used to writing this kind of code:

df = (df 
.pipe(add_normalized_name)
.pipe(filter_on_criteria, CRITERIA)
.assign(target = lambda x: x['sales'] > x['sales'].median())
)

But now that I’m learning Polars I can recognize it has made some good choices.

I’ll illustrate with this dummy data (some sales by user by store, a pseudo transactions table)

import numpy as np 
import pandas as pd
import polars as pl


data_size = 1_000_000

np.random.seed = 1
saleValue = np.random.randint(0, 100, data_size)
storeId = np.random.choice([f'Store: {i}' for i in range(200)], replace=True, size=data_size)
customerId = np.random.choice([f'Customer: {i}' for i in range(10_000)], replace=True, size=data_size)

df = pd.DataFrame(
dict(storeId=storeId, customerId=customerId, saleValue=saleValue)
).pipe(pl.from_pandas)

Choice 1: Better creation of new columns

Compare this pandas code:

df = (df
.assign(**{
'Mean Sales': lambda x: x['saleValue'].mean(),
'Median Sales': lambda x: x['saleValue'].median()
})
)

To this Polar’s code:

df = (df
.with_columns([
pl.col('saleValue').mean().alias('Mean Sales'),
pl.col('saleValue').median().alias('Median Sales')
])
)

or alternatively:

df = (df
.with_columns(**{
'Mean Sales': pl.col('saleValue').mean(),
'Median Sales': pl.col('saleValue').median()
})
)

Either results in the desired additional columns:

Choice 2: Window functions

It is very easy to write a window function. For example, consider this addition of the average sale by store:

df.with_columns(**{
'Median Sale by Store': pl.col('saleValue').median().over('storeId')
})

This adds a new column with the median sales by store:

And I should say, at 23ms for a 1m row table, it’s also delightfully fast.

Choice 3: Good support for lists and structs

Polars also supports arrays inside it’s cells. To illustrate, let’s get a list of the users top 5 spenders per store.

(df
.groupby(['storeId','customerId'])
.agg(pl.col('saleValue').sum().alias('totalSales'))
.sort('totalSales', reverse=True)
.groupby('storeId')
.agg(pl.col('customerId').head(5).list().alias('customerIds'))
)

Once again, at 112ms this is also delightfully fast! About 4x faster than the equivalent pandas code.

Choice 4: Did I mention fast?

I recently wrote a small article on one hot encoding strings that express group membership. See full article here.

To build on that illustration, let’s turn the previous table into a one-hot-encoded dataframe, where the rows are the storeIds, columns are the customerId’s and a 1 represents that that customer is one of the top 5 spenders at that store.

x = (df
.groupby(['storeId','customerId'])
.agg(pl.col('saleValue').sum().alias('totalSales'))
.sort('totalSales', reverse=True)
.groupby('storeId')
.agg(pl.col('customerId').head(5).list().alias('customerIds'))
)

We can one-hot-encode the customerIds column with:

(x
.explode('customerIds')
.with_columns(pl.lit(1).alias('__one__'))
.pivot(index='storeId', columns='customerIds', values='__one__')
.fill_null(0)
)

And as I said, at 5ms this is delightfully fast.

For comparison, the equivalent pandas code took 10x longer (63ms).

If we push the analysis from:

  • 1m transactions
  • across 200 stores
  • and 10k customers

to

  • 10m transactions
  • across 2k stores
  • and 100k cusomers

we get the following comparison:

Pandas — 10s:

# Top customers by store
x_pandas = (df_pandas
.groupby(['storeId','customerId'])
['saleValue'].sum().reset_index()
.sort_values('saleValue', ascending=False)
.groupby('storeId')
['customerId'].apply(lambda x: x.head(5).values)
.reset_index()
)

# One-hot-encoded top customers
df_ans = (x_pandas
.explode('customerId')
.assign(__one__ =1)
.pivot_table(index='storeId',
columns='customerId',
values='__one__',
fill_value=0)
)

Polars — 2s:

# Top customers by store
x = (df
.groupby(['storeId','customerId'])
.agg(pl.col('saleValue').sum().alias('totalSales'))
.sort('totalSales', reverse=True)
.groupby('storeId')
.agg(pl.col('customerId').head(5).list().alias('customerIds'))
)

# One-hot-encoded top customers
df_ans = (x
.explode('customerIds')
.with_columns(pl.lit(1).alias('__one__'))
.pivot(index='storeId', columns='customerIds', values='__one__')
.fill_null(0)
)

The development of Polars is progressing rapidly, and I follow it with excitement.

--

--