You are just one line of code away from speeding up your functions by using simple caching functionality

Not so long ago, I built an ETL-pipeline that was running daily, pulling data from an external service to enrich my input data and then loading the result into a database.

As my input data grew, my ETL-process became slower and slower because waiting for the responses from the external service was time-consuming. After some investigation, I realized I didn’t have that many different input values (~500) compared to the number of total records (~500k).

So, in other words, I was calling the external service with the same argument roughly 1000 times per record.

A situation like this is a prime use case of using caching. Caching a function means that whenever we calculate a return value of the function for the first time, we put the input and the result into a dictionary.

For every subsequent function call, we first check if the result has already been calculated by looking into our cache. If we find it there, perfect, no calculation needed! If not, we calculate the result and store the input and result in the cache, so that the next function call can look it up.

The Python standard library comes with many lesser-known but powerful packages. For our example at hand, we will be using lru_cache from functools. (LRU stands for Least Recently Used and means exactly that, the cache is going to keep the most recent input/result pairs by discarding the least recent/oldest entries first)

From Fun(c)tools Import lru_cache

The c in parentheses is a bit of a dad-joke, because then functools becomes fun tools and using caching is certainly fun!

from functools import lru_cache
from datetime import datetime

@lru_cache(maxsize=None)
def fib_cache(n):
    if n < 2:
        return n
    return fib_cache(n-1) + fib_cache(n-2)

def fib_no_cache(n):
    if n < 2:
        return n
    return fib_no_cache(n-1) + fib_no_cache(n-2)

def timeit(func,samples):
    start = datetime.now()
    func(samples)
    end = datetime.now()
    return end-start

Not much magic here. We import lru_cache and decorate a function that generates Fibonacci numbers with it.

Decorating a function means that we are wrapping it with the caching function, and whenever we subsequently call the fib_cache function, we are calling the cached function.

The Race Is On

UPPER = 40

cached = []
for i in range(0,UPPER):
    cached.append(timeit(fib_cache,i))

not_cached = []
for i in range(0,UPPER):
    not_cached.append(timeit(fib_no_cache,i))

We run an experiment where we calculate the time it takes to calculate all the Fibonacci numbers from 0-40 for the cached and the uncached versions of the function and put the results into their respective lists.

We run an experiment where we calculate the time it takes to calculate all the Fibonacci numbers from 0-40 for the cached and the uncached versions of the function and put the results into their respective lists.

And the Winner Is

For the lower Fibonacci numbers, it doesn’t make a big difference, but the efficiency gains for the cached function start to add up once we reach about 30 samples.

I didn’t have the patience to let the uncached version run much longer than 40 samples as the time it takes grows exponentially, whereas, for the cached version, it’s just a linear increment.

import pandas as pd
import seaborn as sns

_ = pd.DataFrame([cached,not_cached]).T
_.columns = ['cached','not_cached']
_ = _.applymap(lambda x: x.value/1000000000)

sns.set()
g = sns.lineplot(data=_)
g.set_ylabel('time in seconds')
g.set_xlabel('samples')

There you have it! You are just one line away from caching in Python. It isn’t so scary after all.

I hope you’ll like it and will find it some use. Feel free to fork it, report bug or ask for new features on its GitHub!

Leave a Reply

Your email address will not be published. Required fields are marked *