🧶 Local-first development: pandarallel

Scraping links off the internet is a thankless job but someone's gotta do it 🤷🏻‍♂️

Aug 25, 2023

Hey, it’s Vlad — here’s a short story for you: I’ve recently tried to scrape about 5k links off the internet with Python, for a personal project (yup, it's about retrieval augmented generation 🙈).

I didn't worry too much about optimizing this since it was pretty much a one-off thing, but after about half an hour of running the script and with only 300 links retrieved, I was getting impatient.

So, I did what every red-blooded engineer would do: I avoided doing all the other things I should have been doing, and started looking for a *simple* way to speed up the process.

This is how I found pandarallel.

pandarallel is a simple and efficient tool to parallelize Pandas operations on all available CPUs. With a one line code change, it allows any Pandas user to take advandage of his multi-core computer, while pandas uses only one core.

Thing is, I was using Pandas and it was just a simple change from df.apply to df.parallel_apply, so I installed it and gave it a try. The results were quite impressive.

Chunk of code showcasing parallel_apply — I could have changed the number of workers, but the default 12 worked just fine

While my initial script handled about 200 links in 20 minutes (😱), the parallelized version handled almost all 5k links in the same period (🥳). Plus, it included some nice progress bars (see the screenshot above).

Simple things should be simple.

— Vlad

Vlad Iliescu

Discussion about this post