🧶 Local-first development: pandarallel
Scraping links off the internet is a thankless job but someone's gotta do it 🤷🏻♂️
Hey, it’s Vlad — here’s a short story for you: I’ve recently tried to scrape about 5k links off the internet with Python, for a personal project (yup, it's about retrieval augmented generation 🙈).
I didn't worry too much about optimizing this since it was pretty much a one-off thing, but after about half an hour of running the script and with only 300 links retrieved, I was getting impatient.
So, I did what every red-blooded engineer would do: I avoided doing all the other things I should have been doing, and started looking for a *simple* way to speed up the process.
This is how I found pandarallel.
pandarallel is a simple and efficient tool to parallelize Pandas operations on all available CPUs. With a one line code change, it allows any Pandas user to take advandage of his multi-core computer, while pandas uses only one core.
Thing is, I was using Pandas and it was just a simple change from df.apply
to df.parallel_apply
, so I installed it and gave it a try. The results were quite impressive.
While my initial script handled about 200 links in 20 minutes (😱), the parallelized version handled almost all 5k links in the same period (🥳). Plus, it included some nice progress bars (see the screenshot above).
Simple things should be simple.
— Vlad