π§Ά Local-first development: pandarallel
Scraping links off the internet is a thankless job but someone's gotta do it π€·π»ββοΈ
Hey, itβs Vlad β hereβs a short story for you: Iβve recently tried to scrape about 5k links off the internet with Python, for a personal project (yup, it's about retrieval augmented generation π).
I didn't worry too much about optimizing this since it was pretty much a one-off thing, but after about half an hour of running the script and with only 300 links retrieved, I was getting impatient.
So, I did what every red-blooded engineer would do: I avoided doing all the other things I should have been doing, and started looking for a *simple* way to speed up the process.
This is how I found pandarallel.
pandarallel is a simple and efficient tool to parallelize Pandas operations on all available CPUs. With a one line code change, it allows any Pandas user to take advandage of his multi-core computer, while pandas uses only one core.
Thing is, I was using Pandas and it was just a simple change from df.apply
to df.parallel_apply
, so I installed it and gave it a try. The results were quite impressive.
While my initial script handled about 200 links in 20 minutes (π±), the parallelized version handled almost all 5k links in the same period (π₯³). Plus, it included some nice progress bars (see the screenshot above).
Simple things should be simple.
β Vlad