The simplest Internet-connect search assistant ever, new difficult dataset for evaluating LLMs, GPT-3.5 Turbo fine-tuning pricing
Hi friends,
It just struck me — Christmas is coming, time to put up the Christmas tree and listen to Sia’s Everyday Is Christmas on repeat until the world makes no sense anymore.
Me, I’m planning to rewatch Violent Night soon. It’s just that good. (I’m also excited about The Holdovers🤩!).
Now, on to today’s issue 🎄🚀
I’ve talked about building a poor-man’s version of BingChat at Global AI Notes, so if you want to go through a speedrun of how LLMs work, how they’re trained, fine-tuned, and prompted, and how they can be used to kick ass in general, watch this.
Or you can just go ahead and run the code I’ve used to demo. I’ve updated it to use openai==1.3.5 due to a lot of “APIConnectionError: Error communicating with OpenAI: No connection adapters were found for …” errors when trying to call Azure OpenAI instances with openai==0.28.1 😒.
In other news, GAIA has been released a few days ago.
GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc).
Data GAIA is made of more than 450 non-trivial question with an unambiguous answer, requiring different levels of tooling and autonomy to solve. It is therefore divided in 3 levels, where level 1 should be breakable by very good LLMs, and level 3 indicate a strong jump in model capabilities. Each level is divided into a fully public dev set for validation, and a test set with private answers and metadata.
I like that it's a **difficult** dataset, as evidenced by the not-so-great test results i.e. on 30 Nov 2023, GPT-4 Turbo had an average score of 9.7%. Out of 100% 🥶. It's also new, so it probably hasn't been included in any model's training/fine-tuning data.
Yet.
So it's a good way to evaluate model fine-tunes and the like without worrying about data contagion.
For now.
Plus, Yann LeCun is one of the authors so it must be good. To be honest, I'm quite excited about it.
Dataset: https://huggingface.co/datasets/gaia-benchmark/GAIA
Leaderboard: https://huggingface.co/spaces/gaia-benchmark/leaderboard
Maybe you’ve missed this with all the Game of OpenAI Thrones drama, but OpenAI has recently reduced their fine-tuning prices for GPT-3.5 Turbo by ~75% 😱.
I've updated my cost analysis to reflect it, and I think that this is a bit of a bit of a big deal -- it makes OpenAI more cost effective for fine-tuning than Azure OpenAI, even at crazy volumes like 1.1 billion tokens per month!
For these volumes, it looks like Azure OpenAI is the clear choice for Davinci-002, a somewhat clear choice for Babbage-002, and a very muddy if not unclear choice for GPT 3.5 Turbo.
Cost analysis: https://vladiliescu.net/finetuning-costs-openai-vs-azure-openai/
OpenAI pricing: https://openai.com/pricing
Azure OpenAI pricing: https://azure.microsoft.com/en-gb/pricing/details/cognitive-services/openai-service/
That’s it for now, see you in two weeks!