A challenge when developing software that makes use of large language models is writing automated tests for it.
With traditional software, you tend to know exactly what output you expect to receive for a given input.
When large language models are involved, however, it’s common to expect a chunk of text as the output and that text can vary whilst still being ‘correct’ (imagine, for example, some software that is supposed to summarise an input document – two very differently-worded summaries could be equally valid.)
This poses a challenge – how do you confirm, in your test, whether a given output is correct or not.
One way to tackle this is to use a large language model again – this time, ask it to assess whether the output text meets the requirements.
Now, the next time you want to tweak your prompts or switch to a new language model, you can run these tests and be more confident that the change won’t be breaking existing functionality.