Revisiting the Question: Can AI Judge AI Output?

Revisiting the key themes of The Data Score deep dive into AI Checker Agents with the help of a NotebookLM Podcast.

Dec 03, 2024

I recently wrote a detailed case study on how bias can affect the use of AI Checker agents in an Agent of Agents approach to quality control.

This was a long article, which didn’t deter people who opened the article. The substack data showed that many readers reopened the article multiple times. The article had among the highest number of views relative to the number of unique readers across all Data Score articles. This data showed that the article was valuable but very long (even for a Data Score article), which may have prevented some readers from fully benefiting from it.

So, I’m trying something new in this article. I used NotebookLM to create a podcast about the Data Score article on AI Agents and Bias. The AI-generated podcast provides a useful summary of the article's key points.

The video below, with the help of NotebookLM, will help cover the main points of the article in 23 minutes (or faster if you want to speed up the audio).

This podcast gets a bit meta with an AI-generated podcast discussing the use of AI checker agents assessing the creative output of other AI agents, driven by a case study where I set up the most popular AI agents to complete a creative task and assess which AI model did best.

As the AI-generated hosts discuss the key point of my article, they add additional creative ideas not covered in my article to help explain the concepts. The conversation explains that AI models rely on pattern matching and do not exhibit human-like characteristics. However, the AI hosts occasionally assign human qualities to the LLM programs in the conversation.

We also discover whether NotebookLM can pronounce my last name correctly—spoiler: it can, but seems to choose not to for the sake of comedy!

Here’s the key points from the original article

Takeaways for the agent of agents approach to validating model output when leveraging the creativity of generative AI.

Without clear benchmarks for evaluating creative AI outputs, it remains challenging to transition from “human-in-the-loop³” oversight—where humans are directly involved—to “human-on-the-loop⁴” oversight, where humans monitor but intervene only when needed. Until that time, it is important to have the generative AI models explain their rational step by step to help a human assess if the quality assessment can be trusted, focused on the end user of the application in production assessing on the spot if they can trust the answer.
Further research is necessary to scientifically validate this theory and refine the agent-of-agents approach for more consistent quality checks. It seems like the agents testing the quality of output of other agents will be biased by their own RLHF alignment training, such that it can impact the effectiveness of the quality check.
Selecting the right model for each specific testing task is essential when designing an agent-of-agents framework to ensure unbiased quality assessments. Considations of each model's potential bias by its RLHF alignment training could affect the accuracy of the testing agent.
While not strictly necessary, masking model identities in an agent-of-agents setup may enhance confidence in the independence of one generative AI model's assessment of another's output.

And here’s the link to the article if you want to go deeper into the details:

Can AI Judge AI Output?

Jason DeRise

November 4, 2024

A case study on how AI checker agents can be biased and what to do about it

Read full story

Welcome to the Data Score newsletter, composed by DataChorus LLC. The newsletter is your source for insights into the world of data-driven decision-making. Whether you're an insight seeker, a unique data company, a software-as-a-service provider, or an investor, this newsletter is for you. I'm Jason DeRise, a seasoned expert in the field of data-driven insights. As one of the first 10 members of UBS Evidence Lab, I was at the forefront of pioneering new ways to generate actionable insights from alternative data. Before that, I successfully built a sell-side equity research franchise based on proprietary data and non-consensus insights. After moving on from UBS Evidence Lab, I’ve remained active in the intersection of data, technology, and financial insights. Through my extensive experience as a purchaser, user, and creator of data, I have gained a unique perspective, which I am sharing through the newsletter.

The Data Score

Can AI Judge AI Output?

Discussion about this post

Ready for more?

The Data Score

Revisiting the Question: Can AI Judge AI Output?

Revisiting the key themes of The Data Score deep dive into AI Checker Agents with the help of a NotebookLM Podcast.

Here’s the key points from the original article

Can AI Judge AI Output?

- Jason DeRise, CFA (and NotebookLM)

Discussion about this post

Ready for more?