ChatGPT is recommending insurance, and it knows it shouldn't

Large language models (LLMs), and ChatGPT in particular, have become remarkably good at plausibly answering consumer questions. Ask which travel insurer covers epilepsy, which life insurer is best for a family, or which business insurance policy is most suitable for a software company, and it will produce a confident, conversational answer.

The problem is that confidence and competence are not the same thing. As ChatGPT admitted to me.

A Simple Question

Consider a seemingly straightforward question:

"Which travel insurance will cover epilepsy for my trip to Spain?"

ChatGPT confidently responded with:

"I'd start with AllClear."

This was followed by paraphrased marketing information about AllClear, who have a very comprehensive landing page for Epilepsy-related insurance. At first glance, this appears helpful. The insurer mentioned does claim to cover epilepsy. The statement may not be factually incorrect.

But, there's a deeper question.

On what basis was that recommendation made?

Did the model compare policy wordings? Analyse pricing? Review acceptance criteria? Consider the user's age, seizure history, medication, destination, trip duration and claims history? The answer is no.

The model is often generating a plausible recommendation, based on marketing materials, rather than a needs and evidence-based one.

The Recommendation Illusion

The challenge is not necessarily factual accuracy, the recommended insurance's extensive marketing material may reflect a superior product that may happen to be a better fit than competitors. The LLM might get lucky. Which highlights the bigger problem: the impression of reasoning. Consumers assume that recommendations arise from analysis, especially when delivered confidently and concisely. When a human says "I'd start with AllClear", the listener may infer that the speaker has compared alternatives and reached a conclusion.

LLMs produce the same linguistic signals. The phrase "I'd start with..." sounds modest and conversational. Yet it implicitly conveys preference, judgment and prioritisation. Users may reasonably interpret it as the result of expertise rather than statistical language generation.This creates what might be called a recommendation illusion: language that sounds like advice without being grounded in a robust advisory process.

The Scale Argument

Many discussions about AI regulation focus on individual conversations.

Was this specific statement advice?
Was this specific recommendation personalised?
Was this specific interaction regulated?

These are valid questions, but they may miss the more important issue. LLMs do not operate once. They operate millions of times. A human adviser making an unsupported recommendation affects one customer at a time. An LLM can generate similar recommendation patterns across millions of interactions.

This raises an uncomfortable question:

Should regulators evaluate individual outputs, or should they evaluate the behaviour of the system as a whole?

A recommendation that appears insignificant in isolation becomes far more significant when repeated at industrial scale, especially when it's baseless. When pressed, ChatGPT will admit this.

Argument with an LLM over insurance regulation

The Human Comparison

Imagine a call centre employee who repeatedly told customers:

"I'd start with Insurer A."

When challenged, the employee admits:

They have not compared policy terms.
They have not assessed customer suitability.
They have not reviewed competing products.
They have no evidence that Insurer A is preferable.
They have seen a marketing campaign paid for by Insurer A that says Insurer A is strong - and that's what they've based the recommendation on.

Most observers would question whether such behaviour was appropriate. Yet ChatGPT can generate equivalent recommendations because recommendation-like language is deeply embedded in natural conversation.

This creates a difficult regulatory question. If a human repeatedly steers consumers towards products, many would regard that activity as advice or at least as something requiring oversight. If an AI system performs the same function, should the standard be different?

Information Versus Advice

The insurance industry has long relied on a distinction between information and advice.

For example:

"This insurer covers epilepsy."
"These insurers commonly insure travellers with epilepsy."
"These are the typical exclusions."
"These are the price ranges."

These statements provide information. However:

"I'd start with AllClear."
"This is probably your best option."
"You should choose this insurer."

These statements look very much like recommendations.

The challenge is that LLMs are optimised to be helpful - probably over-optimised. Helpful conversation naturally drifts toward making recommendations.

Consumers rarely ask for information alone. They ask:

"What's the best option?"

The desire to appear helpful and answer directly pushes models across the information-advice boundary.

The Governance Challenge

This creates a governance problem that existing frameworks may struggle to address.

Traditional regulation assumes identifiable actors:

advisers
brokers
insurers
comparison sites.

LLMs do not fit neatly into those categories, yet they influence consumer decisions in similar ways. The result is a growing tension:

Models are capable of influencing purchasing decisions.
Models lack the evidential basis expected of regulated advice, but give the impression of having it when re-hashing marketing materials.
Users understandably interpret model outputs as advice.
The scale of deployment magnifies the potential impact.

A New Regulatory Question

Regulators need to start asking a different version of an old question:

When an AI system systematically generates recommendation-like outputs about regulated products, what standards of evidence, accountability and oversight should apply?

This is not simply an AI question. It is a consumer protection question. The insurance industry has spent decades building frameworks around suitability, disclosure, accountability and advice standards. Large language models challenge many of those assumptions because they can produce the language of advice without necessarily performing the process behind it. That gap may become one of the defining regulatory challenges of AI in financial services.