Skip to main content
Chen discusses AI quality, emphasizing continuous experimentation, iteration, and improvement for optimal results. [accessibi

Editorial illustration for Chen says AI quality requires ongoing experimentation, iteration and improvement

AI Sycophancy: Why Models Change Answers on Demand

Chen says AI quality requires ongoing experimentation, iteration and improvement

2 min read

The push for AI that can handle layered, real‑world queries has exposed a gap between impressive benchmarks and the messy reality of decision‑makers who need fully formed answers. While large‑scale tests can showcase fluency, they often stop short of confirming whether a model can stitch together disparate facts into a coherent whole. Chen’s group has taken a different tack: instead of relying solely on headline scores, they built a suite of evaluation tools aimed at measuring how often an answer leaves critical pieces out.

“Getting ‘complete’ answers to multi‑faceted questions” became the yardstick, prompting the team to track gaps, contradictions and omissions across dozens of test cases. Their methodology now feeds back into the development loop, forcing engineers to revisit prompts, tweak training data and re‑run diagnostics. The result is a growing inventory of edge cases that keep the conversation alive.

It’s this relentless focus on measurement that frames Chen’s view of what really matters for AI performance.

---

"At the end of the day, what matters most for us is the quality of the AI outcome, and that is a continuous journey of experimentation, iteration and improvement," Chen said.

"At the end of the day, what matters most for us is the quality of the AI outcome, and that is a continuous journey of experimentation, iteration and improvement," Chen said. Getting 'complete' answers to multi-faceted questions To evaluate models and their outputs, Chen's team has established more than a half-dozen "sub metrics" to measure "usefulness" based on several factors -- authority, citation accuracy, hallucination rates -- as well as "comprehensiveness." This particular metric is designed to evaluate whether a gen AI response fully addressed all aspects of a users' legal questions.

Chen’s remarks underscore a simple truth: AI in law can’t rely on accuracy alone. Relevance, authority, citation precision, and low hallucination rates have become non‑negotiable benchmarks. LexisNexis has responded by moving past vanilla retrieval‑augmented generation, experimenting with graph‑based RAG and agentic graphs that promise richer context.

The company also says it has expanded its evaluation framework, though the details remain sketchy. This ongoing experimentation feels more like a marathon than a sprint, and the continuous‑iteration model suggests no single release won’t ever be “complete.” Yet whether graph RAG will consistently deliver the nuanced answers lawyers need is still unclear. The approach reflects a cautious optimism—progress measured in incremental improvements rather than bold promises.

As the team keeps tweaking models, the real test will be how often the outputs meet the high stakes of legal work without slipping into misinformation. Until that balance is demonstrably achieved, the journey described by Chen remains a work in progress.

Further Reading

Common Questions Answered

What is the FActScore methodology for evaluating language model factuality?

[arxiv.org](https://arxiv.org/abs/2305.14251) introduces FActScore as a novel evaluation technique that breaks generated text into atomic facts and calculates the percentage of facts supported by a reliable knowledge source. The method allows for a more nuanced assessment of factuality beyond binary quality judgments, revealing that models like ChatGPT only achieve around 58% factual precision in biographies.

How does FActScore address the challenge of evaluating long-form text generation?

FActScore addresses the complexity of evaluating long-form text by breaking generations into individual atomic facts and systematically checking each fact's support from a reliable knowledge base. The researchers developed both a human evaluation method and an automated model that can estimate the factuality score with less than a 2% error rate, making it possible to evaluate large numbers of generations that would be prohibitively expensive to assess manually.

What key insights did the FActScore research reveal about different language models?

The research evaluated biographies generated by several state-of-the-art commercial language models, finding significant variations in factuality across different systems. Notably, the study revealed that GPT-4 and ChatGPT are more factual than public models, while Vicuna and Alpaca emerged as some of the best-performing public models in terms of factual precision.