Editorial photo showing a visual model demonstrating Chinese character similarities between 打, 拍, 拉, alongside a text model a

Editorial illustration for Visual model exploits similarity of 打, 拍, 拉; text model starts from embeddings

Visual model exploits similarity of 打, 拍, 拉; text model...

By AI Daily Post Edited by Brian Petersen, Editor-in-Chief

June 12, 2026 • Updated: July 15, 2026 • 3 min read

Being able to see is not the same as being able to read. A new comparison of training methods for Chinese characters proves it.

The visual model gets a head start. It recognizes that the characters 打, 拍, and 拉 all share the same hand-shaped radical. Before it reads a single sentence, it knows they’re probably related.

It begins with this structural clue baked in. The text-based model starts with nothing. Its initial embeddings are random, a blank slate.

It has to find these connections through brute statistical force, parsing millions of word pairings.

The numbers show the gap. Before training even starts, characters sharing a radical cluster tightly together in the visual model’s space. Their cosine similarity is about 0.27.

For the untrained text model, that measure is 0.002. Basically zero. One model begins with a hunch.

The other begins in the dark.

The visual model arrives at training already knowing something useful: that 打, 拍, and 拉 look similar, and probably behave similarly. The text-based model starts with random embeddings and has to figure this out from scratch.

Is Language Visual? An Experiment with Chinese Characters - Towards Data Science

Here’s the twist. That visual head start only covers appearance. It knows the characters look alike.

It has no idea if they are used alike. The radical hints at a family resemblance, not a functional one. To the text model, grinding through context after context, these characters are strangers at first.

It must learn that a hand that strikes, a hand that claps, and a hand that pull operate in completely different verbal neighborhoods.

And yet, after all the training, they perform the same. The race ends in a tie. The visual shortcut gets you to the starting line faster.

It doesn’t change the finish. Language is not about how a word looks on the page. It’s about the company it keeps.

Syntax and semantics live in the patterns of use, patterns no radical can predict. The visual prior is just that, a prior. Not a prophecy.

Common Questions Answered

Why does the visual model have an advantage when learning Chinese characters like 打, 拍, and 拉?

The visual model recognizes that these three characters all share the same hand-shaped radical, giving it a structural clue about their relationship before processing any text. This visual similarity provides a head start by allowing the model to understand that these characters are probably related based on their appearance alone.

How does the text-based model's initial state differ from the visual model when learning Chinese characters?

The text-based model starts with random embeddings as a blank slate, with no prior knowledge of character relationships or visual similarities. Unlike the visual model, it must discover connections between characters entirely through context and usage patterns in training data.

What is the key limitation of the visual model's understanding of Chinese character radicals?

While the visual model can recognize that characters like 打, 拍, and 拉 share a hand radical and look alike, it cannot determine how these characters are actually used in language. The radical only hints at visual family resemblance, not functional similarity, so the model lacks understanding of their different semantic and contextual meanings.

How do the text model and visual model differ in learning that 打, 拍, and 拉 operate in different verbal contexts?

The text model must learn through extensive context exposure that a hand that strikes, a hand that claps, and a hand that pulls operate in completely different verbal neighborhoods. The visual model's initial advantage of recognizing the shared radical does not help it understand these functional differences, which only emerge through analyzing actual language usage patterns.

Ship an AI product this weekend — no engineers required.

Structured, in-depth lessons on the exact no-code tools — not scattered tutorials.

The exact platforms, taught in depth
Build real, working projects
Our honest review + a reader discount

Read the review →

Visual model exploits similarity of 打, 拍, 拉; text model...

Common Questions Answered

Why does the visual model have an advantage when learning Chinese characters like 打, 拍, and 拉?

How does the text-based model's initial state differ from the visual model when learning Chinese characters?

What is the key limitation of the visual model's understanding of Chinese character radicals?

How do the text model and visual model differ in learning that 打, 拍, and 拉 operate in different verbal contexts?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

GM Engineers Now Spend Just 15% of Time Writing Code After AI Overhaul

Runway's AI video bug becomes a feature, guided by LLM context.

Amazon Scales Back Nova AI Models, Bets on New Frontier Team

Anthropic CEO: Open-weight AI models carry heightened biological risks

NVIDIA Jetson Puts Powerful AI Compute in Your Hand

Perplexity’s Personal Computer Turns Windows PCs Into AI Agents

Fish Audio raises USD 50M seed for AI voice tools targeting creators and enterprises

Snowflake's USD 1.33 Billion Revenue Day Included Smaller AI Deal

MCP's new authorization protocols make it "enterprise ready

Anthropic says shared Claude chats may appear on Google

Related Reading

Google's FACTS benchmark shows 70% factuality ceiling across four tests

Databricks finds multi-step agents beat single-turn RAG by 21% to 38% on STaRK

Nvidia's DLSS 4.5 beta adds 6x Multi Frame Generation for RTX 50 GPUs

New arXiv Paper Introduces Strategic Decision Support for AI Agents

LSEG integrates trusted data into ChatGPT workflows, says Max Grigoryev

Common Questions Answered

Why does the visual model have an advantage when learning Chinese characters like 打, 拍, and 拉?

How does the text-based model's initial state differ from the visual model when learning Chinese characters?

What is the key limitation of the visual model's understanding of Chinese character radicals?

How do the text model and visual model differ in learning that 打, 拍, and 拉 operate in different verbal contexts?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

GM Engineers Now Spend Just 15% of Time Writing Code After AI Overhaul

Runway's AI video bug becomes a feature, guided by LLM context.

Amazon Scales Back Nova AI Models, Bets on New Frontier Team

Anthropic CEO: Open-weight AI models carry heightened biological risks

NVIDIA Jetson Puts Powerful AI Compute in Your Hand

Perplexity’s Personal Computer Turns Windows PCs Into AI Agents

Fish Audio raises USD 50M seed for AI voice tools targeting creators and enterprises

Snowflake's USD 1.33 Billion Revenue Day Included Smaller AI Deal

MCP's new authorization protocols make it "enterprise ready

Anthropic says shared Claude chats may appear on Google