Skip to main content
Editorial photo showing a visual model demonstrating Chinese character similarities between 打, 拍, 拉, alongside a text model a

Editorial illustration for Visual model exploits similarity of 打, 拍, 拉; text model starts from embeddings

Visual model exploits similarity of 打, 拍, 拉; text model...

Visual model exploits similarity of 打, 拍, 拉; text model starts from embeddings

2 min read

Three renditions of 人工智能—full, 80 % retained, 50 % retained—appear side by side. You can read each instantly, even though the latter two show only a slice of the original image. The percentages refer to how much of the picture remains after a horizontal cut at a fixed height, not to individual characters. That simple trick raises a bigger question: does Chinese function as a visual system at its core?

Here's the thing. Language models start by tokenizing text, turning every character into an arbitrary ID— “你” becomes 100, “好” becomes 3, and so on. In that process the glyphs’ shapes disappear.

Yet Chinese characters carry stroke patterns, radicals and spatial layouts that convey meaning. Take 打, 拍 and 拉: all share the hand radical 扌, but once they are reduced to IDs 423, 1089 and 2341 the visual relationship vanishes.

The experiment that follows feeds pixels directly into a model and asks it to output tokens, testing whether preserving visual information changes how a model learns Chinese.

The visual model arrives at training already knowing something useful: that 打, 拍, and 拉 look similar, and probably behave similarly. The text-based model starts with random embeddings and has to figure this out from scratch. If you look at the embedding space at initialization -- before any training -- you can see this directly: You can see that characters sharing the same radical cluster together at the very early training stage.

Cosine similarity for radical-sharing pairs: ~0.27 for visual embeddings, ~0.002 for random token embeddings. Why the Race Ends in a Tie Here's the key thing: the visual prior encodes visual similarity, but not linguistic co-occurrence.

Why this matters

Can a visual front‑end give AI a shortcut? The experiment shows a visual model already “knowing” that 打, 拍 and 拉 share form, while a text‑only model starts from random embeddings and must discover that similarity itself. We see that even heavily cropped versions of 人工智能 remain legible, suggesting the Chinese script encodes recognisable patterns that survive image loss.

For developers, this hints that visual tokenisation might reduce the learning burden for character‑rich languages, but we have no evidence yet that the head start translates into better performance on downstream tasks. Founders should weigh the added complexity of processing images against the potential gain in early convergence; the trade‑off remains unclear. Researchers are left with a concrete data point: visual similarity can be baked in at initialization, yet it is uncertain whether this advantage persists once models are fine‑tuned on diverse corpora.

Ultimately, the finding invites a cautious look at multimodal approaches for languages where glyph shape carries semantic weight, without assuming it will automatically outweigh established text‑only pipelines.

Further Reading