Skip to main content
Qwen 3.6-35B-A3B multimodal AI demo: inference, thinking control, and RAG capabilities displayed on screen.

Editorial illustration for Qwen 3.6-35B-A3B Demo Implements Multimodal Inference, Thinking Control and RAG

Qwen 3.6B: Multimodal AI with Thinking Control Tech

Qwen 3.6-35B-A3B Demo Implements Multimodal Inference, Thinking Control and RAG

2 min read

Why does this matter? Because the Qwen 3.6‑35B‑A3B demo isn’t just another language model showcase—it stitches together multimodal inference, thinking‑control, tool calling, MoE routing, RAG and session persistence into a single pipeline. While the codebase is openly available, the real question is how the model’s “thinking budget” is enforced during generation.

Here’s the thing: the demo caps the internal deliberation to 150 tokens, then hands off the remainder of the output to a standard generation step that can emit up to 1,200 new tokens. The snippet below prints the actual token count used for thinking and displays the final answer, or notes if the output was truncated. Seeing the budget in action clarifies how developers can balance compute constraints with response length, especially when the model is toggling between visual inputs and tool calls.

The following excerpt illustrates that mechanism in practice.

Explain.") budget = ThinkingBudget(processor.tokenizer, budget=150) think, ans = c.generate(enable_thinking=True, max_new_tokens=1200, stopping_criteria=StoppingCriteriaList([budget])) print(f"Thinking ~{len(processor.tokenizer.encode(think))} tok | Answer:\n{ans or '(truncated)'}") print("\n" + "="*20, "§5 streaming split", "="*20) c = QwenChat(model, processor) c.user("Explain why transformers scale better than RNNs, in two short paragraphs.") print("[THINKING >>] ", end="", flush=True) first = [True] def _ot(x): print(x, end="", flush=True) def _oa(x): if first[0]: print("\n\n[ANSWER >>] ", end="", flush=True); first[0] = False print(x, end="", flush=True) c.stream(enable_thinking=True, preset="thinking_general", max_new_tokens=700, on_thinking=_ot, on_answer=_oa); print() print("\n" + "="*20, "§6 vision", "="*20) IMG = "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/CI_Demo/mathv-1327.jpg" c = QwenChat(model, processor) c.history.append({"role":"user","content":[ {"type":"image","image":IMG}, {"type":"text","text":"Describe this figure in one sentence, then state what it's asking."}]}) _, ans = c.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=300) print("Describe:", ans) GRD = "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.6/demo/RealWorld/RealWorld-04.png" c = QwenChat(model, processor) c.history.append({"role":"user","content":[ {"type":"image","image":GRD}, {"type":"text","text": "Locate every distinct object.

The demo stitches together environment setup, adaptive model loading and a chat wrapper that can emit both raw replies and explicit reasoning traces. By probing the tokenizer, the script caps a thinking budget at 150 tokens, then lets the model generate up to 1 200 new tokens while streaming separate streams for thought and answer. The printed output shows a token count for the reasoning segment and the final answer, or a truncation notice if the limit is hit.

This hands‑on walk‑through proves that Qwen 3.6‑35B‑A3B can be wired into a workflow that respects GPU memory constraints and enforces a budget on internal deliberation. Yet the tutorial stops short of benchmarking latency, accuracy or robustness across diverse multimodal inputs. It remains unclear whether the same approach scales cleanly to larger deployments or more complex tool‑calling scenarios.

The code offers a reproducible baseline, but further testing will be needed to gauge how the model’s MoE routing and RAG components behave under real‑world loads.

Further Reading

Common Questions Answered

How does the Qwen 3.6-35B-A3B demo implement a 'thinking budget' during model generation?

The demo uses a ThinkingBudget mechanism that caps internal deliberation to 150 tokens before generating the final answer. This approach allows the model to perform reasoning within a constrained token limit, then proceed with generating a response up to 1,200 new tokens while streaming separate reasoning and answer streams.

What key features are integrated into the Qwen 3.6-35B-A3B demo's pipeline?

The demo combines multiple advanced AI techniques including multimodal inference, thinking-control, tool calling, MoE (Mixture of Experts) routing, retrieval-augmented generation (RAG), and session persistence. This integrated approach allows for more sophisticated and context-aware language model interactions.

How does the demo handle streaming output during model generation?

The demo creates separate streaming streams for the model's reasoning trace and final answer, with the ability to track and display the token count for each segment. By using a custom stopping criteria, the script can capture the model's internal thinking process before generating the main response, providing transparency into the model's reasoning.