Graph comparing sigmoid and ReLU activation functions: Sigmoid plateaus at 0.28 by epoch 400, ReLU improves.

Editorial illustration for Sigmoid plateaus at 0.28 by epoch 400 while ReLU keeps improving

Activation Functions: Why ReLU Outperforms Sigmoid in LLMs

Sigmoid plateaus at 0.28 by epoch 400 while ReLU keeps improving

By AI Daily Post Edited by Brian Petersen, Editor-in-Chief

April 10, 2026 • Updated: July 4, 2026 • 4 min read

Deep learning is full of convenient narratives, ReLU solves vanishing gradients, Sigmoid is obsolete, but the real story is more revealing. Watch a Sigmoid network train: it improves, then stalls at a loss of 0.28 by epoch 400, grinding to a near halt. Not because it’s finished learning, but because the activation itself has squeezed the life out of its own signal.

Meanwhile, a ReLU network starting from a lower loss (~0.15) keeps dropping, reaching 0.03 by epoch 800. That gap isn’t just about speed; it’s about geometry. Sigmoid compresses every value into a tight band between 0.5 and 0.6, erasing the nuanced differences needed to curve a decision boundary.

The result? A nearly linear wall that misses the shape of the data, topping out at 79% accuracy. ReLU, by contrast, preserves magnitude across layers, letting the network bend and refine its boundary into a faithful mirror of the two-moons distribution, pushing accuracy to 96%.

Both networks start with the same raw activation strength (~2.0) at layer one, but Sigmoid immediately crushes it to ~0.3. That initial compression is the hidden tax: depth becomes wasted capacity, not expressive power. The choice of activation isn’t just a technical detail, it’s the difference between a model that learns the data’s shape and one that learns its own limitations.

Sigmoid improves initially but plateaus around ~0.28 by epoch 400, showing almost no progress afterward -- a sign that the network has exhausted the useful signal it can extract. ReLU, in contrast, continues to steadily reduce loss throughout training, dropping from ~0.15 to ~0.03 by epoch 800. This isn't just faster convergence; it reflects a deeper issue: Sigmoid's compression is limiting the flow of meaningful information, causing the model to stall, while ReLU preserves that signal, allowing the network to keep refining its decision boundary.

The Sigmoid network learns a nearly linear boundary, failing to capture the curved structure of the two-moons dataset, which results in lower accuracy (~79%). This is a direct consequence of its compressed internal representations -- the network simply doesn't have enough geometric signal to construct a complex boundary. In contrast, the ReLU network learns a highly non-linear, well-adapted boundary that closely follows the data distribution, achieving much higher accuracy (~96%).

Because ReLU preserves magnitude across layers, it enables the network to progressively bend and refine the decision surface, turning depth into actual expressive power rather than wasted capacity. Both networks start with similar pre-activation magnitude at the first layer (~2.0), but Sigmoid immediately compresses it to ~0.3, while ReLU retains a higher value. As we move deeper, Sigmoid continues to squash the signal into a narrow band (0.5-0.6), effectively erasing meaningful differences.

Sigmoid vs ReLU Activation Functions: The Inference Cost of Losing Geometric Context - MarkTechPost

The takeaway is brutal and instructive. Sigmoid doesn’t just slow down, it suffocates. By compressing every layer into a narrow band, it guarantees that depth becomes a liability rather than an asset.

The network stalls because it has literally nothing left to work with. ReLU, by contrast, turns depth into a genuine resource. Each layer can bend the boundary a little more, because the signal remains rich enough to guide the adjustment.

The result isn’t merely faster training or higher accuracy. It’s a fundamental shift in what the network can represent. One activation function preserves geometric context; the other erases it.

That is the difference between a model that learns and one that merely memorizes a flat line. Choose carefully, your network’s capacity to see the world depends on it.

Common Questions Answered

How do sigmoid and ReLU activation functions differ in their performance during model training?

Sigmoid activation initially improves but quickly plateaus around 0.28 by epoch 400, showing almost no further progress. In contrast, ReLU continues to steadily reduce loss throughout training, dropping from approximately 0.15 to 0.03 by epoch 800, demonstrating a more effective signal preservation and optimization process.

Why does sigmoid activation limit the model's ability to extract meaningful information?

Sigmoid compresses inputs into a 0-to-1 interval, which erodes the geometric information that deeper neural network layers rely on for learning. This compression effectively limits the flow of meaningful information, causing the model to stall and preventing further improvements in model performance.

What advantage does ReLU have over sigmoid in neural network training?

ReLU preserves signal range by remaining unbounded on the positive side, which helps maintain distance cues and geometric information during training. This characteristic allows ReLU to continue reducing loss throughout the training process, unlike sigmoid which quickly reaches a performance plateau.

Ship an AI product this weekend — no engineers required.

Structured, in-depth lessons on the exact no-code tools — not scattered tutorials.

The exact platforms, taught in depth
Build real, working projects
Our honest review + a reader discount

Read the review →

Activation Functions: Why ReLU Outperforms Sigmoid in LLMs

Common Questions Answered

How do sigmoid and ReLU activation functions differ in their performance during model training?

Why does sigmoid activation limit the model's ability to extract meaningful information?

What advantage does ReLU have over sigmoid in neural network training?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

OpenAI's Miles Wang in Talks for USD 2B AI Drug Discovery Startup

Mistral Vibe for Code Leads in Multi-Agent Programming Benchmark

OpenAI's First Hardware Device Is a Movable, Screenless Speaker

PrismML's Bonsai 27B Runs Qwen3.6 on Laptops With 1-bit and Ternary Builds

OpenAI Targets 2027 for First Major Hardware: A ChatGPT Speaker

Publishers sue Google over unauthorized AI book training

Anthropic's Claude for Teachers Vows Not to Train on Student Data

DeepSeek Seeks More Capital Weeks After USD 7B Funding Round

Anthropic's New AI Ad Campaign Draws Criticism for 'Creepy' Tactics

DeepMind CEO proposes independent AI regulator as White House advisor voices skepticism

Related Reading

ChatGPT's 'Nerdy' tweak rewards goblin metaphors in answers, study finds

Google tests visual 'magazine-style' UI for Gemini 3 Pro users

AI Engineers Face Rising Costs, Need New Strategies for Efficiency

Meta Superintelligence Lab unveils Muse Spark, its first multimodal model

CPUs and GPUs: Complementary Roles in Five Key AI Compute Architectures

Common Questions Answered

How do sigmoid and ReLU activation functions differ in their performance during model training?

Why does sigmoid activation limit the model's ability to extract meaningful information?

What advantage does ReLU have over sigmoid in neural network training?

Further Reading

Ship an AI product this weekend — no engineers required.

Latest News

OpenAI's Miles Wang in Talks for USD 2B AI Drug Discovery Startup

Mistral Vibe for Code Leads in Multi-Agent Programming Benchmark

OpenAI's First Hardware Device Is a Movable, Screenless Speaker

PrismML's Bonsai 27B Runs Qwen3.6 on Laptops With 1-bit and Ternary Builds

OpenAI Targets 2027 for First Major Hardware: A ChatGPT Speaker

Publishers sue Google over unauthorized AI book training

Anthropic's Claude for Teachers Vows Not to Train on Student Data

DeepSeek Seeks More Capital Weeks After USD 7B Funding Round

Anthropic's New AI Ad Campaign Draws Criticism for 'Creepy' Tactics

DeepMind CEO proposes independent AI regulator as White House advisor voices skepticism