Editorial illustration for Sigmoid plateaus at 0.28 by epoch 400 while ReLU keeps improving
Activation Functions: Why ReLU Outperforms Sigmoid in LLMs
Sigmoid plateaus at 0.28 by epoch 400 while ReLU keeps improving
Why does the choice of activation function still matter when training large language models? While the hype often centers on model size or data volume, the math that sits inside each neuron can dictate how far a network will travel during optimization. In a recent experiment comparing two classic nonlinearities, researchers tracked loss curves over hundreds of epochs to see how each behaved when the model tried to extract signal from the same dataset.
The test measured loss at regular checkpoints, noting where progress stalled and where it kept sliding. The results highlight a practical trade‑off: one function gives a quick early boost but soon runs out of useful gradient information, while the other keeps chipping away at error well beyond the midway point. Understanding these dynamics matters for anyone balancing training time, compute budget, and final model quality.
The following observation captures the stark difference between the two approaches.
Sigmoid improves initially but plateaus around ~0.28 by epoch 400, showing almost no progress afterward -- a sign that the network has exhausted the useful signal it can extract. ReLU, in contrast, continues to steadily reduce loss throughout training, dropping from ~0.15 to ~0.03 by epoch 800. This isn't just faster convergence; it reflects a deeper issue: Sigmoid's compression is limiting the flow of meaningful information, causing the model to stall, while ReLU preserves that signal, allowing the network to keep refining its decision boundary.
The Sigmoid network learns a nearly linear boundary, failing to capture the curved structure of the two-moons dataset, which results in lower accuracy (~79%). This is a direct consequence of its compressed internal representations -- the network simply doesn't have enough geometric signal to construct a complex boundary. In contrast, the ReLU network learns a highly non-linear, well-adapted boundary that closely follows the data distribution, achieving much higher accuracy (~96%).
Because ReLU preserves magnitude across layers, it enables the network to progressively bend and refine the decision surface, turning depth into actual expressive power rather than wasted capacity. Both networks start with similar pre-activation magnitude at the first layer (~2.0), but Sigmoid immediately compresses it to ~0.3, while ReLU retains a higher value. As we move deeper, Sigmoid continues to squash the signal into a narrow band (0.5-0.6), effectively erasing meaningful differences.
Sigmoid improves initially but stalls. By epoch 400 its accuracy sits at roughly 0.28 and shows almost no further progress. The article attributes this to the activation’s compression of inputs into the 0‑to‑1 interval, which erodes the geometric information that deeper layers rely on.
ReLU, by contrast, keeps the signal range unbounded on the positive side, preserving distance cues. Consequently, loss continues to fall, dropping from about 0.15 early on to 0.03 by epoch 800. It's a clear contrast.
The data suggest that maintaining spatial context is crucial for sustained learning. Yet the report doesn’t explain whether adjustments to training hyper‑parameters could rescue sigmoid’s performance. It remains unclear if the plateau reflects an inherent limitation of the function or a symptom of the specific network setup.
Overall, the evidence presented favors ReLU for tasks where long‑term loss reduction is needed, while sigmoid’s early gains appear transient. Further experiments would be needed to confirm these observations across different architectures. Such findings could inform activation‑choice guidelines for future model design, though broader validation remains necessary.
Further Reading
Common Questions Answered
How do sigmoid and ReLU activation functions differ in their performance during model training?
Sigmoid activation initially improves but quickly plateaus around 0.28 by epoch 400, showing almost no further progress. In contrast, ReLU continues to steadily reduce loss throughout training, dropping from approximately 0.15 to 0.03 by epoch 800, demonstrating a more effective signal preservation and optimization process.
Why does sigmoid activation limit the model's ability to extract meaningful information?
Sigmoid compresses inputs into a 0-to-1 interval, which erodes the geometric information that deeper neural network layers rely on for learning. This compression effectively limits the flow of meaningful information, causing the model to stall and preventing further improvements in model performance.
What advantage does ReLU have over sigmoid in neural network training?
ReLU preserves signal range by remaining unbounded on the positive side, which helps maintain distance cues and geometric information during training. This characteristic allows ReLU to continue reducing loss throughout the training process, unlike sigmoid which quickly reaches a performance plateau.