Editorial illustration for TensorFlow Emotion Dataset with 54,263 Texts Shows Class Imbalance
TensorFlow Emotion Dataset with 54,263 Texts Shows Class...
TensorFlow Emotion Dataset with 54,263 Texts Shows Class Imbalance
For a recent project we needed to read emotions in online media, but we also wanted an open‑weight model, a permissive license and full transparency. The team leaned toward European‑origin models, yet Hugging Face didn’t list a Mistral version with a proper model card. That gap pushed us to look at the most detailed public set for emotion work: the GoEmotions corpus.
It contains about 58 000 Reddit comments, each tagged with 27 emotion labels plus “neutral,” and it’s notoriously class‑imbalanced. Fine‑tuning a small language model on such skewed data isn’t straightforward; it demands more than a simple train‑test split. We tackled the imbalance with three steps: we undersampled the dominant emotion, we synthetically boosted the rare classes using Nature’s 2025 ISMOTE algorithm, and we applied loss‑function weighting.
The result is MistralSmall‑3.1.GoEmotions, now on Hugging Face, which reaches an F1 score above 0.7 on most target emotions. This piece walks through the preprocessing, the ISMOTE‑based augmentation, and the practicalities of adapting a small language model for nuanced emotion recognition.
The dataset was released on TensorFlow Datasets under the Apache 2.0 License and contains 54,263 labeled texts. Here is what it looks like: After a quick check, we can see a high-class imbalance in the data where the neutral category prevails: 3. Training set preprocessing Our goal is to develop a classifier to identify 15 emotions in general-language texts.
Training on class-imbalanced data can introduce bias, as the fine-tuned model tends to favor the majority class and perform worse on the minority ones, so preprocessing is essential. We used a combination of methods for the training set; the validation and test sets remained unchanged to address class imbalance and maximize performance on the target emotions (fear, sadness, disgust, disapproval, annoyance, anger, disappointment, optimism, amusement, surprise, admiration, excitement, confusion, joy, love): - We thinned the data by randomly filtering the "neutral" rows. - We generated synthetic samples for the least-represented emotional categories using ISMOTE (Improved Synthetic Minority Over-sampling Technique).
Why this matters
We see a concrete step toward accessible emotion AI: a 54,263‑text dataset released under Apache 2.0, ready for TensorFlow users. Its open license and transparent provenance align with our push for lower‑cost, open‑weight models. Yet the numbers speak loudly—neutral dominates, and the class distribution is skewed.
That imbalance could bias any fine‑tuned system, especially when developers expect balanced emotional coverage. Our team notes the lack of a European‑sourced Mistral model on Hugging Face, which may limit options for those preferring regional compliance. The dataset’s size is respectable, but without clear mitigation strategies for the imbalance, performance on under‑represented emotions remains uncertain.
Researchers will need to weigh the trade‑off between open access and potential bias. In short, the release adds a useful building block, but its practical impact hinges on how the community addresses the skewed label distribution and the current gap in ready‑made European model cards.
Further Reading
- Classification on imbalanced data | TensorFlow Core - TensorFlow
- MentalDistress: A multi-class social media text dataset for mental health text classification - Mendeley Data
- Multi-class Emotion Classification for Short Texts - GitHub Pages
- Optimizing Class Imbalance in Facial Expression Recognition Using Dynamic Intra-class Clustering - PubMed Central
- Emotion analysis and Classification using LSTM 93% - Kaggle