Skip to main content
Graph showing model 5 scoring lower in PR-AUC, recall, and F1 metrics during training evaluation, highlighting performance co

Editorial illustration for Model 5 tops penalized PR-AUC, recall and F1-score in scoring model training

Model 5 tops penalized PR-AUC, recall and F1-score in...

Model 5 tops penalized PR-AUC, recall and F1-score in scoring model training

2 min read

All the code for this section lives on GitHub, tucked away in src/selection/logit_model_selection.py, with the accompanying analysis in 08_logistic_model_selection.qmd. Here's the thing: a scoring model isn’t just a high‑performing algorithm on a training set. In a professional credit‑risk setting it must also be statistically sound, stable over time, interpretable, aligned with business expectations, and easy to monitor once deployed.

While earlier pieces in the series walked through dataset construction, exploratory analysis, variable preparation, predictor pre‑selection, and stability testing, this article dives into what many consider the most critical phase—training candidate models and picking the final one. The methodology laid out blends statistical rigor with business and operational criteria, aiming for a model that checks every box, not just the performance metrics.

Tools like ChatGPT, Codex, and GitHub Copilot can generate code, automate loops, run statistical tests, and format results; in this work we lean on Codex to see how well it handles each task. The piece is split into three parts, each building toward a reproducible, transparent scoring‑model pipeline.

A meaningful model must achieve a PR-AUC substantially above this threshold. Model 5 achieves the best penalized PR-AUC, the best penalized recall, and the best penalized F1-score. If the primary objective is the operational detection of defaults using a classification threshold, Model 5 is a compelling option.

However, for a scoring model, the main criterion remains the ability to rank risk--that is, the Gini index --particularly on the test and out-of-time datasets, and, in our case, the penalized Gini. Model 4 offers the best overall trade-off for the following reasons: - It achieves the highest penalized Gini at 56.01%, reflecting strong and stable discriminatory power across datasets. - It improves marginally on Model 3 by incorporating the variable cb_person_default_on_file , which adds meaningful risk information.

Why this matters

Model 5’s lead on penalized PR‑AUC, recall and F1‑score is noteworthy, yet it raises practical questions. The repository shows the full pipeline in src/selection/logit_model_selection.py and the accompanying analysis in 08_logistic_model_selection.qmd, so reproducibility is strong. However, the article warns that “speed creates a risk,” implying that rapid iteration may sacrifice robustness.

A model that tops metrics on a training sample does not automatically translate to reliable default detection in production. We must ask whether the reported PR‑AUC is sufficiently above the unspecified threshold to be meaningful. If our goal is operational detection of defaults at a fixed classification threshold, Model 5 appears compelling, but the trade‑off between speed and stability remains unclear.

Developers should scrutinize the penalization scheme and test against out‑of‑sample data before committing. For founders, the takeaway is that strong metric performance alone does not guarantee business‑critical reliability. Researchers can use the open code to probe these uncertainties further.

Further Reading