
2025-09-03 19:14:40
Here's an odd effect (stumbled on by accident). The blue loss curve is from a well-tuned BERT baseline (from the "cramming"paper).
The only thing I changed for the orange is to put a residual connection around each transformer block and to multiply the output of the block by a scalar parameter initialized to 0.
I'm surprised that has such a substantial impact. Not just on the performance, but on the shape of the loss curve.
Probing the Black Hole Interior with Holographic Entanglement Entropy and the Role of AdS/BCFT Correspondence
Fabiano F. Santos
https://arxiv.org/abs/2508.21224 https://
A Digital Twin-Based Simulation Framework for Safe Curve Speed Estimation Using Unity
Araf Rahman (Clemson University), M. Sabbir Salek (Clemson University), Mashrur Chowdhury (Clemson University), Wayne A. Sarasua (Clemson University)
https://arxiv.org/abs/2508.14046
Diversity in Hydrogen-rich Envelope Mass of Type II Supernovae. (III). The mass-loss and evolutionary pathways of the red supergiant progenitors
Qiliang Fang, Takashi J. Moriya, Keiichi Maeda, Andris Dorozsmai, Javier Silva-Farf\'an
https://arxiv.org/abs/2507.14665
AC Magnetometry Loop Tracer Compatible with Magnetic Calorimetry for Power Loss Analysis
Thomas Veile, Michael Harmel, Mathias Zambach, Philip Holm, Frederik L. Durhuus, Cathrine Frandsen
https://arxiv.org/abs/2508.07929