SemiAnalysis launches InferenceMAX, an open-source benchmark that automatically tracks LLM inference performance across AI models and frameworks every night (Kimbo Chen/SemiAnalysis)
https://newsletter.semianalysis.com/p/inferencemax-open-source-inference
KV Cache Compression for Inference Efficiency in LLMs: A Review
Yanyu Liu (Shandong University of Science and Technology), Jingying Fu (Shandong University of Science and Technology), Sixiang Liu (Shandong University of Science and Technology), Yitian Zou (Shandong University of Science and Technology), You Fu (Shandong University of Science and Technology), Jiehan Zhou (Shandong University of Science and Technology), Shouhua Zhang (University of Oulu)
DMFI: Dual-Modality Fine-Tuning and Inference Framework for LLM-Based Insider Threat Detection
Kaichuan Kong, Dongjie Liu, Xiaobo Jin, Guanggang Geng, Zhiying Li, Jian Weng
https://arxiv.org/abs/2508.05694
LIMFAST. IV. Learning High-Redshift Galaxy Formation from Multiline Intensity Mapping with Implicit Likelihood Inference
Guochao Sun, Tri Nguyen, Claude-Andr\'e Faucher-Gigu\`ere, Adam Lidz, Tjitske Starkenburg, Bryan R. Scott, Tzu-Ching Chang, Steven R. Furlanetto
https://arxiv.org/abs/2509.07060
ReNiL: Relative Neural Inertial Locator with Any-Scale Bayesian Inference
Kaixuan Wu (School of Computer Science, Wuhan University, Wuhan, China, School of Cyber Science and Engineering, Wuhan University, Wuhan, China), Yuanzhuo Xu (School of Computer Science, Wuhan University, Wuhan, China), Zejun Zhang (University of Southern California, Los Angeles, United States), Weiping Zhu (School of Computer Science, Wuhan University, Wuhan, China), Steve Drew (Department of Electrical and Soft…
Active Membership Inference Test (aMINT): Enhancing Model Auditability with Multi-Task Learning
Daniel DeAlcala, Aythami Morales, Julian Fierrez, Gonzalo Mancera, Ruben Tolosana, Javier Ortega-Garcia
https://arxiv.org/abs/2509.07879
Handling Open-Vocabulary Constructs in Formalizing Specifications: Retrieval-Augmented Parsing with Expert Knowledge
Mohammad Saqib Hasan, Sayontan Ghosh, Dhruv Verma, Geoff Kuenning, Erez Zadok, Scott A. Smolka, Niranjan Balasubramanian
https://arxiv.org/abs/2509.08808
"[Chain of reasoning] reports are untrustworthy on principle: they are plausible explanations for plausible responses, and since the inferences involved are more complex, they burn more compute and carbon per query as well as introducing more mistakes"
This is a particularly offensive point about #LLMs: we actually do have a class of systems, inference engines, which do reason and can…
Comparison of Fully Homomorphic Encryption and Garbled Circuit Techniques in Privacy-Preserving Machine Learning Inference
Kalyan Cheerla (University of North Texas), Lotfi Ben Othmane (University of North Texas), Kirill Morozov (University of North Texas)
https://arxiv.org/abs/2510.07457
NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference
Edresson Casanova, Paarth Neekhara, Ryan Langman, Shehzeen Hussain, Subhankar Ghosh, Xuesong Yang, Ante Juki\'c, Jason Li, Boris Ginsburg
https://arxiv.org/abs/2508.05835
Towards Generalized Routing: Model and Agent Orchestration for Adaptive and Efficient Inference
Xiyu Guo, Shan Wang, Chunfang Ji, Xuefeng Zhao, Wenhao Xi, Yaoyao Liu, Qinglan Li, Chao Deng, Junlan Feng
https://arxiv.org/abs/2509.07571
MoE-Compression: How the Compression Error of Experts Affects the Inference Accuracy of MoE Model?
Songkai Ma, Zhaorui Zhang, Sheng Di, Benben Liu, Xiaodong Yu, Xiaoyi Lu, Dan Wang
https://arxiv.org/abs/2509.07727
Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs
Hyungjin Chung, Hyelin Nam, Jiyeon Kim, Hyojun Go, Byeongjun Park, Junho Kim, Joonseok Lee, Seongsu Ha, Byung-Hoon Kim
https://arxiv.org/abs/2509.08016
Taking the Weight Off: Mitigating Parameter Bias from Catastrophic Outliers in 3$\times$2pt Analysis
Carolyn McDonald Mill, C. Danielle Leonard, Markus Michael Rau, Cora Uhlemann, Shahab Joudaki
https://arxiv.org/abs/2509.08052
Dynamic Features Adaptation in Networking: Toward Flexible training and Explainable inference
Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, Merim Dzaferagic, John D. Kelleher
https://arxiv.org/abs/2510.08303
Automatic Failure Attribution and Critical Step Prediction Method for Multi-Agent Systems Based on Causal Inference
Guoqing Ma, Jia Zhu, Hanghui Guo, Weijie Shi, Jiawei Shen, Jingjiang Liu, Yidan Liang
https://arxiv.org/abs/2509.08682
Baseten, which helps companies launch open-source or custom AI models, raised a $150M Series D led by Bond at a $2.15B valuation, up from $825M in February (Allie Garfinkle/Fortune)
https://fortune.com/2025/09/05/exclusive-b…
Fisher Random Walk: Automatic Debiasing Contextual Preference Inference for Large Language Model Evaluation
Yichi Zhang, Alexander Belloni, Ethan X. Fang, Junwei Lu, Xiaoan Xu
https://arxiv.org/abs/2509.05852
Hess-MC2: Sequential Monte Carlo Squared using Hessian Information and Second Order Proposals
Joshua Murphy, Conor Rosato, Andrew Millard, Lee Devlin, Paul Horridge, Simon Maskell
https://arxiv.org/abs/2507.07461
Staircase Streaming for Low-Latency Multi-Agent Inference
Junlin Wang (Zach), Jue Wang (Zach), Zhen (Zach), Xu, Ben Athiwaratkun, Bhuwan Dhingra, Ce Zhang, James Zou
https://arxiv.org/abs/2510.05059 …
TinierHAR: Towards Ultra-Lightweight Deep Learning Models for Efficient Human Activity Recognition on Edge Devices
Sizhen Bian, Mengxi Liu, Vitor Fortes Rey, Daniel Geissler, Paul Lukowicz
https://arxiv.org/abs/2507.07949
DACIP-RC: Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension on Business Conversations
Elena Khasanova, Harsh Saini, Md Tahmid Rahman Laskar, Xue-Yong Fu, Cheng Chen, Shashi Bhushan TN
https://arxiv.org/abs/2510.08152
Unleashing the True Potential of LLMs: A Feedback-Triggered Self-Correction with Long-Term Multipath Decoding
Jipeng Li, Zeyu Gao, Yubin Qi, Hande Dong, Weijian Chen, Qiang Lin
https://arxiv.org/abs/2509.07676
ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving
Zhiyu Zheng, Shaoyu Chen, Haoran Yin, Xinbang Zhang, Jialv Zou, Xinggang Wang, Qian Zhang, Lefei Zhang
https://arxiv.org/abs/2510.08562
Talking with Oompa Loompas: A novel framework for evaluating linguistic acquisition of LLM agents
Sankalp Tattwadarshi Swain, Anshika Krishnatray, Dhruv Kumar, Jagat Sesh Challa
https://arxiv.org/abs/2509.07389
Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization
Yuchen Zhu, Wei Guo, Jaemoo Choi, Petr Molodyk, Bo Yuan, Molei Tao, Yongxin Chen
https://arxiv.org/abs/2510.08233
ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation
Guanghao Li, Kerui Ren, Linning Xu, Zhewen Zheng, Changjian Jiang, Xin Gao, Bo Dai, Jian Pu, Mulin Yu, Jiangmiao Pang
https://arxiv.org/abs/2510.08551
Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing
Jeffrey Amico, Gabriel Passamani Andrade, John Donaghy, Ben Fielding, Tristin Forbus, Harry Grieve, Semih Kara, Jari Kolehmainen, Yihua Lou, Christopher Nies, Edward Phillip Flores Nu\~no, Diogo Ortega, Shikhar Rastogi, Austin Virts, Matthew J. Wright
https://
Skip a Layer or Loop it? Test-Time Depth Adaptation of Pretrained LLMs
Ziyue Li, Yang Li, Tianyi Zhou
https://arxiv.org/abs/2507.07996 https://arxiv.org/pdf/2507.07996 https://arxiv.org/html/2507.07996
arXiv:2507.07996v1 Announce Type: new
Abstract: Can a pretrained neural network adapt its architecture to different inputs without any finetuning? Do we need all layers for simple tasks, and are they adequate for challenging tasks? We found that the layers of a pretrained large language model (LLM) can be manipulated as separate modules to build a better and even shallower model customized for each test sample. In particular, each layer from the pretrained model can be skipped/pruned or repeated multiple times as recurrent neural networks (RNN), and stacked with others in arbitrary orders, yielding a chain-of-layers (CoLa) per sample. This compositional space greatly expands the scope of existing works on looped/recurrent pretrained modules, layer pruning, or early-exit networks. We develop a Monte Carlo Tree Search (MCTS) protocol to explore and identify the optimal CoLa for each sample from math and commonsense reasoning benchmarks. Compared to a static model of a fixed depth, CoLa allows shortcut paths (fast thinking), recurrence of the same layer(s) (slow thinking), and combining both, offering more flexible, dynamic architectures for different inputs. We conduct an extensive analysis of the MCTS-optimized CoLa, which leads to two key findings: (1) For >75% of samples with correct predictions by the original LLM, we can find shorter CoLa, suggesting a large space for improving inference efficiency; (2) For >60% of samples with originally incorrect predictions, we can identify CoLa achieving correct predictions, suggesting a large space of performance enhancement. Our results highlight the shortcomings of using a fixed architecture of pre-trained LLMs for inference on different samples and pave the way to unlock the generalization power of test-time depth adaptation.
toXiv_bot_toot