SysOM-AI: Continuous Cross-Layer Performance Diagnosis for Production AI Training
Yusheng Zheng, Wenan Mao, Shuyi Cheng, Fuqiu Feng, Guangshui Li, Zhaoyan Liao, Yongzhuo Huang, Zhenwei Xiao, Yuqing Li, Andi Quinn, Tao Ma
https://arxiv.org/abs/2603.29235 https://arxiv.org/pdf/2603.29235 https://arxiv.org/html/2603.29235
arXiv:2603.29235v1 Announce Type: new
Abstract: Performance diagnosis in production-scale AI training is challenging because subtle OS-level issues can trigger cascading GPU delays and network slowdowns, degrading training efficiency across thousands of GPUs. Existing profiling tools are limited to single system layers, incur prohibitive overhead (10--30%), or lack continuous deployment capabilities, resulting in manual analyses spanning days. We argue that continuous, cross-layer observability enabled by OS-level instrumentation and layered differential diagnosis is necessary to address this gap. We introduce SysOM-AI, a production observability system that continuously integrates CPU stack profiling, GPU kernel tracing, and NCCL event instrumentation via adaptive hybrid stack unwinding and eBPF-based tracing, incurring less than 0.4% overhead. Deployed at Alibaba across over 80,000 GPUs for more than one year, SysOM-AI helped diagnose 94 confirmed production issues, reducing median diagnosis time from days to approximately 10 minutes.
toXiv_bot_toot
GraphWalker: Agentic Knowledge Graph Question Answering via Synthetic Trajectory Curriculum
Shuwen Xu, Yao Xu, Jiaxiang Liu, Chenhao Yuan, Wenshuo Peng, Jun Zhao, Kang Liu
https://arxiv.org/abs/2603.28533 https://arxiv.org/pdf/2603.28533 https://arxiv.org/html/2603.28533
arXiv:2603.28533v1 Announce Type: new
Abstract: Agentic knowledge graph question answering (KGQA) requires an agent to iteratively interact with knowledge graphs (KGs), posing challenges in both training data scarcity and reasoning generalization. Specifically, existing approaches often restrict agent exploration: prompting-based methods lack autonomous navigation training, while current training pipelines usually confine reasoning to predefined trajectories. To this end, this paper proposes \textit{GraphWalker}, a novel agentic KGQA framework that addresses these challenges through \textit{Automated Trajectory Synthesis} and \textit{Stage-wise Fine-tuning}. GraphWalker adopts a two-stage SFT training paradigm: First, the agent is trained on structurally diverse trajectories synthesized from constrained random-walk paths, establishing a broad exploration prior over the KG; Second, the agent is further fine-tuned on a small set of expert trajectories to develop reflection and error recovery capabilities. Extensive experiments demonstrate that our stage-wise SFT paradigm unlocks a higher performance ceiling for a lightweight reinforcement learning (RL) stage, enabling GraphWalker to achieve state-of-the-art performance on CWQ and WebQSP. Additional results on GrailQA and our constructed GraphWalkerBench confirm that GraphWalker enhances generalization to out-of-distribution reasoning paths. The code is publicly available at https://github.com/XuShuwenn/GraphWalker
toXiv_bot_toot
Sparse Bayesian Deep Functional Learning with Structured Region Selection
Xiaoxian Zhu, Yingmeng Li, Shuangge Ma, Mengyun Wu
https://arxiv.org/abs/2602.20651 https://arxiv.org/pdf/2602.20651 https://arxiv.org/html/2602.20651
arXiv:2602.20651v1 Announce Type: new
Abstract: In modern applications such as ECG monitoring, neuroimaging, wearable sensing, and industrial equipment diagnostics, complex and continuously structured data are ubiquitous, presenting both challenges and opportunities for functional data analysis. However, existing methods face a critical trade-off: conventional functional models are limited by linearity, whereas deep learning approaches lack interpretable region selection for sparse effects. To bridge these gaps, we propose a sparse Bayesian functional deep neural network (sBayFDNN). It learns adaptive functional embeddings through a deep Bayesian architecture to capture complex nonlinear relationships, while a structured prior enables interpretable, region-wise selection of influential domains with quantified uncertainty. Theoretically, we establish rigorous approximation error bounds, posterior consistency, and region selection consistency. These results provide the first theoretical guarantees for a Bayesian deep functional model, ensuring its reliability and statistical rigor. Empirically, comprehensive simulations and real-world studies confirm the effectiveness and superiority of sBayFDNN. Crucially, sBayFDNN excels in recognizing intricate dependencies for accurate predictions and more precisely identifies functionally meaningful regions, capabilities fundamentally beyond existing approaches.
toXiv_bot_toot
Proc3D: Procedural 3D Generation and Parametric Editing of 3D Shapes with Large Language Models
Fadlullah Raji, Stefano Petrangeli, Matheus Gadelha, Yu Shen, Uttaran Bhattacharya, Gang Wu
https://arxiv.org/abs/2601.12234 https://arxiv.org/pdf/2601.12234 https://arxiv.org/html/2601.12234
arXiv:2601.12234v1 Announce Type: new
Abstract: Generating 3D models has traditionally been a complex task requiring specialized expertise. While recent advances in generative AI have sought to automate this process, existing methods produce non-editable representation, such as meshes or point clouds, limiting their adaptability for iterative design. In this paper, we introduce Proc3D, a system designed to generate editable 3D models while enabling real-time modifications. At its core, Proc3D introduces procedural compact graph (PCG), a graph representation of 3D models, that encodes the algorithmic rules and structures necessary for generating the model. This representation exposes key parameters, allowing intuitive manual adjustments via sliders and checkboxes, as well as real-time, automated modifications through natural language prompts using Large Language Models (LLMs). We demonstrate Proc3D's capabilities using two generative approaches: GPT-4o with in-context learning (ICL) and a fine-tuned LLAMA-3 model. Experimental results show that Proc3D outperforms existing methods in editing efficiency, achieving more than 400x speedup over conventional approaches that require full regeneration for each modification. Additionally, Proc3D improves ULIP scores by 28%, a metric that evaluates the alignment between generated 3D models and text prompts. By enabling text-aligned 3D model generation along with precise, real-time parametric edits, Proc3D facilitates highly accurate text-based image editing applications.
toXiv_bot_toot
Replaced article(s) found for cs.LG. https://arxiv.org/list/cs.LG/new
[3/6]:
- Towards Scalable Oversight via Partitioned Human Supervision
Ren Yin, Takashi Ishida, Masashi Sugiyama
https://arxiv.org/abs/2510.22500 https://mastoxiv.page/@arXiv_csLG_bot/115451787490434401
- ContextPilot: Fast Long-Context Inference via Context Reuse
Yinsicheng Jiang, Yeqi Huang, Liang Cheng, Cheng Deng, Xuan Sun, Luo Mai
https://arxiv.org/abs/2511.03475 https://mastoxiv.page/@arXiv_csLG_bot/115502245581974540
- Metabolomic Biomarker Discovery for ADHD Diagnosis Using Interpretable Machine Learning
Nabil Belacel, Mohamed Rachid Boulassel
https://arxiv.org/abs/2601.11283 https://mastoxiv.page/@arXiv_csLG_bot/115921183182326799
- PhysE-Inv: A Physics-Encoded Inverse Modeling approach for Arctic Snow Depth Prediction
Akila Sampath, Vandana Janeja, Jianwu Wang
https://arxiv.org/abs/2601.17074
- SAGE-5GC: Security-Aware Guidelines for Evaluating Anomaly Detection in the 5G Core Network
Cristian Manca, Christian Scano, Giorgio Piras, Fabio Brau, Maura Pintor, Battista Biggio
https://arxiv.org/abs/2602.03596
- LORE: Jointly Learning the Intrinsic Dimensionality and Relative Similarity Structure From Ordina...
Anand, Helbling, Davenport, Berman, Alagapan, Rozell
https://arxiv.org/abs/2602.04192
- Towards Robust Scaling Laws for Optimizers
Alexandra Volkova, Mher Safaryan, Christoph H. Lampert, Dan Alistarh
https://arxiv.org/abs/2602.07712 https://mastoxiv.page/@arXiv_csLG_bot/116046369672796465
- Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs
Sagnik Mukherjee, Lifan Yuan, Pavan Jayasinha, Dilek Hakkani-T\"ur, Hao Peng
https://arxiv.org/abs/2602.07729 https://mastoxiv.page/@arXiv_csLG_bot/116046377539155485
- AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine L...
Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Siheng Chen
https://arxiv.org/abs/2602.07906 https://mastoxiv.page/@arXiv_csLG_bot/116046423413650658
- VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
Guobin Shen, Chenxiao Zhao, Xiang Cheng, Lei Huang, Xing Yu
https://arxiv.org/abs/2602.10693 https://mastoxiv.page/@arXiv_csLG_bot/116057229834947730
- KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang
https://arxiv.org/abs/2602.11184 https://mastoxiv.page/@arXiv_csLG_bot/116062537528208461
- MUSE: Multi-Tenant Model Serving With Seamless Model Updates
Correia, Ferreira, Martins, Bento, Guerreiro, Pereira, Gomes, Bono, Ferreira, Bizarro
https://arxiv.org/abs/2602.11776 https://mastoxiv.page/@arXiv_csLG_bot/116062952355379801
- Pawsterior: Variational Flow Matching for Structured Simulation-Based Inference
Jorge Carrasco-Pollo, Floor Eijkelboom, Jan-Willem van de Meent
https://arxiv.org/abs/2602.13813 https://mastoxiv.page/@arXiv_csLG_bot/116085828112928218
- Silent Inconsistency in Data-Parallel Full Fine-Tuning: Diagnosing Worker-Level Optimization Misa...
Hong Li, Zhen Zhou, Honggang Zhang, Yuping Luo, Xinyue Wang, Han Gong, Zhiyuan Liu
https://arxiv.org/abs/2602.14462 https://mastoxiv.page/@arXiv_csLG_bot/116085997857526328
- Divine Benevolence is an $x^2$: GLUs scale asymptotically faster than MLPs
Alejandro Francisco Queiruga
https://arxiv.org/abs/2602.14495 https://mastoxiv.page/@arXiv_csLG_bot/116086011618741857
- \"UberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset
DatologyAI, et al.
https://arxiv.org/abs/2602.15210 https://mastoxiv.page/@arXiv_csLG_bot/116090912256712568
- GLM-5: from Vibe Coding to Agentic Engineering
GLM-5-Team, et al.
https://arxiv.org/abs/2602.15763 https://mastoxiv.page/@arXiv_csLG_bot/116091080686771018
- Anatomy of Capability Emergence: Scale-Invariant Representation Collapse and Top-Down Reorganizat...
Jayadev Billa
https://arxiv.org/abs/2602.15997 https://mastoxiv.page/@arXiv_csLG_bot/116096541546306333
- AI-CARE: Carbon-Aware Reporting Evaluation Metric for AI Models
KC Santosh, Srikanth Baride, Rodrigue Rizk
https://arxiv.org/abs/2602.16042 https://mastoxiv.page/@arXiv_csLG_bot/116096581524696028
- Beyond Message Passing: A Symbolic Alternative for Expressive and Interpretable Graph Learning
Chuqin Geng, Li Zhang, Haolin Ye, Ziyu Zhao, Yuhe Jiang, Tara Saba, Xinyu Wang, Xujie Si
https://arxiv.org/abs/2602.16947 https://mastoxiv.page/@arXiv_csLG_bot/116102426238903124
toXiv_bot_toot