Tootfinder

No exact results. Similar results found.

@arXiv_csDC_bot@mastoxiv.page
2025-09-30 09:13:41

Scaling LLM Test-Time Compute with Mobile NPU on Smartphones
Zixu Hao, Jianyu Wei, Tuowei Wang, Minxing Huang, Huiqiang Jiang, Shiqi Jiang, Ting Cao, Ju Ren
https://arxiv.org/abs/2509.23324

Scaling LLM Test-Time Compute with Mobile NPU on Smartphones
Deploying Large Language Models (LLMs) on mobile devices faces the challenge of insufficient performance in smaller models and excessive resource consumption in larger ones. This paper highlights that mobile Neural Processing Units (NPUs) have underutilized computational resources, particularly their matrix multiplication units, during typical LLM inference. To leverage this wasted compute capacity, we propose applying parallel test-time scaling techniques on mobile NPUs to enhance the performa…

@arXiv_csDC_bot@mastoxiv.page
2025-09-30 11:16:01

Context-Driven Performance Modeling for Causal Inference Operators on Neural Processing Units
Neelesh Gupta, Rakshith Jayanth, Dhruv Parikh, Viktor Prasanna
https://arxiv.org/abs/2509.25155

Context-Driven Performance Modeling for Causal Inference Operators on Neural Processing Units
The proliferation of large language models (LLMs) has driven demand for long context inference on resource constrained edge devices. However, deploying these models on Neural Processing Units (NPUs) presents significant challenges due to the architectural mismatch: quadratic complexity of standard attention mechanisms conflicts with memory and compute patterns of edge accelerators. This paper presents a comprehensive performance analysis of various causal inference operators on a modern NPU. We…

@arXiv_csAR_bot@mastoxiv.page
2025-10-08 07:40:29

From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs
Tianhao Zhu, Dahu Feng, Erhu Feng, Yubin Xia
https://arxiv.org/abs/2510.05632 https://

From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs
With the widespread adoption of Large Language Models (LLMs), the demand for high-performance LLM inference services continues to grow. To meet this demand, a growing number of AI accelerators have been proposed, such as Google TPU, Huawei NPU, Graphcore IPU, and Cerebras WSE, etc. Most of these accelerators adopt multi-core architectures to achieve enhanced scalability, but lack the flexibility of SIMT architectures. Therefore, without careful configuration of the hardware architecture, as wel…

@arXiv_csET_bot@mastoxiv.page
2025-09-23 09:46:40

Evaluating the Energy Efficiency of NPU-Accelerated Machine Learning Inference on Embedded Microcontrollers
Anastasios Fanariotis, Theofanis Orphanoudakis, Vasilis Fotopoulos
https://arxiv.org/abs/2509.17533

Evaluating the Energy Efficiency of NPU-Accelerated Machine Learning Inference on Embedded Microcontrollers
The deployment of machine learning (ML) models on microcontrollers (MCUs) is constrained by strict energy, latency, and memory requirements, particularly in battery-operated and real-time edge devices. While software-level optimizations such as quantization and pruning reduce model size and computation, hardware acceleration has emerged as a decisive enabler for efficient embedded inference. This paper evaluates the impact of Neural Processing Units (NPUs) on MCU-based ML execution, using the A…

@arXiv_csAI_bot@mastoxiv.page
2025-10-01 11:31:17

Benchmarking Deep Learning Convolutions on Energy-constrained CPUs
Enrique Galvez (ALSOC), Adrien Cassagne (ALSOC), Alix Munier (ALSOC), Manuel Bouyer
https://arxiv.org/abs/2509.26217

Benchmarking Deep Learning Convolutions on Energy-constrained CPUs
This work evaluates state-of-the-art convolution algorithms for CPU-based deep learning inference. While most prior studies focus on GPUs or NPUs, CPU implementations remain relatively underoptimized. We benchmark direct, GEMM-based, and Winograd convolutions across modern CPUs from ARM __ , Intel __ , AMD __ , Apple __ , and Nvidia __ , considering both latency and energy efficiency. Our results highlight the key architectural factors that govern CPU efficiency for convolution operations, prov…

@arXiv_csAR_bot@mastoxiv.page
2025-09-19 07:31:01

eIQ Neutron: Redefining Edge-AI Inference with Integrated NPU and Compiler Innovations
Lennart Bamberg, Filippo Minnella, Roberto Bosio, Fabrizio Ottati, Yuebin Wang, Jongmin Lee, Luciano Lavagno, Adam Fuks
https://arxiv.org/abs/2509.14388

eIQ Neutron: Redefining Edge-AI Inference with Integrated NPU and Compiler Innovations
Neural Processing Units (NPUs) are key to enabling efficient AI inference in resource-constrained edge environments. While peak tera operations per second (TOPS) is often used to gauge performance, it poorly reflects real-world performance and typically rather correlates with higher silicon cost. To address this, architects must focus on maximizing compute utilization, without sacrificing flexibility. This paper presents the eIQ Neutron efficient-NPU, integrated into a commercial flagship MPU, …

@arXiv_csAI_bot@mastoxiv.page
2025-10-01 17:40:14

Replaced article(s) found for cs.AI. https://arxiv.org/list/cs.AI/new
[8/9]:
- MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs
Feilong Chen, Yijiang Liu, Yi Huang, Hao Wang, Miren Tian, Ya-Qi Yu, Minghui Liao, Jihao Wu

@arXiv_csDC_bot@mastoxiv.page
2025-10-08 07:36:59

Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices
Yilong Li, Shuai Zhang, Yijing Zeng, Hao Zhang, Xinmiao Xiong, Jingyu Liu, Pan Hu, Suman Banerjee
https://arxiv.org/abs/2510.05109

Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices
Large Multimodal Models (LMMs) are inherently modular, consisting of vision and audio encoders, projectors, and large language models. Yet, they are almost always executed monolithically, which underutilizes the heterogeneous accelerators (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end latency. In this paper, we present NANOMIND, a hardware--software co-design inference framework for Large Multimodal Models (LMMs) that breaks large models into modular ``bricks'' (vision, language…

Tootfinder

Opt-in global Mastodon full text search. Join the index!