Tootfinder

@michabbb@social.vivaldi.net
2025-07-30 19:53:57

💰 Supports multiple dimensions and quantization options - binary 512d version outperforms OpenAI-v3-large while reducing vector database costs by 99.48%
🔍 Processes entire documents in single pass to generate chunk embeddings enriched with document-level context
🎯 Less sensitive to chunking strategies compared to traditional context-agnostic embedding models

@berlinbuzzwords@floss.social
2025-07-28 11:00:28

At Berlin Buzzwords, Alessio Vertemati and Andrea Ponti discussed how different document parsing and chunking strategies affect the performance of RAG pipelines.
Watch the full session: https://youtu.be/OMyEklV0G0E?si=P1GSTVShZYEKr7A5
--
Save the Date - Berlin Buzzword…

@arXiv_csSE_bot@mastoxiv.page
2025-06-26 08:22:10

Can LLMs Replace Humans During Code Chunking?
Christopher Glasz, Emily Escamilla, Eric O. Scott, Anand Patel, Jacob Zimmer, Colin Diggs, Michael Doyle, Scott Rosen, Nitin Naik, Justin F. Brunelle, Samruddhi Thaker, Parthav Poudel, Arun Sridharan, Amit Madan, Doug Wendt, William Macke, Thomas Schill
https://arxiv.org/abs/2506.198…

Can LLMs Replace Humans During Code Chunking?
Large language models (LLMs) have become essential tools in computer science, especially for tasks involving code understanding and generation. However, existing work does not address many of the unique challenges presented by code written for government applications. In particular, government enterprise software is often written in legacy languages like MUMPS or assembly language code (ALC) and the overall token lengths of these systems exceed the context window size for current commercially a…

@arXiv_csSE_bot@mastoxiv.page
2025-06-19 08:37:13

cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree
Yilin Zhang, Xinran Zhao, Zora Zhiruo Wang, Chenyang Yang, Jiayi Wei, Tongshuang Wu
https://arxiv.org/abs/2506.15655

cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree
Retrieval-Augmented Generation (RAG) has become essential for large-scale code generation, grounding predictions in external code corpora to improve actuality. However, a critical yet underexplored aspect of RAG pipelines is chunking -- the process of dividing documents into retrievable units. Existing line-based chunking heuristics often break semantic structures, splitting functions or merging unrelated code, which can degrade generation quality. We propose chunking via Abstract Syntax Trees …

@arXiv_csLG_bot@mastoxiv.page
2025-07-11 10:23:31

Reinforcement Learning with Action Chunking
Qiyang Li, Zhiyuan Zhou, Sergey Levine
https://arxiv.org/abs/2507.07969 https://arxiv.org/pdf/2507.07969 https://arxiv.org/html/2507.07969
arXiv:2507.07969v1 Announce Type: new
Abstract: We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.
toXiv_bot_toot

@arXiv_csCL_bot@mastoxiv.page
2025-08-08 10:04:32

H-Net : Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages
Mehrdad Zakershahrak, Samira Ghodratnama
https://arxiv.org/abs/2508.05628

H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages
Byte-level language models eliminate fragile tokenizers but face computational challenges in morphologically-rich languages (MRLs), where words span many bytes. We propose H-NET++, a hierarchical dynamic-chunking model that learns linguistically-informed segmentation through end-to-end training. Key innovations include: (1) a lightweight Transformer context-mixer (1.9M parameters) for cross-chunk attention, (2) a two-level latent hyper-prior for document-level consistency, (3) specialized handl…

@arXiv_csDC_bot@mastoxiv.page
2025-08-11 07:33:29

Accelerating Data Chunking in Deduplication Systems using Vector Instructions
Sreeharsha Udayashankar, Abdelrahman Baba, Samer Al-Kiswany
https://arxiv.org/abs/2508.05797 https:…

Accelerating Data Chunking in Deduplication Systems using Vector Instructions
Content-defined Chunking (CDC) algorithms dictate the overall space savings that deduplication systems achieve. However, due to their need to scan each file in its entirety, they are slow and often the main performance bottleneck within data deduplication. We present VectorCDC, a method to accelerate hashless CDC algorithms using vector CPU instructions, such as SSE / AVX. Our evaluation shows that VectorCDC is effective on Intel, AMD, ARM, and IBM CPUs, achieving 8.35x - 26.2x higher throughpu…

@arXiv_csIR_bot@mastoxiv.page
2025-06-18 08:21:59

Knowledge Compression via Question Generation: Enhancing Multihop Document Retrieval without Fine-tuning
Anvi Alex Eponon, Moein Shahiki-Tash, Ildar Batyrshin, Christian E. Maldonado-Sifuentes, Grigori Sidorov, Alexander Gelbukh
https://arxiv.org/abs/2506.13778

Knowledge Compression via Question Generation: Enhancing Multihop Document Retrieval without Fine-tuning
This study presents a question-based knowledge encoding approach that improves retrieval-augmented generation (RAG) systems without requiring fine-tuning or traditional chunking. We encode textual content using generated questions that span the lexical and semantic space, creating targeted retrieval cues combined with a custom syntactic reranking method. In single-hop retrieval over 109 scientific papers, our approach achieves a Recall@3 of 0.84, outperforming traditional chunking methods by …

@arXiv_quantph_bot@mastoxiv.page
2025-06-12 10:01:51

K-ADAPT-VQE: Optimizing Molecular Ground State Searches by Chunking Operators
Tatiana Bespalova, Oumaya Ladhari, Guido Masella
https://arxiv.org/abs/2506.09658

K-ADAPT-VQE: Optimizing Molecular Ground State Searches by Chunking Operators
Classical simulation of molecular systems is limited by exponential scaling, a hurdle quantum algorithms like Variational Quantum Eigensolvers (VQEs) aim to overcome. Although ADAPT-VQE enhances VQEs by dynamically building ansätze, it can remain computationally intensive. This work presents K-ADAPT-VQE, which improves efficiency by adding operators in chunks of K at each iteration. Our results from simulating small molecular systems show that K-ADAPT-VQE substantially reduces the total number…

@berlinbuzzwords@floss.social
2025-07-24 11:00:13

Many chatbots and search engines use Retrieval-Augmented Generation, but they often underperform compared to ChatGPT, frustrating users. At Berlin Buzzwords, Lewin von Saldern discussed how poor chunking strategies contribute to this issue and showcased improved techniques for building more reliable RAG systems.
Watch the full session: https://

@arXiv_csRO_bot@mastoxiv.page
2025-08-18 09:07:50

Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward
Jiarui Yang, Bin Zhu, Jingjing Chen, Yu-Gang Jiang
https://arxiv.org/abs/2508.11143

Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward
Existing reinforcement learning (RL) methods struggle with long-horizon robotic manipulation tasks, particularly those involving sparse rewards. While action chunking is a promising paradigm for robotic manipulation, using RL to directly learn continuous action chunks in a stable and data-efficient manner remains a critical challenge. This paper introduces AC3 (Actor-Critic for Continuous Chunks), a novel RL framework that learns to generate high-dimensional, continuous action sequences. To mak…

@arXiv_csIR_bot@mastoxiv.page
2025-06-24 09:02:30

Chunk Twice, Embed Once: A Systematic Study of Segmentation and Representation Trade-offs in Chemistry-Aware Retrieval-Augmented Generation
Mahmoud Amiri, Thomas Bocklitz
https://arxiv.org/abs/2506.17277

Chunk Twice, Embed Once: A Systematic Study of Segmentation and Representation Trade-offs in Chemistry-Aware Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) systems are increasingly vital for navigating the ever-expanding body of scientific literature, particularly in high-stakes domains such as chemistry. Despite the promise of RAG, foundational design choices -- such as how documents are segmented and represented -- remain underexplored in domain-specific contexts. This study presents the first large-scale, systematic evaluation of chunking strategies and embedding models tailored to chemistry-focused RAG syst…

@arXiv_csLG_bot@mastoxiv.page
2025-07-11 10:23:11

Dynamic Chunking for End-to-End Hierarchical Sequence Modeling
Sukjun Hwang, Brandon Wang, Albert Gu
https://arxiv.org/abs/2507.07955 https://arxiv.org/pdf/2507.07955 https://arxiv.org/html/2507.07955
arXiv:2507.07955v1 Announce Type: new
Abstract: Despite incredible progress in language models (LMs) in recent years, largely resulting from moving away from specialized models designed for specific tasks to general models based on powerful architectures (e.g. the Transformer) that learn everything from raw data, pre-processing steps such as tokenization remain a barrier to true end-to-end foundation models. We introduce a collection of new techniques that enable a dynamic chunking mechanism which automatically learns content -- and context -- dependent segmentation strategies learned jointly with the rest of the model. Incorporating this into an explicit hierarchical network (H-Net) allows replacing the (implicitly hierarchical) tokenization-LM-detokenization pipeline with a single model learned fully end-to-end. When compute- and data- matched, an H-Net with one stage of hierarchy operating at the byte level outperforms a strong Transformer language model operating over BPE tokens. Iterating the hierarchy to multiple stages further increases its performance by modeling multiple levels of abstraction, demonstrating significantly better scaling with data and matching a token-based Transformer of twice its size. H-Nets pretrained on English show significantly increased character-level robustness, and qualitatively learn meaningful data-dependent chunking strategies without any heuristics or explicit supervision. Finally, the H-Net's improvement over tokenized pipelines is further increased in languages and modalities with weaker tokenization heuristics, such as Chinese and code, or DNA sequences (nearly 4x improvement in data efficiency over baselines), showing the potential of true end-to-end models that learn and scale better from unprocessed data.
toXiv_bot_toot

@arXiv_csCL_bot@mastoxiv.page
2025-06-17 09:20:39

Focusing on Students, not Machines: Grounded Question Generation and Automated Answer Grading
G\'er\^ome Meyer, Philip Breuer
https://arxiv.org/abs/2506.12066

Focusing on Students, not Machines: Grounded Question Generation and Automated Answer Grading
Digital technologies are increasingly used in education to reduce the workload of teachers and students. However, creating open-ended study or examination questions and grading their answers is still a tedious task. This thesis presents the foundation for a system that generates questions grounded in class materials and automatically grades student answers. It introduces a sophisticated method for chunking documents with a visual layout, specifically targeting PDF documents. This method enhance…

@berlinbuzzwords@floss.social
2025-08-11 11:00:12

How can you determine if your RAG system is functioning properly? At #bbuzz 2025, Roman Grebennikov presented a real-world case study on assessing the effectiveness of RAG in production. He discussed challenges such as messy data, chunking errors, and unexpected chatbot behaviour, while also sharing tools to confidently measure quality.
Watch the full session:

@arXiv_csRO_bot@mastoxiv.page
2025-08-05 11:45:31

CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning
Dongchi Huang, Zhirui Fang, Tianle Zhang, Yihang Li, Lin Zhao, Chunhe Xia
https://arxiv.org/abs/2508.02219

CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning
Vision-Language-Action (VLA) models demonstrate significant potential for developing generalized policies in real-world robotic control. This progress inspires researchers to explore fine-tuning these models with Reinforcement Learning (RL). However, fine-tuning VLA models with RL still faces challenges related to sample efficiency, compatibility with action chunking, and training stability. To address these challenges, we explore the fine-tuning of VLA models through offline reinforcement lear…

@arXiv_csIR_bot@mastoxiv.page
2025-08-11 09:31:39

LMAR: Language Model Augmented Retriever for Domain-specific Knowledge Indexing
Yao Zhao, Yantian Ding, Zhiyue Zhang, Dapeng Yao, Yanxun Xu
https://arxiv.org/abs/2508.05672 http…

LMAR: Language Model Augmented Retriever for Domain-specific Knowledge Indexing
Retrieval Augmented Generation (RAG) systems often struggle with domain-specific knowledge due to performance deterioration of pre-trained embeddings and prohibitive computational costs of large language model (LLM)-based retrievers. While fine-tuning data augmentation embedding models offers a promising direction, its effectiveness is limited by the need for high-quality training data and reliable chunking strategies that preserve contextual integrity. We propose LMAR (Language Model Augmented…

@arXiv_csIR_bot@mastoxiv.page
2025-06-10 07:55:42

Optimizing RAG Pipelines for Arabic: A Systematic Analysis of Core Components
Jumana Alsubhi, Mohammad D. Alahmadi, Ahmed Alhusayni, Ibrahim Aldailami, Israa Hamdine, Ahmad Shabana, Yazeed Iskandar, Suhayb Khayyat
https://arxiv.org/abs/2506.06339

Optimizing RAG Pipelines for Arabic: A Systematic Analysis of Core Components
Retrieval-Augmented Generation (RAG) has emerged as a powerful architecture for combining the precision of retrieval systems with the fluency of large language models. While several studies have investigated RAG pipelines for high-resource languages, the optimization of RAG components for Arabic remains underexplored. This study presents a comprehensive empirical evaluation of state-of-the-art RAG components-including chunking strategies, embedding models, rerankers, and language models-across …

Tootfinder

Opt-in global Mastodon full text search. Join the index!