Tootfinder

@mia@hcommons.social
2025-09-19 14:22:23

Some nice examples in the 'use cases' section of AI for Humanists https://aiforhumanists.com/guides/usecases/ - from OCR to annotation to identifying voices and styles

Use Cases
The AI for Humanists project is developing resources to enable DH scholars to explore how large language models and AI technologies can be used in their research and teaching. Find an annotated bibliography of research papers and tools, a glossary of relevant terms, code tutorials, and information about our workshops.

@arXiv_csDL_bot@mastoxiv.page
2025-09-17 07:56:49

Layout-Aware OCR for Black Digital Archives with Unsupervised Evaluation
Fitsum Sileshi Beyene, Christopher L. Dancy
https://arxiv.org/abs/2509.13236 https://

Layout-Aware OCR for Black Digital Archives with Unsupervised Evaluation
Despite their cultural and historical significance, Black digital archives continue to be a structurally underrepresented area in AI research and infrastructure. This is especially evident in efforts to digitize historical Black newspapers, where inconsistent typography, visual degradation, and limited annotated layout data hinder accurate transcription, despite the availability of various systems that claim to handle optical character recognition (OCR) well. In this short paper, we present a l…

@mgorny@social.treehouse.systems
2025-08-14 19:06:21

Paperwork does OCR on everything I scan. I've just scanned a document with my signature on it. It OCR-ed the signature (which is literally a scrawl on "Michał Górny") as "NBA".

@avstockhausen@fedihum.org
2025-07-09 15:35:02

Bookmarked: calfa-co/hye-tesseract: Open OCR model for Armenian #Armenisch_OCR_Tesseract

GitHub - calfa-co/hye-tesseract: Open OCR model for Armenian
Open OCR model for Armenian. Contribute to calfa-co/hye-tesseract development by creating an account on GitHub.

@arXiv_csCV_bot@mastoxiv.page
2025-07-18 10:22:32

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia
https://arxiv.org/abs/2507.13348

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically proc…

@arXiv_csCL_bot@mastoxiv.page
2025-09-15 09:43:41

Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning
Haiyang Yu, Yuchuan Wu, Fan Shi, Lei Liao, Jinghui Lu, Xiaodong Ge, Han Wang, Minghan Zhuo, Xuecheng Wu, Xiang Fei, Hao Feng, Guozhi Tang, An-Lan Wang, Hanshen Zhu, Yangfan He, Quanhuan Liang, Liyuan Meng, Chao Feng, Can Huang, Jingqun Tang, Bin Li
https://

Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning
Chinese ancient documents, invaluable carriers of millennia of Chinese history and culture, hold rich knowledge across diverse fields but face challenges in digitization and understanding, i.e., traditional methods only scan images, while current Vision-Language Models (VLMs) struggle with their visual and linguistic complexity. Existing document benchmarks focus on English printed texts or simplified Chinese, leaving a gap for evaluating VLMs on ancient Chinese documents. To address this, we p…

@arXiv_csCV_bot@mastoxiv.page
2025-07-10 10:17:11

Design and Implementation of an OCR-Powered Pipeline for Table Extraction from Invoices
Parshva Dhilankumar Patel
https://arxiv.org/abs/2507.07029 https://…

Design and Implementation of an OCR-Powered Pipeline for Table Extraction from Invoices
This paper presents the design and development of an OCR-powered pipeline for efficient table extraction from invoices. The system leverages Tesseract OCR for text recognition and custom post-processing logic to detect, align, and extract structured tabular data from scanned invoice documents. Our approach includes dynamic preprocessing, table boundary detection, and row-column mapping, optimized for noisy and non-standard invoice formats. The resulting pipeline significantly improves data extr…

@arXiv_csCL_bot@mastoxiv.page
2025-09-05 09:41:51

E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition
Aryan Gupta, Anupam Purwar
https://arxiv.org/abs/2509.03615 https://…

E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition
Optical Character Recognition (OCR) in multilingual, noisy, and diverse real-world images remains a significant challenge for optical character recognition systems. With the rise of Large Vision-Language Models (LVLMs), there is growing interest in their ability to generalize and reason beyond fixed OCR pipelines. In this work, we introduce Sprinklr-Edge-OCR, a novel OCR system built specifically optimized for edge deployment in resource-constrained environments. We present a large-scale compar…

@arXiv_csCY_bot@mastoxiv.page
2025-09-04 08:22:51

Integrating Generative AI into Cybersecurity Education: A Study of OCR and Multimodal LLM-assisted Instruction
Karan Patel, Yu-Zheng Lin, Gaurangi Raul, Bono Po-Jen Shih, Matthew W. Redondo, Banafsheh Saber Latibari, Jesus Pacheco, Soheil Salehi, Pratik Satam
https://arxiv.org/abs/2509.02998

Integrating Generative AI into Cybersecurity Education: A Study of OCR and Multimodal LLM-assisted Instruction
This full paper describes an LLM-assisted instruction integrated with a virtual cybersecurity lab platform. The digital transformation of Fourth Industrial Revolution (4IR) systems is reshaping workforce needs, widening skill gaps, especially among older workers. With rising emphasis on robotics, automation, AI, and security, re-skilling and up-skilling are essential. Generative AI can help build this workforce by acting as an instructional assistant to support skill acquisition during experien…

@toxi@mastodon.thi.ng
2025-08-04 15:27:23

Finally found a great ad-free and tracking-free #OpenSource document scanner for iOS, with OCR and multi-page PDF output:
https://openscanner.app/

Open Scanner
Open Scanner is an open-source document scanning app for iPhone

@arXiv_csIR_bot@mastoxiv.page
2025-07-04 07:35:01

Uncertainty-Aware Complex Scientific Table Data Extraction
Kehinde Ajayi, Yi He, Jian Wu
https://arxiv.org/abs/2507.02009 https://arx…

Uncertainty-Aware Complex Scientific Table Data Extraction
Table structure recognition (TSR) and optical character recognition (OCR) play crucial roles in extracting structured data from tables in scientific documents. However, existing extraction frameworks built on top of TSR and OCR methods often fail to quantify the uncertainties of extracted results. To obtain highly accurate data for scientific domains, all extracted data must be manually verified, which can be time-consuming and labor-intensive. We propose a framework that performs uncertainty-a…

@grumpybozo@toad.social
2025-09-03 14:49:04

33k one-page TIFFs is an OCR challenge, but it's not insurmountable. https://fed.brid.gy/r/https://bsky.app/profile/did:plc:gvda6fem6r7selm4gzjjww4a/post/3lxvbrbeabc2a

Leah McElrath (@leahmcelrath.bsky.social)
Looks like they purposefully made the released Epstein documents into a pile of hay to make finding any needles very challenging. They even made each page a separate file.

@vform@openbiblio.social
2025-07-05 12:36:50

Bei dem ganzen KI-Gedöns würde ich ja denken, die perfekten und freien, sparsamen Modelle für Autokorrektur und OCR-Erkennung sollte da sein. So als quasi Kernkompetenz von LLMs. Aber hören und lesen tu ich hauptsächlich in Richtung "Chat"-Nutzung.

@michabbb@social.vivaldi.net
2025-07-22 20:15:55

#MistralAI Document #AI: Advanced #OCR solution for complex document processing 📄
📺

@mela@zusammenkunft.net
2025-08-27 01:00:26

Gibt's eine brauchbare Scanner-App für Android, ohne Abo? Braucht kein OCR, nur gute mehrseitige Scans2PDF.

@arXiv_csCV_bot@mastoxiv.page
2025-09-15 09:58:31

VARCO-VISION-2.0 Technical Report
Young-rok Cha, Jeongho Ju, SunYoung Park, Jong-Hyeon Lee, Younghyun Yu, Youngjune Kim
https://arxiv.org/abs/2509.10105 https://

VARCO-VISION-2.0 Technical Report
We introduce VARCO-VISION-2.0, an open-weight bilingual vision-language model (VLM) for Korean and English with improved capabilities compared to the previous model VARCO-VISION-14B. The model supports multi-image understanding for complex inputs such as documents, charts, and tables, and delivers layoutaware OCR by predicting both textual content and its spatial location. Trained with a four-stage curriculum with memory-efficient techniques, the model achieves enhanced multimodal alignment, wh…

@nelson@tech.lgbt
2025-08-30 01:35:07

One of my most useful tools these days are things that take screenshots. Greenshot, a Windows tool with excellent usability. And Powertools Text Extractor which lets me OCR bits of text on the screen. Usability is important here: press one button and stuff is copied to clipboard.

@arXiv_csCL_bot@mastoxiv.page
2025-07-25 09:55:02

Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil
Nevidu Jayatilleke, Nisansa de Silva
https://arxiv.org/abs/2507.18264 https://

Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil
Solving the problem of Optical Character Recognition (OCR) on printed text for Latin and its derivative scripts can now be considered settled due to the volumes of research done on English and other High-Resourced Languages (HRL). However, for Low-Resourced Languages (LRL) that use unique scripts, it remains an open problem. This study presents a comparative analysis of the zero-shot performance of six distinct OCR engines on two LRLs: Sinhala and Tamil. The selected engines include both commer…

@arXiv_csCV_bot@mastoxiv.page
2025-09-01 09:52:02

Why Stop at Words? Unveiling the Bigger Picture through Line-Level OCR
Shashank Vempati, Nishit Anand, Gaurav Talebailkar, Arpan Garai, Chetan Arora
https://arxiv.org/abs/2508.21693

Why Stop at Words? Unveiling the Bigger Picture through Line-Level OCR
Conventional optical character recognition (OCR) techniques segmented each character and then recognized. This made them prone to error in character segmentation, and devoid of context to exploit language models. Advances in sequence to sequence translation in last decade led to modern techniques first detecting words and then inputting one word at a time to a model to directly output full words as sequence of characters. This allowed better utilization of language models and bypass error-prone…

@arXiv_csDL_bot@mastoxiv.page
2025-07-28 07:59:01

Comparing OCR Pipelines for Folkloristic Text Digitization
Octavian M. Machidon, Alina L. Machidon
https://arxiv.org/abs/2507.19092 https://arxiv.org/pdf/2…

Comparing OCR Pipelines for Folkloristic Text Digitization
The digitization of historical folkloristic materials presents unique challenges due to diverse text layouts, varying print and handwriting styles, and linguistic variations. This study explores different optical character recognition (OCR) approaches for Slovene folkloristic and historical text digitization, integrating both traditional methods and large language models (LLMs) to improve text transcription accuracy while maintaining linguistic and structural integrity. We compare single-stage …

@arXiv_csHC_bot@mastoxiv.page
2025-07-01 11:05:23

Email as the Interface to Generative AI Models: Seamless Administrative Automation
Andres Navarro, Carlos de Quinto, Jos\'e Alberto Hern\'andez
https://arxiv.org/abs/2506.23850

Email as the Interface to Generative AI Models: Seamless Administrative Automation
This paper introduces a novel architectural framework that integrates Large Language Models (LLMs) with email interfaces to automate administrative tasks, specifically targeting accessibility barriers in enterprise environments. The system connects email communication channels with Optical Character Recognition (OCR) and intelligent automation, enabling non-technical administrative staff to delegate complex form-filling and document processing tasks using familiar email interfaces. By treating …

@arXiv_csCY_bot@mastoxiv.page
2025-07-08 11:48:31

Real-Time AI-Driven Pipeline for Automated Medical Study Content Generation in Low-Resource Settings: A Kenyan Case Study
Emmanuel Korir, Eugene Wechuli
https://arxiv.org/abs/2507.05212

Real-Time AI-Driven Pipeline for Automated Medical Study Content Generation in Low-Resource Settings: A Kenyan Case Study
Juvenotes is a real-time AI-driven pipeline that automates the transformation of academic documents into structured exam-style question banks, optimized for low-resource medical education settings in Kenya. The system combines Azure Document Intelligence for OCR and Azure AI Foundry (OpenAI o3-mini) for question and answer generation in a microservices architecture, with a Vue/TypeScript frontend and AdonisJS backend. Mobile-first design, bandwidth-sensitive interfaces, institutional tagging, a…

@arXiv_csCV_bot@mastoxiv.page
2025-08-21 10:04:30

Improving OCR using internal document redundancy
Diego Belzarena, Seginus Mowlavi, Aitor Artola, Camilo Mari\~no, Marina Gardella, Ignacio Ram\'irez, Antoine Tadros, Roy He, Natalia Bottaioli, Boshra Rajaei, Gregory Randall, Jean-Michel Morel
https://arxiv.org/abs/2508.14557

Improving OCR using internal document redundancy
Current OCR systems are based on deep learning models trained on large amounts of data. Although they have shown some ability to generalize to unseen data, especially in detection tasks, they can struggle with recognizing low-quality data. This is particularly evident for printed documents, where intra-domain data variability is typically low, but inter-domain data variability is high. In that context, current OCR methods do not fully exploit each document's redundancy. We propose an unsupervis…

@michabbb@social.vivaldi.net
2025-07-22 20:15:56

🔄 Significantly improves #RAG pipeline performance by creating context-rich, high-quality text from documents that enhances #AI application accuracy
💼 Addresses critical business challenges where traditional #OCR

@arXiv_csIR_bot@mastoxiv.page
2025-06-30 09:15:30

Evaluating VisualRAG: Quantifying Cross-Modal Performance in Enterprise Document Understanding
Varun Mannam, Fang Wang, Xin Chen
https://arxiv.org/abs/2506.21604

Evaluating VisualRAG: Quantifying Cross-Modal Performance in Enterprise Document Understanding
Current evaluation frameworks for multimodal generative AI struggle to establish trustworthiness, hindering enterprise adoption where reliability is paramount. We introduce a systematic, quantitative benchmarking framework to measure the trustworthiness of progressively integrating cross-modal inputs such as text, images, captions, and OCR within VisualRAG systems for enterprise document intelligence. Our approach establishes quantitative relationships between technical metrics and user-centric…

@arXiv_csIR_bot@mastoxiv.page
2025-08-27 08:26:03

Extracting Information from Scientific Literature via Visual Table Question Answering Models
Dongyoun Kim, Hyung-do Choi, Youngsun Jang, John Kim
https://arxiv.org/abs/2508.18661

Extracting Information from Scientific Literature via Visual Table Question Answering Models
This study explores three approaches to processing table data in scientific papers to enhance extractive question answering and develop a software tool for the systematic review process. The methods evaluated include: (1) Optical Character Recognition (OCR) for extracting information from documents, (2) Pre-trained models for document visual question answering, and (3) Table detection and structure recognition to extract and merge key information from tables with textual content to answer extra…

Tootfinder

Opt-in global Mastodon full text search. Join the index!