So this guy threw Natural Language Processing at the Voynich Manuscript and concluded that it probably is written in some kind of language and is not just total gibberish. Cool bit of ML research! https://github.com/brianmg/voynich-nlp-analysis
LLM vs. SAST: A Technical Analysis on Detecting Coding Bugs of GPT4-Advanced Data Analysis
Madjid G. Tehrani, Eldar Sultanow, William J. Buchanan, Mahkame Houmani, Christel H. Djaha Fodja
https://arxiv.org/abs/2506.15212
Language Surgery in Multilingual Large Language Models
Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, Genta Indra Winata, Peerat Limkonchotiwat, Alham Fikri Aji, Samuel Cahyawijaya
https://arxiv.org/abs/2506.12450
In our #ISE2025 lecture last Wednesday, we learned how in n-gram language models via Markov assumption and maximum likelihood estimation we can predict the probability of the occurrence of a word given a specific context (i.e. n words previous in the sequence of words).
#NLP
Identifying economic narratives in large text corpora -- An integrated approach using Large Language Models
Tobias Schmidt, Kai-Robin Lange, Matthias Reccius, Henrik M\"uller, Michael Roos, Carsten Jentsch
https://arxiv.org/abs/2506.15041
Evaluating Large Language Models for Phishing Detection, Self-Consistency, Faithfulness, and Explainability
Shova Kuikel, Aritran Piplai, Palvi Aggarwal
https://arxiv.org/abs/2506.13746
Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management
Luis Gasco, Hermenegildo Fabregat, Laura Garc\'ia-Sardi\~na, Paula Estrella, Daniel Deniz, Alvaro Rodrigo, Rabih Zbib
https://arxiv.org/abs/2507.13275
DSSD: Efficient Edge-Device Deployment and Collaborative Inference via Distributed Split Speculative Decoding
Jiahong Ning, Ce Zheng, Tingting Yang
https://arxiv.org/abs/2507.12000
Integrating Quantized LLMs into Robotics Systems as Edge AI to Leverage their Natural Language Processing Capabilities
Miguel \'A. Gonz\'alez-Santamarta, Francisco J. Rodr\'iguez-Lera, David Sobr\'in-Hidalgo, \'Angel Manuel Guerrero-Higueras, Vicente Matell\'An-Olivera
https://arxiv.org/abs/2506.09581
Improving Drug Identification in Overdose Death Surveillance using Large Language Models
Arthur J. Funnell, Panayiotis Petousis, Fabrice Harel-Canada, Ruby Romero, Alex A. T. Bui, Adam Koncsol, Hritika Chaturvedi, Chelsea Shover, David Goodman-Meza
https://arxiv.org/abs/2507.12679
This week, we were discussing the central question Can we "predict" a word? as the basis for statistical language models in our #ISE2025 lecture. Of course, I wasx trying Shakespeare quotes to motivate the (international) students to complement the quotes with "predicted" missing words ;-)
"All the world's a stage, and all the men and women merely...."
INTERPOS: Interaction Rhythm Guided Positional Morphing for Mobile App Recommender Systems
M. H. Maqbool, Moghis Fereidouni, Umar Farooq, A. B. Siddique, Hassan Foroosh
https://arxiv.org/abs/2506.12661
ElliottAgents: A Natural Language-Driven Multi-Agent System for Stock Market Analysis and Prediction
Jaros{\l}aw A. Chudziak, Micha{\l} Wawer
https://arxiv.org/abs/2507.03435
Evaluating Large Language Models for Phishing Detection, Self-Consistency, Faithfulness, and Explainability
Shova Kuikel, Aritran Piplai, Palvi Aggarwal
https://arxiv.org/abs/2506.13746
A Topic Modeling Analysis of Stigma Dimensions, Social, and Related Behavioral Circumstances in Clinical Notes Among Patients with HIV
Ziyi Chen, Yiyang Liu, Mattia Prosperi, Krishna Vaddiparti, Robert L Cook, Jiang Bian, Yi Guo, Yonghui Wu
https://arxiv.org/abs/2506.09279
Next stop on our NLP timeline (as part of the #ISE2025 lecture) was Terry Winograd's SHRDLU, an early natural language understanding system developed in 1968-70 that could manipulate blocks in a virtual world.
Winograd, T. Procedures as a Representation for Data in a Computer Program for Understanding Natural Language. MIT AI Technical Report 235.
Quantum Adiabatic Generation of Human-Like Passwords
Sascha M\"ucke, Raoul Heese, Thore Gerlach, David Biesner, Loong Kuan Lee, Nico Piatkowski
https://arxiv.org/abs/2506.08917
Training-free LLM Merging for Multi-task Learning
Zichuan Fu, Xian Wu, Yejing Wang, Wanyu Wang, Shanshan Ye, Hongzhi Yin, Yi Chang, Yefeng Zheng, Xiangyu Zhao
https://arxiv.org/abs/2506.12379
FinBERT2: A Specialized Bidirectional Encoder for Bridging the Gap in Finance-Specific Deployment of Large Language Models
Xuan Xu, Fufang Wen, Beilin Chu, Zhibing Fu, Qinhong Lin, Jiaqi Liu, Binjie Fei, Zhongliang Yang, Linna Zhou, Yu Li
https://arxiv.org/abs/2506.06335
Replaced article(s) found for q-bio.OT. https://arxiv.org/list/q-bio.OT/new
[1/1]:
English dictionaries, gold and silver standard corpora for biomedical natural language processing...
BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects
Hongyang Li, Sanjoy Dey, Bum Chul Kwon, Michael Danziger, Michal Rosen-Tzvi, Jianying Hu, James Kozloski, Ching-Huei Tsou, Bharath Dandala, Pablo Meyer
https://arxiv.org/abs/2507.05265
Last leg on our brief history of NLP (so far) is the advent of large language models with GPT-3 in 2020 and the introduction of learning from the prompt (aka few-shot learning).
T. B. Brown et al. (2020). Language models are few-shot learners. NIPS'20
https://…
Investigating Vulnerabilities and Defenses Against Audio-Visual Attacks: A Comprehensive Survey Emphasizing Multimodal Models
Jinming Wen, Xinyi Wu, Shuai Zhao, Yanhao Jia, Yuwen Li
https://arxiv.org/abs/2506.11521
Estimation of An Infinite Dimensional Transition Probability Matrix Using a Generalized Hierarchical Stick-Breaking Process
Agamani Saha, Souvik Roy
https://arxiv.org/abs/2507.07433
FlexiSAGA: A Flexible Systolic Array GEMM Accelerator for Sparse and Dense Processing
Mika Markus M\"uller, Konstantin L\"ubeck, Alexander Louis-Ferdinand Jung, Jannik Steinmetz, Oliver Bringmann
https://arxiv.org/abs/2506.01566
With the advent of ELIZA, Joseph Weizenbaum's first psychotherapist chatbot, NLP took another major step with pattern-based substitution algorithms based on simple regular expressions.
Weizenbaum, Joseph (1966). ELIZA—a computer program for the study of natural language communication between man and machine. Com. of the ACM. 9: 36–45.
A Language-Driven Framework for Improving Personalized Recommendations: Merging LLMs with Traditional Algorithms
Aaron Goldstein, Ayan Dutta
https://arxiv.org/abs/2507.07251
CSI2Vec: Towards a Universal CSI Feature Representation for Positioning and Channel Charting
Victoria Palhares, Sueda Taner, Christoph Studer
https://arxiv.org/abs/2506.05237
VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning
Siran Chen, Boyu Chen, Chenyun Yu, Yuxiao Luo, Ouyang Yi, Lei Cheng, Chengxiang Zhuo, Zang Li, Yali Wang
https://arxiv.org/abs/2507.02626
Iterative Augmentation with Summarization Refinement (IASR) Evaluation for Unstructured Survey data Modeling and Analysis
Payal Bhattad, Sai Manoj Pudukotai Dinakarrao, Anju Gupta
https://arxiv.org/abs/2507.12126
Last week, our students learned how to conduct a proper evaluation for an NLP experiment. To this end, we introduced a small textcorpus with sentences about Joseph Fourier, who counts as one of the discoverers of the greenhouse effect, responsible for global warming.
Unveiling Privacy Policy Complexity: An Exploratory Study Using Graph Mining, Machine Learning, and Natural Language Processing
Vijayalakshmi Ramasamy, Seth Barrett, Gokila Dorai, Jessica Zumbach
https://arxiv.org/abs/2507.02968
An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques
Walid Mohamed Aly, Taysir Hassan A. Soliman, Amr Mohamed AbdelAziz
https://arxiv.org/abs/2507.05123
Is Diversity All You Need for Scalable Robotic Manipulation?
Modi Shi, Li Chen, Jin Chen, Yuxiang Lu, Chiming Liu, Guanghui Ren, Ping Luo, Di Huang, Maoqing Yao, Hongyang Li
https://arxiv.org/abs/2507.06219
VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator
Zhican Wang, Hongxiang Fan, Haroon Waris, Gang Wang, Zhenyu Li, Jianfei Jiang, Yanan Sun, Guanghui He
https://arxiv.org/abs/2507.00797
Building on the 90s, statistical n-gram language models, trained on vast text collections, became the backbone of NLP research. They fueled advancements in nearly all NLP techniques of the era, laying the groundwork for today's AI.
F. Jelinek (1997), Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA
#NLP
SocioXplorer: An Interactive Tool for Topic and Network Analysis in Social Data
Sandrine Chausson, Youssef Al Hariri, Walid Magdy, Bj\"orn Ross
https://arxiv.org/abs/2506.18845
Next stop in our NLP timeline is 2013, the introduction of low dimensional dense word vectors - so-called "word embeddings" - based on distributed semantics, as e.g. word2vec by Mikolov et al. from Google, which enabled representation learning on text.
T. Mikolov et al. (2013). Efficient Estimation of Word Representations in Vector Space.
…
Searching Clinical Data Using Generative AI
Karan Hanswadkar, Anika Kanchi, Shivani Tripathi, Shi Qiao, Rony Chatterjee, Alekh Jindal
https://arxiv.org/abs/2505.24090
The Dark Side of LLMs Agent-based Attacks for Complete Computer Takeover
Matteo Lupinacci, Francesco Aurelio Pironti, Francesco Blefari, Francesco Romeo, Luigi Arena, Angelo Furfaro
https://arxiv.org/abs/2507.06850
Propaganda and Information Dissemination in the Russo-Ukrainian War: Natural Language Processing of Russian and Western Twitter Narratives
Zaur Gouliev
https://arxiv.org/abs/2506.01807
Adversarial Text Generation with Dynamic Contextual Perturbation
Hetvi Waghela, Jaydip Sen, Sneha Rakshit, Subhasis Dasgupta
https://arxiv.org/abs/2506.09148
Last week, we continued our #ISE2025 lecture on distributional semantics with the introduction of neural language models (NLMs) and compared them to traditional statistical n-gram models.
Benefits of NLMs:
- Capturing Long-Range Dependencies
- Computational and Statistical Tractability
- Improved Generalisation
- Higher Accuracy
@…
SeisCoDE: 3D Seismic Interpretation Foundation Model with Contrastive Self-Distillation Learning
Goodluck Archibong, Ardiansyah Koeshidayatullah, Umair Waheed, Weichang Li, Dicky Harishidayat, Motaz Alfarraj
https://arxiv.org/abs/2505.20518
Optimizing Storytelling, Improving Audience Retention, and Reducing Waste in the Entertainment Industry
Andrew Cornfeld, Ashley Miller, Mercedes Mora-Figueroa, Kurt Samuels, Anthony Palomba
https://arxiv.org/abs/2506.00076
SoK: Are Watermarks in LLMs Ready for Deployment?
Kieu Dang, Phung Lai, NhatHai Phan, Yelong Shen, Ruoming Jin, Abdallah Khreishah, My Thai
https://arxiv.org/abs/2506.05594
False Alarms, Real Damage: Adversarial Attacks Using LLM-based Models on Text-based Cyber Threat Intelligence Systems
Samaneh Shafee, Alysson Bessani, Pedro M. Ferreira
https://arxiv.org/abs/2507.06252
From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems
Youngjoon Jang, Seongtae Hong, Junyoung Son, Sungjin Park, Chanjun Park, Heuiseok Lim
https://arxiv.org/abs/2507.07847
LMPVC and Policy Bank: Adaptive voice control for industrial robots with code generating LLMs and reusable Pythonic policies
Ossi Parikka, Roel Pieters
https://arxiv.org/abs/2506.22028
Towards Fair Rankings: Leveraging LLMs for Gender Bias Detection and Measurement
Maryam Mousavian, Zahra Abbasiantaeb, Mohammad Aliannejadi, Fabio Crestani
https://arxiv.org/abs/2506.22372
Seeing Through Green: Text-Based Classification and the Firm's Returns from Green Patents
Lapo Santarlasci, Armando Rungi, Antonio Zinilli
https://arxiv.org/abs/2507.02287
Retrieval-Augmented Generation in Biomedicine: A Survey of Technologies, Datasets, and Clinical Applications
Jiawei He, Boya Zhang, Hossein Rouhizadeh, Yingjian Chen, Rui Yang, Jin Lu, Xudong Chen, Nan Liu, Irene Li, Douglas Teodoro
https://arxiv.org/abs/2505.01146
Rethinking the Privacy of Text Embeddings: A Reproducibility Study of "Text Embeddings Reveal (Almost) As Much As Text"
Dominykas Seputis, Yongkang Li, Karsten Langerak, Serghei Mihailov
https://arxiv.org/abs/2507.07700
Dialogue-Based Multi-Dimensional Relationship Extraction from Novels
Yuchen Yan, Hanjie Zhao, Senbin Zhu, Hongde Liu, Zhihong Zhang, Yuxiang Jia
https://arxiv.org/abs/2507.04852
PROVSYN: Synthesizing Provenance Graphs for Data Augmentation in Intrusion Detection Systems
Yi Huang, Wajih UI Hassan, Yao Guo, Xiangqun Chen, Ding Li
https://arxiv.org/abs/2506.06226
LLMs as Architects and Critics for Multi-Source Opinion Summarization
Anuj Attri, Arnav Attri, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Muthusamy Chelliah, Nikesh Garera
https://arxiv.org/abs/2507.04751
eSapiens: A Real-World NLP Framework for Multimodal Document Understanding and Enterprise Knowledge Processing
Isaac Shi, Zeyuan Li, Wenli Wang, Lewei He, Yang Yang, Tianyu Shi
https://arxiv.org/abs/2506.16768
MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation
Yile Liu, Ziwei Ma, Xiu Jiang, Jinglu Hu, Jing Chang, Liang Li
https://arxiv.org/abs/2506.01776
Put Teacher in Student's Shoes: Cross-Distillation for Ultra-compact Model Compression Framework
Maolin Wang, Jun Chu, Sicong Xie, Xiaoling Zang, Yao Zhao, Wenliang Zhong, Xiangyu Zhao
https://arxiv.org/abs/2507.04636
Malware Classification Leveraging NLP & Machine Learning for Enhanced Accuracy
Bishwajit Prasad Gond, Rajneekant, Pushkar Kishore, Durga Prasad Mohapatra
https://arxiv.org/abs/2506.16224
Revisiting Active Learning under (Human) Label Variation
Cornelia Gruber, Helen Alber, Bernd Bischl, G\"oran Kauermann, Barbara Plank, Matthias A{\ss}enmacher
https://arxiv.org/abs/2507.02593
Perspectives in Play: A Multi-Perspective Approach for More Inclusive NLP Systems
Benedetta Muscato, Lucia Passaro, Gizem Gezici, Fosca Giannotti
https://arxiv.org/abs/2506.20209 …