Tootfinder

@netzschleuder@social.skewed.de
2025-12-20 20:00:03

unicodelang: Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.
This network has 868 nodes and 1255 edges.
Tags: Informational, Relatedness, Weighted

unicodelang: Languages spoken by country (2015). 868 nodes, 1255 edges. https://networks.skewed.de/net/unicodelang

unicodelang — Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.

@cjust@infosec.exchange
2025-11-21 00:12:27

Neuroscientists Studied More Than 80,000 People and Found That Speaking Multiple Languages Might Slow Down Brain Aging
https://www.smithsonianmag.com/smart-news/neurosc…

@deprogrammaticaipsum@mas.to
2025-11-20 16:15:52

"PHP is the lingua franca of affordable web hosting options; or, in other terms, the Toyota Corolla of programming languages: boring, solid, easy, and affordable. You can find, almost anywhere in the world, an affordable web hosting with the saint quadrinity of LAMP: Linux, Apache, MySQL, and PHP; an OS, a web server, a database server, and a scripting language, in an inexpensive package, enabling the masses to go further. Paraphrasing George Clooney, what else?"

The Toyota Corolla Of Programming
In 1995, an otherwise unknown software developer released the first version of a new scripting language whose explicit aim was to make applications for this new platform called "The World Wide Web". After starting as a small project, and thanks to the crazy dot-com years, it grew dramatically to become one of the most widely used programming languages of all time. After some stumbling first steps, it eventually got some sort of standardization in 1997, even reluctantly including some OOP featur…

@sauer_lauwarm@mastodon.social
2025-12-20 21:14:21

https://www.instagram.com/p/DSbtXuCCe9Y/?utm_source=ig_web_copy_link&igsh=NTc4MTIwNjQ2YQ==

Japan Daily on Instagram: "Marty Friedman, an American musician based in Japan, visited a junior high school in Chiba at an event organized by local police to promote coexistence with foreigners. Speaking to first-year students, he said that 99.9% of foreigners who come to Japan are not dangerous and are often eager to communicate with Japanese people. He encouraged students to learn foreign languages and approach others without fear. Friedman also shared a personal story about forgetting his wallet in Japan, highlighting the country’s honesty. The event aimed to help young people understand cultural differences as foreign visitors and workers increase. Learning with empathy can reduce tension and improve everyday interactions. Source:TOKYO NP"
2,062 likes, 113 comments - japandaily_jp on December 18, 2025: "Marty Friedman, an American musician based in Japan, visited a junior high school in Chiba at an event organized by local police to promote coexistence with foreigners. Speaking to first-year students, he said that 99.9% of foreigners who come to Japan are not dangerous and are often eager to communicate with Japanese people. He encouraged students to learn foreign languages and approach others without fear. Friedman also shar…

@thomasfuchs@hachyderm.io
2025-12-19 18:33:35

Someone argued with me that using higher level programming languages is just like vibe-coding because "C has race conditions"

@fortune@social.linux.pizza
2025-10-19 10:00:01

Some programming languages manage to absorb change, but withstand progress.
-- Epigrams in Programming, ACM SIGPLAN Sept. 1982

@cketti@social.int21.dev
2025-11-20 18:57:51

@… So many languages, so little time 🙁

@cketti@int21.dev
2025-11-20 18:57:51

@… So many languages, so little time 🙁

@toxi@mastodon.thi.ng
2025-11-20 09:44:59

Wow, just noticed #ThingUmbrella reached 3700 stars on GitHub — I'm celebrating... 🤩🫠
Heartfelt thanks to all of you who've been helping along the way (in any shape & form) and been supporting this work for all these years and across different programming languages/camps! Merci beaucoup!!! Esp. big Thank You's to fellow fediverse people/supporters from various stages…

thi.ng/umbrella

@newstik@social.heise.de
2025-11-18 13:52:02

What the ~same message will have different lengths in different #languages:
English: a mint
German: eine Münzprägeanstalt
English: that goes without saying
Viennese: eh

@penguin42@mastodon.org.uk
2025-10-20 00:12:15

There's a Ghidra pull request to add hd6303/6301 - this is looking much better for doing Epson HX-20 stuff;
Copy the Processors/MC6800/data/languages/*6303* into a standard Ghidra world and run 'ant' in the data directory, restart - and it works!
https://github.com/NationalSecurityA…

Add support for HD6301 and HD6303 microcontrollers by depili · Pull Request #6314 · NationalSecurityAgency/ghidra
HD6303 is a Hitachi clone of 6803. This implementation has been done based on the Hitachi HD6301V1/HD6303R User's Manual. The ISA differs from 6805 and 6809. The flags are implemented as pseudo...

@fanf@mendeddrum.org
2025-10-12 17:42:03

from my link log —
Let's take esoteric programming languages seriously.
https://arxiv.org/abs/2505.15327
saved 2025-10-11 https://dotat.at/:/XKTKR.…

Let's Take Esoteric Programming Languages Seriously
Esoteric programming languages are challenging to learn, but their unusual features and constraints may serve to improve programming ability. From languages designed to be intentionally obtuse (e.g. INTERCAL) to others targeting artistic expression (e.g. Piet) or exploring the nature of computation (e.g. Fractan), there is rich variety in the realm of esoteric programming languages. This essay examines the counterintuitive appeal of esoteric languages and seeks to analyse reasons for this popul…

@thomasfuchs@hachyderm.io
2025-12-19 14:51:00

What’s really amazing about vibe-coding is how people are replacing programming languages which are strictly deterministic with human speech which is highly ambiguous and expect programming to be faster and better.
“Well only use it when you’re already an expert!”
None of the people starting their careers using this technology are experts yet, nor will the ever be.
And within some finite amount of time nether will you, the expert, be an expert anymore.

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:48:41

Cost Analysis of Human-corrected Transcription for Predominately Oral Languages
Yacouba Diarra, Nouhoum Souleymane Coulibaly, Michael Leventhal
https://arxiv.org/abs/2510.12781 …

Cost Analysis of Human-corrected Transcription for Predominately Oral Languages
Creating speech datasets for low-resource languages is a critical yet poorly understood challenge, particularly regarding the actual cost in human labor. This paper investigates the time and complexity required to produce high-quality annotated speech data for a subset of low-resource languages, low literacy Predominately Oral Languages, focusing on Bambara, a Manding language of Mali. Through a one-month field study involving ten transcribers with native proficiency, we analyze the correction …

@cdarwin@c.im
2025-12-19 02:54:38

The engraving depicts the waveforms of the spoken word "water" in 103 different languages
https://science.nasa.gov/mission/europa-clipper/europa-clipper-vault-plate/

Europa Clipper Vault Plate
There's a legacy of NASA spacecraft carrying inspirational messages. Europa Clipper continues that tradition with special messages on its vault plate.

@jorgecandeias@mastodon.social
2025-12-13 21:39:37

Isto é engraçado.
Aparentemente nós, os lusófonos, lemos a uma velocidade média de 181 palavras por minuto.
(mas é provšvel que quem lê regularmente tenha uma velocidade de leitura superior a isto... duvido que chegue Šs 200, mas provavelmente chegarš Šs 190)
https://irisreading.com/average…

What is the Average Reading Speed in Various Languages?
To find out the average speed people read in their native language, a study took a piece of text and translated it to different languages with surprising results

@netzschleuder@social.skewed.de
2025-10-18 07:00:04

unicodelang: Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.
This network has 868 nodes and 1255 edges.
Tags: Informational, Relatedness, Weighted

unicodelang — Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.

@relcfp@mastodon.social
2025-11-20 06:10:28

International Conference on Globalisation in Languages, Education, Culture, and Communication(GLECC 2026) 28-30 July, 2026, Manchester, UK
https://ift.tt/TKHpszg
updated: Wednesday, November 19, 2025 - 3:08pmfull name / name of organization: GLECC Organising…
via Input 4 RELCFP

@Techmeme@techhub.social
2025-11-06 20:41:03

Amazon is testing an AI tool called Kindle Translate that automatically translates books into other languages, for authors that self-publish on the platform (Lawrence Bonk/Engadget)
https://www.engadget.com/ai/amazon-is-test

Amazon is testing an AI tool that automatically translates books into other languages
Amazon is introducing an AI tool that will automatically translate books into other languages. This should be useful for authors who self publish.

@arXiv_csPL_bot@mastoxiv.page
2025-10-15 08:34:11

Operational methods in semantics
Roberto M. Amadio
https://arxiv.org/abs/2510.12295 https://arxiv.org/pdf/2510.12295

Operational methods in semantics
The focus of these lecture notes is on abstract models and basic ideas and results that relate to the operational semantics of programming languages largely conceived. The approach is to start with an abstract description of the computation steps of programs and then to build on top semantic equivalences, specification languages, and static analyses. While other approaches to the semantics of programming languages are possible, it appears that the operational one is particularly effective in th…

@Mediagazer@mstdn.social
2025-11-06 20:47:19

Amazon is testing an AI tool called Kindle Translate that automatically translates books into other languages, for authors that self-publish on the platform (Lawrence Bonk/Engadget)
https://www.engadget.com/ai/amazon-is-test

Amazon is testing an AI tool that automatically translates books into other languages
Amazon is introducing an AI tool that will automatically translate books into other languages. This should be useful for authors who self publish.

@arXiv_csLO_bot@mastoxiv.page
2025-10-14 08:38:18

Proceedings Twentieth International Workshop on Logical Frameworks and Meta-Languages: Theory and Practice
Kaustuv Chaudhuri (Inria, France), Daniele Nantes-Sobrinho (Imperial College, UK)
https://arxiv.org/abs/2510.11199

Proceedings Twentieth International Workshop on Logical Frameworks and Meta-Languages: Theory and Practice
These are the contributed papers presented at the 20th International Workshop on Logical Frameworks and Meta-Languages: Theory and Practice (LFMTP 2025), at Birmingham, UK on 19 July as a satellite event of the FSCD conference. The program committee for this edition of LFMTP was chaired by Kaustuv Chaudhuri and Daniele Nantes-Sobrinho. More information about LFMTP can be found on https://lfmtp.org.

@kubikpixel@chaos.social
2025-12-12 21:45:03

These are three arguments for web dev serv. APIs, even if you have to take a critical look at them in detail:
»Speed Comparison: Benchmarking programming languages using the Leibniz formula for calculating π«
— 2025-12-12
📊 https://niklas-heer.github.io/speed-comparison/…

@arXiv_csFL_bot@mastoxiv.page
2025-10-15 07:37:31

Bringing Algebraic Hierarchical Decompositions to Concatenative Functional Languages
Attila Egri-Nagy
https://arxiv.org/abs/2510.12481 https://arxiv.org/pd…

Bringing Algebraic Hierarchical Decompositions to Concatenative Functional Languages
Programming languages tend to evolve over time to use more and more concepts from theoretical computer science. Still, there is a gap between programming and pure mathematics. Not all theoretical results have realized their promising applications. The algebraic decomposition of finite state automata (Krohn-Rhodes Theory) constructs an emulating hierarchical structure from simpler components for any computing device. These decompositions provide ways to understand and control computational proce…

@soundclamp@mastodon.xyz
2025-12-17 21:50:51

@… 👀
https://mastodon.social/@kottke/115735572087447936

kottke.org (@kottke@mastodon.social)
Matt Webb reports on going to algoraves. “There are special browser-based programming languages like strudel where you type code to define the beats and the sound, like mod synth in code, and it plays in a loop even while you’re coding.” https://interconnected.org/home/2025/12/11/live

@grumpybozo@toad.social
2025-10-17 20:45:43

It has occurred to me that a lot of data processing/transformation which is entirely feasible without highly trained LLMs and neural nets and GPUs is being handed over to such monstrosities in part because no one wants to do the app design. Like the current scourge of web-scraper bots, which seem to be doing #NLG with ultra-simple languages constructed by examining working URLs. It's a large project…

@arXiv_csSE_bot@mastoxiv.page
2025-10-10 09:42:29

Building Whitespace-Sensitive Languages Using Whitespace-Insensitive Components
Alexander Hellwig, Nico Jansen, Bernhard Rumpe
https://arxiv.org/abs/2510.08200 https://

Building Whitespace-Sensitive Languages Using Whitespace-Insensitive Components
In Software Language Engineering, there is a trend towards reusability by composing modular language components. However, this reusability is severely inhibited by a gap in integrating whitespace-sensitive and whitespace-insensitive languages. There is currently no consistent procedure for seamlessly reusing such language components in both cases, such that libraries often cannot be reused, and whitespacesensitive languages are developed from scratch. This paper presents a technique for using m…

@arXiv_csCL_bot@mastoxiv.page
2025-10-14 13:14:58

Invisible Languages of the LLM Universe
Saurabh Khanna, Xinxu Li
https://arxiv.org/abs/2510.11557 https://arxiv.org/pdf/2510.11557

Invisible Languages of the LLM Universe
Large Language Models are trained on massive multilingual corpora, yet this abundance masks a profound crisis: of the world's 7,613 living languages, approximately 2,000 languages with millions of speakers remain effectively invisible in digital ecosystems. We propose a critical framework connecting empirical measurements of language vitality (real world demographic strength) and digitality (online presence) with postcolonial theory and epistemic injustice to explain why linguistic inequality i…

@benb@osintua.eu
2025-10-15 16:36:30

Ukraine's language ombudsman calls for Russian to be stripped from list of protected 'minority' languages over mistranslation: https://benborges.xyz/2025/10/15/ukraines-language-ombudsman-calls-for.html

Tracking information about the Russian War against Ukraine — Support the OSINT Ukraine Archive the 🇷🇺 War against Ukraine 🇺🇦
Tracking information about the Russian War against Ukraine

@jamesthebard@social.linux.pizza
2025-11-17 06:19:38

First annoyance I've run into: standard bit shifting operations in Nim. It's not bad, but it took far too long to track down the right operator. In most languages, you're looking at the `>>` and `<<` operators, in Nim it's `shr` and `shl` which I totally wouldn't have guessed. However, got the initial register idea down.
```nim
type
Register = object
low: uint8 = 0
high: uint8 = 0
prime: uint16 = 0
proc swa…

@kornel@mastodon.social
2025-11-11 01:03:45

"What color is your function?" is a wonderful title. It's so good, the title alone could win the Sundance Festival.
But that post is about a JavaScript-specific limitation (not applicable to other languages), and some wishful bikeshedding about syntax (which turns out to be a leaky abstraction that makes locking ambiguous, very problematic in low-level languages).
But *color* is so catchy. It's not well defined that post, but you can't have "color"…

@vague@social.linux.pizza
2025-10-15 08:47:50

Looking for a phantomjs alternative, ran a search and ended up on a page with a seemingly good comparison of possibilities, until you realize the page is on zenrows.com domain. I don't think I'll take YOUR word for your product. Not sure what snakeoil they might be selling but feels overtly self-aggrandizing at least

a table comparing web scraping tools. The table has five columns: "Tool", "Languages", "Best For", "Popularity", "Ease of Use", and "Speed".

The first row lists "ZenRows" as the tool, with "Python, NodeJs, Java, PHP, Go, Ruby, and any other" as the languages supported, “Web scraping without getting blocked” as its best use, “Rapidly growing” as the popularity, “Beginner-friendly and very quick to implement” as the ease of use, and “Lightweight and fast” as its speed. The second row lists "Pup…

@arXiv_csCV_bot@mastoxiv.page
2025-10-09 10:09:51

A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages
Zibo Su, Kun Wei, Jiahua Li, Xu Yang, Cheng Deng
https://arxiv.org/abs/2510.06612

A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages
Speech-driven talking face synthesis (TFS) focuses on generating lifelike facial animations from audio input. Current TFS models perform well in English but unsatisfactorily in non-English languages, producing wrong mouth shapes and rigid facial expressions. The terrible performance is caused by the English-dominated training datasets and the lack of cross-language generalization abilities. Thus, we propose Multilingual Experts (MuEx), a novel framework featuring a Phoneme-Guided Mixture-of-Exp…

@netzschleuder@social.skewed.de
2025-12-14 22:00:04

unicodelang: Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.
This network has 868 nodes and 1255 edges.
Tags: Informational, Relatedness, Weighted

unicodelang — Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.

@sascha_wolfer@fediscience.org
2025-10-10 06:06:01

Eyeballing Figure 1 of their response actually seems to support this: the three subregions in the Americas contain nearly 80 % of all polysynthetic languages. In each of them, the median population size lies below the global median. However, if we compare within each of these three regions, polysynthetic languages have a higher median L1_population size than non-polysynthetic ones. Might this pattern point towards a classic Simpson's paradox?
A negative global association arises because polysynth lang are concentrated in regions with smaller overall populations, even though within regions the relationsh is positive. Once we account for that structure—as our mixed logit models do—the supposed "global" negative effect reverses direction.

@shriramk@mastodon.social
2025-11-06 19:32:57

Oh good! Someone from the @… community can weigh in on the benefits.
Oh wait…
Anyway, sooner or later the time for computable reals will come. I'm still HODLing stock in continued fractions (and teaching them every year to my first-year students).

Isaac King Advocates Replacing Floating-Point with Exact Arithmetic in High-Level Languages

Last updated 13 hours ago

Software engineer Isaac King initiated a discussion on X on November 5, arguing that high-level languages like Python and JavaScript should default to arbitrary-precision arithmetic, such as rationals, to avoid floating-point precision errors that burden developers. Supporters agree on the need for greater accuracy in non-integer computations, while critics, including program…

@arXiv_mathCT_bot@mastoxiv.page
2025-10-06 07:43:19

Homotopy Languages
C\'esar Bardomiano Mart\'inez, Simon Henry
https://arxiv.org/abs/2510.02607 https://arxiv.org/pdf/2510.02607

Homotopy Languages
We attach to each weak model category $\mathcal{M}$ a class of first order formulas about the fibrant objects of $\mathcal{M}$ whose validity is invariant under homotopies and weak equivalences. This is a generalization of the classical Blanc-Freyd Language of categories -- which involves formula avoiding equality on objects and which are invariant under isomorphism and equivalences of categories. In particular, we obtain similar homotopy invariant languages for $2$-categories, bicategories, ch…

@arXiv_csLG_bot@mastoxiv.page
2025-10-15 08:21:22

GAR: Generative Adversarial Reinforcement Learning for Formal Theorem Proving
Ruida Wang, Jiarui Yao, Rui Pan, Shizhe Diao, Tong Zhang
https://arxiv.org/abs/2510.11769 https://

GAR: Generative Adversarial Reinforcement Learning for Formal Theorem Proving
Solving math problems through verifiable languages such as Lean has significantly impacted both the mathematics and computer science communities. Current state-of-the-art models are often trained with expensive online Reinforcement Learning (RL) or expert iteration. However, these approaches rely on fixed problem sets, which causes inefficient training and limits the model to tackle complex problems. To overcome these limitations, we propose GAR: Generative Adversarial Reinforcement learning, a…

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:46:11

Which Word Orders Facilitate Length Generalization in LMs? An Investigation with GCG-Based Artificial Languages
Nadine El-Naggar, Tatsuki Kuribayashi, Ted Briscoe
https://arxiv.org/abs/2510.12722

Which Word Orders Facilitate Length Generalization in LMs? An Investigation with GCG-Based Artificial Languages
Whether language models (LMs) have inductive biases that favor typologically frequent grammatical properties over rare, implausible ones has been investigated, typically using artificial languages (ALs) (White and Cotterell, 2021; Kuribayashi et al., 2024). In this paper, we extend these works from two perspectives. First, we extend their context-free AL formalization by adopting Generalized Categorial Grammar (GCG) (Wood, 2014), which allows ALs to cover attested but previously overlooked cons…

@Techmeme@techhub.social
2025-09-26 09:31:52

How inaccurate AI translations of Wikipedia pages, which AI models use for training, may cause a doom spiral that further marginalizes vulnerable languages (Jacob Judah/MIT Technology Review)
https://www.technologyreview.com/2025/09/25/11240…

How AI and Wikipedia have sent vulnerable languages into a doom spiral
Machine translators have made it easier than ever to create error-plagued Wikipedia articles in obscure languages. What happens when AI models get trained on junk pages?

‪@mxp@mastodon.acm.org‬
2025-10-13 20:30:21

@… Regarding quotes, I'd add that in Germany and Austria, guillemets are an alternative to „…“ and are used like this: »…«
In Switzerland, only «…» are used for all national languages, but they are only spaced in French.

@mxp@mastodon.acm.org‬
2025-10-13 20:30:21

@… Regarding quotes, I'd add that in Germany and Austria, guillemets are an alternative to „…“ and are used like this: »…«
In Switzerland, only «…» are used for all national languages, but they are only spaced in French.

@arXiv_csAI_bot@mastoxiv.page
2025-10-15 09:53:21

Tensor Logic: The Language of AI
Pedro Domingos
https://arxiv.org/abs/2510.12269 https://arxiv.org/pdf/2510.12269…

Tensor Logic: The Language of AI
Progress in AI is hindered by the lack of a programming language with all the requisite features. Libraries like PyTorch and TensorFlow provide automatic differentiation and efficient GPU implementation, but are additions to Python, which was never intended for AI. Their lack of support for automated reasoning and knowledge acquisition has led to a long and costly series of hacky attempts to tack them on. On the other hand, AI languages like LISP an Prolog lack scalability and support for learn…

@smurthys@hachyderm.io
2025-11-20 08:24:32

"Like herding cats" in the English world (and may be elsewhere).
"Like weighing frogs" in Kannada (and may be other languages in India).
#saying #English #Kannada #India #impossibleThings #coordination

@vrandecic@mas.to
2025-09-25 08:44:30

Mixing languages can be confusing
#linguistics #languages #language

A cookie jar for selling cookies labeled with "American Cookie hell - Stück 1,90€"

@lysander07@sigmoid.social
2025-09-27 11:59:19

Interesting way to represent uncertain or vague information into #knowledgegraphs (as e.g. easier integration of LLM/Deep Learning Results into KGs) via "Fuzzy OWL". Paper by Fernando Bobillo & Umberto Straccia: Fuzzy Ontology Representation using OWL 2

Fuzzy Ontology Representation using OWL 2
The need to deal with vague information in Semantic Web languages is rising in importance and, thus, calls for a standard way to represent such information. We may address this issue by either extending current Semantic Web languages to cope with vagueness, or by providing a procedure to represent such information within current standard languages and tools. In this work, we follow the latter approach, by identifying the syntactic differences that a fuzzy ontology language has to cope with, and…

@frankel@mastodon.top
2025-10-12 18:31:04

In #OOP, objects collaborate. The initial idea of collaboration, first found in Smalltalk, was for object A to send a message to object B. Languages designed later use method calling. In both cases, the same question stands: how does an object reference other objects to reach the desired results?
In this post, I tackle the problem of passing

@yaya@jorts.horse
2025-11-06 07:02:13

I gotta step up my Irish learning so we can do some irish anarchism https://todon.eu/@CrimethInc/115501256616016837

CrimethInc. Ex-Workers (@CrimethInc@todon.eu)
Attached: 1 image We've published our first translation in Gallego (Galician), bringing the total number of languages represented on our website to 44. https://crimethinc.com/2025/06/18/epilogo-arredor-do-legalismo If you can help us with translation into any language, please get in touch! You can find a comprehensive list of all our work arranged by language here: https://crimethinc.com/languages

@Mediagazer@mstdn.social
2025-09-26 14:36:09

How inaccurate AI translations of Wikipedia pages, which AI models use for training, may cause a doom spiral that further marginalizes vulnerable languages (Jacob Judah/MIT Technology Review)
https://www.technologyreview.com/2025/09/25/11240…

How AI and Wikipedia have sent vulnerable languages into a doom spiral
Machine translators have made it easier than ever to create error-plagued Wikipedia articles in obscure languages. What happens when AI models get trained on junk pages?

@patrikja@functional.cafe
2025-10-14 07:04:32

@… on stage: Type Universes as Kripke Worlds
Paulette Koronkevich, William J. Bowman @… https://

Slide with stick figure liking "pure functional languages" and "simple mutable state"

@pavelasamsonov@mastodon.social
2025-12-08 18:49:07

"Speak proper English!" is one of the silliest things one could say.
Proper *English*? Of all languages? Come on.

@arXiv_csSE_bot@mastoxiv.page
2025-10-14 11:21:48

Interoperability From OpenTelemetry to Kieker: Demonstrated as Export from the Astronomy Shop
David Georg Reichelt, Shinhyung Yang, Wilhelm Hasselbring
https://arxiv.org/abs/2510.11179

Interoperability From OpenTelemetry to Kieker: Demonstrated as Export from the Astronomy Shop
The observability framework Kieker provides a range of analysis capabilities, but it is currently only able to instrument a smaller selection of languages and technologies, including Java, C, Fortran, and Python. The OpenTelemetry standard aims for providing reference implementations for most programming languages, including C# and JavaScript, that are currently not supported by Kieker. In this work, we describe how to transform OpenTelemetry tracing data into the Kieker framework. Thereby, it …

@arXiv_csPL_bot@mastoxiv.page
2025-10-15 12:25:58

Replaced article(s) found for cs.PL. https://arxiv.org/list/cs.PL/new
[1/1]:
- Incremental Computation: What Is the Essence?
Yanhong A. Liu
https://

@netzschleuder@social.skewed.de
2025-12-11 02:00:04

unicodelang: Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.
This network has 868 nodes and 1255 edges.
Tags: Informational, Relatedness, Weighted

unicodelang — Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.

@fanf@mendeddrum.org
2025-10-14 11:42:03

from my link log —
A C to Brainfuck compiler written in Rust.
https://iacgm.pages.dev/posts/c2bf/
saved 2025-10-13 https://dotat.at/:/OEPJB.html

C? Rewrite it in Brainfuck.
Doing Things Worst People always seem to want to do things well, and when they fail, they tend to blame their tools. So it should come as no surprise that programmers, being somewhat similar to people (and being generally bad at what they do), have a long tradition of growing near-religious zeal for editors, paradigms, code styles, and, of course, programming languages. The bickering never ends, and whatever one person preaches, another considers harmful.

@sascha_wolfer@fediscience.org
2025-10-10 06:05:01

Out now in PNAS: Statistical errors undermine claims about the evolution of #polysynthetic #languages by Alex Koplenig and me: #linguistics 🧶 coming up...

@cdarwin@c.im
2025-11-29 03:43:11

Both Quikscript and Shavian were essentially the results of a design competition,
sponsored--posthumously--by playwright George Bernard Shaw,
who laid out the terms in his will.
Shaw wanted someone to create an ideal phonetic alphabet for English that trumped Pitman shorthand.
British designer Ronald Kingsley Read, a finalist in the 1960s competition, designed both Quikscript and Shavian, the latter being named in Shaw's honor

Here are 50 Different Written Languages. Can You Tell Which are Fake? - Core77
Below are examples of 50 different written languages. Unless you're Indiana Jones, I doubt you'll recognize more than a handful. However, of these 50 scripts, five of them are contrived 20th Century creations. Two of them are shorthand-style phonetic alphabets designed for English; one of them is an ideographic writing

@arXiv_csCL_bot@mastoxiv.page
2025-10-13 10:41:50

A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages
Raoyuan Zhao, Yihong Liu, Hinrich Sch\"utze, Michael A. Hedderich
https://arxiv.org/abs/2510.09555

A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages
Large reasoning models (LRMs) increasingly rely on step-by-step Chain-of-Thought (CoT) reasoning to improve task performance, particularly in high-resource languages such as English. While recent work has examined final-answer accuracy in multilingual settings, the thinking traces themselves, i.e., the intermediate steps that lead to the final answer, remain underexplored. In this paper, we present the first comprehensive study of multilingual CoT reasoning, evaluating three key dimensions: per…

@arXiv_csFL_bot@mastoxiv.page
2025-10-15 12:23:37

Replaced article(s) found for cs.FL. https://arxiv.org/list/cs.FL/new
[1/1]:
- Can ChatGPT support software verification?
Christian Jan{\ss}en, Cedric Richter, Heike Wehrheim

@arXiv_csPL_bot@mastoxiv.page
2025-10-15 11:06:49

Crosslisted article(s) found for cs.PL. https://arxiv.org/list/cs.PL/new
[1/1]:
- Tensor Logic: The Language of AI
Pedro Domingos
https://ar…

@sauer_lauwarm@mastodon.social
2025-12-14 10:21:47

*nochmalskicher*
https://www.instagram.com/reel/DSKoD3wiAF6/?utm_source=ig_web_copy_link&igsh=NTc4MTIwNjQ2YQ==

ISTB University of Vienna on Instagram: "We are deeply honoured and delighted that the South Asian, Tibetan, and Buddhist Studies Library has been selected as one of the distinguished institutions to receive the eighty-volume commemorative edition of the Tipitaka, published in Thailand in 2016 to mark the seventieth anniversary of His Majesty King Bhumibol’s accession to the throne. The Thai monarchy has long upheld a well-established tradition of commissioning, presenting, and receiving editions of the Pali Canon. In 1893, King Rama V commissioned the first printed edition of the Tipitaka in Thailand, which was subsequently presented as a gift to institutions in more than twenty-five countries. The 40-volume “King Bhumibol Edition” allows monks and Buddhist laity worldwide to chant the Tipiṭaka in a consistent, rule-based manner. It is accompanied by the 40-volume “Queen Sirikit Edition” which reproduces King Rama V’s use of Syām-Pāli annotation with additional notes. The ISTB library provides an ideal home for this new edition of the Tipitaka. With a collection of more than 70,000 volumes in over ninety Asian languages, it serves as a vital centre for research and teaching in South Asian, Tibetan, and Buddhist Studies at the University of Vienna."
20 likes, 0 comments - istb_univienna on December 12, 2025: "We are deeply honoured and delighted that the South Asian, Tibetan, and Buddhist Studies Library has been selected as one of the distinguished institutions to receive the eighty-volume commemorative edition of the Tipitaka, published in Thailand in 2016 to mark the seventieth anniversary of His Majesty King Bhumibol’s accession to the throne. The Thai monarchy has long upheld a well-established tradition of commissioning, presentin…

@Techmeme@techhub.social
2025-12-12 18:56:24

Google expands Google Translate's live speech translation from Pixel Buds to any headphones, supporting 70 languages, in beta on compatible Android phones (Stevie Bonifield/The Verge)
https://www.theverge.com/news/843483/google-translate-live-sp…

Google Translate brings real-time speech translations to any headphones
Google Translate will now let users hear live speech translations in any headphones on its Android app, expanding a feature that was once only available with Pixel Buds.

@arXiv_csCL_bot@mastoxiv.page
2025-10-07 12:07:42

How I Built ASR for Endangered Languages with a Spoken Dictionary
Christopher Bartley, Anton Ragni
https://arxiv.org/abs/2510.04832 https://arxiv.org/pdf/2…

How I Built ASR for Endangered Languages with a Spoken Dictionary
Nearly half of the world's languages are endangered. Speech technologies such as Automatic Speech Recognition (ASR) are central to revival efforts, yet most languages remain unsupported because standard pipelines expect utterance-level supervised data. Speech data often exist for endangered languages but rarely match these formats. Manx Gaelic ($\sim$2,200 speakers), for example, has had transcribed speech since 1948, yet remains unsupported by modern systems. In this paper, we explore how litt…

@kubikpixel@chaos.social
2025-12-10 06:05:32

»Introduction to CSS if() Statements and Conditional Logic«
CSS will probably become logically structurable after a long time. It's not a programming language and that's why it's all the more exciting.
🖌️ https://markodenic.com/introduction-to

Introduction to CSS if Statements and Conditional Logic
Conditional logic is a familiar concept to anyone who has written a programming language. Languages like JavaScript or Python use if/else statements to evaluate expressions and execute different blocks of code depending on whether the condition is true or false.

@sascha_wolfer@fediscience.org
2025-10-10 06:06:17

Finally, what Xia & Lindell call a "separation problem" is, in our view, a feature of our approach and not a bug.
If, e.g., all languages in a family are polysynthetic (or none are), that’s not a statistical artefact – it’s the signal. The outcome is well associated with genealogy, showing that family membership captures someth genuinely informative about the process. When the model finds that family explains a large share of the variance, that's not a failure–it's evidence that phylogenetic structure dominates the pattern.
So while Xia & Lindell insist that "autocorrelation due to relationships and distance cannot be captured in family or regional-level analyses", we see that as an empirical question – and we treated it as one.
The real test is whether a mixed model that explicitly represents phylogeny and geography performs worse than their alternative, where the entire shared history of languages and environments is effectively collapsed into a single dimension (an eigenvector).
In other words: we model relationships – Xia & Lindell summarise them into one number per language.

@netzschleuder@social.skewed.de
2025-11-07 20:00:03

unicodelang: Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.
This network has 868 nodes and 1255 edges.
Tags: Informational, Relatedness, Weighted

unicodelang — Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.

@arXiv_csFL_bot@mastoxiv.page
2025-10-15 10:58:03

Crosslisted article(s) found for cs.FL. https://arxiv.org/list/cs.FL/new
[1/1]:
- Flavors of Quantifiers in Hyperlogics
Marek Chalupa, Thomas A. Henzinger, Ana Oliveira da Costa

@fanf@mendeddrum.org
2025-11-03 21:42:03

from my link log —
Control structures in programming languages: from goto to algebraic effects.
http://xavierleroy.org/control-structures/
saved 2025-11-03

Control structures in programming languages
Xavier Leroy

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:24:41

Tracing Multilingual Knowledge Acquisition Dynamics in Domain Adaptation: A Case Study of English-Japanese Biomedical Adaptation
Xin Zhao, Naoki Yoshinaga, Yuma Tsuta, Akiko Aizawa
https://arxiv.org/abs/2510.12115

Tracing Multilingual Knowledge Acquisition Dynamics in Domain Adaptation: A Case Study of English-Japanese Biomedical Adaptation
Multilingual domain adaptation (ML-DA) is widely used to learn new domain knowledge across languages into large language models (LLMs). Although many methods have been proposed to improve domain adaptation, the mechanisms of multilingual knowledge acquisition, how domain knowledge is learned within a language and transferred across languages, remain underexplored. This gap leads to suboptimal performance, particularly in low-resource settings. This work examines the learning dynamics of LLMs du…

@Techmeme@techhub.social
2025-11-03 09:45:36

Researchers find OpenAI's o1 can analyze languages like a human expert, including inferring the phonological rules of made-up languages without prior knowledge (Steve Nadis/Quanta Magazine)
https://www.quantamagazine.org/in-a-first-

In a First, AI Models Analyze Language As Well As a Human Expert | Quanta Magazine
If language is what makes us human, what does it mean now that large language models have gained “metalinguistic” abilities?

@netzschleuder@social.skewed.de
2025-11-06 22:00:04

unicodelang: Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.
This network has 868 nodes and 1255 edges.
Tags: Informational, Relatedness, Weighted

unicodelang — Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.

@arXiv_csSE_bot@mastoxiv.page
2025-10-15 09:58:22

DarTwin made precise by SysMLv2 -- An Experiment
{\O}ystein Haugen, Stefan Klikovits, Martin Arthur Andersen, Jonathan Beaulieu, Francis Bordeleau, Joachim Denil, Joost Mertens
https://arxiv.org/abs/2510.12478

DarTwin made precise by SysMLv2 -- An Experiment
The new SysMLv2 adds mechanisms for the built-in specification of domain-specific concepts and language extensions. This feature promises to facilitate the creation of Domain-Specific Languages (DSLs) and interfacing with existing system descriptions and technical designs. In this paper, we review these features and evaluate SysMLv2's capabilities using concrete use cases. We develop DarTwin DSL, a DSL that formalizes the existing DarTwin notation for Digital Twin (DT) evolution, through SysMLv…

@arXiv_csPL_bot@mastoxiv.page
2025-10-15 07:35:41

[2025-10-15 Wed (UTC), 5 new articles found for cs.PL Programming Languages]
toXiv_bot_toot

@arXiv_csCL_bot@mastoxiv.page
2025-10-06 10:21:59

Model-Based Ranking of Source Languages for Zero-Shot Cross-Lingual Transfer
Abteen Ebrahimi, Adam Wiemerslage, Katharina von der Wense
https://arxiv.org/abs/2510.03202 https://…

Model-Based Ranking of Source Languages for Zero-Shot Cross-Lingual Transfer
We present NN-Rank, an algorithm for ranking source languages for cross-lingual transfer, which leverages hidden representations from multilingual models and unlabeled target-language data. We experiment with two pretrained multilingual models and two tasks: part-of-speech tagging (POS) and named entity recognition (NER). We consider 51 source languages and evaluate on 56 and 72 target languages for POS and NER, respectively. When using in-domain data, NN-Rank beats state-of-the-art baselines t…

@netzschleuder@social.skewed.de
2025-12-06 09:00:04

unicodelang: Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.
This network has 868 nodes and 1255 edges.
Tags: Informational, Relatedness, Weighted

unicodelang — Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.

@arXiv_csCL_bot@mastoxiv.page
2025-10-09 10:40:01

Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages
Amir Hossein Yari, Kalmit Kulkarni, Ahmad Raza Khan, Fajri Koto
https://arxiv.org/abs/2510.07061

Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages
While automatic metrics drive progress in Machine Translation (MT) and Text Summarization (TS), existing metrics have been developed and validated almost exclusively for English and other high-resource languages. This narrow focus leaves Indian languages, spoken by over 1.5 billion people, largely overlooked, casting doubt on the universality of current evaluation practices. To address this gap, we introduce ITEM, a large-scale benchmark that systematically evaluates the alignment of 26 automat…

@arXiv_csFL_bot@mastoxiv.page
2025-10-14 17:33:23

Replaced article(s) found for cs.FL. https://arxiv.org/list/cs.FL/new
[1/1]:
- Mathematical Approach in Automata and Automata Association
Sergio Henrique Maciel
h…

@Techmeme@techhub.social
2025-11-11 13:01:35

Samsung rolls out its Vision AI Companion, a generative AI-powered upgrade to its Bixby assistant, across its 2025 TV lineup, with support for 10 languages (Dominic Preston/The Verge)
https://www.theverge.com/news/818355/samsung-tvs-bixby-generative-ai-con…

Samsung brings a generative AI-powered Bixby to its TVs
Samsung Vision AI Companion will let you ask Bixby questions about what’s on screen in real time, powered by Copilot and Perplexity.

@arXiv_csFL_bot@mastoxiv.page
2025-10-14 13:52:24

Crosslisted article(s) found for cs.FL. https://arxiv.org/list/cs.FL/new
[1/1]:
- Abstract String Domain Defined with Word Equations as a Reduced Product (Extended Version)
Antonina Nepeivoda, Ilya Afanasyev

@Techmeme@techhub.social
2025-11-10 23:45:44

Meta introduces Omnilingual Automatic Speech Recognition, a suite of AI models providing automatic speech recognition capabilities for more than 1,600 languages (Carl Franzen/VentureBeat)
https://venturebeat.com/ai/meta-returns-to-open-source-ai…

@arXiv_csPL_bot@mastoxiv.page
2025-09-25 08:31:12

Macro-embedding Compiler Intermediate Languages in Racket
William J. Bowman
https://arxiv.org/abs/2509.19607 https://arxiv.org/pdf/2509.19607

Macro-embedding Compiler Intermediate Languages in Racket
We present the design and implementation of a macro-embedding of a family of compiler intermediate languages, from a Scheme-like language to x86-64, into Racket. This embedding is used as part of a testing framework for a compilers course to derive interpreters for all the intermediate languages. The embedding implements features including safe, functional abstractions as well as unsafe assembly features, and the interactions between the two at various intermediate stages. This paper aims to …

@arXiv_csCL_bot@mastoxiv.page
2025-10-02 10:48:21

Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Review
Sukairaj Hafiz Imam, Tadesse Destaw Belay, Kedir Yassin Husse, Ibrahim Said Ahmad, Idris Abdulmumin, Hadiza Ali Umar, Muhammad Yahuza Bello, Joyce Nakatumba-Nabende, Seid Muhie Yimam, Shamsuddeen Hassan Muhammad
https://arxiv.org/abs/2510.…

Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Review
ASR has achieved remarkable global progress, yet African low-resource languages remain rigorously underrepresented, producing barriers to digital inclusion across the continent with more than +2000 languages. This systematic literature review (SLR) explores research on ASR for African languages with a focus on datasets, models and training methods, evaluation techniques, challenges, and recommends future directions. We employ the PRISMA 2020 procedures and search DBLP, ACM Digital Library, Goog…

@arXiv_csFL_bot@mastoxiv.page
2025-10-10 07:35:28

Languages of Words of Low Automatic Complexity Are Hard to Compute
Joey Chen, Bj{\o}rn Kjos-Hanssen, Ivan Koswara, Linus Richter, Frank Stephan
https://arxiv.org/abs/2510.07696 …

Languages of Words of Low Automatic Complexity Are Hard to Compute
The automatic complexity of a finite word (string) is an analogue for finite automata of Sipser's distinguishing complexity (1983) and was introduced by Shallit and Wang (2001). For a finite alphabet $Σ$ of at least two elements, we consider the non-deterministic automatic complexity given by exactly - yet not necessarily uniquely - accepting automata: a word $x \in Σ^*$ has exact non-deterministic automatic complexity $k \in \mathbb{N}$ if there exists a non-deterministic automaton of $k$ st…

@arXiv_csCL_bot@mastoxiv.page
2025-10-09 10:35:51

Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages
Neel Prabhanjan Rachamalla, Aravind Konakalla, Gautam Rajeev, Ashish Kulkarni, Chandra Khatri, Shubham Agarwal
https://arxiv.org/abs/2510.07000

Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages
The effectiveness of Large Language Models (LLMs) depends heavily on the availability of high-quality post-training data, particularly instruction-tuning and preference-based examples. Existing open-source datasets, however, often lack multilingual coverage, cultural grounding, and suffer from task diversity gaps that are especially pronounced for Indian languages. We introduce a human-in-the-loop pipeline that combines translations with synthetic expansion to produce reliable and diverse Indic…

@netzschleuder@social.skewed.de
2025-09-26 18:00:04

unicodelang: Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.
This network has 868 nodes and 1255 edges.
Tags: Informational, Relatedness, Weighted

unicodelang — Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.

@arXiv_csFL_bot@mastoxiv.page
2025-10-15 07:34:01

[2025-10-15 Wed (UTC), 1 new article found for cs.FL Formal Languages and Automata Theory]
toXiv_bot_toot

@netzschleuder@social.skewed.de
2025-10-10 08:00:03

word_adjacency: Word Adjacency Networks
Directed Networks of word adjacency in texts of several languages including English, French, Spanish and Japanese.
This network has 11586 nodes and 45129 edges.
Tags: Informational, Language, Unweighted
https://networks.skewed.de/net/word_ad

word_adjacency: Word Adjacency Networks. 11586 nodes, 45129 edges. https://networks.skewed.de/net/word_adjacency#spanish

word_adjacency — Word Adjacency Networks
Directed Networks of word adjacency in texts of several languages including English, French, Spanish and Japanese

@arXiv_csCL_bot@mastoxiv.page
2025-10-02 10:28:51

EuroSpeech: A Multilingual Speech Corpus
Samuel Pfisterer, Florian Gr\"otschla, Luca A. Lanzend\"orfer, Florian Yan, Roger Wattenhofer
https://arxiv.org/abs/2510.00514

EuroSpeech: A Multilingual Speech Corpus
Recent progress in speech processing has highlighted that high-quality performance across languages requires substantial training data for each individual language. While existing multilingual datasets cover many languages, they often contain insufficient data for most languages. Thus, trained models perform poorly on the majority of the supported languages. Our work addresses this challenge by introducing a scalable pipeline for constructing speech datasets from parliamentary recordings. The p…

@arXiv_csFL_bot@mastoxiv.page
2025-10-13 12:42:54

Replaced article(s) found for cs.FL. https://arxiv.org/list/cs.FL/new
[1/1]:
- Parameterized Verification of Timed Networks with Clock Invariants
\'Etienne Andr\'e, Swen Jacobs, Shyam Lal Karra, Ocan Sankur

@netzschleuder@social.skewed.de
2025-11-11 02:00:05

wikipedia_link: Wikipedia links (2016)
Networks of hyperlinks among articles on Wikipedia, for all available languages. A directed edge (i,j) indicates that article i hyperlinks to j.
This network has 25250 nodes and 698864 edges.
Tags: Informational, Web graph, Unweighted
https://networks.skewed.de/net…

wikipedia_link: Wikipedia links (2016). 25250 nodes, 698864 edges. https://networks.skewed.de/net/wikipedia_link#yi

wikipedia_link — Wikipedia links (2016)
Networks of hyperlinks among articles on Wikipedia, for all available languages. A directed edge (i,j) indicates that article i hyperlinks to j.

@arXiv_csFL_bot@mastoxiv.page
2025-10-13 10:57:07

Crosslisted article(s) found for cs.FL. https://arxiv.org/list/cs.FL/new
[1/1]:
- Psi-Turing Machines: Bounded Introspection for Complexity Barriers and Oracle Separations
Rafig Huseynzade

@netzschleuder@social.skewed.de
2025-09-22 16:00:04

unicodelang: Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.
This network has 868 nodes and 1255 edges.
Tags: Informational, Relatedness, Weighted

unicodelang — Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:37:41

Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency
Hailay Kidu Teklehaymanot, Wolfgang Nejdl
https://arxiv.org/abs/2510.12389

Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency
Tokenization disparities pose a significant barrier to achieving equitable access to artificial intelligence across linguistically diverse populations. This study conducts a large-scale cross-linguistic evaluation of tokenization efficiency in over 200 languages to systematically quantify computational inequities in large language models (LLMs). Using a standardized experimental framework, we applied consistent preprocessing and normalization protocols, followed by uniform tokenization through …

@arXiv_csCL_bot@mastoxiv.page
2025-10-15 10:38:31

Resource-sensitive but language-blind: Community size and not grammatical complexity better predicts the accuracy of Large Language Models in a novel Wug Test
Nikoleta Pantelidou, Evelina Leivada, Paolo Morosi
https://arxiv.org/abs/2510.12463

Resource-sensitive but language-blind: Community size and not grammatical complexity better predicts the accuracy of Large Language Models in a novel Wug Test
The linguistic abilities of Large Language Models are a matter of ongoing debate. This study contributes to this discussion by investigating model performance in a morphological generalization task that involves novel words. Using a multilingual adaptation of the Wug Test, six models were tested across four partially unrelated languages (Catalan, English, Greek, and Spanish) and compared with human speakers. The aim is to determine whether model accuracy approximates human competence and whethe…

@netzschleuder@social.skewed.de
2025-11-08 13:00:04

wikipedia_link: Wikipedia links (2016)
Networks of hyperlinks among articles on Wikipedia, for all available languages. A directed edge (i,j) indicates that article i hyperlinks to j.
This network has 9189 nodes and 176051 edges.
Tags: Informational, Web graph, Unweighted
https://networks.skewed.de/net…

wikipedia_link: Wikipedia links (2016). 9189 nodes, 176051 edges. https://networks.skewed.de/net/wikipedia_link#gan

wikipedia_link — Wikipedia links (2016)
Networks of hyperlinks among articles on Wikipedia, for all available languages. A directed edge (i,j) indicates that article i hyperlinks to j.

@arXiv_csCL_bot@mastoxiv.page
2025-09-26 10:18:11

The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages
Pranjal A. Chitale, Varun Gumma, Sanchit Ahuja, Prashant Kodali, Manan Uppadhyay, Deepthi Sudharsan, Sunayana Sitaram
https://arxiv.org/abs/2509.21294

The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages
Developing AI systems that operate effectively across languages while remaining culturally grounded is a long-standing challenge, particularly in low-resource settings. Synthetic data provides a promising avenue, yet its effectiveness in multilingual and multicultural contexts remains underexplored. We investigate the creation and impact of synthetic, culturally contextualized datasets for Indian languages through a bottom-up generation strategy that prompts large open-source LLMs (>= 235B para…

@netzschleuder@social.skewed.de
2025-10-30 21:00:04

unicodelang: Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.
This network has 868 nodes and 1255 edges.
Tags: Informational, Relatedness, Weighted

unicodelang — Languages spoken by country (2015)
A bipartite network of languages and the countries in which they are spoken, as estimated by Unicode. Edges are weighted by the proportion of the given country's population that is literate in a particular language.

@arXiv_csCL_bot@mastoxiv.page
2025-10-01 11:37:47

MENLO: From Preferences to Proficiency - Evaluating and Modeling Native-like Quality Across 47 Languages
Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi, Janice Lam, Nicol\`o Busetto, Denise Diaz
https://arxiv.org/abs/2509.26601

MENLO: From Preferences to Proficiency - Evaluating and Modeling Native-like Quality Across 47 Languages
Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt-response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM j…

@arXiv_csCL_bot@mastoxiv.page
2025-09-25 10:38:52

CorIL: Towards Enriching Indian Language to Indian Language Parallel Corpora and Machine Translation Systems
Soham Bhattacharjee, Mukund K Roy, Yathish Poojary, Bhargav Dave, Mihir Raj, Vandan Mujadia, Baban Gain, Pruthwik Mishra, Arafat Ahsan, Parameswari Krishnamurthy, Ashwath Rao, Gurpreet Singh Josan, Preeti Dubey, Aadil Amin Kak, Anna Rao Kulkarni, Narendra VG, Sunita Arora, Rakesh Balbantray, Prasenjit Majumdar, Karunesh K Arora, Asif Ekbal, Dipti Mishra Sharma

CorIL: Towards Enriching Indian Language to Indian Language Parallel Corpora and Machine Translation Systems
India's linguistic landscape is one of the most diverse in the world, comprising over 120 major languages and approximately 1,600 additional languages, with 22 officially recognized as scheduled languages in the Indian Constitution. Despite recent progress in multilingual neural machine translation (NMT), high-quality parallel corpora for Indian languages remain scarce, especially across varied domains. In this paper, we introduce a large-scale, high-quality annotated parallel corpus covering 1…

Tootfinder

Opt-in global Mastodon full text search. Join the index!