
2025-07-22 10:21:10
Harnessing LLMs for Document-Guided Fuzzing of OpenCV Library
Bin Duan, Tarek Mahmud, Meiru Che, Yan Yan, Naipeng Dong, Dan Dongseong Kim, Guowei Yang
https://arxiv.org/abs/2507.14558
Harnessing LLMs for Document-Guided Fuzzing of OpenCV Library
Bin Duan, Tarek Mahmud, Meiru Che, Yan Yan, Naipeng Dong, Dan Dongseong Kim, Guowei Yang
https://arxiv.org/abs/2507.14558
So, @… is working on using LLMs to process XML Except for, the models can’t write legal XML. So he’s using the model to generate a sloppy-XML parser: https://lucumr.pocoo.org/202…
I randomly bought this book in a quirky bookshop in Copenhagen for the sole reason that it said all the wrong things right on the cover.
(Sales: the single most important profession. NLP™: not natural language processing but neuro-linguistic programming. Meta: the Meta Model™ and Meta Publications™.)
I just started reading it and boy oh boy, I was not disappointed. It's outrageously hilarious.
"Persuasion engineering".
Cactus Flowers. Huntington Library, San Marino, California, USA. June, 2025. #huntingtonlibrary #cactüs #cactusflower
FAU University Press: Now in the top catalogs for open access publications https://ub.fau.de/en/2025/06/17/fau-university-press-now-in-the-top-catalogs-for-open-access-publications/
"How to Become an Integrity Sleuth in the Library"
https://katinamagazine.org/content/article/future-of-work/2025/how-to-become-an-integrity-sleuth-in-the-library
"Open access agreement management c…
#DH2025 Listening to Victoria and Thea on 'Building a FAIR data future at the Journal of Open Humanities' - I'm hoping you'll see a lot more British Library data papers over time, as along with datasheets for datasets it's a big part of making our open collections findable and usable
»Belgisches Gericht ordnet Sperre der Open Library des Internet Archive an:
Ein Brüsseler Gericht hat eine sehr breite Anordnung für Websperren erlassen. Sie richtet sich gegen die Open Library sowie Schattenbibliotheken wie Z-Library«
Archiv ist, egal in welcher techn. Form, wichtig und hat nichts mit Datenklau zu tun. Dies wird leider aber vom Kommerz öfters als solches angesehen.
🤨
Open Letter to CRL from the academic wing of #CripLib - ACRLog
https://acrlog.org/2025/05/2…
I learned¹ about the Baldwin Library of Historical Children's Literature² that has more than 10000 books scanned and available online. Just great.
It is hosted by the University of Florida. So let's hope that it stays available, i.e. that the Republicans don't find the old children's books from 1750 to woke.³
__
¹via
Etwas mehr der heute besonders häufig geteilten #News:
Belgisches Gericht ordnet Sperre der Open Library des Internet Archive an
PGLib-CO2: A Power Grid Library for Computing and Optimizing Carbon Emissions
Young-ho Cho, Min-Seung Ko, Hao Zhu
https://arxiv.org/abs/2506.14662 https://…
EcBot: Data-Driven Energy Consumption Open-Source MATLAB Library for Manipulators
Juan Heredia, Christian Schlette, Mikkel Baun Kj{\ae}rgaard
https://arxiv.org/abs/2508.06276 ht…
A new Dune grid for scalable dynamic adaptivity based on the p4est software library
Carsten Burstedde, Mikhail Kirilin, Robert Kl\"ofkorn
https://arxiv.org/abs/2507.11386
Your library only has compelling stories on the inside.
#library #libraries
Einige der zuletzt hier besonders häufig geteilten #News:
Belgisches Gericht ordnet Sperre der Open Library des Internet Archive an
If you get an invite to this generative art software engineering call, note that if you submit something and it gets accepted, as far as I can tell it would cost you $3000 in open access fees... unless you want it to languish behind a paywall (you'd then only be allowed to share an unedited draft, and even then would have to advertise the paywall on it). They don't seem to want to make this clear in their call.
OpenLB-UQ: An Uncertainty Quantification Framework for Incompressible Fluid Flow Simulations
Mingliang Zhong, Adrian Kummerl\"ander, Shota Ito, Mathias J. Krause, Martin Frank, Stephan Simonis
https://arxiv.org/abs/2508.13867
Is anyone looking for good first-timer OSS contributor issues? Crell/Serde has a few tagged "good first issue" if you're interested.
https://github.com/Crell/Serde/issues?q=is:issue state:open label:"good fir…
Einige der zuletzt hier besonders häufig geteilten #News:
Belgisches Gericht ordnet Sperre der Open Library des Internet Archive an
AI is flooding libraries with generated content just as budgets and staff are at their most precarious. This Thursday at 10am EDT my ASIS&T webinar asks if we need to ban it, label it, absorb it—or rethink the library itself.
https://www.asist.org/meetings-events/webi
FIDESlib: A Fully-Fledged Open-Source FHE Library for Efficient CKKS on GPUs
Carlos Agull\'o-Domingo (Universidad de Murcia), \'Oscar Vera-L\'opez (Universidad de Murcia), Seyda Guzelhan (Boston University), Lohit Daksha (Boston University), Aymane El Jerari (Northeastern University), Kaustubh Shivdikar (Advanced Micro Devices), Rashmi Agrawal (Boston University), David Kaeli (Northeastern University), Ajay Joshi (Boston University), Jos\'e L. Abell\'an (Universidad…
Porous Convection in the Discrete Exterior Calculus with Geometric Multigrid
Luke Morris, George Rauta, Kevin Carlson, James Fairbanks
https://arxiv.org/abs/2508.12501 https://
Cloudflare open sourced an OAuth library mostly written by Claude, showing how AI handles mechanical implementation while humans guide with context and judgment (Max Mitchell)
https://www.maxemitchell.com/writings/i-read-all-of-cloudflares…
"Navigating #openaccess publishing #agreement caps in 2025" https://…
TIL linking likely does not make a program a derivative of a library in the EU, thus making the GPL, LGPL and MPL effectivelly identical here.
https://interoperable-europe.ec.europa.eu/collection/eupl/news/copyleft-or-reciprocal
OpenSN: An Open Source Library for Emulating LEO Satellite Networks
Wenhao Lu, Zhiyuan Wang, Hefan Zhang, Shan Zhang, Hongbin Luo
https://arxiv.org/abs/2507.03248
Zum Abend noch einige der heute besonders häufig geteilten #News:
Belgisches Gericht ordnet Sperre der Open Library des Internet Archive an
Subtooting since people in the original thread wanted it to be over, but selfishly tagging @… and @… whose opinions I value...
I think that saying "we are not a supply chain" is exactly what open-source maintainers should be doing right now in response to "open source supply chain security" threads.
I can't claim to be an expert and don't maintain any important FOSS stuff, but I do release almost all of my code under open licenses, and I do use many open source libraries, and I have felt the pain of needing to replace an unmaintained library.
There's a certain small-to-mid-scale class of program, including many open-source libraries, which can be built/maintained by a single person, and which to my mind best operate on a "snake growth" model: incremental changes/fixes, punctuated by periodic "skin-shedding" phases where make rewrites or version updates happen. These projects aren't immortal either: as the whole tech landscape around them changes, they become unnecessary and/or people lose interest, so they go unmaintained and eventually break. Each time one of their dependencies breaks (or has a skin-shedding moment) there's a higher probability that they break or shed too, as maintenance needs shoot up at these junctures. Unless you're a company trying to make money from a single long-lived app, it's actually okay that software churns like this, and if you're a company trying to make money, your priorities absolutely should not factor into any decisions people making FOSS software make: we're trying (and to a huge extent succeeding) to make a better world (and/or just have fun with our own hobbies share that fun with others) that leaves behind the corrosive & planet-destroying plague which is capitalism, and you're trying to personally enrich yourself by embracing that plague. The fact that capitalism is *evil* is not an incidental thing in this discussion.
To make an imperfect analogy, imagine that the peasants of some domain have set up a really-free-market, where they provide each other with free stuff to help each other survive, sometimes doing some barter perhaps but mostly just everyone bringing their surplus. Now imagine the lord of the domain, who is the source of these peasants' immiseration, goes to this market secretly & takes some berries, which he uses as one ingredient in delicious tarts that he then sells for profit. But then the berry-bringer stops showing up to the free market, or starts bringing a different kind of fruit, or even ends up bringing rotten berries by accident. And the lord complains "I have a supply chain problem!" Like, fuck off dude! Your problem is that you *didn't* want to build a supply chain and instead thought you would build your profit-focused business in other people's free stuff. If you were paying the berry-picker, you'd have a supply chain problem, but you weren't, so you really have an "I want more free stuff" problem when you can't be arsed to give away your own stuff for free.
There can be all sorts of problems in the really-free-market, like maybe not enough people bring socks, so the peasants who can't afford socks are going barefoot, and having foot problems, and the peasants put their heads together and see if they can convince someone to start bringing socks, and maybe they can't and things are a bit sad, but the really-free-market was never supposed to solve everyone's problems 100% when they're all still being squeezed dry by their taxes: until they are able to get free of the lord & start building a lovely anarchist society, the really-free-market is a best-effort kind of deal that aims to make things better, and sometimes will fall short. When it becomes the main way goods in society are distributed, and when the people who contribute aren't constantly drained by the feudal yoke, at that point the availability of particular goods is a real problem that needs to be solved, but at that point, it's also much easier to solve. And at *no* point does someone coming into the market to take stuff only to turn around and sell it deserve anything from the market or those contributing to it. They are not a supply chain. They're trying to help each other out, but even then they're doing so freely and without obligation. They might discuss amongst themselves how to better coordinate their mutual aid, but they're not going to end up forcing anyone to bring anything or even expecting that a certain person contribute a certain amount, since the whole point is that the thing is voluntary & free, and they've all got changing life circumstances that affect their contributions. Celebrate whatever shows up at the market, express your desire for things that would be useful, but don't impose a burden on anyone else to bring a specific thing, because otherwise it's fair for them to oppose such a burden on you, and now you two are doing your own barter thing that's outside the parameters of the really-free-market.
I was trying to package #FlexiBLAS for #Gentoo, and to be honest, it doesn't look that good.
The first red flag is lack of an open bug tracker. Apparently, there is the tracker on GitLab that's limited to "members of their group and selected external contributors", but it doesn't seem to be used much. So it's "send us an email", and wonder how many people sent us the same bug report before.
The git repository is currently at something tagged 3.4.80 that seems to be prerelease, and its build system is quite broken. Not exactly the best path to verify that the bugs you are hitting are still there.
Now, upstream seems to insist on either using vendored netlib #LAPACK, or statically linking to the system library (we don't install the static libraries). Apparently I can specify the shared libraries instead, but it doesn't work — and it's unclear to me whether it doesn't work because I'm using the shared libraries, or because it doesn't support my LAPACK version. If I build LAPACK without deprecated symbols, it refuses to load it at runtime because of missing symbols. And if I build it with deprecated symbols, it fails to find some symbols at CMake time.
Honestly, I feel like I've spent too much time on this project already, especially given that its future is entirely unclear to me — the current git is quite broken, I have no clue how many issues were reported already and whether my bug reports will receive any reply. It definitely doesn't fare well for a package that we might start to rely heavily on. We don't want a cathedral there.
https://www.mpi-magdeburg.mpg.de/projects/flexiblas
https://gitlab.mpi-magdeburg.mpg.de/software/flexiblas-release
ATM I don't see any end in site for me sipping for tailwind. It solves all my problems and doesn't cause any.
Always open to being sold something new but I wanted tailwind since 2017 when I wanted to just use inline css instead of what ever css library I was using.
I’m trying to help a client pick a good UI framework they can start their product with, but ultimately grow into their own design system and component library. They have started development with React, which isn’t surprising, but they are also open to using a more framework-agnostic approach in the future.
Any suggestions for a really mature and solid, themeable framework as a starting point? Chakra UI? Ark UI? Radix?
Video-Guided Text-to-Music Generation Using Public Domain Movie Collections
Haven Kim, Zachary Novack, Weihan Xu, Julian McAuley, Hao-Wen Dong
https://arxiv.org/abs/2506.12573
Very excited about this! Code to access GRIN will help lots of Google Books partners, and the example might open other doors, as well as the obvious benefits of access to data!
'Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability' https://arxiv.org/abs/2506…
How Robust are LLM-Generated Library Imports? An Empirical Study using Stack Overflow
Jasmine Latendresse, SayedHassan Khatoonabadi, Emad Shihab
https://arxiv.org/abs/2507.10818
Help wanted: Can we get someone to go through the build/link time dependencies of ngscopeclient, identify every third-party open source library we use, and ensure that they're all credited properly in the documentation, and include/link to the text of the appropriate licenses?
https://github.com/ng…
MRpro - open PyTorch-based MR reconstruction and processing package
Felix Frederik Zimmermann, Patrick Schuenke, Christoph S. Aigner, Bill A. Bernhardt, Mara Guastini, Johannes Hammacher, Noah Jaitner, Andreas Kofler, Leonid Lunin, Stefan Martin, Catarina Redshaw Kranich, Jakob Schattenfroh, David Schote, Yanglei Wu, Christoph Kolbitsch
https://
Fast simulations of continuous-variable circuits using the coherent state decomposition
Olga Solodovnikova, Ulrik L. Andersen, Jonas S. Neergaard-Nielsen
https://arxiv.org/abs/2508.06175
ALPaca: The ALP Automatic Computing Algorithm
Jorge Alda, Marta Fuentes Zamoro, Luca Merlo, Xavier Ponce D\'iaz, Stefano Rigolin
https://arxiv.org/abs/2508.08354 https://
Met @… at @… event. #CoSocialCa members in the wild.
MultiObjectiveAlgorithms.jl: a Julia package for solving multi-objective optimization problems
Oscar Dowson, Xavier Gandibleux, G\"okhan Kof
https://arxiv.org/abs/2507.05501 …
Implementing the finite-volume three-pion scattering formalism across all non-maximal isospins
Athari Alotaibi, Maxwell T. Hansen, Ra\'ul A. Brice\~no
https://arxiv.org/abs/2508.11627
Spatialize v1.0: A Python/C Library for Ensemble Spatial Interpolation
Alvaro F. Ega\~na, Alejandro Ehrenfeld, Felipe Garrido, Mar\'ia Jes\'us Valenzuela, Juan F. S\'anchez-P\'erez
https://arxiv.org/abs/2507.17867
Cardiotensor: A Python Library for Orientation Analysis and Tractography in 3D Cardiac Imaging
Joseph Brunet, Lisa Chestnutt, Matthieu Chourrout, Hector Dejea, Vaishnavi Sabarigirivasan, Peter D. Lee, Andrew C. Cook
https://arxiv.org/abs/2508.07476
Should we teach vibe coding? Here's why not.
Should AI coding be taught in undergrad CS education?
1/2
I teach undergraduate computer science labs, including for intro and more-advanced core courses. I don't publish (non-negligible) scholarly work in the area, but I've got years of craft expertise in course design, and I do follow the academic literature to some degree. In other words, In not the world's leading expert, but I have spent a lot of time thinking about course design, and consider myself competent at it, with plenty of direct experience in what knowledge & skills I can expect from students as they move through the curriculum.
I'm also strongly against most uses of what's called "AI" these days (specifically, generative deep neutral networks as supplied by our current cadre of techbro). There are a surprising number of completely orthogonal reasons to oppose the use of these systems, and a very limited number of reasonable exceptions (overcoming accessibility barriers is an example). On the grounds of environmental and digital-commons-pollution costs alone, using specifically the largest/newest models is unethical in most cases.
But as any good teacher should, I constantly question these evaluations, because I worry about the impact on my students should I eschew teaching relevant tech for bad reasons (and even for his reasons). I also want to make my reasoning clear to students, who should absolutely question me on this. That inspired me to ask a simple question: ignoring for one moment the ethical objections (which we shouldn't, of course; they're very stark), at what level in the CS major could I expect to teach a course about programming with AI assistance, and expect students to succeed at a more technically demanding final project than a course at the same level where students were banned from using AI? In other words, at what level would I expect students to actually benefit from AI coding "assistance?"
To be clear, I'm assuming that students aren't using AI in other aspects of coursework: the topic of using AI to "help you study" is a separate one (TL;DR it's gross value is not negative, but it's mostly not worth the harm to your metacognitive abilities, which AI-induced changes to the digital commons are making more important than ever).
So what's my answer to this question?
If I'm being incredibly optimistic, senior year. Slightly less optimistic, second year of a masters program. Realistic? Maybe never.
The interesting bit for you-the-reader is: why is this my answer? (Especially given that students would probably self-report significant gains at lower levels.) To start with, [this paper where experienced developers thought that AI assistance sped up their work on real tasks when in fact it slowed it down] (https://arxiv.org/abs/2507.09089) is informative. There are a lot of differences in task between experienced devs solving real bugs and students working on a class project, but it's important to understand that we shouldn't have a baseline expectation that AI coding "assistants" will speed things up in the best of circumstances, and we shouldn't trust self-reports of productivity (or the AI hype machine in general).
Now we might imagine that coding assistants will be better at helping with a student project than at helping with fixing bugs in open-source software, since it's a much easier task. For many programming assignments that have a fixed answer, we know that many AI assistants can just spit out a solution based on prompting them with the problem description (there's another elephant in the room here to do with learning outcomes regardless of project success, but we'll ignore this over too, my focus here is on project complexity reach, not learning outcomes). My question is about more open-ended projects, not assignments with an expected answer. Here's a second study (by one of my colleagues) about novices using AI assistance for programming tasks. It showcases how difficult it is to use AI tools well, and some of these stumbling blocks that novices in particular face.
But what about intermediate students? Might there be some level where the AI is helpful because the task is still relatively simple and the students are good enough to handle it? The problem with this is that as task complexity increases, so does the likelihood of the AI generating (or copying) code that uses more complex constructs which a student doesn't understand. Let's say I have second year students writing interactive websites with JavaScript. Without a lot of care that those students don't know how to deploy, the AI is likely to suggest code that depends on several different frameworks, from React to JQuery, without actually setting up or including those frameworks, and of course three students would be way out of their depth trying to do that. This is a general problem: each programming class carefully limits the specific code frameworks and constructs it expects students to know based on the material it covers. There is no feasible way to limit an AI assistant to a fixed set of constructs or frameworks, using current designs. There are alternate designs where this would be possible (like AI search through adaptation from a controlled library of snippets) but those would be entirely different tools.
So what happens on a sizeable class project where the AI has dropped in buggy code, especially if it uses code constructs the students don't understand? Best case, they understand that they don't understand and re-prompt, or ask for help from an instructor or TA quickly who helps them get rid of the stuff they don't understand and re-prompt or manually add stuff they do. Average case: they waste several hours and/or sweep the bugs partly under the rug, resulting in a project with significant defects. Students in their second and even third years of a CS major still have a lot to learn about debugging, and usually have significant gaps in their knowledge of even their most comfortable programming language. I do think regardless of AI we as teachers need to get better at teaching debugging skills, but the knowledge gaps are inevitable because there's just too much to know. In Python, for example, the LLM is going to spit out yields, async functions, try/finally, maybe even something like a while/else, or with recent training data, the walrus operator. I can't expect even a fraction of 3rd year students who have worked with Python since their first year to know about all these things, and based on how students approach projects where they have studied all the relevant constructs but have forgotten some, I'm not optimistic seeing these things will magically become learning opportunities. Student projects are better off working with a limited subset of full programming languages that the students have actually learned, and using AI coding assistants as currently designed makes this impossible. Beyond that, even when the "assistant" just introduces bugs using syntax the students understand, even through their 4th year many students struggle to understand the operation of moderately complex code they've written themselves, let alone written by someone else. Having access to an AI that will confidently offer incorrect explanations for bugs will make this worse.
To be sure a small minority of students will be able to overcome these problems, but that minority is the group that has a good grasp of the fundamentals and has broadened their knowledge through self-study, which earlier AI-reliant classes would make less likely to happen. In any case, I care about the average student, since we already have plenty of stuff about our institutions that makes life easier for a favored few while being worse for the average student (note that our construction of that favored few as the "good" students is a large part of this problem).
To summarize: because AI assistants introduce excess code complexity and difficult-to-debug bugs, they'll slow down rather than speed up project progress for the average student on moderately complex projects. On a fixed deadline, they'll result in worse projects, or necessitate less ambitious project scoping to ensure adequate completion, and I expect this remains broadly true through 4-6 years of study in most programs (don't take this as an endorsement of AI "assistants" for masters students; we've ignored a lot of other problems along the way).
There's a related problem: solving open-ended project assignments well ultimately depends on deeply understanding the problem, and AI "assistants" allow students to put a lot of code in their file without spending much time thinking about the problem or building an understanding of it. This is awful for learning outcomes, but also bad for project success. Getting students to see the value of thinking deeply about a problem is a thorny pedagogical puzzle at the best of times, and allowing the use of AI "assistants" makes the problem much much worse. This is another area I hope to see (or even drive) pedagogical improvement in, for what it's worth.
1/2
rd-spiral: An open-source Python library for learning 2D reaction-diffusion dynamics through pseudo-spectral method
Sandy H. S. Herho, Iwan P. Anwar, Rusmawan Suwarman
https://arxiv.org/abs/2506.20633
"#OpenAccess and #Citation #Impact: Modality, Funding, Publisher, and Disciplinary Trends at the University of Kentucky"
Understanding API Usage and Testing: An Empirical Study of C Libraries
Ahmed Zaki, Cristian Cadar
https://arxiv.org/abs/2506.11598 https://
🇺🇦 #NowPlaying on #KEXP's #Early
The Linda Lindas:
🎵 Racist, Sexist Boy (Live at LA Public Library)
#TheLindaLindas
https://thelindalindas.bandcamp.com/track/racist-sexist-boy
https://open.spotify.com/track/6CSLL3sOgYIMSRj69mkGSI
TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability
Mohammad Aflah Khan, Ameya Godbole, Johnny Tian-Zheng Wei, Ryan Wang, James Flemings, Krishna Gummadi, Willie Neiswanger, Robin Jia
https://arxiv.org/abs/2507.19419
Replaced article(s) found for cs.DS. https://arxiv.org/list/cs.DS/new
[1/1]:
- TGLib: An Open-Source Library for Temporal Graph Analysis
Lutz Oettershagen, Petra Mutzel
Sacred Lotus. Huntington Library, San Marino, California, USA. July, 2025. #huntingtonlibrary #lotus #waterlily
DPLib: A Standard Benchmark Library for Distributed Power System Analysis and Optimization
Milad Hasanzadeh, Amin Kargarian
https://arxiv.org/abs/2506.20819
DefElement: an encyclopedia of finite element definitions
Matthew W. Scroggs, Pablo D. Brubeck, Joseph P. Dean, J{\o}rgen S. Dokken, India Marsden
https://arxiv.org/abs/2506.20188
"Open Access and Citation Impact: Modality, Funding, Publisher, and Disciplinary Trends at the University of Kentucky" #OpenAccess
Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms
Zhiyi Hu, Siyuan Shen, Tommaso Bonato, Sylvain Jeaugey, Cedell Alexander, Eric Spada, Jeff Hammond, Torsten Hoefler
https://arxiv.org/abs/2507.04786
A factorisation-based regularised interior point method using the augmented system
Filippo Zanetti, Jacek Gondzio
https://arxiv.org/abs/2508.04370 https://…
f4ncgb: High Performance Gr\"obner Basis Computations in Free Algebras
Maximilian Heisinger, Clemens Hofstadler
https://arxiv.org/abs/2505.19304 https…
Just read this post by @… on an optimistic AGI future, and while it had some interesting and worthwhile ideas, it's also in my opinion dangerously misguided, and plays into the current AGI hype in a harmful way.
https://social.coop/@eloquence/114940607434005478
My criticisms include:
- Current LLM technology has many layers, but the biggest most capable models are all tied to corporate datacenters and require inordinate amounts of every and water use to run. Trying to use these tools to bring about a post-scarcity economy will burn up the planet. We urgently need more-capable but also vastly more efficient AI technologies if we want to use AI for a post-scarcity economy, and we are *not* nearly on the verge of this despite what the big companies pushing LLMs want us to think.
- I can see that permacommons.org claims a small level of expenses on AI equates to low climate impact. However, given current deep subsidies on place by the big companies to attract users, that isn't a great assumption. The fact that their FAQ dodges the question about which AI systems they use isn't a great look.
- These systems are not free in the same way that Wikipedia or open-source software is. To run your own model you need a data harvesting & cleaning operation that costs millions of dollars minimum, and then you need millions of dollars worth of storage & compute to train & host the models. Right now, big corporations are trying to compete for market share by heavily subsidizing these things, but it you go along with that, you become dependent on them, and you'll be screwed when they jack up the price to a profitable level later. I'd love to see open dataset initiatives SBD the like, and there are some of these things, but not enough yet, and many of the initiatives focus on one problem while ignoring others (fine for research but not the basis for a society yet).
- Between the environmental impacts, the horrible labor conditions and undercompensation of data workers who filter the big datasets, and the impacts of both AI scrapers and AI commons pollution, the developers of the most popular & effective LLMs have a lot of answer for. This project only really mentions environmental impacts, which makes me think that they're not serious about ethics, which in turn makes me distrustful of the whole enterprise.
- Their language also ends up encouraging AI use broadly while totally ignoring several entire classes of harm, so they're effectively contributing to AI hype, especially with such casual talk of AGI and robotics as if embodied AGI were just around the corner. To be clear about this point: we are several breakthroughs away from AGI under the most optimistic assumptions, and giving the impression that those will happen soon plays directly into the hands of the Sam Altmans of the world who are trying to make money off the impression of impending huge advances in AI capabilities. Adding to the AI hype is irresponsible.
- I've got a more philosophical criticism that I'll post about separately.
I do think that the idea of using AI & other software tools, possibly along with robotics and funded by many local cooperatives, in order to make businesses obsolete before they can do the same to all workers, is a good one. Get your local library to buy a knitting machine alongside their 3D printer.
Lately I've felt too busy criticizing AI to really sit down and think about what I do want the future to look like, even though I'm a big proponent of positive visions for the future as a force multiplier for criticism, and this article is inspiring to me in that regard, even if the specific project doesn't seem like a good one.
At @oapenbooks.bsky.social, we have updated our #Metadata feeds, to better integrate our #OpenAccess #books into #libraries
ProCaliper: functional and structural analysis, visualization, and annotation of proteins
Jordan C. Rozum, Hunter Ufford, Alexandria K. Im, Tong Zhang, David D. Pollock, Doo Nam Kim, Song Feng
https://arxiv.org/abs/2506.19961
This https://arxiv.org/abs/1910.14012 has been replaced.
link: https://scholar.google.com/scholar?q=a
LLM coding is the opposite of DRY
An important principle in software engineering is DRY: Don't Repeat Yourself. We recognize that having the same code copied in more than one place is bad for several reasons:
1. It makes the entire codebase harder to read.
2. It increases maintenance burden, since any problems in the duplicated code need to be solved in more than one place.
3. Because it becomes possible for the copies to drift apart if changes to one aren't transferred to the other (maybe the person making the change has forgotten there was a copy) it makes the code more error-prone and harder to debug.
All modern programming languages make it almost entirely unnecessary to repeat code: we can move the repeated code into a "function" or "module" and then reference it from all the different places it's needed. At a larger scale, someone might write an open-source "library" of such functions or modules and instead of re-implementing that functionality ourselves, we can use their code, with an acknowledgement. Using another person's library this way is complicated, because now you're dependent on them: if they stop maintaining it or introduce bugs, you've inherited a problem, but still, you could always copy their project and maintain your own version, and it would be not much more work than if you had implemented stuff yourself from the start. It's a little more complicated than this, but the basic principle holds, and it's a foundational one for software development in general and the open-source movement in particular. The network of "citations" as open-source software builds on other open-source software and people contribute patches to each others' projects is a lot of what makes the movement into a community, and it can lead to collaborations that drive further development. So the DRY principle is important at both small and large scales.
Unfortunately, the current crop of hyped-up LLM coding systems from the big players are antithetical to DRY at all scales:
- At the library scale, they train on open source software but then (with some unknown frequency) replicate parts of it line-for-line *without* any citation [1]. The person who was using the LLM has no way of knowing that this happened, or even any way to check for it. In theory the LLM company could build a system for this, but it's not likely to be profitable unless the courts actually start punishing these license violations, which doesn't seem likely based on results so far and the difficulty of finding out that the violations are happening. By creating these copies (and also mash-ups, along with lots of less-problematic stuff), the LLM users (enabled and encouraged by the LLM-peddlers) are directly undermining the DRY principle. If we see what the big AI companies claim to want, which is a massive shift towards machine-authored code, DRY at the library scale will effectively be dead, with each new project simply re-implementing the functionality it needs instead of every using a library. This might seem to have some upside, since dependency hell is a thing, but the downside in terms of comprehensibility and therefore maintainability, correctness, and security will be massive. The eventual lack of new high-quality DRY-respecting code to train the models on will only make this problem worse.
- At the module & function level, AI is probably prone to re-writing rather than re-using the functions or needs, especially with a workflow where a human prompts it for many independent completions. This part I don't have direct evidence for, since I don't use LLM coding models myself except in very specific circumstances because it's not generally ethical to do so. I do know that when it tries to call existing functions, it often guesses incorrectly about the parameters they need, which I'm sure is a headache and source of bugs for the vibe coders out there. An AI could be designed to take more context into account and use existing lookup tools to get accurate function signatures and use them when generating function calls, but even though that would probably significantly improve output quality, I suspect it's the kind of thing that would be seen as too-baroque and thus not a priority. Would love to hear I'm wrong about any of this, but I suspect the consequences are that any medium-or-larger sized codebase written with LLM tools will have significant bloat from duplicate functionality, and will have places where better use of existing libraries would have made the code simpler. At a fundamental level, a principle like DRY is not something that current LLM training techniques are able to learn, and while they can imitate it from their training sets to some degree when asked for large amounts of code, when prompted for many smaller chunks, they're asymptotically likely to violate it.
I think this is an important critique in part because it cuts against the argument that "LLMs are the modern compliers, if you reject them you're just like the people who wanted to keep hand-writing assembly code, and you'll be just as obsolete." Compilers actually represented a great win for abstraction, encapsulation, and DRY in general, and they supported and are integral to open source development, whereas LLMs are set to do the opposite.
[1] to see what this looks like in action in prose, see the example on page 30 of the NYTimes copyright complaint against OpenAI (#AI #GenAI #LLMs #VibeCoding
EngiBench: A Framework for Data-Driven Engineering Design Research
Florian Felten, Gabriel Apaza, Gerhard Br\"aunlich, Cashen Diniz, Xuliang Dong, Arthur Drake, Milad Habibi, Nathaniel J. Hoffman, Matthew Keeler, Soheyl Massoudi, Francis G. VanGessel, Mark Fuge
https://arxiv.org/abs/2508.00831…
SAVANT: Vulnerability Detection in Application Dependencies through Semantic-Guided Reachability Analysis
Wang Lingxiang, Quanzhi Fu, Wenjia Song, Gelei Deng, Yi Liu, Dan Williams, Ying Zhang
https://arxiv.org/abs/2506.17798
It's time to lower your inhibitions towards just asking a human the answer to your question.
In the early nineties, effectively before the internet, that's how you learned a lot of stuff. Your other option was to look it up in a book. I was a kid then, so I asked my parents a lot of questions.
Then by ~2000 or a little later, it started to feel almost rude to do this, because Google was now a thing, along with Wikipedia. "Let me Google that for you" became a joke website used to satirize the poor fool who would waste someone's time answering a random question. There were some upsides to this, as well as downsides. I'm not here to judge them.
At this point, Google doesn't work any more for answering random questions, let alone more serous ones. That era is over. If you don't believe it, try it yourself. Between Google intentionally making their results worse to show you more ads, the SEO cruft that already existed pre-LLMs, and the massive tsunami of SEO slop enabled by LLMs, trustworthy information is hard to find, and hard to distinguish from the slop. (I posted an example earlier: #AI #LLMs #DigitalCommons #AskAQuestion