2025-10-24 00:54:03
An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications
Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, Ahmed E. Hassan
https://arxiv.org/abs/2509.19185
Amazon is testing AR glasses for delivery drivers, using AI and computer vision to help them scan packages, follow walking directions, and get proof of delivery (Todd Bishop/GeekWire)
https://www.geekwire.com/2025/amazon-unveils-ai-powe…
The FDA Often Doesn't Test Generic Drugs for Quality Concerns, So ProPublica Did (ProPublica)
https://www.propublica.org/article/fda-generic-drug-testing
http://www.memeorandum.com/251222/p66#a251222p66
LED Lighting: Mini Reviews - Real-world testing! - https://m.earth.org.uk/LED-lighting.html
For weeks earlier this year, the Army’s top uniformed lawyer had been raising legal concerns inside the Pentagon about some of the new policies being rolled out dictating how the military can be used and staffed.
In late January, Lt. Gen. Joe Berger, who had taken the top posting in July 2024, was asked for his advice about the legality of using Texas National Guard soldiers for immigration enforcement.
Berger told Army chief of staff Gen. Randy George that he was skeptical and wa…
The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks
Yu Gu, Jingjing Fu, Xiaodong Liu, Jeya Maria Jose Valanarasu, Noel Codella, Reuben Tan, Qianchu Liu, Ying Jin, Sheng Zhang, Jinyu Wang, Rui Wang, Lei Song, Guanghui Qin, Naoto Usuyama, Cliff Wong, Cheng Hao, Hohin Lee, Praneeth Sanapathi, Sarah Hilado, Bian Jiang, Javier Alvarez-Valle, Mu Wei, Jianfeng Gao, Eric Horvitz, Matt Lungren, Hoifung Poon, Paul Vozila
I read about a major update of #jellyfin and that one should do this planned with some care.
So I sat down, prepared for an update full of hassles, debugging, restoring, ...
> docker down / pull / validate / up / done
Took me about 5-10 min.
Very well done Jellyfin!
----
@…
> I don't think this is normal installation method on linux
Correct. This is an installer I wrote to allow multiple side by side installs, that we use for testing
> and it's not complaint with free desktop standard.
Actually to some extent it is, I am registering with the desktop environment, which is why it has dock …
DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models' Understanding on Indian Culture
Arijit Maji, Raghvendra Kumar, Akash Ghosh, Anushka, Nemil Shah, Abhilekh Borah, Vanshika Shah, Nishant Mishra, Sriparna Saha
https://arxiv.org/abs/2509.19274
Since the last Android release @… allows using a custom server from which to download map files.
This decreases reliance on CoMaps-run infrastructure but is also great for people who live in places with limited internet connectivity, as you can now serve maps from local networks.
To make it easier, I wrote a small CLI tool over the weekend that downloads maps of interest & then serves them locally. A first alpha release for testing is on Codeberg:
#OpenStreetMap
A Sequential Testing Problem with Signal Control
Steven Campbell, Georgy Gaitsgori, Richard Groenewald
https://arxiv.org/abs/2509.18209 https://arxiv.org/p…
Experimentally Testing AI-Powered Content Transformations on Student Learning
Courtney Heldreth, Laura M. Vardoulakis, Nicole E. Miller, Yael Haramaty, Diana Akrong, Lidan Hackmon, Lior Belinsky
https://arxiv.org/abs/2509.18664
Posting this here for others with #LongCovid.
One of the things I've done this year is rule out potential other causes of my symptoms (imaging, blood work, neurological testing). This, in addition to treating symptoms as best I can (insomnia, brain fog, joint pain, double vision, tinnitus, hearing loss, etc.).
This fall, I wrapped up with neurology (I didn't want to, a…
Testing Aurora, the KDE brother of Bluefin, briefly. Switched to the latest branch of course. Interesting and well executed. Too much for me, application wise, nevertheless very nicely done.
PS: Bazaar is a very nice appstore and the Curated section allows Aurora (and thus others) to show their preferred applications. Very neat.
Just added Web Reachability API (at least that’s what I’m calling it) support to https://ip.small-web.org.
It’s for testing the reachability of your Small Web servers (using a domain or, more importantly, an IPv4/IPv6 address). I’m using it to implement Web Numbers¹ support in Auto Encrypt² and Kitten³.
…
X is testing a new way of opening links without fully covering an X post, allowing users to see the Like, Repost, and other buttons, starting on iOS (Cheyenne MacDonald/Engadget)
https://www.engadget.com/x-is-testing-a-new-way-of…
5GC-Bench: A Framework for Stress-Testing and Benchmarking 5G Core VNFs
Ioannis Panitsas, Tolga O. Atalay, Dragoslav Stojadinovic, Angelos Stavrou, Leandros Tassiulas
https://arxiv.org/abs/2509.18443
Heute vor 74 Jahren: Am 24. September 1951 testete die #Sowjetunion ihre zweite #Atombombe, die RDS-2, am Semipalatinsk-Testgelände. Die Detonation, auch "Joe-2" genannt, hatte eine Sprengkraft von 38,3 Kilotonnen und wurde auf einem 30 Meter hohen Turm gezündet.
Testing, with the help of a funnel, the theory that cats are actually a liquid. Still having trouble getting his butt through the hole. #CatsOfMastodon
Aligned, Multiple-transient Events in the First #PalomarSkySurvey / Transients in the Palomar Observatory Sky Survey (POSS-I) may be associated with nuclear testing and reports of unidentified anomalous phenomena: https://iopscience.iop.org/article/10.1088/1538-3873/ae0afe / https://www.nature.com/articles/s41598-025-21620-3 -> Unexpected patterns in historical astronomical observations: https://www.su.se/english/news/unexpected-patterns-in-historical-astronomical-observations-1.855042 -> summary of why these papers are doubtful: https://x.com/MickWest/status/1980690132543107204 - links to deep discussions.
is this thing on? testing my new mastodon server :-)
These goofs want to use "drones that would fly up to 150 feet and eventually deliver some of the orders that today are transported by the company’s drivers."
We have a technology that efficiently delivers goods to homes in an urban neighborhood. It's called bicycles. https://
Claude Sonnet 4.5 shows significantly increased situational awareness when testing for alignment, here's a fascinating example from p. 59 of the system card (#anthropology #AIResearch
X is testing a new way of opening links without fully covering an X post, allowing users to see the Like, Repost, and other buttons, starting on iOS (Cheyenne MacDonald/Engadget)
https://www.engadget.com/x-is-testing-a-new-way-of…
Benchmarking PDF Accessibility Evaluation A Dataset and Framework for Assessing Automated and LLM-Based Approaches for Accessibility Testing
Anukriti Kumar, Tanushree Padath, Lucy Lu Wang
https://arxiv.org/abs/2509.18965
With this being the first Raspberry Pi 5 of the cluster, and since those are notorious for running hotter than the Pi 4. Which I did extensive testing on last year: https://blog.wyrihaximus.net/2024/12/building-a-kubernetes-homelab-w…
Zu Testzwecken läuft PixelFed jetzt erst einmal auf einem lokalen Rechner auf der Instanz https://pixelfed.pixelgalaxy.net
Sobald ein geeigneter VPS verfügbar ist, werde ich die Instanz auf der Hauptdomain installieren. Der Installationsprozess mit YunoHost war relativ einfach. Der Teufel liegt aber wie…
Crosslisted article(s) found for cs.ET. https://arxiv.org/list/cs.ET/new
[1/1]:
- An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications
Hasan, Li, Fallahzadeh, Rajbahadur, Adams, Hassan
Do you have a device with an AMD Neural Processing Unit (NPU) that you actively use? I could use your help testing support for monitoring AMD NPUs in Resources!
I'd really appreciate it if you'd check out the amdxdna-support branch of Resources' GitHub repository and check if NPU utilization correctly shows up in Resources.
You can find more info in this GitHub issue:
Fray Detects #Concurrency Issues in #JVM Languages
https://www.infoq.com/news/2025/12/fray-de
PSA about food labeling in the US
We have a gluten detection service dog because many things that should be gluten free/say they’re gluten free are not actually gluten free.
Stuff gets contaminated when growing (e.g. next to wheat field), by shared equipment, in factories, from packaging, during transport and in-store.
Every US consumer should know:
1. The list of ingredients on food isn't exhaustive
2. Allergen labeling:
a) limited to just some allergens
b) manufacturers don't actually have to test
c) "certified" foods are tested—but not continuously
d) testing only works with enough contamination
Some certifications may require batch-testing, but usually they don't.
A "certified gluten free" product may e.g. contain oats which sometimes are contaminated with gluten—but as not every batch is tested it's impossible to know unless you test yourself (hence the service dog).
Even if the product is properly batch-tested, you might get a part of the product that has the allergen in it, whereas the tested part didn't.
Or the threshold was too low (our dog can detect gluten better than any available lab testing equipment; yes, dogs are amazing).
Food products also contain ingredients that do not have to be included on the label when they're "incidental" (included in an another ingredient) or if they're considered part of the manufacturing process but not of the final product (e.g. various coatings on factory equipment).
Don't need to list flavors or specific spices either. ¯\_(ツ)_/¯
As for allergens, only those responsible for ~90% of food allergies* have to be specifically declared, and they're not tested for as it's simply based on the ingredients list.
Good luck if you have other allergies.
*milk, egg, egg, fish, Crustacean shellfish, tree nuts, wheat, peanuts, soybeans
Just deleted a bunch of testing boilerplate in this repo after @… added `import.meta.resolve` support 🏆
https://github.com/vitest-dev/vitest/issues/6953
Merde j'avais raté la sortie de servo 0.0.2 https://github.com/servo/servo/releases/tag/v0.0.2
Discover the power of property-based testing in R with the #quickcheck package! Seamlessly integrates with #testthat and offers a variety of generators for atomic vectors, lists, and tibbles. Perfect for ensuring your code's reliability. Check it out:
💥 Bubble wrap bursts enable power-free acoustic testing
#sensors
we are the clapton, always anti-fascist, fuck the terf FA, LET THEM PLAY
#claptoncfc #ccfc #ftfa #fedifc
Maybe this is it: we just need to rebrand Solar as Fusion (alpha testing) 😂 @… https://mastodon.world/@davidho/115400254100213642
Oh goody, a LinkedIn connection request with a message!
Let’s break this down:
> …vc backed…
Focused on quarterly profits / RoI instead of outcomes.
> …invite-only…
NDAs and other gag agreements.
> …openAI browser…
Chromium that begs authors for ARIA to parse content.
> …seeing promising results…
Which aren’t genuine results.
> …goal is 100% WCAG testing…
Ah, snake oil.
> …with high capture rates.
Sales…
Baidu and Swiss Post's PostBus plan to launch Baidu's Apollo Go autonomous vehicle service in Switzerland, testing in December ahead of rollout by Q1 2027 (Reuters)
https://www.reuters.com/technology/baidu-expands-robotaxi-push-swit…
Donation to your mastodon server will be cheaper https://techhub.social/@Techmeme/115737372852805627
"Regulators overlooking toxic Pfas found around Lancashire chemicals plant"
#UK #UnitedKingdom #PFAS #ForeverChemicals
Meta is testing limiting Facebook professional accounts and Pages to posting just two links per month, unless they subscribe to $14.99 /month Meta Verified (Ivan Mehta/TechCrunch)
https://techcrunch.com/2025/12/17/facebook-is-te…
While testing some new misp-modules, such as the OpenAPI interface, I discovered a strange behavior in Firefox when trying to reach TCP port 6666, which is the default port used by misp-modules.
It seems Firefox blocks access to a predefined list of TCP ports, and this has been in place for quite some time, as you can see in the commit log.
If you want to override the blocked port list, there is an obscure setting called network.security.ports.banned.override.
…
Folks using #PhanpySocialDev , there are 2 new features that need a bit of testing:
1. QR code for profiles and shortcuts settings - includes scanner (camera) too
2. Import/export accounts - the export *excludes* access tokens, so need to login again after import
They’re quite hidden, so just a heads-up 🙇♂️
RE: https://mastodon.social/@firefoxwebdevs/115740501470592801
I agree in theory that "assume good faith“ is the right way to go and a very pragmatic maxim for communities, but Mozilla has been testing that assumption for quite a while now.
As the system comes up, the component builders will from time to time appear,
bearing hot new versions of their pieces -- faster, smaller, more complete,
or putatively less buggy. The replacement of a working component by a new
version requires the same systematic testing procedure that adding a new
component does, although it should require less time, for more complete and
efficient test cases will usually be available.
-- Frederick Brooks Jr., "The Myt…
" @… noted that authoritarian regimes fear mass organizing and peaceful protest because they reveal a regime’s unpopularity and show that it is losing its grip on power."
"Much as tossing chests of tea into Boston Harbor did about 250 years ago." - @…
Sources: Meta plans to test an AI-powered personalized daily briefing, designed to compete with ChatGPT's Pulse, with some Facebook users in NYC and SF (Naomi Nix/Washington Post)
https://www.washingtonpost.com/technology/2025/11/21/meta-ai-powered-daily…
from my link log —
Mutation testing for librsvg with cargo-mutants.
https://viruta.org/mutation-testing-librsvg.html
saved 2025-12-03 https://…
I’m somewhat exhausted to announce attrs 25.4.0!
The main reason for this release (and why it's published today) is that it ships the first pieces of work for Python 3.14 and PEP 749. There will be more work required and there's going to be a lot more churn once everyone starts testing 3.14 earnestly. We hope to receive more feedback before spending more time on this. #Python
UK Unveils Roadmap to Phase Out Animal Testing, Commits £75M to Develop Alternative Methods https://vegconomist.com/science/uk-unveils-roadmap-phase-out-animal-testing-commiting-develop-alternative-methods/
https://cyberscoop.com/bugcrowd-mayhem-security-acquistion-ai-security-testing/
Omg so happy for the great David Brumley!
From Translink
New bus stop lights roll out to improve safety and visibility
TransLink testing solar-powered lights at 14 bus stops
TransLink is improving safety and comfort with new solar-powered lights at bus stops.
The project will test lights at 14 locations where customers or staff have identified a need for improved lighting. The initiative will improve visibility for both Bus Operators and waiting passengers.
I’m currently working on a software designed more than a decade ago. It offers a plugin architecture: you can develop a plugin whose lifecycle is handled by the software. The tough part, though, is how you access the platform capabilities: via static methods on singletons.
How did I #test my code?
Meta is testing limiting Facebook professional accounts and Pages to posting just two links per month, unless they subscribe to $14.99 /month Meta Verified (Ivan Mehta/TechCrunch)
https://techcrunch.com/2025/12/17/facebook-is-te…
I regularly warn that content on Forbes is pay-for-play.
Here an overlay vendor shares prompts to feed an LLM for testing code, demonstrating LLMs on their own can’t do it without extensive coaching _and_ that coaching needs to be correct (the examples have issues):
https://www.
LED Lighting: Mini Reviews - Real-world testing! - https://m.earth.org.uk/LED-lighting.html
{testthat} is great for automatic testing. Here are some tricks for the heavy user: #rstats
😰 Miami Is Testing a Self-Driving Police Car That Can Launch Drones
https://www.thedrive.com/news/miami-is-testing-a-self-driving-police-car-that-can-launch-drones
With @… you gotta daily-check those repo's. Updated KDE liveslak, updated window manager goodies, updated testing (minimal KDE, with homebrew, flatpak and distrobox): on a roll!
#slackware
If I'm gonna do some "pen testing" it's gonna be me sitting at my desk with a stack of nice paper, some music playing, and no computers involved.
Just updated Node Pebble to support latest release version of Let’s Encrypt’s Pebble testing server.
#LetsEncrypt…
PSA for users that regularly test #Fedora Beta as well as proposed updates once the new version was released:
Do not enable updates-testing[1] by modifying /etc/yum.repos.d/fedora-updates-testing.repo; instead do it like this:
$ sudo dnf config-manager setopt updates-testing.enabled=true
Otherwise updates-testing will be disabled shortly before the release of a new version (t…
The LAX Automated People Mover (APM) is a new, free, electric train system that connects airport terminals, parking, and public transportation.
After delays, it is expected to open for public use in January 2026,
with final testing concluding in June 2026.
The 2.25-mile elevated guideway has six stations, and trains will run 24/7, arriving every two minutes during peak hours
Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem
Muhammad Maaz, Liam DeVoe, Zac Hatfield-Dodds, Nicholas Carlini
https://arxiv.org/abs/2510.09907 https:/…
Agentio, which uses AI to help brands automate and scale campaigns with YouTube creators, raised a $40M Series B led by Forerunner at a $340M valuation (Ivan Mehta/TechCrunch)
https://techcrunch.com/2025/11/18/agen
Function Health, a health tracking tech company, raised a $298M Series B led by Redpoint Ventures at a $2.5B valuation, bringing its total funding to $350M (Kate Park/TechCrunch)
https://techcrunch.com/2025/11/19/funct…
LLMs are All You Need? Improving Fuzz Testing for MOJO with Large Language Models
Linghan Huang, Peizhou Zhao, Huaming Chen
https://arxiv.org/abs/2510.10179 https://
Reddit starts a limited test of verified profiles, an opt-in feature that places a gray checkmark beside the usernames of notable people or businesses (Amanda Silberling/TechCrunch)
https://techcrunch.com/2025/12/10/reddit-is-testing-verification-badges/
Google plans to release Gemini 3 Deep Think to Google AI Ultra subscribers in the coming weeks, once it passes further rounds of safety testing (Russell Brandom/TechCrunch)
https://techcrunch.com/2025/11/18/google-launches-gemini-3-…
Agentic RAG for Software Testing with Hybrid Vector-Graph and Multi-Agent Orchestration
Mohanakrishnan Hariharan, Satish Arvapalli, Seshu Barma, Evangeline Sheela
https://arxiv.org/abs/2510.10824
Windows 11 Copilot AI hands-on: despite Microsoft advertising the feature as able to "understand you", it delivers inconsistent and often incorrect responses (Antonio G. Di Benedetto/The Verge)
https://www.theverge.com/report/822443/mic
Amazon's Twitch begins testing livestream shopping ads allowing users to buy products "in real time", starting with e.l.f. Cosmetics (Peter Adams/Marketing Dive)
https://www.marketingdive.com/news/elf-cosmetics-first-test-twitc…
Agentio, which uses AI to help brands automate and scale campaigns with YouTube creators, raised a $40M Series B led by Forerunner at a $340M valuation (Ivan Mehta/TechCrunch)
https://techcrunch.com/2025/11/18/agen
Google tests AI-powered overviews on some publications' Google News pages; publishers like Der Spiegel, El País, and WaPo in commercial partnerships get paid (Aisha Malik/TechCrunch)
https://techcrunch.com/2025/12/10/goog
Shares of Tesla closed at a 2025 high on Monday after the company confirmed it is testing driverless vehicles in Austin without a human safety operator (Lora Kolodny/CNBC)
https://www.cnbc.com/2025/12/15/tesla-tests-driverless-cars-in-austin…
WhatsApp is testing imposing per-month limits on how many messages individual users and businesses can send to unknown people without getting a response (Ivan Mehta/TechCrunch)
https://techcrunch.com/2025/10/17/what
Reddit starts a limited test of verified profiles, an opt-in feature that places a gray checkmark beside the usernames of notable people or businesses (Amanda Silberling/TechCrunch)
https://techcrunch.com/2025/12/10/reddit-is-testing-verification-badges/
Automaker Stellantis and Pony.ai sign a non-binding agreement to develop robotaxis for deployment in Europe, with plans to start testing in the coming months (Rebecca Bellan/TechCrunch)
https://techcrunch.com/2025/10/17/stellantis-teams-up-w…
Amazon's Twitch begins testing livestream shopping ads allowing users to buy products "in real time", starting with e.l.f. Cosmetics (Peter Adams/Marketing Dive)
https://www.marketingdive.com/news/elf-cosmetics-first-test-twitc…
Sources: Apple is in preliminary talks with India's CG Semi, which offers chip assembly and testing services, to assemble and package chips for the iPhone (Dia Rekhi/The Economic Times)
https://economictimes.indiatimes.com/t
DoorDash launches Zesty, a standalone AI-powered social app for users to find nearby restaurants, in public testing in San Francisco and New York (Natalie Lung/Bloomberg)
https://www.bloomberg.com/news/articles/2025-12-…
Google tests AI-powered overviews on some publications' Google News pages; publishers like Der Spiegel, El País, and WaPo in commercial partnerships get paid (Aisha Malik/TechCrunch)
https://techcrunch.com/2025/12/10/goog
Amazon is testing an AI tool called Kindle Translate that automatically translates books into other languages, for authors that self-publish on the platform (Lawrence Bonk/Engadget)
https://www.engadget.com/ai/amazon-is-test
ROG Xbox Ally X review: great performance and grips, silent cooling, good 120Hz display, and USB-C, but Windows optimization isn't great and poor battery life (Rebecca Spear/Windows Central)
https://www.windowscentral.com/hardware/asus/asus-rog-xbox-ally-x-review…