«Worse, as the latest Apple papers shows, LLMs may well work on your easy test set (like Hanoi with 4 discs) and seduce you into thinking it has built a proper, generalizable solution when it does not.»
The technology that keeps on giving
https://garymarcus.substack.com/p/a-knockout-blow-for-llms
AKEGEN: A LLM-based Tabular Corpus Generator for Evaluating Dataset Discovery in Data Lakes
Zhenwei Dai, Chuan Lei, Asterios Katsifodimos, Xiao Qin, Christos Faloutsos, Huzefa Rangwala
https://arxiv.org/abs/2507.04687
Exoplanet Atmospheric Refraction Effects in the #Kepler Sample: https://arxiv.org/abs/2507.02126 -> "We present an analysis on the detection viability of refraction effects in Kepler's exoplanet atmospheres using binning techniques for their light curves in order to compare against simulated refraction effects. We split the Kepler exoplanets into sub-populations according to orbital period and planetary radius, then search for out-of-transit changes in the relative flux associated with atmospheric refraction of starlight. The presence of refraction effects - or lack thereof - may be used to measure and set limits on the bulk properties of an atmosphere, including mean molecular weight or the presence of hazes.
In this work, we use the presence of refraction effects to test whether exoplanets above the period-radius valley have H/He atmospheres, which high levels of stellar radiation could evaporate away, in turn leaving rocky cores below the valley. We find strong observational evidence of refraction effects for exoplanets above the period-radius valley based on Kepler photometry, however those related to optically thin H/He atmospheres are not common in the observed planetary population. This result may be attributed to signal dampening caused by clouds and hazes, consistent with the optically thick and intrinsically hotter atmospheres of Kepler exoplanets caused by relatively close host star proximity."
But that is adjustable via the included Ecobee smart thermostat. If we set the backup heat to start at minus 5°C then our furnace could be running for 8 weeks or less. The key is to test and monitor electricity consumption at lower temperatures since the heat pump works harder when it's colder. Then compare our gas and electricity bills to determine the right set-up. Over time we should find the right balance point.
hiv_transmission: HIV transmission network (1988-2001)
A set of networks of HIV transmissions between people through sexual, needle-sharing, or social connections, based on combining 8 datasets collected from 1988 to 2001. Metadata includes test results of several diseases, as well as demographic variables such as age, ethnicity, and gender. Networks come in two flavors: egodyads and altdyads. Egodyads are the network among study-participants and their direct partners. Altdyads are the…
One of the goals I've set for further development of #Python eclasses in #Gentoo was to avoid needless complexity. Unfortunately, the subject matter sometimes requires them. However, many of the functions added lately were already manually done in ebuilds for years.
We've started disabling plugin autoloading years ago. First we just did that for individual packages that caused issues. Then, for these where tests ended up being really slow. Finally, pretty much anywhere `python_test()` was declared. Doing it all manually was particularly cumbersome — all I needed for `EPYTEST_PLUGINS` is a good idea how to generalize it.
Similarly, `EPYTEST_XDIST` was added after we have been adding manually `epytest -p xdist -n "$(makeopts_jobs)" --dist=worksteal` — and while at it, I've added `EPYTEST_JOBS` to override the job count.
Perhaps `EPYTEST_TIMEOUT` wasn't that common. However, it was meant to help CI systems that could otherwise get stuck on hanging test.
Similarly, "standard library" version (like `3.9`) matching to `python_gen_cond_dep` was added after a long period of explicitly stating `python3_9 pypy3`. As an extra benefit, this also resolved the problem that at the time `pypy3` could mean different Python versions.
Test Drive Unlimited: a quieter Horizon Festival
I found TDU in a box in the loft a couple of weeks ago, and since I now have a new Xbox 360 set up (replacing my old one which didn't have HDMI output) so can play games which are not compatible with the Xbox One, I thought I'd revisit the game. Unfortunately my save wasn't in the cloud, and wasn't transferred while I had both consoles set up, so I needed to start from the beginning.
hiv_transmission: HIV transmission network (1988-2001)
A set of networks of HIV transmissions between people through sexual, needle-sharing, or social connections, based on combining 8 datasets collected from 1988 to 2001. Metadata includes test results of several diseases, as well as demographic variables such as age, ethnicity, and gender. Networks come in two flavors: egodyads and altdyads. Egodyads are the network among study-participants and their direct partners. Altdyads are the…
In-context learning for the classification of manipulation techniques in phishing emails
Antony Dalmiere (LAAS-TRUST, LAAS), Guillaume Auriol (LAAS-TRUST, INSA Toulouse), Vincent Nicomette (LAAS-TSF, LAAS), Pascal Marchand (LERASS)
https://arxiv.org/abs/2506.22515
A note on the properties of the confidence set for the local average treatment effect obtained by inverting the score test
Ezequiel Smucler, Ludovico Lanni, David Masip
https://arxiv.org/abs/2506.10449
hiv_transmission: HIV transmission network (1988-2001)
A set of networks of HIV transmissions between people through sexual, needle-sharing, or social connections, based on combining 8 datasets collected from 1988 to 2001. Metadata includes test results of several diseases, as well as demographic variables such as age, ethnicity, and gender. Networks come in two flavors: egodyads and altdyads. Egodyads are the network among study-participants and their direct partners. Altdyads are the…
Proportional Sensitivity in Generative Adversarial Network (GAN)-Augmented Brain Tumor Classification Using Convolutional Neural Network
Mahin Montasir Afif, Abdullah Al Noman, K. M. Tahsin Kabir, Md. Mortuza Ahmmed, Md. Mostafizur Rahman, Mufti Mahmud, Md. Ashraful Babu
https://arxiv.org/abs/2506.17165
Potemkin Understanding in Large Language Models
Marina Mancoridis, Bec Weeks, Keyon Vafa, Sendhil Mullainathan
https://arxiv.org/abs/2506.21521 https://arxiv.org/pdf/2506.21521 https://arxiv.org/html/2506.21521
arXiv:2506.21521v1 Announce Type: new
Abstract: Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM's capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs -- such as AP exams -- are also those used to test people. However, this raises an implication: these benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin understanding: the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.
toXiv_bot_toot
hiv_transmission: HIV transmission network (1988-2001)
A set of networks of HIV transmissions between people through sexual, needle-sharing, or social connections, based on combining 8 datasets collected from 1988 to 2001. Metadata includes test results of several diseases, as well as demographic variables such as age, ethnicity, and gender. Networks come in two flavors: egodyads and altdyads. Egodyads are the network among study-participants and their direct partners. Altdyads are the…
Refract ICL: Rethinking Example Selection in the Era of Million-Token Models
Arjun R. Akula, Kazuma Hashimoto, Krishna Srinivasan, Aditi Chaudhary, Karthik Raman, Michael Bendersky
https://arxiv.org/abs/2506.12346