
2025-06-10 20:45:52
Twenty-seven states and DC sue 23andMe to oppose the sale of DNA data from its customers without their direct consent (Rylee Kirk/New York Times)
https://www.nytimes.com/2025/06/10/business/23andme-data-lawsuit.html
Twenty-seven states and DC sue 23andMe to oppose the sale of DNA data from its customers without their direct consent (Rylee Kirk/New York Times)
https://www.nytimes.com/2025/06/10/business/23andme-data-lawsuit.html
Dirty Data in the Newsroom: Comparing Data Preparation in Journalism and Data Science
Stephen Kasica, Charles Berret, Tamara Munzner
https://arxiv.org/abs/2507.07238
Web 3.0 Requires Data Integrity
New integrity-focused standards are necessary to enable the trusted AI services of tomorrow.
🔐 https://cacm.acm.org/opinion/web-3-0-requires-data-integrity/
{nplyr} has helper functions to work on nested dataframes: #rstats #datascience
Wow. https://themarkup.org/pixel-hunt/2025/04/28/how-california-sent-residents-personal-health-data-to-linkedin My opinion of LinkedIn was already vanishingly low (one way to ensure I won't read something is to po…
https://www.theguardian.com/society/2025/jun/11/public-health-bodies-urged-launch-period-tracking-apps-protect-data
Public health bodies urged to launch period tracking apps to protect data
Understanding and Improving Data Repurposing
J. Parsons, R. Lukyanenko, B. Greenwood, C. Cooper
https://arxiv.org/abs/2506.09073 https://
The Duty Comes From the Data: Rethinking Platform Liability in the Age of Algorithmic Harm
https://musictechpolicy.com/2025/07/05
NFL, Genius Sports extend, expand data deal https://www.espn.com/nfl/story/_/id/45493514/nfl-extends-expands-exclusive-data-deal-genius-sports
Analysing semantic data storage in Distributed Ledger Technologies for Data Spaces
Juan Cano-Benito, Andrea Cimmino, Sven Hertling, Heiko Paulheim, Ra\'ul Garc\'ia-Castro
https://arxiv.org/abs/2507.07116
President Trump's spending bill could limit local control over zoning and environmental regulations for AI data centers, worrying state lawmakers (Molly Taft/Wired)
https://www.wired.com/story/a-political-battle-is-brewing-over-data-centers/…
Just scrolling through the "Tree Preservation Orders" MHCLG has published geo-data for.
https://www.planning.data.gov.uk/dataset/tree
arxiv_citation: arXiv citation networks (1993-2003)
Citations among papers posted on arxiv.org under the hep-ph and hep-th categories, between 1993 and 2003. This time begins a few months after axiv was launched. If a paper i cites a paper j also in this data set, then a directed edge connects i to j. (Papers not in the data set are excluded.) These data were originally released as part of the 2003 KDD Cup.
This network has 27770 nodes and 352807 edges.
Tags: Informational,…
Airlines Don't Want You to Know They Sold Your Flight Data to DHS https://www.404media.co/airlines-dont-want-you-to-know-they-sold-your-flight-data-to-dhs/
Very excited about this! Code to access GRIN will help lots of Google Books partners, and the example might open other doors, as well as the obvious benefits of access to data!
'Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability' https://arxiv.org/abs/2506…
It's almost like the only way to prevent your entire life from being used against you is for someone to pass a law...
https://www.404media.co/airlines-dont-want-you-to-know-they-sold-your-flight-data-to-dhs/
CFMI: Flow Matching for Missing Data Imputation
Vaidotas Simkus, Michael U. Gutmann
https://arxiv.org/abs/2506.09258 https://arxiv.or…
RADAR: Benchmarking Language Models on Imperfect Tabular Data
Ken Gu, Zhihan Zhang, Kate Lin, Yuwei Zhang, Akshay Paruchuri, Hong Yu, Mehran Kazemi, Kumar Ayush, A. Ali Heydari, Maxwell A. Xu, Girish Narayanswamy, Yun Liu, Ming-Zher Poh, Yuzhe Yang, Mark Malhotra, Shwetak Patel, Hamid Palangi, Xuhai Xu, Daniel McDuff, Tim Althoff, Xin Liu
https://
Trump's DOJ makes its most sweeping demand for election data yet (NPR)
https://www.npr.org/2025/06/11/nx-s1-5426097/trump-justice-department-voter-data-colorado
http://www.memeorandum.com/250611/p142#a250611p142
States sue to block 23andMe from auctioning genetic data in bankruptcy plan | Courthouse News Service
https://www.courthousenews.com/states-sue-to-block-23andme-from-auctioning-genetic-data-in-bankruptcy-plan/
Check out today's Metacurity for a concise round-up of the critical infosec developments you should know, including
--Cambridge researchers warn that private companies are harvesting period tracker data
--United Natural Foods says system restoration likely by June 15,
--Gabbard wants feds to use private sector for intel tech needs,
--States sue to stop sale of 23andMe DNA data,
--Microsoft issues at least 67 patches,
--Microsoft fixes zero day exploited…
Calculating water/energy usage for "AI" per token is a bit problematic: A data center has a massive base load even if nobody uses it just by sheer existence. And since we have no actual data for any of the popular platforms all numbers floating around are problematic and not very useful.
Like how much power does one of those servers NVIDIA cards really save if its utilization is only 50%? And are the overhead costs actually counted?
Data-Driven Nonlinear Regulation: Gaussian Process Learning
Telema Harry, Martin Guay, Shimin Wang, Richard D. Braatz
https://arxiv.org/abs/2506.09273 http…
Error-Guided Pose Augmentation: Enhancing Rehabilitation Exercise Assessment through Targeted Data Generation
Omar Sherif, Ali Hamdi
https://arxiv.org/abs/2506.09833
Consent for Processing Personal Data in the Age of AI: Key Updates Across Asia-Pacific
https://fpf.org/blog/consent-for-processing-personal-data-in-the-age-of-ai-key-updates-across-asia-pacific/
こいつ…動くぞ!
(ぞーぺんのiOS版を作る、技術検証のやつです。まだ長いよ!)
QT: https://fedibird.com/@takke/114669428324725545
Lightweight Electronic Signatures and Reliable Access Control Included in Sensor Networks to Prevent Cyber Attacks from Modifying Patient Data
Mishall Al-Zubaidie
https://arxiv.org/abs/2506.08828
Gaussian copula correlation network analysis with application to multi-omics data
Ekaterina Tomilina (MaIAGE, GABI), Florence Jaffr\'ezic (GABI), Gildas Mazo (MaIAGE)
https://arxiv.org/abs/2506.08586
A five-month investigation found that data centers in Mexico, Chile, South Africa, the Netherlands, and the U.S. overhyped their economic benefits and downplayed environmental damage, especially their water use, even in drought zones.
https://www.themaybe.org/research/data-cen
Calibrated Lanthanide Atomic Data for Kilonova Radiative Transfer. I. Atomic Structure and Opacities
Andreas Fl\"ors, Ricardo Ferreira da Silva, Jos\'e P. Marques, Jorge M. Sampaio, Gabriel Mart\'inez-Pinedo
https://arxiv.org/abs/2507.07785
Linking Data Citation to Repository Visibility: An Empirical Study
Fakhri Momeni, Janete Saldanha Bach, Brigitte Mathiak, Peter Mutschke
https://arxiv.org/abs/2506.09530
Data Exfiltration in plain sight.
“Internet dead zones”, of course!
NOT!
https://www.thedailybeast.com/elon-musks-doge-goons-surreptitiously-transmitted-reams-of-white-house-data/
Real-Time Network Traffic Forecasting with Missing Data: A Generative Model Approach
Lei Deng, Wenhan Xu, Jingwei Li, Danny H. K. Tsang
https://arxiv.org/abs/2506.09647
RIVM update rioolwaarden en percentage positief.
We lijken in deze 16e golf andermaal op een plateau beland te zijn van ongeveer 280. Echter, wegens Pinksteren loopt de verwerking van de data zo'n 3 dagen achter en de dagwaarden variëren nogal, dus houd daar rekening mee.
Er zitten dan ook maar 2 nieuwe dagen in de data: 4 en 5 juni, met resp. 35% en 25% van de meetstations.
#qp2t…
Exploring non-cold dark matter in a scenario of dynamical dark energy with DESI DR2 data
Tian-Nuo Li, Peng-Ju Wu, Guo-Hong Du, Yan-Hong Yao, Jing-Fei Zhang, Xin Zhang
https://arxiv.org/abs/2507.07798
Gradual Metaprogramming
Tianyu Chen, Darshal Shetty, Jeremy G. Siek, Chao-Hong Chen, Weixi Ma, Arnaud Venet, Rocky Liu
https://arxiv.org/abs/2506.09043 htt…
Data-driven Kinematic Modeling in Soft Robots: System Identification and Uncertainty Quantification
Zhanhong Jiang, Dylan Shah, Hsin-Jung Yang, Soumik Sarkar
https://arxiv.org/abs/2507.07370
This https://arxiv.org/abs/2506.06155 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCV_…
Stop battling confusing Google Sheets charts when you've got different types of data! 🙅♀️ There's a much better way to show everything clearly.
My new video dives deep into Combo Charts, making even wildly different scales (like baby names & wind energy! 🌬️👶 – yes, really!) look perfectly clear with dual axes. It’s all about making your data understandable at a glance.
Check out the full tutorial for the how-to:
Neurosymbolic Feature Extraction for Identifying Forced Labor in Supply Chains
Zili Wang, Frank Montabon, Kristin Yvonne Rozier
https://arxiv.org/abs/2507.07217
High Signal: Data Science | Career | AI
Great Australian Pods Podcast Directory: #GreatAusPods
Hi-d maps: An interactive visualization technique for multi-dimensional categorical data
Radi Muhammad Reza, Benjamin A Watson
https://arxiv.org/abs/2507.07890
Virtru, which offers data security services to clients like the US DOD and Salesforce, raised a $50M Series D led by Iconiq Capital at a $500M valuation (Allie Garfinkle/Fortune)
https://fortune.com/2025/07/11/exclusi
Interesting stuff. This touches on a lot of work that was on my todo list. Especially estimating the "interestingness" of data by measuring the maximum compute needed to reach the optimal compression.
AIT is definitely becoming practically relevant.
https://arxiv.org/abs/2507.07995
Bardzo udany "film z dnia" vanlifera :) oczywiście nie mój, ale to jeden z autentyczniejszych kanałów, które śledzę i nowa forma POV w tym odcinku https://youtu.be/VTNR2vFPXcI
#vanlife #yt
GW170817 Viable Einstein-Gauss-Bonnet Inflation Compatible with the Atacama Cosmology Telescope Data
S. D. Odintsov, V. K. Oikonomou
https://arxiv.org/abs/2506.08193
arxiv_citation: arXiv citation networks (1993-2003)
Citations among papers posted on arxiv.org under the hep-ph and hep-th categories, between 1993 and 2003. This time begins a few months after axiv was launched. If a paper i cites a paper j also in this data set, then a directed edge connects i to j. (Papers not in the data set are excluded.) These data were originally released as part of the 2003 KDD Cup.
This network has 27770 nodes and 352807 edges.
Tags: Informational,…
MBTModelGenerator: A software tool for reverse engineering of Model-based Testing (MBT) models from clickstream data of web applications
Sasidhar Matta, Vahid Garousi
https://arxiv.org/abs/2506.08179
An Introduction to Solving the Least-Squares Problem in Variational Data Assimilation
I. Dau\v{z}ickait\.e, M. A. Freitag, S. G\"urol, A. S. Lawless, A. Ramage, J. A. Scott, J. M. Tabeart
https://arxiv.org/abs/2506.09211
Incentive Mechanism for Mobile Crowd Sensing with Assumed Bid Cost Reverse Auction
Jowa Yangchin, Ningrinla Marchang
https://arxiv.org/abs/2507.07688 https…
J'ai découvert les noms "Wide data" et "Long data" : #BuisnessIntelligence
Next stop in our NLP timeline is 2013, the introduction of low dimensional dense word vectors - so-called "word embeddings" - based on distributed semantics, as e.g. word2vec by Mikolov et al. from Google, which enabled representation learning on text.
T. Mikolov et al. (2013). Efficient Estimation of Word Representations in Vector Space.
…
Replaced article(s) found for physics.data-an. https://arxiv.org/list/physics.data-an/new
[1/1]:
- Orthogonal projections of hypercubes
Yoshiaki Horiike, Shin Fujishiro
A Saddle Point Algorithm for Robust Data-Driven Factor Model Problems
Shabnam Khodakaramzadeh, Soroosh Shafiee, Gabriel de Albuquerque Gleizer, Peyman Mohajerin Esfahani
https://arxiv.org/abs/2506.09776
This https://arxiv.org/abs/2407.00976 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_qbi…
Gradient-Weighted, Data-Driven Normalization for Approximate Border Bases -- Concept and Computation
Hiroshi Kera, Achim Kehrein
https://arxiv.org/abs/2506.09529
KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes
Eugenie Lai, Gerardo Vitagliano, Ziyu Zhang, Sivaprasad Sudhir, Om Chabra, Anna Zeng, Anton A. Zabreyko, Chenning Li, Ferdi Kossmann, Jialin Ding, Jun Chen, Markos Markakis, Matthew Russo, Weiyang Wang, Ziniu Wu, Michael J. Cafarella, Lei Cao, Samuel Madden, Tim Kraska
arxiv_citation: arXiv citation networks (1993-2003)
Citations among papers posted on arxiv.org under the hep-ph and hep-th categories, between 1993 and 2003. This time begins a few months after axiv was launched. If a paper i cites a paper j also in this data set, then a directed edge connects i to j. (Papers not in the data set are excluded.) These data were originally released as part of the 2003 KDD Cup.
This network has 27770 nodes and 352807 edges.
Tags: Informational,…
The Allen Institute for AI launches FlexOlmo, an LLM architecture that lets data owners control and remove their training data from a model even after training (Will Knight/Wired)
https://www.wired.com/story/flexolmo-ai-model-lets-data-owners-take-cont…
ErrorEraser: Unlearning Data Bias for Improved Continual Learning
Xuemei Cao, Hanlin Gu, Xin Yang, Bingjun Wei, Haoyang Liang, Xiangkun Wang, Tianrui Li
https://arxiv.org/abs/2506.09347
This https://arxiv.org/abs/2502.07732 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCY_…
Conditional Unigram Tokenization with Parallel Data
Gianluca Vico, Jind\v{r}inch Libovick\'y
https://arxiv.org/abs/2507.07824 https://
Towards Provenance-Aware Earth Observation Workflows: the openEO Case Study
H. Omidi, L. Sacco, V. Hutter, G. Irsiegler, M. Claus, M. Schobben, A. Jacob, M. Schramm, S. Fiore
https://arxiv.org/abs/2506.08597
Qantas says 5.7 million affected by breach, leaked info not enough to access frequent flyer accounts https://therecord.media/qantas-airline-data-breach-frequent-flyer-numbers
US customs duties top $100 billion for first time in a fiscal year (David Lawder/Reuters)
https://www.reuters.com/business/trumps-tariff-collections-expected-grow-june-us-budget-data-2025-07-11/
http://www.memeorandum.com/250711/p103#a250711p103
Learning event-triggered controllers for linear parameter-varying systems from data
Renjie Ma, Su Zhang, Wenjie Liu, Zhijian Hu, Peng Shi
https://arxiv.org/abs/2506.08366
This https://arxiv.org/abs/2410.16316 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCR_…
Integrated Galaxy Light from Stacking $10^5$ Random Pointings in the Dark Energy Survey Data
Jenna E. Moore, Seth H. Cohen, Philip Mauskopf, Evan Scannapieco
https://arxiv.org/abs/2506.08162
This https://arxiv.org/abs/2506.02791 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csSE_…
Replaced article(s) found for physics.data-an. https://arxiv.org/list/physics.data-an/new
[1/1]:
- Fuzzy permutation time irreversibility for nonequilibrium analysis of complex system
Wenpo Yao
Salesforce is restricting third-party companies from long-term indexing and storing of Slack messages, which would hamper rival enterprise AI firms like Glean (The Information)
https://www.theinformation.com/articles/salesforce-blocks-ai-rivals…
Algorithmic Complexity Attacks on All Learned Cardinality Estimators: A Data-centric Approach
Yingze Li, Xianglong Liu, Dong Wang, Zixuan Wang, Hongzhi Wang, Kaixing Zhang, Yiming Guan
https://arxiv.org/abs/2507.07438
(Trying this again, only this time with the right threat group. It's only Monday 🤪)
I had always assumed Salt Typhoon hit Comcast.
https://www.nextgov.com/cybersecurity/2025…
Generalization Error Analysis for Attack-Free and Byzantine-Resilient Decentralized Learning with Data Heterogeneity
Haoxiang Ye, Tao Sun, Qing Ling
https://arxiv.org/abs/2506.09438
Expediting data extraction using a large language model (LLM) and scoping review protocol: a methodological study within a complex scoping review
James Stewart-Evans, Emma Wilson, Tessa Langley, Andrew Prayle, Angela Hands, Karen Exley, Jo Leonardi-Bee
https://arxiv.org/abs/2507.06623
citeseer: CiteSeer citations (2014)
Citations among papers indexed by the CiteSeer digital library. If a paper i cites a paper j also in this data set, then a directed edge connects i to j. (Papers not in the data set are excluded.) Self-loops may be present.
This network has 384413 nodes and 1751463 edges.
Tags: Informational, Citation, Unweighted
FedP3E: Privacy-Preserving Prototype Exchange for Non-IID IoT Malware Detection in Cross-Silo Federated Learning
Rami Darwish, Mahmoud Abdelsalam, Sajad Khorsandroo, Kaushik Roy
https://arxiv.org/abs/2507.07258
Check out today's Metacurity for the critical infosec developments you should know, including
--UK's NCA arrested four people for M&S, Co-Op cyberattacks
--Russian hoops player Kasatkin busted in France in connection with ransomware,
--McDonald's employee chatbot was riddled with absurd flaws,
--Hackers stole $40m from GMX protocol,
--Customer data exposed in Bitcoin Depot breach,
--Hackers run scam messages in old Mt. Gox wallets,
--Ni…
Mycelium: A Transformation-Embedded LSM-Tree
Holly Casaletto, Jeff Lefevre, Aldrin Montana, Peter Alvaro
https://arxiv.org/abs/2506.08923 https://
This https://arxiv.org/abs/2506.04929 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCL_…
cora: CORA citations (1998)
Citations among papers indexed by CORA, from 1998, an early computer science research paper search engine. If a paper i cites a paper j also in this data set, then a directed edge connects i to j. (Papers not in the data set are excluded.) Self-loops may be present. The dates of these snapshots are uncertain.
This network has 23166 nodes and 91500 edges.
Tags: Informational, Citation, Unweighted
Universal Embeddings of Tabular Data
Astrid Franz, Frederik Hoppe, Marianne Michaelis, Udo G\"obel
https://arxiv.org/abs/2507.05904 https://
Israeli data security startup Cyera raised $540M led by Georgian, Greenoaks, and Lightspeed at a $6B valuation, up from $3B in November 2024 after raising $300M (Steven Scheer/Reuters)
https://www.reuters.com/world/middle-east/
Never a let-up in cybersecurity developments, so don't miss today's Metacurity for the most critical infosec developments you should know, including
--US grocery distributor United Natural Foods is the latest retail-related cyber victim
--M&S reopens website to shoppers,
--Google account phone numbers could have been brute-forced,
--TX and IL warn of breach-related data exposure,
--NHS blood supply still short a year after ransomware attack,
--C…
PrivTru: A Privacy-by-Design Data Trustee Minimizing Information Leakage
Lukas Gehring, Florian Tschorsch
https://arxiv.org/abs/2506.06124 https://
product_space: Atlas of Economic Complexity export network
Two networks of economic products, where a pair of products are connected if they are exported at similar rates by the same countries. The data are a projection from a bipartite network of nations and the products they export. Edges weights represent a similarity score (called "proximity"). Data based on UN Comtrade worldwide trade patterns. SITC network based on the Standard International Trade Classification and HS …
LGND, which uses vector embeddings to analyze geospatial data and has an enterprise app to query it, raised a $9M seed led by Javelin Venture Partners (Tim De Chant/TechCrunch)
https://techcrunch.com/2025/07/10/lgnd-wants-to-make-chatgpt-for-the-earth/
This https://arxiv.org/abs/2506.00759 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csCL_…
Learning Obfuscations Of LLM Embedding Sequences: Stained Glass Transform
Jay Roberts, Kyle Mylonakis, Sidhartha Roy, Kaan Kale
https://arxiv.org/abs/2506.09452
Will Cathcart says WhatsApp plans to support Apple in its legal case against the UK Home Office over weakening encryption, which may "set a dangerous precedent" (Zoe Kleinman/BBC)
https://www.bbc.com/news/articles/cgmjrn42wdwo
Sources: JPMorgan Chase told fintech companies it will start charging fees for access to customers' account data, which could drastically reshape the industry (Bloomberg)
https://www.bloomberg.com/news/articles/20
FlexOlmo: Open Language Models for Flexible Data Use
Weijia Shi, Akshita Bhagia, Kevin Farhat, Niklas Muennighoff, Pete Walsh, Jacob Morrison, Dustin Schwenk, Shayne Longpre, Jake Poznanski, Allyson Ettinger, Daogao Liu, Margaret Li, Dirk Groeneveld, Mike Lewis, Wen-tau Yih, Luca Soldaini, Kyle Lo, Noah A. Smith, Luke Zettlemoyer, Pang Wei Koh, Hannaneh Hajishirzi, Ali Farhadi, Sewon Min
Synthetic Tabular Data: Methods, Attacks and Defenses
Graham Cormode, Samuel Maddock, Enayat Ullah, Shripad Gade
https://arxiv.org/abs/2506.06108 https://
sp_colocation: Social co-locations (2018)
Network of colocations between peoople, based on the information on which RFID readers received information from the RFID tags. Namely, we define two individuals to be in co-presence if the same exact set of readers have received signals from both individuals during a 20s time window.
This network has 81 nodes and 150126 edges.
Tags: Social, Offline, Unweighted, Weighted, Temporal, Metadata
Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models
Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, Emad Barsoum
https://arxiv.org/abs/2506.09532
The data industry is consolidating, with Databricks' $1B Neon purchase, Salesforce's $8B Informatica deal, and more, fueled by the need for quality data for AI (Rebecca Szkutak/TechCrunch)
https://techcrunch.com/2025/07/07/ai-i