
2025-07-11 07:42:51
Dirty Data in the Newsroom: Comparing Data Preparation in Journalism and Data Science
Stephen Kasica, Charles Berret, Tamara Munzner
https://arxiv.org/abs/2507.07238
Dirty Data in the Newsroom: Comparing Data Preparation in Journalism and Data Science
Stephen Kasica, Charles Berret, Tamara Munzner
https://arxiv.org/abs/2507.07238
High Signal: Data Science | Career | AI
Great Australian Pods Podcast Directory: #GreatAusPods
cora: CORA citations (1998)
Citations among papers indexed by CORA, from 1998, an early computer science research paper search engine. If a paper i cites a paper j also in this data set, then a directed edge connects i to j. (Papers not in the data set are excluded.) Self-loops may be present. The dates of these snapshots are uncertain.
This network has 23166 nodes and 91500 edges.
Tags: Informational, Citation, Unweighted
KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes
Eugenie Lai, Gerardo Vitagliano, Ziyu Zhang, Sivaprasad Sudhir, Om Chabra, Anna Zeng, Anton A. Zabreyko, Chenning Li, Ferdi Kossmann, Jialin Ding, Jun Chen, Markos Markakis, Matthew Russo, Weiyang Wang, Ziniu Wu, Michael J. Cafarella, Lei Cao, Samuel Madden, Tim Kraska
"Applications of the Critical Incident Technique in Library and Information Science Research: A Literature Review" https://doi.org/10.1515/libri-2024-0065
Effective Training Data Synthesis for Improving MLLM Chart Understanding
Yuwei Yang, Zeyu Zhang, Yunzhong Hou, Zhuowan Li, Gaowen Liu, Ali Payani, Yuan-Sen Ting, Liang Zheng
https://arxiv.org/abs/2508.06492
Science at Risk: The Urgent Need for Institutional Support of Long-Term Ecological and Evolutionary Research in an Era of Data Manipulation and Disinformation
Vincent A. Viblanc (UMR ISEM), Elise Huchard (UMR ISEM), Gilles Pinay (CEFE), Elena Orme\~no (CEFE), C\'eline Teplitsky (CEFE), Fran\c{c}ois Criscuolo (IGE), Dominique Joly (IGE), David Renault (IGE), C\'ecile Callou (IGE), Fran\c{c}oise Gourmelon (IGE), Sandrine Anquetin (IGE), B\'en\'edicte Augeard (OFB), Fabien…
MetaInfoSci: An Integrated Web Tool for Scholarly Data Analysis
Kiran Sharmaa, Parul Khurana, Ziya Uddina
https://arxiv.org/abs/2506.09056 https://
Learning the Topic, Not the Language: How LLMs Classify Online Immigration Discourse Across Languages
Andrea Nasuto, Stefano Maria Iacus, Francisco Rowe, Devika Jain
https://arxiv.org/abs/2508.06435
SDSS-V Milky Way Mapper (MWM): ASPCAP Stellar Parameters and Abundances in SDSS-V Data Release 19
Szabolcs M\'esz\'aros, Paula Jofr\'e, Jennifer A. Johnson, Jonathan C. Bird, Andrew R. Casey, Katia Cunha, Nathan De Lee, Peter Frinchaboy, Guillaume Guiglion, Viola Heged\H{u}s, Alex P. Ji, Juna A. Kollmeier, Melissa K. Ness, Jonah Otto, Marc H. Pinsonneault, Alexandre Roman-Lopes, Amaya Sinha, Ying-Yi Song, Guy S. Stringfellow, Keivan G. Stassun, Jamie Tayar, Andrew Tkachenko…
A Metrics-Oriented Architectural Model to Characterize Complexity on Machine Learning-Enabled Systems
Renato Cordeiro Ferreira (University of S\~ao Paulo, Jheronimus Academy of Data Science, Technical University of Eindhoven, Tilburg University)
https://arxiv.org/abs/2506.08153
The most energetic transients - tidal disruptions of high-mass stars: #ExtremeNuclearTransients (ENTs) are the most energetic transients yet observed.
Automated Visualization Makeovers with LLMs
Siddharth Gangwar, David A. Selby, Sebastian J. Vollmer
https://arxiv.org/abs/2508.05637 https://arxiv.org/pdf/…
KI4Demokratie: An AI-Based Platform for Monitoring and Fostering Democratic Discourse
Rudy Alexandro Garrido Veliz, Till Nikolaus Schaland, Simon Bergmoser, Florian Horwege, Somya Bansal, Ritesh Nahar, Martin Semmann, J\"org Forthmann, Seid Muhie Yimam
https://arxiv.org/abs/2506.09947…
Fine-Tuning Vision-Language Models for Markdown Conversion of Financial Tables in Malaysian Audited Financial Reports
Jin Khye Tan (Faculty of Computer Science,Information Technology, Universiti Malaya), En Jun Choong, Ethan Jeremiah Chitty, Yan Pheng Choo, John Hsin Yang Wong, Chern Eu Cheah
https://arxiv.org/abs/2508.05669
This looks very cool.
'OpenAIRE in collaboration with Area Science Park organizes a hands-on workshop titled “Where LEGO Meets FAIR Data,” designed to introduce the principles of FAIR data through a creative, interactive simulation using LEGO metaphors.'
https://www.
Large Language Model-based Data Science Agent: A Survey
Peiran Wang, Yaoning Yu, Ke Chen, Xianyang Zhan, Haohan Wang
https://arxiv.org/abs/2508.02744 https://
Overview of statistical concepts in datascience: could be usefull in case you need some preformulated text to share with stakeholders... https://towardsdatascience.com/ultimate-guide-to-statistics-for-data-science-a3d8f1fd69a7
dblp_cite: DBLP citations (2014)
Citations among papers contained in the DBLP computer science bibliography. If a paper i cites a paper j also in this data set, then a directed edge connects i to j. (Papers not in the data set are excluded.) Self-loops may be present. This snapshot from May 2014.
This network has 12590 nodes and 49759 edges.
Tags: Informational, Citation, Unweighted
This https://arxiv.org/abs/2503.17945 has been replaced.
initial toot: https://mastoxiv.page/@a…
This https://arxiv.org/abs/2412.04854 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_hepe…
Stream DaQ: Stream-First Data Quality Monitoring
Vasileios Papastergios, Anastasios Gounaris
https://arxiv.org/abs/2506.06147 https://
Replaced article(s) found for physics.soc-ph. https://arxiv.org/list/physics.soc-ph/new/
[1/1]:
Growth of Science and Women: Methodological Challenges of Using Structured Big Data
Centralized Copy-Paste: Enhanced Data Augmentation Strategy for Wildland Fire Semantic Segmentation
Joon Tai Kim, Tianle Chen, Ziyu Dong, Nishanth Kunchala, Alexander Guller, Daniel Ospina Acero, Roger Williams, Mrinal Kumar
https://arxiv.org/abs/2507.06321
GloBIAS: strengthening the foundations of BioImage Analysis
A. A. Corbat (BioImage Informatics Unit, Science for Life Laboratory and Department of Information Technology, Uppsala University, Sweden), C. G. Walther (German BioImaging, Gesellschaft f\"ur Mikroskopie und Bildanalyse e.V., Konstanz, Germany, University of Vienna, Vienna, Austria), L. R. de la Ballina (Centre for Cancer Cell Reprogramming, Institute of Clinical Medicine, Faculty of Medicine, University of Oslo, Montebe…
DELPHYNE: A Pre-Trained Model for General and Financial Time Series
Xueying Ding, Aakriti Mittal, Achintya Gopal
https://arxiv.org/abs/2506.06288 https://
Mapping correlations and coherence: adjacency-based approach to data visualization and regularity discovery
Guang-Xing Li
https://arxiv.org/abs/2506.05758 …
A Blueprint to Design Curriculum and Pedagogy for Introductory Data Science
Elijah Meyer, Mine \c{C}etinkaya-Rundel
https://arxiv.org/abs/2508.03952 https://
MLOps with Microservices: A Case Study on the Maritime Domain
Renato Cordeiro Ferreira (Jheronimus Academy of Data Science, Technical University of Eindhoven, Tilburg University), Rowanne Trapmann (Jheronimus Academy of Data Science, Technical University of Eindhoven, Tilburg University), Willem-Jan van den Heuvel (Jheronimus Academy of Data Science, Technical University of Eindhoven, Tilburg University)
Explainer-guided Targeted Adversarial Attacks against Binary Code Similarity Detection Models
Mingjie Chen (Zhejiang University), Tiancheng Zhu (Huazhong University of Science,Technology), Mingxue Zhang (The State Key Laboratory of Blockchain,Data Security, Zhejiang University,Hangzhou High-Tech Zone), Yiling He (University College London), Minghao Lin (University of Southern California), Penghui Li (Columbia University), Kui Ren (The State Key Laboratory of Blockchain,Data Security, Z…
Two of NASA's historic data-collecting missions
— used by scientists and earthbound agriculturalists to track carbon dioxide and crop health
— ❌ may be permanently grounded as the Trump administration looks to shrink the agency's spending.
When they launched over a decade ago,
the satellites known as the
"Orbiting Carbon Observatories" (OCOs) revolutionized the collection of carbon data and greenhouse gas science.
To put it simply, the …
What Does Information Science Offer for Data Science Research?: A Review of Data and Information Ethics Literature
Brady D. Lund, Ting Wang
https://arxiv.org/abs/2506.03165
Explainable Hierarchical Deep Learning Neural Networks (Ex-HiDeNN)
Reza T. Batley, Chanwook Park, Wing Kam Liu, Sourav Saha
https://arxiv.org/abs/2507.05498
"The Seven Capital Sins of Open Science"
1. Worshiping the 'age factor'
2. Ignoring the value of data reuse and complexity
3. Disrespecting other disciplines
4. Publishing data without a supplementary paper
5. Creating and maintaining a nightmare for machines
6. Refusing to support investment in general infrastructure
7. Creating data without a FAIR and explicit data stewardship plan.
47 Tuc in Rubin Data Preview 1 - Exploring Early LSST Data and Science Potential: #Rubin Observatory is Just Getting Started: https://www.universetoday.com/articles/globular-clusters-the-vera-rubin-observatory-is-just-getting-started (based on ComCam images just released in bulk).
Please don't call it "politicized science". What #RFKjr and his ilk are doing is not science. Science does not discard data that disagrees with the researchers desired outcomes.
They are purposefully ignoring and hiding any information that contradicts their beliefs. That is not science, that is censorship and lies in favor of an ideology.
Please don't call it "politicized science". What #RFKjr and his ilk are doing is not science. Science does not discard data that disagrees with the researchers desired outcomes.
They are purposefully ignoring and hiding any information that contradicts their beliefs. That is not science, that is censorship and lies in favor of an ideology.
#DH2025, we're delighted to share a Turing's Humanities and Data Science event in Oxford and online on 25th Sept with a panel asking: 'How far can data science and the humanities help to answer each other’s questions?'
Express your interest here:
cora: CORA citations (1998)
Citations among papers indexed by CORA, from 1998, an early computer science research paper search engine. If a paper i cites a paper j also in this data set, then a directed edge connects i to j. (Papers not in the data set are excluded.) Self-loops may be present. The dates of these snapshots are uncertain.
This network has 23166 nodes and 91500 edges.
Tags: Informational, Citation, Unweighted
Nonlinear Causal Discovery for Grouped Data
Konstantin G\"obler, Tobias Windisch, Mathias Drton
https://arxiv.org/abs/2506.05120 https://
Inside the Vera C. Rubin Observatory, whose 3.2-gigapixel camera will produce 60PB of space image data over 10 years, to be analyzed using ML and deep learning (New York Times)
https://www.nytimes.com/2025/06/20/science…
The future of gravitational wave science unlocking LIGO potential: AI-driven data analysis and exploration
Yong Xiao, Li, Zin Nandar Win, He Wang, Hla Myo Tun, Win Thu Zar
https://arxiv.org/abs/2506.04584
On Inverse Problems, Parameter Estimation, and Domain Generalization
Deborah Pereg
https://arxiv.org/abs/2506.06024 https://arxiv.org…
Next investigation should be into the councillors themselves
Warrnambool council abandons peer-reviewed flood study, citing 'supposed science' - ABC News
https://www.abc.net.au/news/2025-06-05/regiona…
A Novel, Human-in-the-Loop Computational Grounded Theory Framework for Big Social Data
Lama Alqazlan, Zheng Fang, Michael Castelle, Rob Procter
https://arxiv.org/abs/2506.06083
D-Rex: Heterogeneity-Aware Reliability Framework and Adaptive Algorithms for Distributed Storage
Maxime Gonthier (University of Chicago, Argonne National Laboratory), Dante D. Sanchez-Gallegos (Universidad Carlos III de Madrid), Haochen Pan (University of Chicago), Bogdan Nicolae (Argonne National Laboratory), Sicheng Zhou (Southern University of Science and Technology), Hai Duc Nguyen (University of Chicago, Argonne National Laboratory), Valerie Hayot-Sasson (University of Chicago, Ar…
Bilinear Quadratic Output Systems and Balanced Truncation
Heike Fa{\ss}bender (Institute for Numerical Analysis, TU Braunschweig), Serkan Gugercin (Department of Mathematics and Division of Computational Modeling and Data Analytics, Academy of Data Science, Virginia Tech), Till Peters (Institute for Numerical Analysis, TU Braunschweig)
https://
The SPHEREx Sky Simulator: Science Data Modeling for the First All-Sky Near-Infrared Spectral Survey
Brendan P. Crill, Yoonsoo P. Bach, Sean A. Bryan, Jean Choppin de Janvry, Ari J. Cukierman, C. Darren Dowell, Spencer W. Everett, Candice Fazar, Tatiana Goldina, Zhaoyu Huai, Howard Hui, Woong-Seob Jeong, Jae Hwan Kang, Phillip M. Korngut, Jae Joon Lee, Daniel C. Masters, Chi H. Nguyen, Jeonghyun Pyo, Teresa Symons, Yujin Yang, Michael Zemcov, Rachel Akeson, Matthew L. N. Ashby, James J…
This https://arxiv.org/abs/2502.06753 has been replaced.
link: https://scholar.google.com/scholar?q=a
Trustworthy Provenance for Big Data Science: a Modular Architecture Leveraging Blockchain in Federated Settings
Nicola Giuseppe Marchioro, Yannis Velegrakis, Valentine Anantharaj, Ian Foster, Sandro Luigi Fiore
https://arxiv.org/abs/2505.24675
Microsoft has once again been named a Leader in the 2025 Gartner® Magic Quadrant™ for Data Science and Machine Learning (DSML) Platforms.
https://azure.microsoft.com/en-us/blog
The Impact of Carbon Targets on Firms' Carbon Performance
Xichen Sun, Xingzhi Jia, Rogelio Oliva
https://arxiv.org/abs/2508.05811 https://arxiv.org/pdf…
#Blakes7 Series B, Episode 06 - Trial
THANIA: We reserve our opening declaration, sir.
SAMOR: Very well. Enter prosecution data. [A clerk presses some buttons.]
https://blake.torpidity.net/m/206/53
Observable Covariance and Principal Observable Analysis for Data on Metric Spaces
Ece Karacam, Washington Mio, Osman Berat Okutan
https://arxiv.org/abs/2506.04003
lol, basically every single example in this post shows how the LLM is just generating context that's not in the actual image. But somehow this is sold as being better than "classical" computer vision.
I don't know folks, if I actually wanted to do "data science", with focus on the "science" bit, I'd be disturbed by that. 🤷♂️
https://fosstodon.org/@Posit/114597245963405210
cora: CORA citations (1998)
Citations among papers indexed by CORA, from 1998, an early computer science research paper search engine. If a paper i cites a paper j also in this data set, then a directed edge connects i to j. (Papers not in the data set are excluded.) Self-loops may be present. The dates of these snapshots are uncertain.
This network has 23166 nodes and 91500 edges.
Tags: Informational, Citation, Unweighted
Online Sparsification of Bipartite-Like Clusters in Graphs
Joyentanuj Das, Suranjan De, He Sun
https://arxiv.org/abs/2508.05437 https://arxiv.org/pdf/2508.…
47 Tuc in Rubin Data Preview 1: Exploring Early LSST Data and Science Potential
Yumi Choi (David), Knut A. G. Olsen (David), Jeffrey L. Carlin (David), Yuankun (David), Wang, Fred Moolekamp, Abi Saha, Ian Sullivan, Colin T. Slater, Douglas L. Tucker, Christina L. Adair, Peter S. Ferguson, Yijung Kang, Karla Pe\~na Ram\'irez, Markus Rabus
https:…
Pivoting the paradigm: the role of spreadsheets in K-12 data science
Oren Tirschwell, Nicholas Jon Horton
https://arxiv.org/abs/2506.03232 https://
Should we teach vibe coding? Here's why not.
Should AI coding be taught in undergrad CS education?
1/2
I teach undergraduate computer science labs, including for intro and more-advanced core courses. I don't publish (non-negligible) scholarly work in the area, but I've got years of craft expertise in course design, and I do follow the academic literature to some degree. In other words, In not the world's leading expert, but I have spent a lot of time thinking about course design, and consider myself competent at it, with plenty of direct experience in what knowledge & skills I can expect from students as they move through the curriculum.
I'm also strongly against most uses of what's called "AI" these days (specifically, generative deep neutral networks as supplied by our current cadre of techbro). There are a surprising number of completely orthogonal reasons to oppose the use of these systems, and a very limited number of reasonable exceptions (overcoming accessibility barriers is an example). On the grounds of environmental and digital-commons-pollution costs alone, using specifically the largest/newest models is unethical in most cases.
But as any good teacher should, I constantly question these evaluations, because I worry about the impact on my students should I eschew teaching relevant tech for bad reasons (and even for his reasons). I also want to make my reasoning clear to students, who should absolutely question me on this. That inspired me to ask a simple question: ignoring for one moment the ethical objections (which we shouldn't, of course; they're very stark), at what level in the CS major could I expect to teach a course about programming with AI assistance, and expect students to succeed at a more technically demanding final project than a course at the same level where students were banned from using AI? In other words, at what level would I expect students to actually benefit from AI coding "assistance?"
To be clear, I'm assuming that students aren't using AI in other aspects of coursework: the topic of using AI to "help you study" is a separate one (TL;DR it's gross value is not negative, but it's mostly not worth the harm to your metacognitive abilities, which AI-induced changes to the digital commons are making more important than ever).
So what's my answer to this question?
If I'm being incredibly optimistic, senior year. Slightly less optimistic, second year of a masters program. Realistic? Maybe never.
The interesting bit for you-the-reader is: why is this my answer? (Especially given that students would probably self-report significant gains at lower levels.) To start with, [this paper where experienced developers thought that AI assistance sped up their work on real tasks when in fact it slowed it down] (https://arxiv.org/abs/2507.09089) is informative. There are a lot of differences in task between experienced devs solving real bugs and students working on a class project, but it's important to understand that we shouldn't have a baseline expectation that AI coding "assistants" will speed things up in the best of circumstances, and we shouldn't trust self-reports of productivity (or the AI hype machine in general).
Now we might imagine that coding assistants will be better at helping with a student project than at helping with fixing bugs in open-source software, since it's a much easier task. For many programming assignments that have a fixed answer, we know that many AI assistants can just spit out a solution based on prompting them with the problem description (there's another elephant in the room here to do with learning outcomes regardless of project success, but we'll ignore this over too, my focus here is on project complexity reach, not learning outcomes). My question is about more open-ended projects, not assignments with an expected answer. Here's a second study (by one of my colleagues) about novices using AI assistance for programming tasks. It showcases how difficult it is to use AI tools well, and some of these stumbling blocks that novices in particular face.
But what about intermediate students? Might there be some level where the AI is helpful because the task is still relatively simple and the students are good enough to handle it? The problem with this is that as task complexity increases, so does the likelihood of the AI generating (or copying) code that uses more complex constructs which a student doesn't understand. Let's say I have second year students writing interactive websites with JavaScript. Without a lot of care that those students don't know how to deploy, the AI is likely to suggest code that depends on several different frameworks, from React to JQuery, without actually setting up or including those frameworks, and of course three students would be way out of their depth trying to do that. This is a general problem: each programming class carefully limits the specific code frameworks and constructs it expects students to know based on the material it covers. There is no feasible way to limit an AI assistant to a fixed set of constructs or frameworks, using current designs. There are alternate designs where this would be possible (like AI search through adaptation from a controlled library of snippets) but those would be entirely different tools.
So what happens on a sizeable class project where the AI has dropped in buggy code, especially if it uses code constructs the students don't understand? Best case, they understand that they don't understand and re-prompt, or ask for help from an instructor or TA quickly who helps them get rid of the stuff they don't understand and re-prompt or manually add stuff they do. Average case: they waste several hours and/or sweep the bugs partly under the rug, resulting in a project with significant defects. Students in their second and even third years of a CS major still have a lot to learn about debugging, and usually have significant gaps in their knowledge of even their most comfortable programming language. I do think regardless of AI we as teachers need to get better at teaching debugging skills, but the knowledge gaps are inevitable because there's just too much to know. In Python, for example, the LLM is going to spit out yields, async functions, try/finally, maybe even something like a while/else, or with recent training data, the walrus operator. I can't expect even a fraction of 3rd year students who have worked with Python since their first year to know about all these things, and based on how students approach projects where they have studied all the relevant constructs but have forgotten some, I'm not optimistic seeing these things will magically become learning opportunities. Student projects are better off working with a limited subset of full programming languages that the students have actually learned, and using AI coding assistants as currently designed makes this impossible. Beyond that, even when the "assistant" just introduces bugs using syntax the students understand, even through their 4th year many students struggle to understand the operation of moderately complex code they've written themselves, let alone written by someone else. Having access to an AI that will confidently offer incorrect explanations for bugs will make this worse.
To be sure a small minority of students will be able to overcome these problems, but that minority is the group that has a good grasp of the fundamentals and has broadened their knowledge through self-study, which earlier AI-reliant classes would make less likely to happen. In any case, I care about the average student, since we already have plenty of stuff about our institutions that makes life easier for a favored few while being worse for the average student (note that our construction of that favored few as the "good" students is a large part of this problem).
To summarize: because AI assistants introduce excess code complexity and difficult-to-debug bugs, they'll slow down rather than speed up project progress for the average student on moderately complex projects. On a fixed deadline, they'll result in worse projects, or necessitate less ambitious project scoping to ensure adequate completion, and I expect this remains broadly true through 4-6 years of study in most programs (don't take this as an endorsement of AI "assistants" for masters students; we've ignored a lot of other problems along the way).
There's a related problem: solving open-ended project assignments well ultimately depends on deeply understanding the problem, and AI "assistants" allow students to put a lot of code in their file without spending much time thinking about the problem or building an understanding of it. This is awful for learning outcomes, but also bad for project success. Getting students to see the value of thinking deeply about a problem is a thorny pedagogical puzzle at the best of times, and allowing the use of AI "assistants" makes the problem much much worse. This is another area I hope to see (or even drive) pedagogical improvement in, for what it's worth.
1/2
Now settled into low-Earth orbit, #SPHEREx (Spectro-Photometer for the History of the Universe, Epoch of Reionization, and Ices Explorer) has begun delivering its sky survey data to a public archive on a weekly basis, allowing anyone to use the data to probe the secrets of the cosmos: https://science.nasa.gov/open-science/spherex-universe-map/
23andMe's Data Sold to Nonprofit Run by Its Co-Founder - 'And I Still Don't Trust It' - Slashdot
https://science.slashdot.org/story/25/07/19/0252236/23andmes-data-sold-to-nonprofit-run-by-its-co-founder---and-i-still-dont-trust-it
Retrodicting Chaotic Systems: An Algorithmic Information Theory Approach
Kamal Dingle, Boumediene Hamzi, Marcus Hutter, Houman Owhadi
https://arxiv.org/abs/2507.04780
Is Your Training Pipeline Production-Ready? A Case Study in the Healthcare Domain
Daniel Lawand (University of S\~ao Paulo), Lucas Quaresma (University of S\~ao Paulo), Roberto Bolgheroni (University of S\~ao Paulo), Alfredo Goldman (University of S\~ao Paulo), Renato Cordeiro Ferreira (University of S\~ao Paulo, Jheronimus Academy of Data Science, Technical University of Eindhoven, Tilburg University)
Data Agent: A Holistic Architecture for Orchestrating Data AI Ecosystems
Zhaoyan Sun, Jiayi Wang, Xinyang Zhao, Jiachi Wang, Guoliang Li
https://arxiv.org/abs/2507.01599
A Data Science Approach to Calcutta High Court Judgments: An Efficient LLM and RAG-powered Framework for Summarization and Similar Cases Retrieval
Puspendu Banerjee, Aritra Mazumdar, Wazib Ansar, Saptarsi Goswami, Amlan Chakrabarti
https://arxiv.org/abs/2507.01058
CEMP: a platform unifying high-throughput online calculation, databases and predictive models for clean energy materials
Jifeng Wang, Jiazhe Ju, Ying Wang
https://arxiv.org/abs/2507.04423
dblp_cite: DBLP citations (2014)
Citations among papers contained in the DBLP computer science bibliography. If a paper i cites a paper j also in this data set, then a directed edge connects i to j. (Papers not in the data set are excluded.) Self-loops may be present. This snapshot from May 2014.
This network has 12590 nodes and 49759 edges.
Tags: Informational, Citation, Unweighted
This https://arxiv.org/abs/2505.24603 has been replaced.
initial toot: https://mastoxiv.page/@arXiv_csLG_…
Shared on Bluesky from a different #DH2025 'Nwulite Obodo Open Data License — Made for sharing African datasets equitably' https://datasciencelawlab.africa/nwulite-…
Meta debuts a prototype wristband to read electrical signals from forearm muscles, letting users control devices without touch, trained on 10K peoples' EMG data (Cade Metz/New York Times)
https://www.nytimes.com/2025/07/23/science/meta-computer-wristband-r…
Challenging Spontaneous Quantum Collapse with XENONnT
E. Aprile, J. Aalbers, K. Abe, S. Ahmed Maouloud, L. Althueser, B. Andrieu, E. Angelino, D. Ant\'on Martin, S. R. Armbruster, F. Arneodo, L. Baudis, M. Bazyk, L. Bellagamba, R. Biondi, A. Bismark, K. Boese, A. Brown, G. Bruno, R. Budnik, C. Cai, C. Capelli, J. M. R. Cardoso, A. P. Cimental Ch\'avez, A. P. Colijn, J. Conrad, J. J. Cuenca-Garc\'ia, C. Curceanu, V. D'Andrea, L. C. Daniel Garcia, M. P. Decowski, A. Deist…
Edge interventions can mitigate demographic and prestige disparities in the Computer Science coauthorship network
Kate Barnes, Mia Ellis-Einhorn, Carolina Ch\'avez-Ruelas, Nayera Hasan, Mohammad Fanous, Blair D. Sullivan, Sorelle Friedler, Aaron Clauset
https://arxiv.org/abs/2506.04435…
The Ubiquitous Sparse Matrix-Matrix Products
Ayd{\i}n Bulu\c{c}
https://arxiv.org/abs/2508.04077 https://arxiv.org/pdf/2508.04077
cs_department: Aarhus Computer Science department relationships
Multiplex network consisting of 5 edge types corresponding to online and offline relationships (Facebook, leisure, work, co-authorship, lunch) between employees of the Computer Science department at Aarhus. Data hosted by Manlio De Domenico.
This network has 61 nodes and 620 edges.
Tags: Social, Relationships, Multilayer, Unweighted
Analysis of points outcome in ATP Grand Slam Tennis using big data and machine learning
Martin Illum (Department of Applied Mathematics and Computer Science, Technical University of Denmark, Richard Petersens Plads, Denmark), Hans Christian Bechs{\o}fft Mikkelsen (Department of Applied Mathematics and Computer Science, Technical University of Denmark, Richard Petersens Plads, Denmark), Emil Hovad (Department of Applied Mathematics and Computer Science, Technical University of Denmark, …
Airalogy: AI-empowered universal data digitization for research automation
Zijie Yang, Qiji Zhou, Fang Guo, Sijie Zhang, Yexun Xi, Jinglei Nie, Yudian Zhu, Liping Huang, Chou Wu, Yonghe Xia, Xiaoyu Ma, Yingming Pu, Panzhong Lu, Junshu Pan, Mingtao Chen, Tiannan Guo, Yanmei Dou, Hongyu Chen, Anping Zeng, Jiaxing Huang, Tian Xu, Yue Zhang
https://
Poisoning Attacks to Local Differential Privacy for Ranking Estimation
Pei Zhan (School of Cyber Science and Technology, Shandong University, State Key Laboratory of Cryptography and Digital Economy Security, Shandong University, Qingdao, China), Peng Tang (School of Cyber Science and Technology, Shandong University, State Key Laboratory of Cryptography and Digital Economy Security, Shandong University, Qingdao, China), Yangzhuo Li (School of Cyber Science and Technology, Shandong Univ…
Data Cleaning of Data Streams
Valerie Restat, Niklas Rodenhausen, Carina Antonin, Uta St\"orl
https://arxiv.org/abs/2507.20839 https://arxiv.org/pdf/2…
Millimeter-wave observations of Euclid Deep Field South using the South Pole Telescope: A data release of temperature maps and catalogs
M. Archipley, A. Hryciuk, L. E. Bleem, K. Kornoelje, M. Klein, A. J. Anderson, B. Ansarinejad, M. Aravena, L. Balkenhol, P. S. Barry, K. Benabed, A. N. Bender, B. A. Benson, F. Bianchini, S. Bocquet, F. R. Bouchet, E. Camphuis, M. G. Campitiello, J. E. Carlstrom, J. Cathey, C. L. Chang, S. C. Chapman, P. Chaubal, P. M. Chichura, A. Chokshi, T. -L. Chou…
cs_department: Aarhus Computer Science department relationships
Multiplex network consisting of 5 edge types corresponding to online and offline relationships (Facebook, leisure, work, co-authorship, lunch) between employees of the Computer Science department at Aarhus. Data hosted by Manlio De Domenico.
This network has 61 nodes and 620 edges.
Tags: Social, Relationships, Multilayer, Unweighted
#DH2025 thanks @flochiff.bsky.social for sharing this link as I wanted to follow up on Pandore! 'Pandore: automating text-processing workflows for humanities researchers' from Sorbonne Université and ObTIC - Observatoire des textes, des idées et des corpus
faculty_hiring: Faculty hiring networks (Comp. Sci., Business, History)
Three networks of faculty hiring in Computer Science Departments, Business Schools, and History Departments. Each node is a PhD-granting institution in the respective field, and a directed edge (i,j) indicates that a person received their PhD from node i and was tenure-track faculty at node j during time of collection (2011-2013). All data collected from faculty public rosters at the sampled institutions.
Thi…
Carbonate formation and fluctuating habitability on Mars: #Mars Science Laboratory Curiosity rover data may explain why planet was likely harsh desert for most of recent past.
Buckaroo: A Direct Manipulation Visual Data Wrangler
Annabelle Warner, Andrew McNutt, Paul Rosen, El Kindi Rezig
https://arxiv.org/abs/2507.16073 https://
cora: CORA citations (1998)
Citations among papers indexed by CORA, from 1998, an early computer science research paper search engine. If a paper i cites a paper j also in this data set, then a directed edge connects i to j. (Papers not in the data set are excluded.) Self-loops may be present. The dates of these snapshots are uncertain.
This network has 23166 nodes and 91500 edges.
Tags: Informational, Citation, Unweighted
Learning Lineage Constraints for Data Science Operations
Jinjin Zhao
https://arxiv.org/abs/2506.18252 https://arxiv.org/pdf/2506.1825…
faculty_hiring: Faculty hiring networks (Comp. Sci., Business, History)
Three networks of faculty hiring in Computer Science Departments, Business Schools, and History Departments. Each node is a PhD-granting institution in the respective field, and a directed edge (i,j) indicates that a person received their PhD from node i and was tenure-track faculty at node j during time of collection (2011-2013). All data collected from faculty public rosters at the sampled institutions.
Thi…
Towards Next Generation Data Engineering Pipelines
Kevin M. Kramer, Valerie Restat, Sebastian Strasser, Uta St\"orl, Meike Klettke
https://arxiv.org/abs/2507.13892
faculty_hiring: Faculty hiring networks (Comp. Sci., Business, History)
Three networks of faculty hiring in Computer Science Departments, Business Schools, and History Departments. Each node is a PhD-granting institution in the respective field, and a directed edge (i,j) indicates that a person received their PhD from node i and was tenure-track faculty at node j during time of collection (2011-2013). All data collected from faculty public rosters at the sampled institutions.
Thi…
faculty_hiring: Faculty hiring networks (Comp. Sci., Business, History)
Three networks of faculty hiring in Computer Science Departments, Business Schools, and History Departments. Each node is a PhD-granting institution in the respective field, and a directed edge (i,j) indicates that a person received their PhD from node i and was tenure-track faculty at node j during time of collection (2011-2013). All data collected from faculty public rosters at the sampled institutions.
Thi…
sp_infectious: Art exhibit dynamic contacts (2011)
This dataset contains the daily dynamic contact networks collected during the Infectious SocioPatterns event that took place at the Science Gallery in Dublin, Ireland, during the artscience exhibition INFECTIOUS: STAY AWAY. Each file in the downloadable package contains a tab-separated list representing the active contacts during 20-second intervals of one day of data collection. Each line has the form “t i j“, where i and j are the a…
sp_infectious: Art exhibit dynamic contacts (2011)
This dataset contains the daily dynamic contact networks collected during the Infectious SocioPatterns event that took place at the Science Gallery in Dublin, Ireland, during the artscience exhibition INFECTIOUS: STAY AWAY. Each file in the downloadable package contains a tab-separated list representing the active contacts during 20-second intervals of one day of data collection. Each line has the form “t i j“, where i and j are the a…
cs_department: Aarhus Computer Science department relationships
Multiplex network consisting of 5 edge types corresponding to online and offline relationships (Facebook, leisure, work, co-authorship, lunch) between employees of the Computer Science department at Aarhus. Data hosted by Manlio De Domenico.
This network has 61 nodes and 620 edges.
Tags: Social, Relationships, Multilayer, Unweighted
faculty_hiring: Faculty hiring networks (Comp. Sci., Business, History)
Three networks of faculty hiring in Computer Science Departments, Business Schools, and History Departments. Each node is a PhD-granting institution in the respective field, and a directed edge (i,j) indicates that a person received their PhD from node i and was tenure-track faculty at node j during time of collection (2011-2013). All data collected from faculty public rosters at the sampled institutions.
Thi…
sp_infectious: Art exhibit dynamic contacts (2011)
This dataset contains the daily dynamic contact networks collected during the Infectious SocioPatterns event that took place at the Science Gallery in Dublin, Ireland, during the artscience exhibition INFECTIOUS: STAY AWAY. Each file in the downloadable package contains a tab-separated list representing the active contacts during 20-second intervals of one day of data collection. Each line has the form “t i j“, where i and j are the a…
sp_infectious: Art exhibit dynamic contacts (2011)
This dataset contains the daily dynamic contact networks collected during the Infectious SocioPatterns event that took place at the Science Gallery in Dublin, Ireland, during the artscience exhibition INFECTIOUS: STAY AWAY. Each file in the downloadable package contains a tab-separated list representing the active contacts during 20-second intervals of one day of data collection. Each line has the form “t i j“, where i and j are the a…