WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset
Jiantao Qiu, Haijun Lv, Zhenjiang Jin, Rui Wang, Wenchang Ning, Jia Yu, ChaoBin Zhang, Pei Chu, Yuan Qu, Runyu Peng, Zhiyuan Zeng, Huanze Tang, Ruiliang Xu, Wei Li, Hang Yan, Conghui He
https://arxiv.org/abs/2402.19282<…
An open letter on the position of scientists and researchers on the
recently proposed changes to the EU’s proposed Child Sexual Abuse Regulation. As of the 1st May 2024, the letter has been signed by 254 scientists and researchers from 33 countries. (among them @… )
AI companies to universities: Personalized tutors will make you obsolete
Also AI companies: Thanks for recording your lectures so we can sell them on the open market to train personalized tutors
https://annettevee.substack.com/p/when-student-data-is-the-new-oil
Advanced analysis of single-molecule spectroscopic data
Joshua L. Botha, Bertus van Heerden, Tjaart P. J. Kr\"uger
https://arxiv.org/abs/2404.18945 https://arxiv.org/pdf/2404.18945
arXiv:2404.18945v1 Announce Type: new
Abstract: We present Full SMS, a multipurpose graphical user interface (GUI)-based software package for analysing single-molecule spectroscopy (SMS) data. SMS typically delivers multiparameter data -- such as fluorescence brightness, lifetime, and spectra -- of molecular- or nanometre-scale particles such as single dye molecules, quantum dots, or fluorescently labelled biological macromolecules. Full SMS allows an unbiased statistical analysis of fluorescence brightness through level resolution and clustering, analysis of fluorescence lifetimes through decay fitting, as well as the calculation of second-order correlation functions and the display of fluorescence spectra and raster-scan images. Additional features include extensive data filtering options, a custom HDF5-based file format, and flexible data export options. The software is open source and written in Python but GUI-based so it may be used without any programming knowledge. A multi-process architecture was employed for computational efficiency. The software is also designed to be easily extendable to include additional import data types and analysis capabilities.
Differentiated Security Architecture for Secure and Efficient Infotainment Data Communication in IoV Networks
Jiani Fan, Lwin Khin Shar, Jiale Guo, Wenzhuo Yang, Dusit Niyato, Kwok-Yan Lam
https://arxiv.org/abs/2403.20136
The definition of what constitutes an "open system" has really been sanded down by "AI". Just dropping random weights and a bit of network structure counts as open: No idea how and on what something was trained, which data was filtered out, which earlier training runs were discarded.
It's not really open at all. Maybe "free" as in "free beer" but from learning potential, ability to understand the system and maybe extend it, it's not any …
LLM compression will be the final nail in the coffin of the open web
There is no benefit to create content just for it to be copied, compressed, and regurgitated without attribution or payment
Arc Browser is pretty cool, but it assumes websites will continue to publish high quality data without anyone ever visiting them directly
PeLLE: Encoder-based language models for Brazilian Portuguese based on open data
Guilherme Lamartine de Mello, Marcelo Finger, and Felipe Serras, Miguel de Mello Carpi, Marcos Menon Jose, Pedro Henrique Domingues, Paulo Cavalim
https://arxiv.org/abs/2402.19204
Deep Learning for Educational Data Science
Juan D. Pinto, Luc Paquette
https://arxiv.org/abs/2404.19675 https://arxiv.org/pdf/2404.19675
arXiv:2404.19675v1 Announce Type: new
Abstract: With the ever-growing presence of deep artificial neural networks in every facet of modern life, a growing body of researchers in educational data science -- a field consisting of various interrelated research communities -- have turned their attention to leveraging these powerful algorithms within the domain of education. Use cases range from advanced knowledge tracing models that can leverage open-ended student essays or snippets of code to automatic affect and behavior detectors that can identify when a student is frustrated or aimlessly trying to solve problems unproductively -- and much more. This chapter provides a brief introduction to deep learning, describes some of its advantages and limitations, presents a survey of its many uses in education, and discusses how it may further come to shape the field of educational data science.
"The findings demonstrate that repository shutdown does happen and can result in permanent data loss… Data #reuse & #citation are increasingly promoted by journals, funders and other stakeholders. If these practices become more common, data loss might pose a threat to the permanence of the scholarly…
Handling Open Research Data within the Max Planck Society -- Looking Closer at the Year 2020
Martin Boosen, Michael Franke, Yves Vincent Grossmann, Sy Dat Ho, Larissa Leiminger, Jan Matthiesen
https://arxiv.org/abs/2402.18182
Imagine if game studios would open source models of real world buildings, sites, etc. which they have researched for their games (Ubisoft).
Even if they are a decade old data, I wonder if such models could help indie dev studios.
#gaming #gamedesign
Bevy 0.13 is out! 🕺
For those who don't know, Bevy is a data-driven game engine built in Rust.
Check out the new features 👇
https://bevyengine.org/news/bevy-0-13/
Processing HSV Colored Medical Images and Adapting Color Thresholds for Computational Image Analysis: a Practical Introduction to an open-source tool
Lie Cai, Andre Pfob
https://arxiv.org/abs/2404.17878 https://arxiv.org/pdf/2404.17878
arXiv:2404.17878v1 Announce Type: new
Abstract: Background: Using artificial intelligence (AI) techniques for computational medical image analysis has shown promising results. However, colored images are often not readily available for AI analysis because of different coloring thresholds used across centers and physicians as well as the removal of clinical annotations. We aimed to develop an open-source tool that can adapt different color thresholds of HSV-colored medical images and remove annotations with a simple click.
Materials and Methods: We built a function using MATLAB and used multi-center international shear wave elastography data (NCT 02638935) to test the function. We provide step-by-step instructions with accompanying code lines.
Results: We demonstrate that the newly developed pre-processing function successfully removed letters and adapted different color thresholds of HSV-colored medical images.
Conclusion: We developed an open-source tool for removing letters and adapting different color thresholds in HSV-colored medical images. We hope this contributes to advancing medical image processing for developing robust computational imaging algorithms using diverse multi-center big data. The open-source Matlab tool is available at https://github.com/cailiemed/image-threshold-adapting.
PEFSL: A deployment Pipeline for Embedded Few-Shot Learning on a FPGA SoC
Lucas Grativol Ribeiro (IMT Atlantique - MEE, Lab\_STICC\_BRAIn, Lab-STICC\_2AI, LHC), Lubin Gauthier (Lab\_STICC\_BRAIn, IMT Atlantique - MEE), Mathieu Leonardon (IMT Atlantique - MEE, Lab\_STICC\_BRAIn), J\'er\'emy Morlier (IMT Atlantique - MEE, Lab\_STICC\_BRAIn), Antoine Lavrard-Meyer (IMT Atlantique), Guillaume Muller (Mines Saint-\'Etienne MSE, FAYOL-ENSMSE, FAYOL-ENSMSE), Virginie Fresse (LHC, …
Heute vor 39 Jahren: Am 2. Mai 1985 zündeten die #USA im Rahmen von Operation Grenadier die 9. Atombombe "Towanda". Grenadier war eine Serie von #Kernwaffentests bei der 1984/85 insgesamt 16 Bomben im Testgebiet in
European Peatlands & Policies Open Data Mapathon 2024
Seeking teams (2 to 4 people) from all European countries to register for European Peatlands & Policies Open Data Mapathon 2024 #Mapathon2024
Date: 6th April Venue: Online/University of Galway
First Prize €1,200.
February 22, 2024: "Today, we’re excited to announce that the Bluesky network is federating and opening up in a way that allows you to host your own data."
https://bsky.social/about/blog/02-22-2024-open-social-web
OpenMedLM: Prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models
Jenish Maharjan, Anurag Garikipati, Navan Preet Singh, Leo Cyrus, Mayank Sharma, Madalina Ciobanu, Gina Barnes, Rahul Thapa, Qingqing Mao, Ritankar Das
https://arxiv.org/abs/2402.19371
Investigating the dissemination of STEM content on social media with computational tools
Oluwamayokun Oshinowo, Priscila Delgado, Meredith Fay, C. Alessandra Luna, Anjana Dissanayaka, Rebecca Jeltuhin, David R. Myers
https://arxiv.org/abs/2404.18944
3D-Speaker-Toolkit: An Open Source Toolkit for Multi-modal Speaker Verification and Diarization
Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Tinglong Zhu, Changhe Song, Rongjie Huang, Ziyang Ma, Qian Chen, Shiliang Zhang, Xihao Li
https://arxiv.org/abs/2403.19971
"#Apple points to #WebApps as the open alternative to the App Store, and actions to remove them have created deep concern in the web community.
#iOS demoting Web Apps to shortcuts threaten data loss and undermi…
Given the rather sloppy headlines about what “Wordpress” is planning to do with allowing AI systems to collect user data, I will point out:
- All the references to Wordpress refer to Wordpress.com, the hosting company whose parent company is Automattic. They do not refer to Wordpress the open source software project.
- Wordpress.com has published a blog post outlining their policies so that you don’t need to rely on speculative and vaguely alarming news articles
Jetzt Marcel Ackermanns Vortrag "Datenintegration aus offenen Quellen in der dblp computer science bibliography" beim #kimws24. Ich kriege einen kleinen Flashback zu meiner Aktivität in der Working Group on Open Bibliographic Data der OKFN, in deren Rahmen wir 2011 Marcels Gastbeitrag zur Freigabe der DBLP-Daten veröffentlicht haben:
Germany has open sourced a lot of data related to public #EV charging infrastructure.
Charging data is anonymised to protect operator interests, but it may still be useful for others to work with.
15k different ad-hoc pricing structures also registered - who wants to do an analysis on that?
In another move[1] to stay up-to-date with latest version of Zig (v0.12.0), I've also updated all code (and .zig.zon depencency info) in the still-just-a-baby zig.thi.ng repo:
https://github.com/thi-ng/zig-thing
[1] Related (from yesterday):
Colosseum: The Open RAN Digital Twin
Michele Polese, Leonardo Bonati, Salvatore D'Oro, Pedram Johari, Davide Villa, Sakthivel Velumani, Rajeev Gangula, Maria Tsampazi, Clifton Paul Robinson, Gabriele Gemmi, Andrea Lacava, Stefano Maxenti, Hai Cheng, Tommaso Melodia
https://arxiv.org/abs/2404.17317
Wow! My newsletter has 1800 subscribers now 🤩 💃
I am so happy that so many people are interested 🫶
To celebrate, I am preparing a very special issue this week:
✨ 3 simple rues for creating Open Science Policies ✨
(a collaboration with Sander Bosch)
Interested? You can have it in your inbox on Friday or view it directly on the newsletter page:
Join Hellmar Becker at this year's Berlin Buzzwords to learn how to track data lineage in a real-time, open source analytics pipeline. #bbuzz
https://2024.berlinbuzzwords.de/sessio
"Bei keiner der für 2024 angekündigten deutschen Open-Data-Day-Veranstaltungen findet sich das Thema Gender Data Gap.
„Die Auswirkungen geschlechtsspezifischer Datenlücken werden völlig unterschätzt“, so die Präsidentin des djb, Ursula Matthiessen-Kreuder. „Es besteht ein dringender Bedarf an nach Geschlechtern aufgeschlüsselten Daten.“"
Spent 3 days and 4 different operating systems trying to get an NVIDIA desktop GPU to act like a data center GPU in Linux. Turns out it was one kernel parameter. Woof.
#nvidia #linux #ai
EndToEndML: An Open-Source End-to-End Pipeline for Machine Learning Applications
Nisha Pillai, Athish Ram Das, Moses Ayoola, Ganga Gireesan, Bindu Nanduri, Mahalingam Ramkumar
https://arxiv.org/abs/2403.18203
StarCoder 2 and The Stack v2: The Next Generation
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krau{\ss}, Naman Jain, Yixuan S…
Heute vor 39 Jahren: Am 2. April 1985 zündeten die #USA im Rahmen von Operation Grenadier die 7. Atombombe "Hermosa". Grenadier war eine Serie von #Kernwaffentests bei der 1984/85 insgesamt 16 Bomben im Testgebiet in
February 22, 2024: "Today, we’re excited to announce that the Bluesky network is federating and opening up in a way that allows you to host your own data."
https://bsky.social/about/blog/02-22-2024-open-social-web
Open Your Ears to Take a Look: A State-of-the-Art Report on the Integration of Sonification and Visualization
Kajetan Enge, Elias Elmquist, Valentina Caiola, Niklas R\"onnberg, Alexander Rind, Michael Iber, Sara Lenzi, Fangfei Lan, Robert H\"oldrich, Wolfgang Aigner
https://arxiv.org/abs/2402.16558
Heute vor 73 Jahren: Am 1. April 1952 zündeten die #USA im Rahmen von Operation Tumbler–Snapper die Atombombe "Able". Tumbler–Snapper war eine Serie von #Kernwaffentests bei der 1952 insgesamt 8 Bomben im Testgebiet in
Snowflake announces Arctic, an LLM optimized for enterprise tasks such as SQL generation, coding, and instruction following, with an Apache 2.0 license (Shubham Sharma/VentureBeat)
https://venturebeat.com/data-infrastru
Cycling on the Freeway: The Perilous State of Open Source Neuroscience Software
Britta U. Westner, Daniel R. McCloy, Eric Larson, Alexandre Gramfort, Daniel S. Katz, Arfon M. Smith, invited co-signees
https://arxiv.org/abs/2403.19394
Cycling on the Freeway: The Perilous State of Open Source Neuroscience Software
Most scientists need software to perform their research (Barker et al., 2020; Carver et al., 2022; Hettrick, 2014; Hettrick et al., 2014; Switters and Osimo, 2019), and neuroscientists are no exception. Whether we work with reaction times, electrophysiological signals, or magnetic resonance imaging data, we rely on software to acquire, analyze, and statistically evaluate the raw data we obtain - or to generate such data if we work with simulations. In recent years there has been a shift toward …
Study on the Temporal Evolution of Literature Bradford Curves in the Context of Library Specialization
Haobai Xue, Xian Liu
https://arxiv.org/abs/2404.19267 https://arxiv.org/pdf/2404.19267
arXiv:2404.19267v1 Announce Type: new
Abstract: The Bradford's law of bibliographic scattering is a fundamental law in bibliometrics and can provide valuable guidance to academic libraries in literature search and procurement. However, the Bradford's curves can take various shapes at different time points and there is still a lack of causal explanation for it, so the prediction of its shape is still an open question. This paper attributes the deviation of Bradford curve from the theoretical J-shape to the integer constraints of the journal number and article number, and extends the Leimkuhler and Egghe's formula to cover the core region of very productive journals, where the theoretical journal number of which fall below one. The key parameters of the extended formula are identified and studied by using the Simon-Yule model. The reasons for the Groos Droop are explained and the critical point for the shape change are studied. Finally, the proposed formulae are validated with the empirical data found in the literature. It is found that the proposed method can be used to predict the evolution of Bradford's curves and thus guide the academic library for scientific literature procurement and utilization.
Radical-Cylon: A Heterogeneous Data Pipeline for Scientific Computing
Arup Kumar Sarker, Aymen Alsaadi, Niranda Perera, Mills Staylor, Gregor von Laszewski, Matteo Turilli, Ozgur Ozan Kilic, Mikhail Titov, Andre Merzky, Shantenu Jha, Geoffrey Fox
https://arxiv.org/abs/2403.15721
Apparently even the European Data Protection Supervisor #EDPS , can't find the funds for it's mastodon servers. Which is bizar if you think about all the "talk" about not relying on #bigtech , a more independent EU, open standards, security, importance of reliable information etc.
Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom
Shisen Yue, Siyuan Song, Xinyuan Cheng, Hai Hu
https://arxiv.org/abs/2404.19509 https://arxiv.org/pdf/2404.19509
arXiv:2404.19509v1 Announce Type: new
Abstract: Understanding the non-literal meaning of an utterance is critical for large language models (LLMs) to become human-like social communicators. In this work, we introduce SwordsmanImp, the first Chinese multi-turn-dialogue-based dataset aimed at conversational implicature, sourced from dialogues in the Chinese sitcom $\textit{My Own Swordsman}$. It includes 200 carefully handcrafted questions, all annotated on which Gricean maxims have been violated. We test eight close-source and open-source LLMs under two tasks: a multiple-choice question task and an implicature explanation task. Our results show that GPT-4 attains human-level accuracy (94%) on multiple-choice questions. CausalLM demonstrates a 78.5% accuracy following GPT-4. Other models, including GPT-3.5 and several open-source models, demonstrate a lower accuracy ranging from 20% to 60% on multiple-choice questions. Human raters were asked to rate the explanation of the implicatures generated by LLMs on their reasonability, logic and fluency. While all models generate largely fluent and self-consistent text, their explanations score low on reasonability except for GPT-4, suggesting that most LLMs cannot produce satisfactory explanations of the implicatures in the conversation. Moreover, we find LLMs' performance does not vary significantly by Gricean maxims, suggesting that LLMs do not seem to process implicatures derived from different maxims differently. Our data and code are available at https://github.com/sjtu-compling/llm-pragmatics.
Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation
Yuan Ge, Yilun Liu, Chi Hu, Weibin Meng, Shimin Tao, Xiaofeng Zhao, Hongxia Ma, Li Zhang, Hao Yang, Tong Xiao
https://arxiv.org/abs/2402.18191
An analysis of the effects of sharing research data, code, and preprints on citations
Giovanni Colavizza, Lauren Cadwallader, Marcel LaFlamme, Gr\'egory Dozot, St\'ephane Lecorney, Daniel Rappo, Iain Hrynaszkiewicz
https://arxiv.org/abs/2404.16171 https://arxiv.org/pdf/2404.16171
arXiv:2404.16171v1 Announce Type: new
Abstract: Calls to make scientific research more open have gained traction with a range of societal stakeholders. Open Science practices include but are not limited to the early sharing of results via preprints and openly sharing outputs such as data and code to make research more reproducible and extensible. Existing evidence shows that adopting Open Science practices has effects in several domains. In this study, we investigate whether adopting one or more Open Science practices leads to significantly higher citations for an associated publication, which is one form of academic impact. We use a novel dataset known as Open Science Indicators, produced by PLOS and DataSeer, which includes all PLOS publications from 2018 to 2023 as well as a comparison group sampled from the PMC Open Access Subset. In total, we analyze circa 122'000 publications. We calculate publication and author-level citation indicators and use a broad set of control variables to isolate the effect of Open Science Indicators on received citations. We show that Open Science practices are adopted to different degrees across scientific disciplines. We find that the early release of a publication as a preprint correlates with a significant positive citation advantage of about 20.2% on average. We also find that sharing data in an online repository correlates with a smaller yet still positive citation advantage of 4.3% on average. However, we do not find a significant citation advantage for sharing code. Further research is needed on additional or alternative measures of impact beyond citations. Our results are likely to be of interest to researchers, as well as publishers, research funders, and policymakers.
Radical-Cylon: A Heterogeneous Data Pipeline for Scientific Computing
Arup Kumar Sarker, Aymen Alsaadi, Niranda Perera, Mills Staylor, Gregor von Laszewski, Matteo Turilli, Ozgur Ozan Kilic, Mikhail Titov, Andre Merzky, Shantenu Jha, Geoffrey Fox
https://arxiv.org/abs/2403.15721
Apparently even the European Data Protection Supervisor #EDPS , can't find the funds for it's mastodon servers. Which is bizar if you think about all the "talk" about not relying on #bigtech , a more independent EU, open standards, security, importance of reliable information etc.
Heute vor 32 Jahren: Am 30. April 1992 zündeten die #USA im Rahmen von Operation Julin die 3. Atombombe "Diamond Fortune". Julin war eine Serie von #Kernwaffentests bei der 1991/92 insgesamt 9 Bomben im Testgebiet in
Worldcoin announces Personal Custody, which saves biometric data captured by the Orb on users' personal devices, and plans to open source the Orb's software (RT Watson/The Block)
https://www.theblock.co/post/284123/worldcoin-to-end-storing-…
Heute vor 55 Jahren: Am 30. April 1969 zündeten die #USA im Rahmen von Operation Bowline die Atombomben "Blenton" & "Thistle". Bowline war eine Serie von #Kernwaffentests bei der 68/69 insgesamt 58 Bomben im Testgebiet in
Heute vor 49 Jahren: Am 30. April 1975 testen die #USA die Atombombe "Obar". Die Operation Bedrock war eine Serie von 27 US-amerikanischen #Kernwaffentests, die 1974/75 auf der Nevada Test Site in Nevada unterirdisch durchgeführt wurde.
Heute vor 37 Jahren: Am 30.04.1987 zündeten die #USA im Rahmen von Operation Musketeer die Atombombe "Hardin". Musketeer war eine Serie von #Kernwaffentests bei der 1986/87 insgesamt 16 Bomben im Testgebiet in