Tootfinder

@arXiv_csDC_bot@mastoxiv.page
2025-09-12 07:42:19

Optimizing the Variant Calling Pipeline Execution on Human Genomes Using GPU-Enabled Machines
Ajay Kumar, Praveen Rao, Peter Sanders
https://arxiv.org/abs/2509.09058 https://

Optimizing the Variant Calling Pipeline Execution on Human Genomes Using GPU-Enabled Machines
Variant calling is the first step in analyzing a human genome and aims to detect variants in an individual's genome compared to a reference genome. Due to the computationally-intensive nature of variant calling, genomic data are increasingly processed in cloud environments as large amounts of compute and storage resources can be acquired with the pay-as-you-go pricing model. In this paper, we address the problem of efficiently executing a variant calling pipeline for a workload of human genomes…

@UP8@mastodon.social
2025-08-01 21:32:34

⚗️ Secrets of the dark genome could spark new drug discoveries
#drugs

Secrets of the dark genome could spark new drug discoveries
Since the Human Genome Project first produced the genetic instructions for a human being by sequencing DNA 22 years ago, scientists have been focused on roughly 2% of the genome-producing proteins.

@arXiv_quantph_bot@mastoxiv.page
2025-08-12 11:58:23

Pangenome-guided sequence assembly via binary optimisation
Josh Cudby, James Bonfield, Chenxi Zhou, Richard Durbin, Sergii Strelchuk
https://arxiv.org/abs/2508.08200 https://

Pangenome-guided sequence assembly via binary optimisation
De novo genome assembly is challenging in highly repetitive regions; however, reference-guided assemblers often suffer from bias. We propose a framework for pangenome-guided sequence assembly, which can resolve short-read data in complex regions without bias towards a single reference genome. Our method frames assembly as a graph traversal optimisation problem, which can be implemented on quantum computers. The pipeline first annotates pangenome graphs with estimated copy numbers for each node,…

@arXiv_qbioGN_bot@mastoxiv.page
2025-07-14 08:44:01

MicroTrace: A Lightweight R Tool for SNP-Based Pathogen Clustering in Outbreak Detection
Kaitao Lai
https://arxiv.org/abs/2507.08060 https://

MicroTrace: A Lightweight R Tool for SNP-Based Pathogen Clustering in Outbreak Detection
MicroTrace is an open-source R tool that performs SNP-based hierarchical clustering to detect potential transmission clusters from pathogen whole-genome sequencing (WGS) data. Designed for epidemiologists, microbiologists, and genomic surveillance teams, it processes SNP distance matrices and outputs dendrograms and cluster tables with optional metadata integration. MicroTrace enables reproducible outbreak detection workflows with minimal setup.

@arXiv_mathAT_bot@mastoxiv.page
2025-07-09 08:47:42

Topological Sequence Analysis of Genomes: Delta Complex approaches
Jian Liu, Li Shen, Dong Chen, Guo-Wei Wei
https://arxiv.org/abs/2507.05452 https://

Topological Sequence Analysis of Genomes: Delta Complex approaches
Algebraic topology has been widely applied to point cloud data to capture geometric shapes and topological structures. However, its application to genome sequence analysis remains rare. In this work, we propose topological sequence analysis (TSA) techniques by constructing $Δ$-complexes and classifying spaces, leading to persistent homology, and persistent path homology on genome sequences. We also develop $Δ$-complex-based persistent Laplacians to facilitate the topological spectral analysis…

@servelan@newsie.social
2025-07-23 15:11:54

Scientists Decode 1918 Flu Virus Genome From Century-Old Tissue
https://scitechdaily.com/scientists-decode-1918-flu-virus-genome-from-century-old-tissue/

@arXiv_qbioGN_bot@mastoxiv.page
2025-09-09 09:34:32

Minimum-Cost Synthetic Genome Planning: An Algorithmic Framework
Michail Patsakis, Ioannis Mouratidis, Ilias Georgakopoulos-Soares
https://arxiv.org/abs/2509.06234 https://

Minimum-Cost Synthetic Genome Planning: An Algorithmic Framework
As synthetic genomics scales toward the construction of increasingly larger genomes, computational strategies are needed to address technical feasibility. We introduce an algorithmic framework for the Minimum-Cost Synthetic Genome Planning problem, aiming to identify the most cost-effective strategy to assemble a target genome from a source genome through a combination of reuse, synthesis, and join operations. By comparing dynamic programming and greedy heuristic strategies under diverse cost r…

@arXiv_eessIV_bot@mastoxiv.page
2025-09-10 11:27:04

Crosslisted article(s) found for eess.IV. https://arxiv.org/list/eess.IV/new
[1/1]:
- The Protocol Genome A Self Supervised Learning Framework from DICOM Headers
Jimmy Joseph

@arXiv_csDS_bot@mastoxiv.page
2025-08-06 07:39:20

When is String Reconstruction using de Bruijn Graphs Hard?
Ben Bals, Sebastiaan van Krieken, Solon P. Pissis, Leen Stougie, Hilde Verbeek
https://arxiv.org/abs/2508.03433 https:…

When is String Reconstruction using de Bruijn Graphs Hard?
The reduction of the fragment assembly problem to (variations of) the classical Eulerian trail problem [Pevzner et al., PNAS 2001] has led to remarkable progress in genome assembly. This reduction employs the notion of de Bruijn graph $G=(V,E)$ of order $k$ over an alphabet $Σ$. A single Eulerian trail in $G$ represents a candidate genome reconstruction. Bernardini et al. have also introduced the complementary idea in data privacy [ALENEX 2020] based on $z$-anonymity. The pressing question i…

@arXiv_condmatsoft_bot@mastoxiv.page
2025-08-26 09:13:46

Liquid-liquid phase separation enables highly selective viral genome packaging
Layne B. Frechette, Michael F. Hagan
https://arxiv.org/abs/2508.17211 https://

Liquid-liquid phase separation enables highly selective viral genome packaging
In many viruses, hundreds of proteins assemble an outer shell (capsid) around the viral nucleic acid to form an infectious virion. How the assembly process selects the viral genome amidst a vast excess of diverse cellular nucleic acids is poorly understood. It has recently been discovered that many viruses perform assembly and genome packaging within liquid-liquid phase separated biomolecular condensates inside the host cell. However, the role of condensates in genome packaging is poorly unders…

@arXiv_statME_bot@mastoxiv.page
2025-07-09 09:16:32

FDR controlling procedures with dimension reduction and their application to GWAS with linkage disequilibrium score
Dayeon Jung, Yewon Kim, Junyong Park
https://arxiv.org/abs/2507.06049

FDR controlling procedures with dimension reduction and their application to GWAS with linkage disequilibrium score
Genome-wide association studies (GWAS) have led to the discovery of numerous single nucleotide polymorphisms (SNPs) associated with various phenotypes and complex diseases. However, the identified genetic variants do not fully explain the heritability of complex traits, known as the missing heritability problem. To address this challenge and accurately control false positives while maximizing true associations, we propose two approaches involving linkage disequilibrium (LD) scores as covariates…

@arXiv_physicsbioph_bot@mastoxiv.page
2025-07-04 09:18:51

Modelling transcriptional silencing and its coupling to 3D genome organisation
Massimiliano Semeraro, Giuseppe Negro, Davide Marenduzzo, Giada Forte
https://arxiv.org/abs/2507.02150

Modelling transcriptional silencing and its coupling to 3D genome organisation
Timely up- or down-regulation of gene expression is crucial for cellular differentiation and function. While gene upregulation via transcriptional activators has been extensively investigated, gene silencing remains understudied, especially by modelling. This study employs 3D simulations to study the biophysics of a chromatin fibre where active transcription factors compete with repressors for binding to transcription units along the fibre, and investigates how different silencing mechanisms af…

@arXiv_quantph_bot@mastoxiv.page
2025-08-11 09:56:49

Scalable Quantum State Preparation for Encoding Genomic Data with Matrix Product States
Floyd M. Creevey, Hitham T. Hassan, James McCafferty, Lloyd C. L. Hollenberg, Sergii Strelchuk
https://arxiv.org/abs/2508.06184

Scalable Quantum State Preparation for Encoding Genomic Data with Matrix Product States
As quantum computing hardware advances, the need for algorithms that facilitate the loading of classical data into the quantum states of these devices has become increasingly important. This study presents a method for producing scalable quantum circuits to encode genomic data using the Matrix Product State (MPS) formalism. The method is illustrated by encoding the genome of the bacteriophage $ΦX174$ into a 15-qubit state, and analysing the trade-offs between MPS bond dimension, reconstruction…

@arXiv_statCO_bot@mastoxiv.page
2025-08-08 08:35:02

A near-exact linear mixed model for genome-wide association studies
Zhibin Pu, Shufei Ge, Shijia Wang
https://arxiv.org/abs/2508.05278 https://arxiv.org/pd…

A near-exact linear mixed model for genome-wide association studies
Linear mixed models (LMM) are widely adopted in genome-wide association studies (GWAS) to account for population stratification and cryptic relatedness. However, the parameter estimation of LMMs imposes substantial computational burdens due to large-scale operations on genetic similarity matrices (GSM). We introduced the near-exact linear mixed model (NExt-LMM), a novel LMM framework that overcomes critical computational bottlenecks in GWAS through the following key innovations. Firstly, we exp…

@jorgecandeias@mastodon.social
2025-08-18 17:53:18

Um link para deixar cheganos a espumar da boca.
tl;dr - nunca existiu nenhuma espécie de "sangue português". Sempre foi uma gigantesca misturada, com imigrantes vindos de tudo quanto é sítio a contribuir para a caldeirada.
https://link.springer.com/article/10.118…

The genetic history of Portugal over the past 5,000 years - Genome Biology
Background Recent ancient DNA studies uncovering large-scale demographic events in Iberia have presented very limited data for Portugal, a country located at the westernmost edge of continental Eurasia. Here, we present the most comprehensive collection of Portuguese ancient genome-wide data, from 67 individuals spanning 5000 years of human history, from the Neolithic to the nineteenth century. Results We identify early admixture between local hunter-gatherers and Anatolian-related farmers in N…

@arXiv_qbioGN_bot@mastoxiv.page
2025-07-08 08:37:30

AuraGenome: An LLM-Powered Framework for On-the-Fly Reusable and Scalable Circular Genome Visualizations
Chi Zhang, Yu Dong, Yang Wang, Yuetong Han, Guihua Shan, Bixia Tang
https://arxiv.org/abs/2507.02877

AuraGenome: An LLM-Powered Framework for On-the-Fly Reusable and Scalable Circular Genome Visualizations
Circular genome visualizations are essential for exploring structural variants and gene regulation. However, existing tools often require complex scripting and manual configuration, making the process time-consuming, error-prone, and difficult to learn. To address these challenges, we introduce AuraGenome, an LLM-powered framework for rapid, reusable, and scalable generation of multi-layered circular genome visualizations. AuraGenome combines a semantic-driven multi-agent workflow with an inter…

@arXiv_csAR_bot@mastoxiv.page
2025-07-04 12:15:17

Replaced article(s) found for cs.AR. https://arxiv.org/list/cs.AR/new
[1/1]:
- MARS: Processing-In-Memory Acceleration of Raw Signal Genome Analysis Inside the Storage Subsystem
Soysal, Koliogeorgi, Firtina, Ghiasi, Nadig, Mao, Oliveira, Liang, Zambaku, Sadrosadati, Mutlu
…

@arXiv_qbioQM_bot@mastoxiv.page
2025-09-05 08:28:31

Predicting Antimicrobial Resistance (AMR) in Campylobacter, a Foodborne Pathogen, and Cost Burden Analysis Using Machine Learning
Shubham Mishra, The Anh Han, Bruno Silvester Lopes, Shatha Ghareeb, Zia Ush Shamszaman
https://arxiv.org/abs/2509.03551

Predicting Antimicrobial Resistance (AMR) in Campylobacter, a Foodborne Pathogen, and Cost Burden Analysis Using Machine Learning
Antimicrobial resistance (AMR) poses a significant public health and economic challenge, increasing treatment costs and reducing antibiotic effectiveness. This study employs machine learning to analyze genomic and epidemiological data from the public databases for molecular typing and microbial genome diversity (PubMLST), incorporating data from UK government-supported AMR surveillance by the Food Standards Agency and Food Standards Scotland. We identify AMR patterns in Campylobacter jejuni and…

@jby@ecoevo.social
2025-06-18 19:24:35

A new paper projecting Joshua tree habitat under future climate based on incredibly high-resolution distribution data, from Joshua Tree Genome Project collaborators at USGS. They estimate up to 80% loss of suitable habitat by 2100 under the worst-case climate scenario.
#JoshuaTree #science

Map of projected future habitat probabilities for Joshua tree populations based on random forest models of presence and absence, for the years 2071-2100 under SSP3-7.0. Parts of the trees' current range, indicated as outlines, are colored to indicate high probability of presence, but many parts are colored to indicate lower probability

A scatterplot of estimated future suitable habitat area in 2021-2040, 2041-2070, and 2071-2020, under three different future climate scenarios and based on modeling from different baseline time frames. In general, less suitable habitat is projected in the latest time period, and less is projected under more sever climate change

@arXiv_qbioPE_bot@mastoxiv.page
2025-09-01 08:25:23

Suppression of errors in collectively coded information
Martin J. Falk, Leon Zhou, Yoshiya J. Matsubara, Kabir Husain, Jack W. Szostak, Arvind Murugan
https://arxiv.org/abs/2508.21806

Suppression of errors in collectively coded information
Modern life largely transmits genetic information from mother to daughter through the duplication of single physically intact molecules that encode information. However, copying an extended molecule requires highly processive copying machinery and high fidelity that scales with the genome size to avoid the error catastrophe. Here, we explore these fidelity requirements in an alternative architecture, the virtual circular genome, in which no one physical molecule encodes the full genetic informa…

@arXiv_statAP_bot@mastoxiv.page
2025-07-10 08:34:11

A Machine Learning Framework for Breast Cancer Treatment Classification Using a Novel Dataset
Md Nahid Hasan, Md Monzur Murshed, Md Mahadi Hasan, Faysal A. Chowdhury
https://arxiv.org/abs/2507.06243

A Machine Learning Framework for Breast Cancer Treatment Classification Using a Novel Dataset
Breast cancer (BC) remains a significant global health challenge, with personalized treatment selection complicated by the disease's molecular and clinical heterogeneity. BC treatment decisions rely on various patient-specific clinical factors, and machine learning (ML) offers a powerful approach to predicting treatment outcomes. This study utilizes The Cancer Genome Atlas (TCGA) breast cancer clinical dataset to develop ML models for predicting the likelihood of undergoing chemotherapy or horm…

@arXiv_csAI_bot@mastoxiv.page
2025-08-20 10:10:50

A Biased Random Key Genetic Algorithm for Solving the Longest Run Subsequence Problem
Christian Blum, Pedro Pinacho-Davidson
https://arxiv.org/abs/2508.14020 https://

A Biased Random Key Genetic Algorithm for Solving the Longest Run Subsequence Problem
The longest run subsequence (LRS) problem is an NP-hard combinatorial optimization problem belonging to the class of subsequence problems from bioinformatics. In particular, the problem plays a role in genome reassembly. In this paper, we present a solution to the LRS problem using a Biased Random Key Genetic Algorithm (BRKGA). Our approach places particular focus on the computational efficiency of evaluating individuals, which involves converting vectors of gray values into valid solutions to …

@arXiv_qbioGN_bot@mastoxiv.page
2025-07-09 08:27:12

BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects
Hongyang Li, Sanjoy Dey, Bum Chul Kwon, Michael Danziger, Michal Rosen-Tzvi, Jianying Hu, James Kozloski, Ching-Huei Tsou, Bharath Dandala, Pablo Meyer
https://arxiv.org/abs/2507.05265

BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects
Large language models (LLMs) trained on text demonstrated remarkable results on natural language processing (NLP) tasks. These models have been adapted to decipher the language of DNA, where sequences of nucleotides act as "words" that encode genomic functions. However, the genome differs fundamentally from natural language, as it lacks clearly defined words or a consistent grammar. Although DNA language models (DNALMs) such as DNABERT, GENA-LM have achieved high level of performance on genome-…

@arXiv_qbioQM_bot@mastoxiv.page
2025-09-03 08:12:03

Friend or Foe
Oleksandr Cherendichenko, Josephine Solowiej-Wedderburn, Laura M. Carroll, Eric Libby
https://arxiv.org/abs/2509.00123 https://arxiv.org/pdf/…

Friend or Foe
A fundamental challenge in microbial ecology is determining whether bacteria compete or cooperate in different environmental conditions. With recent advances in genome-scale metabolic models, we are now capable of simulating interactions between thousands of pairs of bacteria in thousands of different environmental settings at a scale infeasible experimentally. These approaches can generate tremendous amounts of data that can be exploited by state-of-the-art machine learning algorithms to uncov…

@tiotasram@kolektiva.social
2025-07-19 07:51:05

AI, AGI, and learning efficiency
My 4-month-old kid is not DDoSing Wikipedia right now, nor will they ever do so before learning to speak, read, or write. Their entire "training corpus" will not top even 100 million "tokens" before they can speak & understand language, and do so with real intentionally.
Just to emphasize that point: 100 words-per-minute times 60 minutes-per-hour times 12 hours-per-day times 365 days-per-year times 4 years is a mere 105,120,000 words. That's a ludicrously *high* estimate of words-per-minute and hours-per-day, and 4 years old (the age of my other kid) is well after basic speech capabilities are developed in many children, etc. More likely the available "training data" is at least 1 or 2 orders of magnitude less than this.
The point here is that large language models, trained as they are on multiple *billions* of tokens, are not developing their behavioral capabilities in a way that's remotely similar to humans, even if you believe those capabilities are similar (they are by certain very biased ways of measurement; they very much aren't by others). This idea that humans must be naturally good at acquiring language is an old one (see e.g. #AI #LLM #AGI

@arXiv_statME_bot@mastoxiv.page
2025-07-22 11:11:30

Testing Homogeneity in a heteroscedastic contaminated normal mixture
Xiaoqing Niu, Pengfei Li, Yuejiao Fu
https://arxiv.org/abs/2507.15630 https://

Testing Homogeneity in a heteroscedastic contaminated normal mixture
Large-scale simultaneous hypothesis testing appears in many areas such as microarray studies, genome-wide association studies, brain imaging, disease mapping and astronomical surveys. A well-known inference method is to control the false discovery rate. One popular approach is to model the $z$-scores derived from the individual $t$-tests and then use this model to control the false discovery rate. We propose a new class of contaminated normal mixtures for modelling $z$-scores. We further design…

@arXiv_qbioQM_bot@mastoxiv.page
2025-08-29 09:15:41

Artificial Intelligence for CRISPR Guide RNA Design: Explainable Models and Off-Target Safety
Alireza Abbaszadeh, Armita Shahlai
https://arxiv.org/abs/2508.20130 https://…

Artificial Intelligence for CRISPR Guide RNA Design: Explainable Models and Off-Target Safety
CRISPR-based genome editing has revolutionized biotechnology, yet optimizing guide RNA (gRNA) design for efficiency and safety remains a critical challenge. Recent advances (2020--2025, updated to reflect current year if needed) demonstrate that artificial intelligence (AI), especially deep learning, can markedly improve the prediction of gRNA on-target activity and identify off-target risks. In parallel, emerging explainable AI (XAI) techniques are beginning to illuminate the black-box nature …

@arXiv_physicsbioph_bot@mastoxiv.page
2025-08-26 13:00:44

Crosslisted article(s) found for physics.bio-ph. https://arxiv.org/list/physics.bio-ph/new
[1/1]:
- Liquid-liquid phase separation enables highly selective viral genome packaging
Layne B. Frechette, Michael F. Hagan

@arXiv_qbioGN_bot@mastoxiv.page
2025-09-09 09:33:52

Investigating DNA words and their distributions across the tree of life
Charalampos Koilakos, Kimonas Provatas, Michail Patsakis, Aris Karatzikos, Ilias Georgakopoulos-Soares
https://arxiv.org/abs/2509.05539

Investigating DNA words and their distributions across the tree of life
The frequency distributions of DNA k-mers are shaped by fundamental biological processes and offer a window into genome structure and evolution. Inspired by analogies to natural language, prior studies have attempted to model genomic k-mer usage using Zipf's law, a rank-frequency law originally formulated for words in human language. However, the extent to which this law accurately captures the distribution of k-mers across diverse species remains unclear. Here, we systematically analyze k-mer …

@arXiv_qbioOT_bot@mastoxiv.page
2025-08-20 08:35:20

FAIR sharing of Chromatin Tracing datasets using the newly developed 4DN FISH Omics Format
Rahi Navelkar, Andrea Cosolo, Bogdan Bintu, Yubao Cheng, Vincent Gardeux, Silvia Gutnik, Taihei Fujimori, Antonina Hafner, Atishay Jay, Bojing Blair Jia, Adam Paul Jussila, Gerard Llimos, Antonios Lioutas, Nuno MC Martins, William J Moore, Yodai Takei, Frances Wong, Kaifu Yang, Huaiying Zhang, Quan Zhu, Magda Bienko, Lacramioara Bintu, Long Cai, Bart Deplancke, Marcelo Nollmann, Susan E Mango, Bi…

FAIR sharing of Chromatin Tracing datasets using the newly developed 4DN FISH Omics Format
A key output of the NIH Common Fund 4D Nucleome (4DN) project is the open publication of datasets on the structure of the human cell nucleus and genome. In recent years, multiplexed Fluorescence In Situ Hybridization (FISH) and FISH-omics methods have rapidly expanded, enabling quantification of chromatin organization in single cells, sometimes alongside RNA and protein measurements. These approaches have deepened our understanding of how 3D chromosome architecture relates to transcriptional ac…

@arXiv_qbioGN_bot@mastoxiv.page
2025-08-06 08:28:40

A Novel cVAE-Augmented Deep Learning Framework for Pan-Cancer RNA-Seq Classification
Vinil Polepalli
https://arxiv.org/abs/2508.02743 https://arxiv.org/pdf…

A Novel cVAE-Augmented Deep Learning Framework for Pan-Cancer RNA-Seq Classification
Pan-cancer classification using transcriptomic (RNA-Seq) data can inform tumor subtyping and therapy selection, but is challenging due to extremely high dimensionality and limited sample sizes. In this study, we propose a novel deep learning framework that uses a class-conditional variational autoencoder (cVAE) to augment training data for pan-cancer gene expression classification. Using 801 tumor RNA-Seq samples spanning 5 cancer types from The Cancer Genome Atlas (TCGA), we first perform feat…

@arXiv_statAP_bot@mastoxiv.page
2025-07-29 08:23:21

Consistency and Central Limit Results for the Maximum Likelihood Estimator in the Admixture Model
Carola Sophia Heinzel
https://arxiv.org/abs/2507.19564 https://

Consistency and Central Limit Results for the Maximum Likelihood Estimator in the Admixture Model
In the Admixture Model, the probability of an individual having a certain number of alleles at a specific marker depends on the allele frequencies in $K$ ancestral populations and the fraction of the individual's genome originating from these ancestral populations. This study investigates consistency and central limit results of maximum likelihood estimators (MLEs) for the ancestry and the allele frequencies in the Admixture Model, complimenting previous work by \cite{pfaff2004information, pf…

@arXiv_qbioGN_bot@mastoxiv.page
2025-06-24 09:01:50

Improving Genomic Models via Task-Specific Self-Pretraining
Sohan Mupparapu, Parameswari Krishnamurthy, Ratish Puduppully
https://arxiv.org/abs/2506.17766 …

Improving Genomic Models via Task-Specific Self-Pretraining
Pretraining DNA language models (DNALMs) on the full human genome is resource-intensive, yet often considered necessary for strong downstream performance. Inspired by recent findings in NLP and long-context modeling, we explore an alternative: self-pretraining on task-specific, unlabeled data. Using the BEND benchmark, we show that DNALMs trained with self-pretraining match or exceed the performance of models trained from scratch under identical compute. While genome-scale pretraining may still…

@arXiv_qbioGN_bot@mastoxiv.page
2025-07-04 12:39:55

Replaced article(s) found for q-bio.GN. https://arxiv.org/list/q-bio.GN/new
[1/1]:
- MARS: Processing-In-Memory Acceleration of Raw Signal Genome Analysis Inside the Storage Subsystem
Soysal, Koliogeorgi, Firtina, Ghiasi, Nadig, Mao, Oliveira, Liang, Zambaku, Sadrosadati, Mutl…

@arXiv_qbioGN_bot@mastoxiv.page
2025-06-16 09:41:40

GlobDB: A comprehensive species-dereplicated microbial genome resource
Daan R. Speth (Centre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, Austria), Nick Pullen (Centre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, Austria), Samuel T. N. Aroney (Centre for Microbiome Research School of Biomedical Sciences, Queensland University of Technology, Translational Research Institute, Woolloongabba, Australia), Benjamin L. …

GlobDB: A comprehensive species-dereplicated microbial genome resource
Over the past years, substantial numbers of microbial species' genomes have been deposited outside of conventional INSDC databases. The GlobDB aggregates 14 independent genomic catalogues to provide a comprehensive database of species-dereplicated microbial genomes, with consistent taxonomy, annotations, and additional analysis resources. The GlobDB is available at https://globdb.org/.

@arXiv_statAP_bot@mastoxiv.page
2025-08-18 09:01:40

Functional Analysis of Variance for Association Studies
Olga A. Vsevolozhskaya, Dmitri V. Zaykin, Mark C. Greenwood, Changshuai Wei, Qing Lu
https://arxiv.org/abs/2508.11069 htt…

Functional Analysis of Variance for Association Studies
While progress has been made in identifying common genetic variants associated with human diseases, for most of common complex diseases, the identified genetic variants only account for a small proportion of heritability. Challenges remain in finding additional unknown genetic variants predisposing to complex diseases. With the advance in next-generation sequencing technologies, sequencing studies have become commonplace in genetic research. The ongoing exome-sequencing and whole-genome-sequenc…

@arXiv_qbioGN_bot@mastoxiv.page
2025-08-22 08:26:31

AGP: A Novel Arabidopsis thaliana Genomics-Phenomics Dataset and its HyperGraph Baseline Benchmarking
Manuel Serna-Aguilera, Fiona L. Goggin, Aranyak Goswami, Alexander Bucksch, Suxing Liu, Khoa Luu
https://arxiv.org/abs/2508.14934

AGP: A Novel Arabidopsis thaliana Genomics-Phenomics Dataset and its HyperGraph Baseline Benchmarking
Understanding which genes control which traits in an organism remains one of the central challenges in biology. Despite significant advances in data collection technology, our ability to map genes to traits is still limited. This genome-to-phenome (G2P) challenge spans several problem domains, including plant breeding, and requires models capable of reasoning over high-dimensional, heterogeneous, and biologically structured data. Currently, however, many datasets solely capture genetic informat…

@arXiv_qbioGN_bot@mastoxiv.page
2025-06-16 09:30:09

Multimodal Modeling of CRISPR-Cas12 Activity Using Foundation Models and Chromatin Accessibility Data
Azim Dehghani Amirabad, Yanfei Zhang, Artem Moskalev, Sowmya Rajesh, Tommaso Mansi, Shuwei Li, Mangal Prakash, Rui Liao
https://arxiv.org/abs/2506.11182

Multimodal Modeling of CRISPR-Cas12 Activity Using Foundation Models and Chromatin Accessibility Data
Predicting guide RNA (gRNA) activity is critical for effective CRISPR-Cas12 genome editing but remains challenging due to limited data, variation across protospacer adjacent motifs (PAMs-short sequence requirements for Cas binding), and reliance on large-scale training. We investigate whether pre-trained biological foundation model originally trained on transcriptomic data can improve gRNA activity estimation even without domain-specific pre-training. Using embeddings from existing RNA foundati…

Tootfinder

Opt-in global Mastodon full text search. Join the index!