Christophe AmbroiseAlia DehmanPierre NeuvialGuillem RigaillNathalie Vialaneix
Motivation: Genomic data analyses such as Genome-Wide Association Studies\n(GWAS) or Hi-C studies are often faced with the problem of partitioning\nchromosomes into successive regions based on a similarity matrix of\nhigh-resolution, locus-level measurements. An intuitive way of doing this is to\nperform a modified Hierarchical Agglomerative Clustering (HAC), where only\nadjacent clusters (according to the ordering of positions within a chromosome)\nare allowed to be merged. A major practical drawback of this method is its\nquadratic time and space complexity in the number of loci, which is typically\nof the order of 10^4 to 10^5 for each chromosome. Results: By assuming that the\nsimilarity between physically distant objects is negligible, we propose an\nimplementation of this adjacency-constrained HAC with quasi-linear complexity.\nOur illustrations on GWAS and Hi-C datasets demonstrate the relevance of this\nassumption, and show that this method highlights biologically meaningful\nsignals. Thanks to its small time and memory footprint, the method can be run\non a standard laptop in minutes or even seconds. Availability and\nImplementation: Software and sample data are available as an R package,\nadjclust, that can be downloaded from the Comprehensive R Archive Network\n(CRAN).\n
Christophe AmbroiseAlia DehmanPierre NeuvialMark HoebekeNathalie Vialaneix
Ambroise, ChristopheDehman, AliaNeuvial, PierreRigaill, GuillemVialaneix, Nathalie