ProjectTrustworthy multi-scale manifold learning for genomic and transcriptomic data
Basic data
Title:
Trustworthy multi-scale manifold learning for genomic and transcriptomic data
Duration:
01/01/2022 to 01/01/2025
Abstract / short description:
In recent years, large high-dimensional datasets have become
commonplace in biology. For example, single-cell transcriptomics
routinely produces datasets with sample sizes in hundreds of
thousands of cells and dimensionality in tens of thousands of genes.
Similarly, genomic datasets can encompass hundreds of thousands of
people’s genomes, profiled using millions of single-nucleotide
polymorphisms. One defining feature of such datasets is their
hierarchical organization, with biologically meaningful structure
present on several levels. Such datasets require adequate
computational methods for data analysis, including unsupervised data
exploration, to allow researchers to compactly represent and make
sense of their data. It is commonplace in single-cell transcriptomics to
generate low-dimensional embeddings of the data, using algorithms
such as e.g. t-SNE or UMAP, but the existing methods fall short of
representing the hierarchical structure of the data. Whereas they
excel at preserving local structure, they are unable to recapitulate
larger-scale global structure often present in the data, making it
difficult to interpret the embedding correctly. In this project, our first
aim is to develop a dimensionality reduction method able to preserve
crucial properties of high-dimensional data, such as the local cluster
structure, continuous trajectories, and global hierarchical organization.
The second aim is to develop a suite of quality metrics that will allow
us to benchmark existing and novel algorithms on a range of
challenging datasets. Finally, the third aim is to adapt this machinery
to ultra-high-dimensional data from population genomics. On the
technical level, we are going to rely on the k-nearest-neighbour
graphs and graph coarse-graining. Our work will be useful in practical
applications in biology and bioinformatics, while at the same time
being of high interest for the manifold learning part of the machine
learning community.
commonplace in biology. For example, single-cell transcriptomics
routinely produces datasets with sample sizes in hundreds of
thousands of cells and dimensionality in tens of thousands of genes.
Similarly, genomic datasets can encompass hundreds of thousands of
people’s genomes, profiled using millions of single-nucleotide
polymorphisms. One defining feature of such datasets is their
hierarchical organization, with biologically meaningful structure
present on several levels. Such datasets require adequate
computational methods for data analysis, including unsupervised data
exploration, to allow researchers to compactly represent and make
sense of their data. It is commonplace in single-cell transcriptomics to
generate low-dimensional embeddings of the data, using algorithms
such as e.g. t-SNE or UMAP, but the existing methods fall short of
representing the hierarchical structure of the data. Whereas they
excel at preserving local structure, they are unable to recapitulate
larger-scale global structure often present in the data, making it
difficult to interpret the embedding correctly. In this project, our first
aim is to develop a dimensionality reduction method able to preserve
crucial properties of high-dimensional data, such as the local cluster
structure, continuous trajectories, and global hierarchical organization.
The second aim is to develop a suite of quality metrics that will allow
us to benchmark existing and novel algorithms on a range of
challenging datasets. Finally, the third aim is to adapt this machinery
to ultra-high-dimensional data from population genomics. On the
technical level, we are going to rely on the k-nearest-neighbour
graphs and graph coarse-graining. Our work will be useful in practical
applications in biology and bioinformatics, while at the same time
being of high interest for the manifold learning part of the machine
learning community.
Involved staff
Managers
Hertie Institute for Artificial Intelligence in Brain Health (HIAI)
Non-clinical institutes, Faculty of Medicine
Non-clinical institutes, Faculty of Medicine
Contact persons
Hertie Institute for Artificial Intelligence in Brain Health (HIAI)
Non-clinical institutes, Faculty of Medicine
Non-clinical institutes, Faculty of Medicine
Institute for Bioinformatics and Medical Informatics (IBMI)
Interfaculty Institutes
Interfaculty Institutes
Cluster of Excellence: Machine Learning: New Perspectives for Science (CML)
Centers or interfaculty scientific institutions
Centers or interfaculty scientific institutions
Tübingen AI Center
Department of Informatics, Faculty of Science
Department of Informatics, Faculty of Science
Local organizational units
University Eye Hospital
Center for Ophthalmology
Hospitals and clinical institutes, Faculty of Medicine
Hospitals and clinical institutes, Faculty of Medicine
Research Center for Ophthalmology
Center for Ophthalmology
Hospitals and clinical institutes, Faculty of Medicine
Hospitals and clinical institutes, Faculty of Medicine
Werner Reichardt Center for Integrative Neuroscience (CIN)
Centers or interfaculty scientific institutions
University of Tübingen
University of Tübingen
Funders
Bonn, Nordrhein-Westfalen, Germany