Bioinformatics & The Curse of Dimensionality | EuanMcDonnell

Описание к видео Bioinformatics & The Curse of Dimensionality | EuanMcDonnell

Speaker: Dr Euan McDonnell, Bioinformatics Data Scientist with the Computational Biology Facility (CBF) based at the University of Liverpool

Bioinformatics as a field has seen a rapid expansion in prevalence over the past 25 years. Much of this has been driven by the increase in the scale and frequency of large-scale datasets, predominantly global biological profiling approaches, or so-called “omics” technologies. These encompass a wide range of applications that quantify the abundance, activity, or presence of various biological entities in a top-down and unbiased manner, resulting in datasets with 100s-millions of features. Much of bioinformatics is concerned with the ranking and selection of such features in regards to their relationships to external factors or co-relationships within- or between-datatypes. However the complex, time-consuming, and expensive task of processing and acquiring biological samples, as well as generation of data from such samples means that, in relation to the dimensionality, the number of data points is frequently far less than the number of features. This problem is termed the “large p, small n” or “p n” problem and is a critical issue that is ubiquitous in bioinformatics and health data science. Such high dimensionality in-tandem with low degrees of freedom confers a major analytical and computational challenge due to the explosive increase in the size of the sampling domain with increasing features; the so-called “curse of dimensionality”. This results in overfitting/high variance in statistical and machine learning models, as well as compounding issues faced with the inherent high variability between biological samples. Bioinformatics has thus seen the application of a suite of methodologies that aim to tackle this issue, commonly including the use of empirical Bayes pooling of information, dimensional reduction and regularisation/sparsification procedures. While these approaches have allowed the field to mostly keep up with the increasing scale of data-sets that are being generated, further developments will be required in order.

This talk is part of the Liverpool Virtual Seminar Series on Data Intensive Science; more information can be found at https://indico.ph.liv.ac.uk/e/data_sc...

#data #bigdata #datascience #bioinformatics #informatics

Комментарии

Информация по комментариям в разработке