PCA: A Popular Tool with Pitfalls in Bioinformatics
Written on
Chapter 1: Understanding PCA
Principal Component Analysis (PCA) is a widely utilized technique in machine learning aimed at simplifying complex datasets. However, recent findings suggest that its application may be fraught with issues, potentially contributing to the ongoing reproducibility crisis in scientific research.
The Reproducibility Crisis and PCA
The reproducibility crisis has gained significant attention in the scientific community, with various factors leading to challenges in replicating studies. One major issue stems from the improper application of machine learning techniques like PCA.
Many datasets, particularly in cancer research, are often poorly curated, making access difficult. Additionally, a lack of documentation can hinder the sharing of data. Many researchers withhold data to use it in future studies, further complicating reproducibility efforts.
Moreover, the integration of data science into biology, especially genetics, has highlighted the gap in formal training for many researchers. Once a project is published, the code may become outdated or poorly documented, rendering it hard to reproduce. Often, the original authors are unavailable for clarification, as they may have moved on from the lab.
Systematic biases, including data leakage, are common errors in algorithm application. Biological datasets are inherently complex, suffering from issues like the curse of dimensionality, missing data, and confounding variables. Genetic datasets similarly face challenges such as small sample sizes and flawed study designs.
PCA's Popularity in Genetic Studies
PCA is frequently employed in genetic analyses due to its ability to facilitate clustering and visualize data effectively. Since its introduction in genetics in 1963, it has become a staple in nearly every related publication.
The primary appeal of PCA for population geneticists is its ability to represent genetic and geographic distances among clusters.
However, while PCA can handle both large and small numerical datasets and consistently provides results, it lacks significance measures and error estimation, making it challenging to assess result quality. The only metric often cited is the proportion of explained variance, and there is no consensus on the number of principal components to analyze.
Given these shortcomings, researchers from Lund University have questioned whether PCA could be contributing to the reproducibility crisis.
Evaluating PCA's Reliability
This study aims to evaluate PCA's reliability, robustness, and reproducibility. PCA serves as a mathematical model to describe unknown truths, but testing its accuracy necessitates a clear model where the truth is evident.
The authors developed a simplified model where individuals express only three genes, assigning colors based on a specific vector. This setup simulates SNPs and aids in assessing PCA accuracy.
By applying PCA to this model, the dataset is reduced to two dimensions that account for the majority of variance. This allows visualization of true colors in a PCA scatterplot while measuring distances between the principal components and comparing them to their actual 3D distances.
Bioinformatic tools online - Principal Component Analysis (PCA) - YouTube: This video provides an overview of PCA, highlighting its applications and limitations in bioinformatics.
The researchers also utilized three real human genotype datasets and executed twelve common tasks in population genetics, adjusting the population proportions and re-running PCA for visualization.
The findings indicated that distances between clusters varied with changes in individual numbers. Moreover, excluding certain populations could significantly alter results.
For instance, in an attempt to replicate a 2009 study suggesting that Indians are genetically distinct from Europeans, Asians, and Africans, the authors found that altering data proportions led to contradictory PCA results, raising questions about its reliability.
The authors argue that PCA can generate conflicting and biologically incorrect scenarios. It is misleading to present a limited number of PCA plots without acknowledging the existence of alternative solutions or the proportion of explained variance.
In essence, modifications in data can lead to drastically different conclusions from PCA.
Parting Thoughts
The ongoing reproducibility crisis necessitates a thorough evaluation of scientific tools and methodologies. Given PCA's central role in population genetics, and its failure to produce consistently accurate results, the study underscores the need for caution in relying solely on PCA.
As genetics research is pivotal for clinical and biomedical advancements, understanding the limitations of PCA is crucial. This study demonstrates that one of the most commonly used methods in the field is not as robust or reproducible as previously thought, which can lead to flawed conclusions.
Researchers should exercise caution when drawing conclusions from PCA alone, as unreliable clusters could result in erroneous or even absurd outcomes. The authors caution that by varying factors such as population choice, sample sizes, and markers, researchers may inadvertently create conflicting interpretations.
Thus, while PCA serves as an excellent tool for initial data exploration, making definitive conclusions based on its output is not advisable. The authors liken PCA scatterplots to Rorschach tests, where interpretations can vary widely based on individual perspectives.
T-BioInfo: A big data analysis platform for Omics data - Clustering Dashboard Example: This video showcases the T-BioInfo platform, emphasizing its clustering capabilities and applications in omics data analysis.