Concept: Bayesian inference
- Proceedings of the National Academy of Sciences of the United States of America
- Published about 7 years ago
Recent advances in Bayesian hypothesis testing have led to the development of uniformly most powerful Bayesian tests, which represent an objective, default class of Bayesian hypothesis tests that have the same rejection regions as classical significance tests. Based on the correspondence between these two classes of tests, it is possible to equate the size of classical hypothesis tests with evidence thresholds in Bayesian tests, and to equate P values with Bayes factors. An examination of these connections suggest that recent concerns over the lack of reproducibility of scientific studies can be attributed largely to the conduct of significance tests at unjustifiably high levels of significance. To correct this problem, evidence thresholds required for the declaration of a significant finding should be increased to 25-50:1, and to 100-200:1 for the declaration of a highly significant finding. In terms of classical hypothesis tests, these evidence standards mandate the conduct of tests at the 0.005 or 0.001 level of significance.
We present a statistical framework for estimation and application of sample allele frequency spectra from New-Generation Sequencing (NGS) data. In this method, we first estimate the allele frequency spectrum using maximum likelihood. In contrast to previous methods, the likelihood function is calculated using a dynamic programming algorithm and numerically optimized using analytical derivatives. We then use a Bayesian method for estimating the sample allele frequency in a single site, and show how the method can be used for genotype calling and SNP calling. We also show how the method can be extended to various other cases including cases with deviations from Hardy-Weinberg equilibrium. We evaluate the statistical properties of the methods using simulations and by application to a real data set.
Multiple-locus variable-number tandem repeat analysis (MLVA) is useful to establish transmission routes and sources of infections for various microorganisms including Mycobacterium tuberculosis complex (MTC). The recently released SITVITWEB database contains 12-loci Mycobacterial Interspersed Repetitive Units–Variable Number of Tandem DNA Repeats (MIRU-VNTR) profiles and spoligotype patterns for thousands of MTC strains; it uses MIRU International Types (MIT) and Spoligotype International Types (SIT) to designate clustered patterns worldwide. Considering existing doubts on the ability of spoligotyping alone to reveal exact phylogenetic relationships between MTC strains, we developed a MLVA based classification for MTC genotypic lineages. We studied 6 different subsets of MTC isolates encompassing 7793 strains worldwide. Minimum spanning trees (MST) were constructed to identify major lineages, and the most common representative located as a central node was taken as the prototype defining different phylogenetic groups. A total of 7 major lineages with their respective prototypes were identified: Indo-Oceanic/MIT57, East Asian and African Indian/MIT17, Euro American/MIT116, West African-I/MIT934, West African-II/MIT664, M. bovis/MIT49, M.canettii/MIT60. Further MST subdivision identified an additional 34 sublineage MIT prototypes. The phylogenetic relationships among the 37 newly defined MIRU-VNTR lineages were inferred using a classification algorithm based on a bayesian approach. This information was used to construct an updated phylogenetic and phylogeographic snapshot of worldwide MTC diversity studied both at the regional, sub-regional, and country level according to the United Nations specifications. We also looked for IS6110 insertional events that are known to modify the results of the spoligotyping in specific circumstances, and showed that a fair portion of convergence leading to the currently observed bias in phylogenetic classification of strains may be traced back to the presence of IS6110. These results shed new light on the evolutionary history of the pathogen in relation to the history of peopling and human migration.
In a number of applications there is a need to determine the most likely pedigree for a group of persons based on genetic markers. Adequate models are needed to reach this goal. The markers used to perform the statistical calculations can be linked and there may also be linkage disequilibrium (LD) in the population. The purpose of this paper is to present a graphical Bayesian Network framework to deal with such data. Potential LD is normally ignored and it is important to verify that the resulting calculations are not biased. Even if linkage does not influence results for regular paternity cases, it may have substantial impact on likelihood ratios involving other, more extended pedigrees. Models for LD influence likelihoods for all pedigrees to some degree and an initial estimate of the impact of ignoring LD and/or linkage is desirable, going beyond mere rules of thumb based on marker distance. Furthermore, we show how one can readily include a mutation model in the Bayesian Network; extending other programs or formulas to include such models may require considerable amounts of work and will in many case not be practical. As an example, we consider the two STR markers vWa and D12S391. We estimate probabilities for population haplotypes to account for LD using a method based on data from trios, while an estimate for the degree of linkage is taken from the literature. The results show that accounting for haplotype frequencies is unnecessary in most cases for this specific pair of markers. When doing calculations on regular paternity cases, the markers can be considered statistically independent. In more complex cases of disputed relatedness, for instance cases involving siblings or so-called deficient cases, or when small differences in the LR matter, independence should not be assumed. (The networks are freely available at http://arken.umb.no/~dakl/BayesianNetworks.).
‘Omics analysis (transcriptomics, proteomics) quantifies changes in gene/protein expression, providing a snapshot of changes in biochemical pathways over time. Although tools such as modelling that are needed to investigate the relationships between genes/proteins already exist, they are rarely utilised. We consider the potential for using Structural Equation Modelling to investigate protein-protein interactions in a proposed Rubisco protein degradation pathway using previously published data from 2D electrophoresis and mass spectrometry proteome analysis. These informed the development of a prior model that hypothesised a pathway of Rubisco Large Subunit and Small Subunit degradation, producing both primary and secondary degradation products. While some of the putative pathways were confirmed by the modelling approach, the model also demonstrated features that had not been originally hypothesised. We used Bayesian analysis based on Markov Chain Monte Carlo simulation to generate output statistics suggesting that the model had replicated the variation in the observed data due to protein-protein interactions. This study represents an early step in the development of approaches that seek to enable the full utilisation of information regarding the dynamics of biochemical pathways contained within proteomics data. As these approaches gain attention, they will guide the design and conduct of experiments that enable 'Omics modelling to become a common place practice within molecular biology.
Can behavior be unconsciously primed via the activation of attitudes, stereotypes, or other concepts? A number of studies have suggested that such priming effects can occur, and a prominent illustration is the claim that individuals' accuracy in answering general knowledge questions can be influenced by activating intelligence-related concepts such as professor or soccer hooligan. In 9 experiments with 475 participants we employed the procedures used in these studies, as well as a number of variants of those procedures, in an attempt to obtain this intelligence priming effect. None of the experiments obtained the effect, although financial incentives did boost performance. A Bayesian analysis reveals considerable evidential support for the null hypothesis. The results conform to the pattern typically obtained in word priming experiments in which priming is very narrow in its generalization and unconscious (subliminal) influences, if they occur at all, are extremely short-lived. We encourage others to explore the circumstances in which this phenomenon might be obtained.
We revisit the results of the recent Reproducibility Project: Psychology by the Open Science Collaboration. We compute Bayes factors-a quantity that can be used to express comparative evidence for an hypothesis but also for the null hypothesis-for a large subset (N = 72) of the original papers and their corresponding replication attempts. In our computation, we take into account the likely scenario that publication bias had distorted the originally published results. Overall, 75% of studies gave qualitatively similar results in terms of the amount of evidence provided. However, the evidence was often weak (i.e., Bayes factor < 10). The majority of the studies (64%) did not provide strong evidence for either the null or the alternative hypothesis in either the original or the replication, and no replication attempts provided strong evidence in favor of the null. In all cases where the original paper provided strong evidence but the replication did not (15%), the sample size in the replication was smaller than the original. Where the replication provided strong evidence but the original did not (10%), the replication sample size was larger. We conclude that the apparent failure of the Reproducibility Project to replicate many target effects can be adequately explained by overestimation of effect sizes (or overestimation of evidence against the null hypothesis) due to small sample sizes and publication bias in the psychological literature. We further conclude that traditional sample sizes are insufficient and that a more widespread adoption of Bayesian methods is desirable.
This article explains the foundational concepts of Bayesian data analysis using virtually no mathematical notation. Bayesian ideas already match your intuitions from everyday reasoning and from traditional data analysis. Simple examples of Bayesian data analysis are presented that illustrate how the information delivered by a Bayesian analysis can be directly interpreted. Bayesian approaches to null-value assessment are discussed. The article clarifies misconceptions about Bayesian methods that newcomers might have acquired elsewhere. We discuss prior distributions and explain how they are not a liability but an important asset. We discuss the relation of Bayesian data analysis to Bayesian models of mind, and we briefly discuss what methodological problems Bayesian data analysis is not meant to solve. After you have read this article, you should have a clear sense of how Bayesian data analysis works and the sort of information it delivers, and why that information is so intuitive and useful for drawing conclusions from data.
RELION, for REgularized LIkelihood OptimizatioN, is an open-source computer program for the refinement of macromolecular structures by single-particle analysis of electron cryo-microscopy (cryo-EM) data. Whereas alternative approaches often rely on user expertise for the tuning of parameters, RELION uses a Bayesian approach to infer parameters of a statistical model from the data. This paper describes developments that reduce the computational costs of the underlying maximum a posteriori (MAP) algorithm, as well as statistical considerations that yield new insights into the accuracy with which the relative orientations of individual particles may be determined. A so-called gold-standard Fourier shell correlation (FSC) procedure to prevent overfitting is also described. The resulting implementation yields high-quality reconstructions and reliable resolution estimates with minimal user intervention and at acceptable computational costs.
The genus Wolffia of the duckweed family (Lemnaceae) contains the smallest flowering plants. Presently, 11 species are recognized and categorized mainly on the basis of morphology. Because of extreme reduction of structure of all species, molecular methods are especially required for barcoding and identification of species and clones of this genus. We applied AFLP combined with Bayesian analysis of population structure to 66 clones covering all 11 species. Nine clusters were identified: (1) W. angusta and W. microscopica (only one clone), (2) W. arrhiza, (3) W. cylindracea (except one clone that might be a transition form), (4) W. australiana, (5) W. globosa, (6) W. globosa, W. neglecta, and W. borealis, (7) W. brasiliensis, and W. columbiana, (8) W. columbiana, (9) W. elongata. Furthermore, we investigated the sequences of plastidic regions rps16 (54 clones) and rpl16 (55 clones), and identified the following species: W. angusta, W. australiana, W. brasiliensis, W. cylindracea, W. elongata, W. microscopica, and W. neglecta. Wolffia globosa has been separated into two groups by both methods. One group which consists only of clones from North America and East Asia was labelled here “typical W. globosa”. The other group of W. globosa, termed operationally “W. neglecta”, contains also clones of W. neglecta and shows high similarity to W. borealis. None of the methods recognized W. borealis as a distinct species. Although each clone could be characterized individually by AFLP and plastidic sequences, and most species could be bar-coded, the presently available data are not sufficient to identify all taxa of Wolffia.