Discover the most talked about and latest scientific content & concepts.

Concept: K-means clustering


Complex diseases are typically caused by combinations of molecular disturbances that vary widely among different patients. Endophenotypes, a combination of genetic factors associated with a disease, offer a simplified approach to dissect complex trait by reducing genetic heterogeneity. Because molecular dissimilarities often exist between patients with indistinguishable disease symptoms, these unique molecular features may reflect pathogenic heterogeneity. To detect molecular dissimilarities among patients and reduce the complexity of high-dimension data, we have explored an endophenotype-identification analytical procedure that combines non-negative matrix factorization (NMF) and adjusted rand index (ARI), a measure of the similarity of two clusterings of a data set. To evaluate this procedure, we compared it with a commonly used method, principal component analysis with k-means clustering (PCA-K). A simulation study with gene expression dataset and genotype information was conducted to examine the performance of our procedure and PCA-K. The results showed that NMF mostly outperformed PCA-K. Additionally, we applied our endophenotype-identification analytical procedure to a publicly available dataset containing data derived from patients with late-onset Alzheimer’s disease (LOAD). NMF distilled information associated with 1,116 transcripts into three metagenes and three molecular subtypes (MS) for patients in the LOAD dataset: MS1 (n1=80), MS2 (n2=73), and MS3 (n3=23). ARI was then used to determine the most representative transcripts for each metagene; 123, 89, and 71 metagene-specific transcripts were identified for MS1, MS2, and MS3, respectively. These metagene-specific transcripts were identified as the endophenotypes. Our results showed that 14, 38, 0, and 28 candidate susceptibility genes listed in AlzGene database were found by all patients, MS1, MS2, and MS3, respectively. Moreover, we found that MS2 might be a normal-like subtype. Our proposed procedure provides an alternative approach to investigate the pathogenic mechanism of disease and better understand the relationship between phenotype and genotype.

Concepts: DNA, Gene, Genetics, Principal component analysis, Machine learning, Object-oriented programming, K-means clustering, Rand index


BACKGROUND: Students enter the medical study with internally generated motives like genuine interest (intrinsic motivation) and/or externally generated motives like parental pressure or desire for status or prestige (controlled motivation). According to Self-determination theory (SDT), students could differ in their study effort, academic performance and adjustment to the study depending on the endorsement of intrinsic motivation versus controlled motivation. The objectives of this study were to generate motivational profiles of medical students using combinations of high or low intrinsic and controlled motivation and test whether different motivational profiles are associated with different study outcomes. METHODS: Participating students (N = 844) from University Medical Center Utrecht, the Netherlands, were classified to different subgroups through K-means cluster analysis using intrinsic and controlled motivation scores. Cluster membership was used as an independent variable to assess differences in study strategies, self-study hours, academic performance and exhaustion from study. RESULTS: Four clusters were obtained: High Intrinsic High Controlled (HIHC), Low Intrinsic High Controlled (LIHC), High Intrinsic Low Controlled (HILC), and Low Intrinsic Low Controlled (LILC). HIHC profile, including the students who are interest + status motivated, constituted 25.2% of the population (N = 213). HILC profile, including interest-motivated students, constituted 26.1% of the population (N = 220). LIHC profile, including status-motivated students, constituted 31.8% of the population (N = 268). LILC profile, including students who have a low-motivation and are neither interest nor status motivated, constituted 16.9% of the population (N = 143). Interest-motivated students (HILC) had significantly more deep study strategy (p < 0.001) and self-study hours (p < 0.05), higher GPAs (p < 0.001) and lower exhaustion (p < 0.001) than status-motivated (LIHC) and low-motivation (LILC) students. CONCLUSIONS: The interest-motivated profile of medical students (HILC) is associated with good study hours, deep study strategy, good academic performance and low exhaustion from study. The interest + status motivated profile (HIHC) was also found to be associated with a good learning profile, except that students with this profile showed higher surface strategy. Low-motivation (LILC) and status-motivated profiles (LIHC) were associated with the least desirable learning behaviours.

Concepts: Cluster analysis, Educational psychology, Behavior, Motivation, Human behavior, Self-determination theory, K-means clustering, Profiles


Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in d-dimensional space R(d) and an integer k. The problem is to determine a set of k points in R(d), called centers, so as to minimize the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm, which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is based on the recently established relationship between principal component analysis and the k-means clustering. We provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARI(HA)). We found that when k is close to d, the quality is good (ARI(HA)>0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARI(HA)>0.9). In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the members is used. This has been demonstrated in this work on six non-biological data.

Concepts: Cluster analysis, Algorithm, Principal component analysis, Machine learning, Computational complexity theory, K-means clustering, Rand index


A memory-efficient algorithm for the computation of Principal Component Analysis (PCA) of large mass spectrometry imaging data sets is presented. Mass Spectrometry Imaging (MSI) enables two- and three- dimensional overviews of hundreds of unlabeled molecular species in complex samples such as intact tissue. PCA, in combination with data binning or other reduction algorithms, has been widely used in the unsupervised processing of MSI data and as a dimentionality reduction method prior to clustering and spatial segmentation. Standard implementations of PCA require the data to be stored in random access memory. This imposes an upper limit on the amount of data that can be processed, necessitating a compromise between the number of pixels and the number of peaks to include. With increasing interest in multivariate analysis of large 3D multi-slice datasets and ongoing improvements in instrumentation, the ability to retain all pixels and many more peaks is increasingly important. We present a new method which has no limitation on the number of pixels and allows an increased number of peaks to be retained. The new technique was validated against the MATLAB (The MathWorks Inc., Natick, Massachusetts) implementation of PCA (princomp) and then used to reduce, without discarding peaks or pixels, multiple serial sections acquired from a single mouse brain which was too large to be analysed with princomp. k-means clustering was then performed on the reduced dataset. We further demonstrate with simulated data of 83 slices, comprising 20535 pixels per slice and equalling 44 GB of data, that the new method can be used in combination with existing tools to process an entire organ. MATLAB code implementing the memory efficient PCA algorithm is provided.

Concepts: Multivariate statistics, Data set, Principal component analysis, Machine learning, The MathWorks, MATLAB, K-means clustering, Natick, Massachusetts


Objective The purpose of this study was to examine the heart rate reserve (HRR) at first and second ventilatory thresholds (VT’s) in postmenopausal women and compare it with optimal intensity range recommended by the ACSM (40-84%HRR). An additional aim was to evaluate whether a higher aerobic power level corresponded to a higher HRR at VT’s. Methods Fifty-eight postmenopausal women participated in this study (aged 48-69). A graded 25Wmin(-2) cycle ergometer (Monark E839) exercise protocol was performed in order to assess aerobic power. The heart rate and gas-exchange variables were measured continuously using a portable gas analyzer system (Cosmed K4b). The first (VT(1)) and the second (VT(2)) VT’s were determined by the time course curves of ventilation and O(2) and CO(2) ventilatory equivalents. A K-means clustering analysis was used in order to identify VO(2max) groups (cut-off of 30.5mlkg(-1)min(-1)) and differences were evaluated by an independent sample t-test. Bland-Altman plots were performed to illustrate the agreement between methods. Results The women’s HRR values at VT(1) were similar to 40%HRR in both VO(2max) groups. At VT(2) both VO(2max) groups exhibited negative differences (P<0.01) for the predicted 84%HRR intensity (-14.46% in the lower VO(2max) group and -16.32% in the higher VO(2max) group). Conclusions An upper limit of 84% overestimates the %HRR value for the second ventilatory threshold, suggesting that the cardiorespiratory target zone for this population should be lower and narrower (40-70%HRR).

Concepts: Pulse, Student's t-test, Heart rate, The Higher, Thresholds, Threshold, Limit superior and limit inferior, K-means clustering


Relaxor/ferroelectric ceramic/ceramic composites have shown to be promising in generating large electromechanical strain at moderate electric fields. Nonetheless, the mechanisms of polarization and strain coupling between grains of different nature in the composites remain unclear. To rationalize the coupling mechanisms we performed advanced piezoresponse force microscopy (PFM) studies of 0.92BNT-0.06BT-0.02KNN/0.93BNT-0.07BT (ergodic/non-ergodic relaxor) composites. PFM is able to distinguish grains of different phases by characteristic domain patterns. Polarization switching has been probed locally, on a sub-grain scale. k-Means clustering analysis applied to arrays of local hysteresis loops reveals variations of polarization switching characteristics between the ergodic and non-ergodic relaxor grains. We report a different set of switching parameters for grains in the composites as opposed to the pure phase samples. Our results confirm ceramic/ceramic composites to be a viable approach to tailor the piezoelectric properties and optimize the macroscopic electromechanical characteristics.

Concepts: Electromagnetism, Magnetic field, Fundamental physics concepts, Phase, Hysteresis, Piezoelectricity, Characteristic, K-means clustering


The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.

Concepts: Cluster analysis, Statistics, Machine learning, Computational complexity theory, Mixture model, K-means clustering, Algorithmic efficiency


Wildlife-associated diseases and pathogens have increased in importance; however, management of a large number of diseases and diversity of hosts is prohibitively expensive. Thus, the determination of priority wildlife pathogens and risk factors for disease emergence is warranted. We used an online questionnaire survey to assess release and exposure risks, and consequences of wildlife-associated diseases and pathogens in the Republic of Korea (ROK). We also surveyed opinions on pathways for disease exposure, and risk factors for disease emergence and spread. For the assessment of risk, we employed a two-tiered, statistical K-means clustering algorithm to group diseases into three levels (high, medium and low) of perceived risk based on release and exposure risks, societal consequences and the level of uncertainty of the experts' opinions. To examine the experts' perceived risk of routes of introduction of pathogens and disease amplification and spread, we used a Bayesian, multivariate normal order-statistics model. Six diseases or pathogens, including four livestock and two wildlife diseases, were identified as having high risk with low uncertainty. Similarly, 13 diseases were characterized as having high risk with medium uncertainty with three of these attributed to livestock, six associated with human disease, and the remainder having the potential to affect human, livestock and wildlife (i.e., One Health). Lastly, four diseases were described as high risk with high certainty, and were associated solely with fish diseases. Experts identified migration of wildlife, international human movement and illegal importation of wildlife as the three routes posing the greatest risk of pathogen introduction into ROK. Proximity of humans, livestock and wildlife was the most significant risk factor for promoting the spread of wildlife-associated diseases and pathogens, followed by high density of livestock populations, habitat loss and environmental degradation, and climate change. This study provides useful information to decision makers responsible for allocating resources to address disease risks. This approach provided a rapid, cost-effective method of risk assessment of wildlife-associated diseases and pathogens for which the published literature is sparse.

Concepts: Epidemiology, Disease, Risk, Actuarial science, Decision theory, Risk management, Uncertainty, K-means clustering


Understanding genetic mechanism of complex diseases is a serious challenge. Existing methods often neglect the heterogeneity phenomenon of complex diseases, resulting in lack of power or low reproducibility. Addressing heterogeneity when detecting epistatic single nucleotide polymorphisms (SNPs) can enhance the power of association studies and improve prediction performance of complex diseases diagnosis. In this study, we propose a three-stage framework including epistasis detection, clustering and prediction to address both epistasis and heterogeneity of complex diseases based on deep learning method. The epistasis detection stage applies a multi-objective optimization method to find several candidate sets of epistatic SNPs which contribute to different subtypes of complex diseases. Then, a K-means clustering algorithm is used to define subtypes of the case group. Finally, a deep learning model has been trained for disease prediction based on graphics processing unit (GPU). Experimental results on pure and heterogeneous datasets show that our method has potential practicality and can serve as a possible alternative to other methods. Therefore, when epistasis and heterogeneity exist at the same time, our method is especially suitable for diagnosis of complex diseases.

Concepts: DNA, Scientific method, Cluster analysis, Bioinformatics, SNP array, Single-nucleotide polymorphism, Machine learning, K-means clustering


For the authentication of white rice from different geographical origins, the selection of outstanding discrimination markers is essential. In this study, 80 commercial white rice samples were collected from local markets of Korea and China and discriminated by mass spectrometry-based untargeted metabolomics approaches. Additionally, the potential markers that belong to sugars & sugar alcohols, fatty acids, and phospholipids were examined using several multivariate analyses to measure their discrimination efficiencies. Unsupervised analyses, including principal component analysis and k-means clustering demonstrated the potential of the geographical classification of white rice between Korea and China by fatty acids and phospholipids. In addition, the accuracy, goodness-of-fit (R2), goodness-of-prediction (Q2), and permutation test p-value derived from phospholipid-based partial least squares-discriminant analysis were 1.000, 0.902, 0.870, and 0.001, respectively. Random Forests further consolidated the discrimination ability of phospholipids. Furthermore, an independent validation set containing 20 white rice samples also confirmed that phospholipids were the excellent discrimination markers for white rice between two countries. In conclusion, the proposed approach successfully highlighted phospholipids as the better discrimination markers than sugars & sugar alcohols and fatty acids in differentiating white rice between Korea and China.

Concepts: Nutrition, Fatty acid, Multivariate statistics, Principal component analysis, Machine learning, Glycerol, Multivariate analysis, K-means clustering