SciCombinator

Discover the most talked about and latest scientific content & concepts.

Concept: Statistical classification

178

BACKGROUND: An evidence-based steps/day translation of U.S. federal guidelines for youth to engage in >=60 minutes/day of moderate-to-vigorous physical activity (MVPA) would help health researchers, practitioners, and lay professionals charged with increasing youth’s physical activity (PA). The purpose of this study was to determine the number of free-living steps/day (both raw and adjusted to a pedometer scale) that correctly classified children (6–11 years) and adolescents (12–17 years) as meeting the 60-minute MVPA guideline using the 2005–2006 National Health and Nutrition Examination Survey (NHANES) accelerometer data, and to evaluate the 12,000 steps/day recommendation recently adopted by the President’s Challenge Physical Activity and Fitness Awards Program. METHODS: Analyses were conducted among children (n = 915) and adolescents (n = 1,302) in 2011 and 2012. Receiver Operating Characteristic (ROC) curve plots and classification statistics revealed candidate steps/day cut points that discriminated meeting/not meeting the MVPA threshold by age group, gender and different accelerometer activity cut points. The Evenson and two Freedson age-specific (3 and 4 METs) cut points were used to define minimum MVPA, and optimal steps/day were examined for raw steps and adjusted to a pedometer-scale to facilitate translation to lay populations. RESULTS: For boys and girls (6–11 years) with >= 60 minutes/day of MVPA, a range of 11,500–13,500 uncensored steps/day for children was the optimal range that balanced classification errors. For adolescent boys and girls (12–17) with >=60 minutes/day of MVPA, 11,500–14,000 uncensored steps/day was optimal. Translation to a pedometer-scaling reduced these minimum values by 2,500 step/day to 9,000 steps/day. Area under the curve was >=84% in all analyses. CONCLUSIONS: No single study has definitively identified a precise and unyielding steps/day value for youth. Considering the other evidence to date, we propose a reasonable ‘rule of thumb’ value of >= 11,500 accelerometer-determined steps/day for both children and adolescents (and both genders), accepting that more is better. For practical applications, 9,000 steps/day appears to be a more pedometer-friendly value.

Concepts: Gender, Receiver operating characteristic, Statistical classification, Accelerometer

169

BACKGROUND: Ensemble predictors such as the random forest are known to have superior accuracy but their black-boxpredictions are difficult to interpret. In contrast, a generalized linear model (GLM) is very interpretableespecially when forward feature selection is used to construct the model. However, forward feature selectiontends to overfit the data and leads to low predictive accuracy. Therefore, it remains an important research goalto combine the advantages of ensemble predictors (high accuracy) with the advantages of forward regressionmodeling (interpretability). To address this goal several articles have explored GLM based ensemblepredictors. Since limited evaluations suggested that these ensemble predictors were less accurate thanalternative predictors, they have found little attention in the literature. RESULTS: Comprehensive evaluations involving hundreds of genomic data sets, the UCI machine learning benchmarkdata, and simulations are used to give GLM based ensemble predictors a new and careful look. A novelbootstrap aggregated (bagged) GLM predictor that incorporates several elements of randomness and instability(random subspace method, optional interaction terms, forward variable selection) often outperforms a host ofalternative prediction methods including random forests and penalized regression models (ridge regression,elastic net, lasso). This random generalized linear model (RGLM) predictor provides variable importancemeasures that can be used to define a “thinned” ensemble predictor (involving few features) that retainsexcellent predictive accuracy. CONCLUSION: RGLM is a state of the art predictor that shares the advantages of a random forest (excellent predictiveaccuracy, feature importance measures, out-of-bag estimates of accuracy) with those of a forward selectedgeneralized linear model (interpretability). These methods are implemented in the freely available R softwarepackage randomGLM.

Concepts: Regression analysis, Statistics, Prediction, Machine learning, Generalized linear model, Statistical classification, Predictor, Random multinomial logit

142

Detection of foreign matter in cleaned cotton is instrumental to accurately grading cotton quality, which in turn impacts the marketability of the cotton. Current grading systems return estimates of the amount of foreign matter present, but provide no information about the identity of the contaminants. This paper explores the use of pulsed thermographic analysis to detect and identify cotton foreign matter. The design and implementation of a pulsed thermographic analysis system is described. A sample set of 240 foreign matter and cotton lint samples were collected. Hand-crafted waveform features and frequency-domain features were extracted and analyzed for statistical significance. Classification was performed on these features using linear discriminant analysis and support vector machines. Using waveform features and support vector machine classifiers, detection of cotton foreign matter was performed with 99.17% accuracy. Using frequency-domain features and linear discriminant analysis, identification was performed with 90.00% accuracy. These results demonstrate that pulsed thermographic imaging analysis produces data which is of significant utility for the detection and identification of cotton foreign matter.

Concepts: Statistics, Statistical classification, Support vector machine, Classification algorithms, Quadratic programming, Kernel trick, Linear classifier, Identity function

137

Surgery for brain cancer is a major problem in neurosurgery. The diffuse infiltration into the surrounding normal brain by these tumors makes their accurate identification by the naked eye difficult. Since surgery is the common treatment for brain cancer, an accurate radical resection of the tumor leads to improved survival rates for patients. However, the identification of the tumor boundaries during surgery is challenging. Hyperspectral imaging is a non-contact, non-ionizing and non-invasive technique suitable for medical diagnosis. This study presents the development of a novel classification method taking into account the spatial and spectral characteristics of the hyperspectral images to help neurosurgeons to accurately determine the tumor boundaries in surgical-time during the resection, avoiding excessive excision of normal tissue or unintentionally leaving residual tumor. The algorithm proposed in this study to approach an efficient solution consists of a hybrid framework that combines both supervised and unsupervised machine learning methods. Firstly, a supervised pixel-wise classification using a Support Vector Machine classifier is performed. The generated classification map is spatially homogenized using a one-band representation of the HS cube, employing the Fixed Reference t-Stochastic Neighbors Embedding dimensional reduction algorithm, and performing a K-Nearest Neighbors filtering. The information generated by the supervised stage is combined with a segmentation map obtained via unsupervised clustering employing a Hierarchical K-Means algorithm. The fusion is performed using a majority voting approach that associates each cluster with a certain class. To evaluate the proposed approach, five hyperspectral images of surface of the brain affected by glioblastoma tumor in vivo from five different patients have been used. The final classification maps obtained have been analyzed and validated by specialists. These preliminary results are promising, obtaining an accurate delineation of the tumor area.

Concepts: Cancer, Oncology, Surgery, Brain tumor, Machine learning, Glioblastoma multiforme, Statistical classification, Unsupervised learning

62

Deep learning is rapidly advancing many areas of science and technology with multiple success stories in image, text, voice and video recognition, robotics and autonomous driving. In this paper we demonstrate how deep neural networks (DNN) trained on large transcriptional response data sets can classify various drugs to therapeutic categories solely based on their transcriptional profiles. We used the perturbation samples of 678 drugs across A549, MCF-7 and PC-3 cell lines from the LINCS project and linked those to 12 therapeutic use categories derived from MeSH. To train the DNN, we utilized both gene level transcriptomic data and transcriptomic data processed using a pathway activation scoring algorithm, for a pooled dataset of samples perturbed with different concentrations of the drug for 6 and 24 hours. When applied to normalized gene expression data for “landmark genes,” DNN showed cross-validation mean F1 scores of 0.397, 0.285 and 0.234 on 3-, 5- and 12-category classification problems, respectively. At the pathway level DNN performed best with cross-validation mean F1 scores of 0.701, 0.596 and 0.546 on the same tasks. In both gene and pathway level classification, DNN convincingly outperformed support vector machine (SVM) model on every multiclass classification problem. For the first time we demonstrate a deep learning neural net trained on transcriptomic data to recognize pharmacological properties of multiple drugs across different biological systems and conditions. We also propose using deep neural net confusion matrices for drug repositioning. This work is a proof of principle for applying deep learning to drug discovery and development.

Concepts: DNA, Pharmacology, Gene, Genetics, Gene expression, Transcription, Drug development, Statistical classification

58

High throughput screening determines the effects of many conditions on a given biological target. Currently, to estimate the effects of those conditions on other targets requires either strong modeling assumptions (e.g. similarities among targets) or separate screens. Ideally, data-driven experimentation could be used to learn accurate models for many conditions and targets without doing all possible experiments. We have previously described an active machine learning algorithm that can iteratively choose small sets of experiments to learn models of multiple effects. We now show that, with no prior knowledge and with liquid handling robotics and automated microscopy under its control, this learner accurately learned the effects of 48 chemical compounds on the subcellular localization of 48 proteins while performing only 29% of all possible experiments. The results represent the first practical demonstration of the utility of active learning-driven biological experimentation in which the set of possible phenotypes is unknown in advance.

Concepts: Chemistry, Artificial intelligence, Machine learning, Learning, Chemical compound, Knowledge, Statistical classification, Pattern recognition

42

Recent research indicates a high recall in Google Scholar searches for systematic reviews. These reports raised high expectations of Google Scholar as a unified and easy to use search interface. However, studies on the coverage of Google Scholar rarely used the search interface in a realistic approach but instead merely checked for the existence of gold standard references. In addition, the severe limitations of the Google Search interface must be taken into consideration when comparing with professional literature retrieval tools.The objectives of this work are to measure the relative recall and precision of searches with Google Scholar under conditions which are derived from structured search procedures conventional in scientific literature retrieval; and to provide an overview of current advantages and disadvantages of the Google Scholar search interface in scientific literature retrieval.

Concepts: Google Scholar, Accuracy and precision, Web search engine, Statistical classification, Object-oriented programming, Searching, Google, Recall

31

Purpose To investigate whether multivariate pattern recognition analysis of arterial spin labeling (ASL) perfusion maps can be used for classification and single-subject prediction of patients with Alzheimer disease (AD) and mild cognitive impairment (MCI) and subjects with subjective cognitive decline (SCD) after using the W score method to remove confounding effects of sex and age. Materials and Methods Pseudocontinuous 3.0-T ASL images were acquired in 100 patients with probable AD; 60 patients with MCI, of whom 12 remained stable, 12 were converted to a diagnosis of AD, and 36 had no follow-up; 100 subjects with SCD; and 26 healthy control subjects. The AD, MCI, and SCD groups were divided into a sex- and age-matched training set (n = 130) and an independent prediction set (n = 130). Standardized perfusion scores adjusted for age and sex (W scores) were computed per voxel for each participant. Training of a support vector machine classifier was performed with diagnostic status and perfusion maps. Discrimination maps were extracted and used for single-subject classification in the prediction set. Prediction performance was assessed with receiver operating characteristic (ROC) analysis to generate an area under the ROC curve (AUC) and sensitivity and specificity distribution. Results Single-subject diagnosis in the prediction set by using the discrimination maps yielded excellent performance for AD versus SCD (AUC, 0.96; P < .01), good performance for AD versus MCI (AUC, 0.89; P < .01), and poor performance for MCI versus SCD (AUC, 0.63; P = .06). Application of the AD versus SCD discrimination map for prediction of MCI subgroups resulted in good performance for patients with MCI diagnosis converted to AD versus subjects with SCD (AUC, 0.84; P < .01) and fair performance for patients with MCI diagnosis converted to AD versus those with stable MCI (AUC, 0.71; P > .05). Conclusion With automated methods, age- and sex-adjusted ASL perfusion maps can be used to classify and predict diagnosis of AD, conversion of MCI to AD, stable MCI, and SCD with good to excellent accuracy and AUC values. (©) RSNA, 2016.

Concepts: Alzheimer's disease, Type I and type II errors, Sensitivity and specificity, Machine learning, Receiver operating characteristic, Binary classification, Statistical classification, Supervised learning

28

The usage of the systemic opioid remifentanil in relieving the labor pain has attracted much attention recently. An optimal dosing regimen for administration of remifentanil during labor relies on anticipating the timing of uterine contractions. These predictions should be made early enough to maximize analgesia efficacy during contractions and minimize the impact of the medication between contractions. We have designed a knowledge-assisted sequential pattern analysis framework to 1) predict the intrauterine pressure in real-time; 2) anticipate the next contraction; and, 3) develop a sequential association rule mining approach to identify the patterns of the contractions from historical patient tracings. The basis of this framework is a sequential association rule based collaborative filtering strategy that dynamically selects a better training dataset from historical patient tracings, which are similar to the current patients contraction pattern, and the current patients most recent training time series. A k-nearest neighbors (k-NN) based least squares support vector machine (LS-SVM) approach is used to establish the long-term time series prediction. Further, a postprediction process is proposed to enhance the predictive value. The findings validate that the framework is effective, robust, and efficient for uterine contraction prediction.

Concepts: Childbirth, Statistics, Prediction, Futurology, Future, Prophecy, Statistical classification, Contraction

28

Abstract: Digital staining for the automated annotation of Mass Spectrometry Imaging (MSI) data has previously been achieved using state-of-the-art classifiers such as random forests or support vector machines (SVMs). However, the training of such classifiers requires an expert to label exemplary data in advance. This process is time-consuming and hence costly, especially if the tissue is heterogeneous. In theory, it may be sufficient to only label few highly representative pixels of an MS image, but it is not known a priori which pixels to select. This motivates active learning strategies in which the algorithm itself queries the expert by automatically suggesting promising candidate pixels of an MS image for labeling. Given a suitable querying strategy, the number of required training labels can be significantly reduced while maintaining classification accuracy. In this work, we propose active learning for convenient annotation of MSI data. We generalize a recently proposed active learning method to the multi-class case and combine it with the random forest classifier. Its superior performance over random sampling is demonstrated on Secondary Ion Mass Spectrometry data, making it an interesting approach for the classification of mass spectrometry images.

Concepts: Mass spectrometry, Machine learning, IMAGE, Statistical classification, Support vector machine, Label, Secondary ion mass spectrometry, Active learning