How does network structure affect diffusion? Recent studies suggest that the answer depends on the type of contagion. Complex contagions, unlike infectious diseases (simple contagions), are affected by social reinforcement and homophily. Hence, the spread within highly clustered communities is enhanced, while diffusion across communities is hampered. A common hypothesis is that memes and behaviors are complex contagions. We show that, while most memes indeed spread like complex contagions, a few viral memes spread across many communities, like diseases. We demonstrate that the future popularity of a meme can be predicted by quantifying its early spreading pattern in terms of community concentration. The more communities a meme permeates, the more viral it is. We present a practical method to translate data about community structure into predictive knowledge about what information will spread widely. This connection contributes to our understanding in computational social science, social media analytics, and marketing applications.
Secondary use of electronic health records (EHRs) promises to advance clinical research and better inform clinical decision making. Challenges in summarizing and representing patient data prevent widespread practice of predictive modeling using EHRs. Here we present a novel unsupervised deep feature learning method to derive a general-purpose patient representation from EHR data that facilitates clinical predictive modeling. In particular, a three-layer stack of denoising autoencoders was used to capture hierarchical regularities and dependencies in the aggregated EHRs of about 700,000 patients from the Mount Sinai data warehouse. The result is a representation we name “deep patient”. We evaluated this representation as broadly predictive of health states by assessing the probability of patients to develop various diseases. We performed evaluation using 76,214 test patients comprising 78 diseases from diverse clinical domains and temporal windows. Our results significantly outperformed those achieved using representations based on raw EHR data and alternative feature learning strategies. Prediction performance for severe diabetes, schizophrenia, and various cancers were among the top performing. These findings indicate that deep learning applied to EHRs can derive patient representations that offer improved clinical predictions, and could provide a machine learning framework for augmenting clinical decision systems.
- Proceedings of the National Academy of Sciences of the United States of America
- Published over 3 years ago
Prediction of human physical traits and demographic information from genomic data challenges privacy and data deidentification in personalized medicine. To explore the current capabilities of phenotype-based genomic identification, we applied whole-genome sequencing, detailed phenotyping, and statistical modeling to predict biometric traits in a cohort of 1,061 participants of diverse ancestry. Individually, for a large fraction of the traits, their predictive accuracy beyond ancestry and demographic information is limited. However, we have developed a maximum entropy algorithm that integrates multiple predictions to determine which genomic samples and phenotype measurements originate from the same person. Using this algorithm, we have reidentified an average of >8 of 10 held-out individuals in an ethnically mixed cohort and an average of 5 of either 10 African Americans or 10 Europeans. This work challenges current conceptions of personal privacy and may have far-reaching ethical and legal implications.
Use of socially generated “big data” to access information about collective states of the minds in human societies has become a new paradigm in the emerging field of computational social science. A natural application of this would be the prediction of the society’s reaction to a new product in the sense of popularity and adoption rate. However, bridging the gap between “real time monitoring” and “early predicting” remains a big challenge. Here we report on an endeavor to build a minimalistic predictive model for the financial success of movies based on collective activity data of online users. We show that the popularity of a movie can be predicted much before its release by measuring and analyzing the activity level of editors and viewers of the corresponding entry to the movie in Wikipedia, the well-known online encyclopedia.
In a random number generation task, participants are asked to generate a random sequence of numbers, most typically the digits 1 to 9. Such number sequences are not mathematically random, and both extent and type of bias allow one to characterize the brain’s “internal random number generator”. We assume that certain patterns and their variations will frequently occur in humanly generated random number sequences. Thus, we introduce a pattern-based analysis of random number sequences. Twenty healthy subjects randomly generated two sequences of 300 numbers each. Sequences were analysed to identify the patterns of numbers predominantly used by the subjects and to calculate the frequency of a specific pattern and its variations within the number sequence. This pattern analysis is based on the Damerau-Levenshtein distance, which counts the number of edit operations that are needed to convert one string into another. We built a model that predicts not only the next item in a humanly generated random number sequence based on the item’s immediate history, but also the deployment of patterns in another sequence generated by the same subject. When a history of seven items was computed, the mean correct prediction rate rose up to 27% (with an individual maximum of 46%, chance performance of 11%). Furthermore, we assumed that when predicting one subject’s sequence, predictions based on statistical information from the same subject should yield a higher success rate than predictions based on statistical information from a different subject. When provided with two sequences from the same subject and one from a different subject, an algorithm identifies the foreign sequence in up to 88% of the cases. In conclusion, the pattern-based analysis using the Levenshtein-Damarau distance is both able to predict humanly generated random number sequences and to identify person-specific information within a humanly generated random number sequence.
Correctly assessing a scientist’s past research impact and potential for future impact is key in recruitment decisions and other evaluation processes. While a candidate’s future impact is the main concern for these decisions, most measures only quantify the impact of previous work. Recently, it has been argued that linear regression models are capable of predicting a scientist’s future impact. By applying that future impact model to 762 careers drawn from three disciplines: physics, biology, and mathematics, we identify a number of subtle, but critical, flaws in current models. Specifically, cumulative non-decreasing measures like the h-index contain intrinsic autocorrelation, resulting in significant overestimation of their “predictive power”. Moreover, the predictive power of these models depend heavily upon scientists' career age, producing least accurate estimates for young researchers. Our results place in doubt the suitability of such models, and indicate further investigation is required before they can be used in recruiting decisions.
Through theoretical analysis, we show how a superorganism may react to stimulus variations according to psychophysical laws observed in humans and other animals. We investigate an empirically-motivated honeybee house-hunting model, which describes a value-sensitive decision process over potential nest-sites, at the level of the colony. In this study, we show how colony decision time increases with the number of available nests, in agreement with the Hick-Hyman law of psychophysics, and decreases with mean nest quality, in agreement with Piéron’s law. We also show that colony error rate depends on mean nest quality, and difference in quality, in agreement with Weber’s law. Psychophysical laws, particularly Weber’s law, have been found in diverse species, including unicellular organisms. Our theoretical results predict that superorganisms may also exhibit such behaviour, suggesting that these laws arise from fundamental mechanisms of information processing and decision-making. Finally, we propose a combined psychophysical law which unifies Hick-Hyman’s law and Piéron’s law, traditionally studied independently; this unified law makes predictions that can be empirically tested.
- Proceedings of the National Academy of Sciences of the United States of America
- Published almost 3 years ago
Why do certain group members end up liking each other more than others? How does affective reciprocity arise in human groups? The prediction of interpersonal sentiment has been a long-standing pursuit in the social sciences. We combined fMRI and longitudinal social network data to test whether newly acquainted group members' reward-related neural responses to images of one another’s faces predict their future interpersonal sentiment, even many months later. Specifically, we analyze associations between relationship-specific valuation activity and relationship-specific future liking. We found that one’s own future (T2) liking of a particular group member is predicted jointly by actor’s initial (T1) neural valuation of partner and by that partner’s initial (T1) neural valuation of actor. These actor and partner effects exhibited equivalent predictive strength and were robust when statistically controlling for each other, both individuals' initial liking, and other potential drivers of liking. Behavioral findings indicated that liking was initially unreciprocated at T1 yet became strongly reciprocated by T2. The emergence of affective reciprocity was partly explained by the reciprocal pathways linking dyad members' T1 neural data both to their own and to each other’s T2 liking outcomes. These findings elucidate interpersonal brain mechanisms that define how we ultimately end up liking particular interaction partners, how group members' initially idiosyncratic sentiments become reciprocated, and more broadly, how dyads evolve. This study advances a flexible framework for researching the neural foundations of interpersonal sentiments and social relations that-conceptually, methodologically, and statistically-emphasizes group members' neural interdependence.
Will they participate? predicting patients' response to clinical trial invitations in a pediatric emergency department
- Journal of the American Medical Informatics Association : JAMIA
- Published over 4 years ago
(1) To develop an automated algorithm to predict a patient’s response (ie, if the patient agrees or declines) before he/she is approached for a clinical trial invitation; (2) to assess the algorithm performance and the predictors on real-world patient recruitment data for a diverse set of clinical trials in a pediatric emergency department; and (3) to identify directions for future studies in predicting patients' participation response.
Comparison of the binding sites of proteins is an effective means for predicting protein functions based on their structure information. Despite the importance of this problem and much research in the past, it is still very challenging to predict the binding ligands from the atomic structures of protein binding sites. Here, we designed a new algorithm, TIPSA (Triangulation-based Iterative-closest-point for Protein Surface Alignment), based on the iterative closest point (ICP) algorithm. TIPSA aims to find the maximum number of atoms that can be superposed between two protein binding sites, where any pair of superposed atoms has a distance smaller than a given threshold. The search starts from similar tetrahedra between two binding sites obtained from 3D Delaunay triangulation and uses the Hungarian algorithm to find additional matched atoms. We found that, due to the plasticity of protein binding sites, matching the rigid body of point clouds of protein binding sites is not adequate for satisfactory binding ligand prediction. We further incorporated global geometric information, the radius of gyration of binding site atoms, and used nearest neighbor classification for binding site prediction. Tested on benchmark data, our method achieved a performance comparable to the best methods in the literature, while simultaneously providing the common atom set and atom correspondences.