Concept: Data set
Most imaging studies in the biological sciences rely on analyses that are relatively simple. However, manual repetition of analysis tasks across multiple regions in many images can complicate even the simplest analysis, making record keeping difficult, increasing the potential for error, and limiting reproducibility. While fully automated solutions are necessary for very large data sets, they are sometimes impractical for the small- and medium-sized data sets common in biology. Here we present the Slide Set plugin for ImageJ, which provides a framework for reproducible image analysis and batch processing. Slide Set organizes data into tables, associating image files with regions of interest and other relevant information. Analysis commands are automatically repeated over each image in the data set, and multiple commands can be chained together for more complex analysis tasks. All analysis parameters are saved, ensuring transparency and reproducibility. Slide Set includes a variety of built-in analysis commands and can be easily extended to automate other ImageJ plugins, reducing the manual repetition of image analysis without the set-up effort or programming expertise required for a fully automated solution.
The emerging Bicycle Sharing System (BSS) provides a new social microscope that allows us to “photograph” the main aspects of the society and to create a comprehensive picture of human mobility behavior in this new medium. BSS has been deployed in many major cities around the world as a short-distance trip supplement for public transportations and private vehicles. A unique value of the bike flow data generated by these BSSs is to understand the human mobility in a short-distance trip. This understanding of the population on short-distance trip is lacking, limiting our capacity in management and operation of BSSs. Many existing operations research and management methods for BSS impose assumptions that emphasize statistical simplicity and homogeneity. Therefore, a deep understanding of the statistical patterns embedded in the bike flow data is an urgent and overriding issue to inform decision-makings for a variety of problems including traffic prediction, station placement, bike reallocation, and anomaly detection. In this paper, we aim to conduct a comprehensive analysis of the bike flow data using two large datasets collected in Chicago and Hangzhou over months. Our analysis reveals intrinsic structures of the bike flow data and regularities in both spatial and temporal scales such as a community structure and a taxonomy of the eigen-bike-flows.
It is tempting to treat frequency trends from the Google Books data sets as indicators of the “true” popularity of various words and phrases. Doing so allows us to draw quantitatively strong conclusions about the evolution of cultural perception of a given topic, such as time or gender. However, the Google Books corpus suffers from a number of limitations which make it an obscure mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not. With this understood, the Google Books corpus remains an important data set to be considered more lexicon-like than text-like. Here, we show that a distinct problematic feature arises from the inclusion of scientific texts, which have become an increasingly substantive portion of the corpus throughout the 1900s. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. We use information theoretic methods to highlight these dynamics by examining and comparing major contributions via a divergence measure of English data sets between decades in the period 1800-2000. We find that only the English Fiction data set from the second version of the corpus is not heavily affected by professional texts. Overall, our findings call into question the vast majority of existing claims drawn from the Google Books corpus, and point to the need to fully characterize the dynamics of the corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution.
Extremely large datasets have become routine in biology. However, performing a computational analysis of a large dataset can be overwhelming, especially for novices. Here, we present a step-by-step guide to computing workflows with the biologist end-user in mind. Starting from a foundation of sound data management practices, we make specific recommendations on how to approach and perform computational analyses of large datasets, with a view to enabling sound, reproducible biological research.
Individual differences in brain functional networks may be related to complex personal identifiers, including health, age, and ability. Dynamic network theory has been used to identify properties of dynamic brain function from fMRI data, but the majority of analyses and findings remain at the level of the group. Here, we apply hypergraph analysis, a method from dynamic network theory, to quantify individual differences in brain functional dynamics. Using a summary metric derived from the hypergraph formalism-hypergraph cardinality-we investigate individual variations in two separate, complementary data sets. The first data set (“multi-task”) consists of 77 individuals engaging in four consecutive cognitive tasks. We observe that hypergraph cardinality exhibits variation across individuals while remaining consistent within individuals between tasks; moreover, the analysis of one of the memory tasks revealed a marginally significant correspondence between hypergraph cardinality and age. This finding motivated a similar analysis of the second data set (“age-memory”), in which 95 individuals, aged 18-75, performed a memory task with a similar structure to the multi-task memory task. With the increased age range in the age-memory data set, the correlation between hypergraph cardinality and age correspondence becomes significant. We discuss these results in the context of the well-known finding linking age with network structure, and suggest that hypergraph analysis should serve as a useful tool in furthering our understanding of the dynamic network structure of the brain.
We examined whether it is possible to identify individual subjects on the basis of brain anatomical features. For this, we analyzed a dataset comprising 191 subjects who were scanned three times over a period of two years. Based on FreeSurfer routines, we generated three datasets covering 148 anatomical regions (cortical thickness, area, volume). These three datasets were also combined to a dataset containing all of these three measures. In addition, we used a dataset comprising 11 composite anatomical measures for which we used larger brain regions (11LBR). These datasets were subjected to a linear discriminant analysis (LDA) and a weighted K-nearest neighbors approach (WKNN) to identify single subjects. For this, we randomly chose a data subset (training set) with which we calculated the individual identification. The obtained results were applied to the remaining sample (test data). In general, we obtained excellent identification results (reasonably good results were obtained for 11LBR using WKNN). Using different data manipulation techniques (adding white Gaussian noise to the test data and changing sample sizes) still revealed very good identification results, particularly for the LDA technique. Interestingly, using the small 11LBR dataset also revealed very good results indicating that the human brain is highly individual.
Policies ensuring that research data are available on public archives are increasingly being implemented at the government , funding agency [2-4], and journal [5, 6] level. These policies are predicated on the idea that authors are poor stewards of their data, particularly over the long term , and indeed many studies have found that authors are often unable or unwilling to share their data [8-11]. However, there are no systematic estimates of how the availability of research data changes with time since publication. We therefore requested data sets from a relatively homogenous set of 516 articles published between 2 and 22 years ago, and found that availability of the data was strongly affected by article age. For papers where the authors gave the status of their data, the odds of a data set being extant fell by 17% per year. In addition, the odds that we could find a working e-mail address for the first, last, or corresponding author fell by 7% per year. Our results reinforce the notion that, in the long term, research data cannot be reliably preserved by individual researchers, and further demonstrate the urgent need for policies mandating data sharing via public archives.
Recent advances in genome sequencing technologies provide unprecedented opportunities to characterize individual genomic landscapes and identify mutations relevant for diagnosis and therapy. Specifically, whole-exome sequencing using next-generation sequencing (NGS) technologies is gaining popularity in the human genetics community due to the moderate costs, manageable data amounts and straightforward interpretation of analysis results. While whole-exome and, in the near future, whole-genome sequencing are becoming commodities, data analysis still poses significant challenges and led to the development of a plethora of tools supporting specific parts of the analysis workflow or providing a complete solution. Here, we surveyed 205 tools for whole-genome/whole-exome sequencing data analysis supporting five distinct analytical steps: quality assessment, alignment, variant identification, variant annotation and visualization. We report an overview of the functionality, features and specific requirements of the individual tools. We then selected 32 programs for variant identification, variant annotation and visualization, which were subjected to hands-on evaluation using four data sets: one set of exome data from two patients with a rare disease for testing identification of germline mutations, two cancer data sets for testing variant callers for somatic mutations, copy number variations and structural variations, and one semi-synthetic data set for testing identification of copy number variations. Our comprehensive survey and evaluation of NGS tools provides a valuable guideline for human geneticists working on Mendelian disorders, complex diseases and cancers.
Human neuroscience research faces several challenges with regards to reproducibility. While scientists are generally aware that data sharing is important, it is not always clear how to share data in a manner that allows other labs to understand and reproduce published findings. Here we report a new open source tool, AFQ-Browser, that builds an interactive website as a companion to a diffusion MRI study. Because AFQ-Browser is portable-it runs in any web-browser-it can facilitate transparency and data sharing. Moreover, by leveraging new web-visualization technologies to create linked views between different dimensions of the dataset (anatomy, diffusion metrics, subject metadata), AFQ-Browser facilitates exploratory data analysis, fueling new discoveries based on previously published datasets. In an era where Big Data is playing an increasingly prominent role in scientific discovery, so will browser-based tools for exploring high-dimensional datasets, communicating scientific discoveries, aggregating data across labs, and publishing data alongside manuscripts.
Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.