SciCombinator

Discover the most talked about and latest scientific content & concepts.

Concept: Bioinformatics

181

Tardigrades are able to tolerate almost complete dehydration by reversibly switching to an ametabolic state. This ability is called anhydrobiosis. In the anhydrobiotic state, tardigrades can withstand various extreme environments including space, but their molecular basis remains largely unknown. Late embryogenesis abundant (LEA) proteins are heat-soluble proteins and can prevent protein-aggregation in dehydrated conditions in other anhydrobiotic organisms, but their relevance to tardigrade anhydrobiosis is not clarified. In this study, we focused on the heat-soluble property characteristic of LEA proteins and conducted heat-soluble proteomics using an anhydrobiotic tardigrade. Our heat-soluble proteomics identified five abundant heat-soluble proteins. All of them showed no sequence similarity with LEA proteins and formed two novel protein families with distinct subcellular localizations. We named them Cytoplasmic Abundant Heat Soluble (CAHS) and Secretory Abundant Heat Soluble (SAHS) protein families, according to their localization. Both protein families were conserved among tardigrades, but not found in other phyla. Although CAHS protein was intrinsically unstructured and SAHS protein was rich in β-structure in the hydrated condition, proteins in both families changed their conformation to an α-helical structure in water-deficient conditions as LEA proteins do. Two conserved repeats of 19-mer motifs in CAHS proteins were capable to form amphiphilic stripes in α-helices, suggesting their roles as molecular shield in water-deficient condition, though charge distribution pattern in α-helices were different between CAHS and LEA proteins. Tardigrades might have evolved novel protein families with a heat-soluble property and this study revealed a novel repertoire of major heat-soluble proteins in these anhydrobiotic animals.

Concepts: DNA, Proteins, Protein, Cell, Bioinformatics, Molecular biology, Tardigrade, Cryptobiosis

179

The 20th annual Database Issue of Nucleic Acids Research includes 176 articles, half of which describe new online molecular biology databases and the other half provide updates on the databases previously featured in NAR and other journals. This year’s highlights include two databases of DNA repeat elements; several databases of transcriptional factors and transcriptional factor-binding sites; databases on various aspects of protein structure and protein-protein interactions; databases for metagenomic and rRNA sequence analysis; and four databases specifically dedicated to Escherichia coli. The increased emphasis on using the genome data to improve human health is reflected in the development of the databases of genomic structural variation (NCBI’s dbVar and EBI’s DGVa), the NIH Genetic Testing Registry and several other databases centered on the genetic basis of human disease, potential drugs, their targets and the mechanisms of protein-ligand binding. Two new databases present genomic and RNAseq data for monkeys, providing wealth of data on our closest relatives for comparative genomics purposes. The NAR online Molecular Biology Database Collection, available at http://www.oxfordjournals.org/nar/database/a/, has been updated and currently lists 1512 online databases. The full content of the Database Issue is freely available online on the Nucleic Acids Research website (http://nar.oxfordjournals.org/).

Concepts: DNA, Protein, Gene, Genetics, Bioinformatics, Molecular biology, RNA, Genomics

176

SUMMARY: InterMine is an open-source data warehouse system that facilitates the building of databases with complex data integration requirements and a need for a fast, customisable query facility. Using InterMine, large biological databases can be created from a range of heterogeneous data sources, and the extensible data model allows for easy integration of new data types. The analysis tools include a flexible query builder, genomic region search, and a library of “widgets” performing various statistical analyses. The results can be exported in many commonly used formats. InterMine is a fully extensible framework where developers can add new tools and functionality. Additionally, there is a comprehensive set of web services, for which client libraries are provided in five commonly used programming languages. AVAILABILITY: Freely available from http://www.intermine.org under the LGPL license. CONTACT: g.micklem@gen.cam.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Concepts: Bioinformatics, Statistics, Model organism, Data, Programming language, Data management, Type system, Biological data

175

Metagenomes are often characterized by high levels of unknown sequences. Reads derived from known microorganisms can easily be identified and analyzed using fast homology search algorithms and a suitable reference database, but the unknown sequences are often ignored in further analyses, biasing conclusions. Nevertheless, it is possible to use more data in a comparative metagenomic analysis by creating a cross-assembly of all reads, i.e. a single assembly of reads from different samples. Comparative metagenomics studies the interrelationships between metagenomes from different samples. Using an assembly algorithm is a fast and intuitive way to link (partially) homologous reads without requiring a database of reference sequences.

Concepts: Algorithm, Bioinformatics, Metagenomics, Mathematical analysis, Homology

175

As the volume, complexity and diversity of the information that scientists work with on a daily basis continues to rise, so too does the requirement for new analytic software. The analytic software must solve the dichotomy that exists between the need to allow for a high level of scientific reasoning, and the requirement to have an intuitive and easy to use tool which does not require specialist, and often arduous, training to use. Information visualization provides a solution to this problem, as it allows for direct manipulation and interaction with diverse and complex data. The challenge addressing bioinformatics researches is how to apply this knowledge to data sets that are continually growing in a field that is rapidly changing.

Concepts: Scientific method, Psychology, Bioinformatics, Genomics, Emergence, Logic, Problem solving, Functional genomics

174

BACKGROUND: MEDLINE®/PubMed® indexes over 20 million biomedical articles, providing curated annotation of contents using a controlled vocabulary known as Medical Subject Headings (MeSH). The MeSH vocabulary, developed over 50+ years, provides a broad coverage of topics across biomedical research. Distilling the essential biomedical themes for a topic of interest from the relevant literature is important to both understand the importance of related concepts and discover new relationships. RESULTS: We introduce a novel method for determining enriched curator-assigned MeSH annotations in a set of papers associated to a topic, such as a gene, an author or a disease. We generate MeSH Over-representation Profiles (MeSHOPs) to quantitatively summarize the annotations in a form convenient for further computational analysis and visualization. Based on a hyper geometric distribution of assigned terms, MeSHOPs statistically account for the prevalence of the associated biomedical annotation while highlighting unusually prevalent terms based on a specified background. MeSHOPs can be visualized using word clouds, providing a succinct quantitative graphical representation of the relative importance of terms. Using the publication dates of articles, MeSHOPs track changing patterns of annotation over time. Since MeSHOPs are quantitative vectors, MeSHOPs can be compared using standard techniques such as hierarchical clustering. The reliability of MeSHOP annotations is assessed based on the capacity to re-derive the subset of the Gene Ontology annotations with equivalent MeSH terms. CONCLUSIONS: MeSHOPs allows quantitative measurement of the degree of association between any entity and the annotated medical concepts, based directly on relevant primary literature. Comparison of MeSHOPs allows entities to be related based on shared medical themes in their literature. A web interface is provided for generating and visualizing MeSHOPs.

Concepts: Bioinformatics, Controlled vocabulary, Medical research, Mesh, Medical Subject Headings, GoPubMed

173

The Ensembl Project provides release-specific Perl APIs for efficient high-level programmatic access to data stored in various Ensembl database schema. Although Perl scripts are perfectly suited for processing large volumes of text-based data, Perl is not ideal for developing large-scale software applications nor embedding in graphical interfaces. The provision of a novel Java API would facilitate type-safe, modular, object-orientated development of new Bioinformatics tools with which to access, analyse and visualize Ensembl data.

Concepts: Bioinformatics, Database, Computer program, C, Application programming interface, Graphical user interface, Computer software, Application software

172

BACKGROUND: With the development of high throughput methods of gene analyses, there is a growing need for mining tools to retrieve relevant articles in PubMed. As PubMed grows, literature searches become more complex and time-consuming. Automated search tools with good precision and recall are necessary. We developed GO2PUB to automatically enrich PubMed queries with gene names, symbols and synonyms annotated by a GO term of interest or one of its descendants. RESULTS: GO2PUB enriches PubMed queries based on selected GO terms and keywords. It processes the result and displays the PMID, title, authors, abstract and bibliographic references of the articles. Gene names, symbols and synonyms that have been generated as extra keywords from the GO terms are also highlighted. GO2PUB is based on a semantic expansion of PubMed queries using the semantic inheritance between terms through the GO graph. Two experts manually assessed the relevance of GO2PUB, GoPubMed and PubMed on three queries about lipid metabolism. Experts' agreement was high (kappa=0.88). GO2PUB returned 69 % of the relevant articles, GoPubMed: 40 % and PubMed: 29 %. GO2PUB and GoPubMed have 17 % of their results in common, corresponding to 24 % of the total number of relevant results. 70 % of the articles returned by more than one tool were relevant. 36 % of the relevant articles were returned only by GO2PUB, 17 % only by GoPubMed and 14 % only by PubMed. For determining whether these results can be generalized, we generated twenty queries based on random GO terms with a granularity similar to those of the first three queries and compared the proportions of GO2PUB and GoPubMed results. These were respectively of 77 % and 40 % for the first queries, and of 70 % and 38 % for the random queries. The two experts also assessed the relevance of seven of the twenty queries (the three related to lipid metabolism and four related to other domains). Expert agreement was high (0.93 and 0.8). GO2PUB and GoPubMed performances were similar to those of the first queries. CONCLUSIONS: We demonstrated that the use of genes annotated by either GO terms of interest or a descendant of these GO terms yields some relevant articles ignored by other tools. The comparison of GO2PUB, based on semantic expansion, with GoPubMed, based on text mining techniques, showed that both tools are complementary. The analysis of the randomly-generated queries suggests that the results obtained about lipid metabolism can be generalized to other biological processes. GO2PUB is available at http://go2pub.genouest.org.

Concepts: Protein, Gene, Bioinformatics, Evolution, Organism, Mining, Information retrieval, Gene Ontology

171

We have developed Cake, a bioinformatics software pipeline that integrates four publicly available somatic variant-calling algorithms to identify single nucleotide variants with higher sensitivity and accuracy than any one algorithm alone. Cake can be run on a high-performance computer cluster or used as a standalone application.

Concepts: DNA, Algorithm, Bioinformatics, Computer, Computer program, Computer science, Biostatistics

171

MOTIVATION: BLAST remains one of the most widely used tools in computational biology. The rate at which new sequence data is available continues to grow exponentially, driving the emergence of new fields of biological research. At the same time multicore systems and conventional clusters are more accessible. ScalaBLAST has been designed to run on conventional multiprocessor systems with an eye to extreme parallelism, enabling parallel BLAST calculations using over 16,000 processing cores with a portable, robust, fault-resilient design that introduces little to no overhead with respect to serial BLAST. ScalaBLAST 2.0 source code can be freely downloaded from http://omics.pnl.gov/software/ScalaBLAST.php.

Concepts: Bioinformatics, Biology, Parallel computing, Computer program, Computational biology, C, Source code, Exponential growth