Metagenomes are often characterized by high levels of unknown sequences. Reads derived from known microorganisms can easily be identified and analyzed using fast homology search algorithms and a suitable reference database, but the unknown sequences are often ignored in further analyses, biasing conclusions. Nevertheless, it is possible to use more data in a comparative metagenomic analysis by creating a cross-assembly of all reads, i.e. a single assembly of reads from different samples. Comparative metagenomics studies the interrelationships between metagenomes from different samples. Using an assembly algorithm is a fast and intuitive way to link (partially) homologous reads without requiring a database of reference sequences.
BACKGROUND: The metagenomics approach allows the simultaneous sequencing of all genomes in an environmental sample. This results in high complexity datasets, where in addition to repeats and sequencing errors, the number of genomes and their abundance ratios are unknown. Recently developed next-generation sequencing (NGS) technologies significantly improve the sequencing efficiency and cost. On the other hand, they result in shorter reads, which makes the separation of reads from different species harder. Among the existing computational tools for metagenomic analysis, there are similarity-based methods that use reference databases to align reads and composition-based methods that use composition patterns (i.e., frequencies of short words or l-mers) to cluster reads. Similarity-based methods are unable to classify reads from unknown species without close references (which constitute the majority of reads). Since composition patterns are preserved only in significantly large fragments, composition-based tools cannot be used for very short reads, which becomes a significant limitation with the development of NGS. A recently proposed algorithm, AbundanceBin, introduced another method that bins reads based on predicted abundances of the genomes sequenced. However, it does not separate reads from genomes of similar abundance levels. RESULTS: In this work, we present a two-phase heuristic algorithm for separating short paired-end reads from different genomes in a metagenomic dataset. We use the observation that most of the l-mers belong to unique genomes when l is sufficiently large. The first phase of the algorithm results in clusters of l-mers each of which belongs to one genome. During the second phase, clusters are merged based on l-mer repeat information. These final clusters are used to assign reads. The algorithm could handle very short reads and sequencing errors. It is initially designed for genomes with similar abundance levels and then extended to handle arbitrary abundance ratios. The software can be download for free at http://www.cs.ucr.edu/~tanaseio/toss.htm. CONCLUSIONS: Our tests on a large number of simulated metagenomic datasets concerning species at various phylogenetic distances demonstrate that genomes can be separated if the number of common repeats is smaller than the number of genome-specific repeats. For such genomes, our method can separate NGS reads with a high precision and sensitivity.
BACKGROUND: 454 pyrosequencing is a commonly used massively parallel DNA sequencing technology with a wide variety of application fields such as epigenetics, metagenomics and transcriptomics. A well-known problem of this platform is its sensitivity to base-calling insertion and deletion errors, particularly in the presence of long homopolymers. In addition, the base-call quality scores are not informative with respect to whether an insertion or a deletion error is more likely. Surprisingly, not much effort has been devoted to the development of improved base-calling methods and more intuitive quality scores for this platform. RESULTS: We present HPCall, a 454 base-calling method based on a weighted Hurdle Poisson model. HPCall uses a probabilistic framework to call the homopolymer lengths in the sequence by modeling well-known 454 noise predictors. Base-calling quality is assessed based on estimated probabilities for each homopolymer length, which are easily transformed to useful quality scores. CONCLUSIONS: Using a reference data set of the Escherichia coli K-12 strain, we show that HPCall produces superior quality scores that are very informative towards possible insertion and deletion errors, while maintaining a base-calling accuracy that is better than the current one. Given the generality of the framework, HPCall has the potential to also adapt to other homopolymer-sensitive sequencing technologies.
Tricholoma matsutake, an ectomycorrhizae that has mutual relationships with the rootlet of Pinus denisflora, forms a fruiting body that serves as a valuable food in Asia. However, the artificial culture of this fungus has not been successful. Soil fungi, including T. matsutake, coexist with many other microorganisms and plants; therefore, complex microbial communities have an influence on the fruiting body formation of T. matsutake. Here, we report on the structures of fungal communities associated with the fairy ring of T. matsutake through the pyrosequencing method. Soil samples were collected inside the fairy ring zone, in the fairy ring zone, and outside the fairy ring zone. A total of 37,125 sequencing reads were obtained and 728 to 1,962 Operational Taxonomic Units (OTUs) were observed in the sampling zones. The fairy ring zone had the lowest OTUs and the lowest fungal diversity of all sampling zones. The number of OTUs and fungal taxa inside and outside the fairy ring zone was, respectively, about two times and 1.5 times higher than the fairy ring. Taxonomic analysis showed that each sampling zone has different fungal communities. In particular, out of 209 genera total, six genera, in the fairy ring zone, such as genus Hemimycena, were uniquely present and 31 genera, such as genus Mycena, Boletopsis, and Repetophragma, were specifically absent. The results of metagenomic analysis based on the pyrosequencing indicate a decrease of fungal communities in the fairy ring zone and changes of fungal communities depending on the fairy ring growth of T. matsutake.
Metagenomics is a trending research area, calling for the need to analyze large quantities of data generated from next generation DNA sequencing technologies. The need to store, retrieve, analyze, share, and visualize such data challenges current online computational systems. Interpretation and annotation of specific information is especially a challenge for metagenomic data sets derived from environmental samples, because current annotation systems only offer broad classification of microbial diversity and function. Moreover, existing resources are not configured to readily address common questions relevant to environmental systems. Here we developed a new online user-friendly metagenomic analysis server called MetaStorm (http://bench.cs.vt.edu/MetaStorm/), which facilitates customization of computational analysis for metagenomic data sets. Users can upload their own reference databases to tailor the metagenomics annotation to focus on various taxonomic and functional gene markers of interest. MetaStorm offers two major analysis pipelines: an assembly-based annotation pipeline and the standard read annotation pipeline used by existing web servers. These pipelines can be selected individually or together. Overall, MetaStorm provides enhanced interactive visualization to allow researchers to explore and manipulate taxonomy and functional annotation at various levels of resolution.
We describe MetAMOS, an open source and modular metagenomic assembly and analysis pipeline. MetAMOS represents an important step towards fully automated metagenomic analysis, starting with next-generation sequencing reads and producing genomic scaffolds, open-reading frames and taxonomic or functional annotations. MetAMOS can aid in reducing assembly errors, commonly encountered when assembling metagenomic samples, and improves taxonomic assignment accuracy while also reducing computational cost. MetAMOS can be downloaded from: https://github.com/treangen/MetAMOS.
We investigated methanotrophic bacteria in slightly alkaline surface water (pH 7.4-8.7) of oilsands tailings ponds in Fort McMurray, Canada. These large lakes (up to 10 km(2)) contain water, silt, clay and residual hydrocarbons that are not recovered in oilsands mining. They are primarily anoxic and produce methane but have an aerobic surface layer. Aerobic methane oxidation was measured in the surface water at rates up to 152 nmol CH(4) ml(-1) water d(-1). Microbial diversity was investigated via pyrotag sequencing of amplified 16S rRNA genes, as well as by analysis of methanotroph-specific pmoA genes using both pyrosequencing and microarray analysis. The predominantly detected methanotroph in surface waters at all sampling times was an uncultured species related to the gammaproteobacterial genus Methylocaldum, although a few other methanotrophs were also detected, including Methylomonas spp. Active species were identified via (13)CH(4) stable isotope probing (SIP) of DNA, combined with pyrotag sequencing and shotgun metagenomic sequencing of heavy (13)C-DNA. The SIP-PCR results demonstrated that the Methylocaldum and Methylomonas spp. actively consumed methane in fresh tailings pond water. Metagenomic analysis of DNA from the heavy SIP fraction verified the PCR-based results and identified additional pmoA genes not detected via PCR. The metagenome indicated that the overall methylotrophic community possessed known pathways for formaldehyde oxidation, carbon fixation and detoxification of nitrogenous compounds but appeared to possess only particulate methane monooxygenase not soluble methane monooxygenase.
Several thousand metagenomes have already been sequenced, and this number is set to grow rapidly in the forthcoming years as the uptake of high-throughput sequencing technologies continues. Hand-in-hand with this data bonanza comes the computationally overwhelming task of analysis. Herein, we describe some of the bioinformatic approaches currently used by metagenomics researchers to analyze their data, the issues they face and the steps that could be taken to help overcome these challenges.
Decomposition of lignocelluloses by cooperative microbial actions is an essential process of carbon cycling in nature and provides a basis for biomass conversion to fuels and chemicals in biorefineries. In this study, structurally stable symbiotic aero-tolerant lignocellulose-degrading microbial consortia were obtained from biodiversified microflora present in industrial sugarcane bagasse pile (BGC-1), cow rumen fluid (CRC-1), and pulp mill activated sludge (ASC-1) by successive subcultivation on rice straw under facultative anoxic conditions. Tagged 16S rRNA gene pyrosequencing revealed that all isolated consortia originated from highly diverse environmental microflora shared similar composite phylum profiles comprising mainly Firmicutes, reflecting convergent adaptation of microcosm structures, however, with substantial differences at refined genus level. BGC-1 comprising cellulolytic Clostridium and Acetanaerobacterium in stable coexistence with ligninolytic Ureibacillus showed the highest capability on degradation of agricultural residues and industrial pulp waste with CMCase, xylanase, and β-glucanase activities in the supernatant. Shotgun pyrosequencing of the BGC-1 metagenome indicated a markedly high relative abundance of genes encoding for glycosyl hydrolases, particularly for lignocellulytic enzymes in 26 families. The enzyme system comprised a unique composition of main-chain degrading and side-chain processing hydrolases, dominated by GH2, 3, 5, 9, 10, and 43, reflecting adaptation of enzyme profiles to the specific substrate. Gene mapping showed metabolic potential of BGC-1 for conversion of biomass sugars to various fermentation products of industrial importance. The symbiotic consortium is a promising simplified model for study of multispecies mechanisms on consolidated bioprocessing and a platform for discovering efficient synergistic enzyme systems for biotechnological application.
A metagenomic approach based on target independent next-generation sequencing has become a known method for the detection of both known and novel viruses in clinical samples. This study aimed to use the metagenomic sequencing approach to characterise the viral diversity in respiratory samples from patients with respiratory tract infections. We have investigated 86 respiratory samples received from various hospitals in Kuwait between 2015 and 2016 for the diagnosis of respiratory tract infections. A metagenomic approach using the next-generation sequencer to characterise viruses was used. According to the metagenomic analysis, an average of 145, 019 reads were identified, and 2% of these reads were of viral origin. Also, metagenomic analysis of the viral sequences revealed many known respiratory viruses, which were detected in 30.2% of the clinical samples. Also, sequences of non-respiratory viruses were detected in 14% of the clinical samples, while sequences of non-human viruses were detected in 55.8% of the clinical samples. The average genome coverage of the viruses was 12% with the highest genome coverage of 99.2% for respiratory syncytial virus, and the lowest was 1% for torque teno midi virus 2. Our results showed 47.7% agreement between multiplex Real-Time PCR and metagenomics sequencing in the detection of respiratory viruses in the clinical samples. Though there are some difficulties in using this method to clinical samples such as specimen quality, these observations are indicative of the promising utility of the metagenomic sequencing approach for the identification of respiratory viruses in patients with respiratory tract infections. This article is protected by copyright. All rights reserved.