Concept: Full genome sequencing
The human gut harbors thousands of bacterial taxa. A profusion of metagenomic sequence data has been generated from human stool samples in the last few years, raising the question of whether more taxa remain to be identified. We assessed metagenomic data generated by the Human Microbiome Project Consortium to determine if novel taxa remain to be discovered in stool samples from healthy individuals. To do this, we established a rigorous bioinformatics pipeline that uses sequence data from multiple platforms (Illumina GAIIX and Roche 454 FLX Titanium) and approaches (whole-genome shotgun and 16S rDNA amplicons) to validate novel taxa. We applied this approach to stool samples from 11 healthy subjects collected as part of the Human Microbiome Project. We discovered several low-abundance, novel bacterial taxa, which span three major phyla in the bacterial tree of life. We determined that these taxa are present in a larger set of Human Microbiome Project subjects and are found in two sampling sites (Houston and St. Louis). We show that the number of false-positive novel sequences (primarily chimeric sequences) would have been two orders of magnitude higher than the true number of novel taxa without validation using multiple datasets, highlighting the importance of establishing rigorous standards for the identification of novel taxa in metagenomic data. The majority of novel sequences are related to the recently discovered genus Barnesiella, further encouraging efforts to characterize the members of this genus and to study their roles in the microbial communities of the gut. A better understanding of the effects of less-abundant bacteria is important as we seek to understand the complex gut microbiome in healthy individuals and link changes in the microbiome to disease.
BACKGROUND: DNA analysis of ancient skeletal remains is invaluable in evolutionary biology for exploring the history of species, including humans. Contemporary human bones and teeth, however, are relevant in forensic DNA analyses that deal with the identification of perpetrators, missing persons, disaster victims or family relationships. They may also provide useful information towards unravelling controversies that surround famous historical individuals. Retrieving information about a deceased person’s externally visible characteristics can be informative in both types of DNA analyses. Recently, we demonstrated that human eye and hair colour can be reliably predicted from DNA using the HIrisPlex system. Here we test the feasibility of the novel HIrisPlex system at establishing eye and hair colour of deceased individuals from skeletal remains of various post-mortem time ranges and storage conditions. METHODS: Twenty-one teeth between 1 and approximately 800 years of age and 5 contemporary bones were subjected to DNA extraction using standard organic protocol followed by analysis using the HIrisPlex system. RESULTS: Twenty-three out of 26 bone DNA extracts yielded the full 24 SNP HIrisPlex profile, therefore successfully allowing model-based eye and hair colour prediction. HIrisPlex analysis of a tooth from the Polish general W[latin small letter l with stroke]adys[latin small letter l with stroke]aw Sikorski (1881 to 1943) revealed blue eye colour and blond hair colour, which was positively verified from reliable documentation. The partial profiles collected in the remaining three cases (two contemporary samples and a 14th century sample) were sufficient for eye colour prediction. CONCLUSIONS: Overall, we demonstrate that the HIrisPlex system is suitable, sufficiently sensitive and robust to successfully predict eye and hair colour from ancient and contemporary skeletal remains. Our findings, therefore, highlight the HIrisPlex system as a promising tool in future routine forensic casework involving skeletal remains, including ancient DNA studies, for the prediction of eye and hair colour of deceased individuals.
Remarkable advances in DNA sequencing technology have created a need for de novo genome assembly methods tailored to work with the new sequencing data types. Many such methods have been published in recent years, but assembling raw sequence data to obtain a draft genome has remained a complex, multi-step process, involving several stages of sequence data cleaning, error correction, assembly, and quality control. Successful application of these steps usually requires intimate knowledge of a diverse set of algorithms and software. We present an assembly pipeline called A5 (Andrew And Aaron’s Awesome Assembly pipeline) that simplifies the entire genome assembly process by automating these stages, by integrating several previously published algorithms with new algorithms for quality control and automated assembly parameter selection. We demonstrate that A5 can produce assemblies of quality comparable to a leading assembly algorithm, SOAPdenovo, without any prior knowledge of the particular genome being assembled and without the extensive parameter tuning required by the other assembly algorithm. In particular, the assemblies produced by A5 exhibit 50% or more reduction in broken protein coding sequences relative to SOAPdenovo assemblies. The A5 pipeline can also assemble Illumina sequence data from libraries constructed by the Nextera (transposon-catalyzed) protocol, which have markedly different characteristics to mechanically sheared libraries. Finally, A5 has modest compute requirements, and can assemble a typical bacterial genome on current desktop or laptop computer hardware in under two hours, depending on depth of coverage.
We provide a novel method, DRISEE (duplicate read inferred sequencing error estimation), to assess sequencing quality (alternatively referred to as “noise” or “error”) within and/or between sequencing samples. DRISEE provides positional error estimates that can be used to inform read trimming within a sample. It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples. For shotgun metagenomic data, we believe that DRISEE provides estimates of sequencing error that are more accurate and less constrained by technical limitations than existing methods that rely on reference genomes or the use of scores (e.g. Phred). Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms. The DRISEE error estimate is obtained by analyzing sets of artifactual duplicate reads (ADRs), a known by-product of both sequencing platforms. We present DRISEE as an open-source, platform-independent method to assess sequencing error in shotgun metagenomic data, and utilize it to discover previously uncharacterized error in de novo sequence data from the 454 and Illumina sequencing platforms.
BACKGROUND: 454 pyrosequencing is a commonly used massively parallel DNA sequencing technology with a wide variety of application fields such as epigenetics, metagenomics and transcriptomics. A well-known problem of this platform is its sensitivity to base-calling insertion and deletion errors, particularly in the presence of long homopolymers. In addition, the base-call quality scores are not informative with respect to whether an insertion or a deletion error is more likely. Surprisingly, not much effort has been devoted to the development of improved base-calling methods and more intuitive quality scores for this platform. RESULTS: We present HPCall, a 454 base-calling method based on a weighted Hurdle Poisson model. HPCall uses a probabilistic framework to call the homopolymer lengths in the sequence by modeling well-known 454 noise predictors. Base-calling quality is assessed based on estimated probabilities for each homopolymer length, which are easily transformed to useful quality scores. CONCLUSIONS: Using a reference data set of the Escherichia coli K-12 strain, we show that HPCall produces superior quality scores that are very informative towards possible insertion and deletion errors, while maintaining a base-calling accuracy that is better than the current one. Given the generality of the framework, HPCall has the potential to also adapt to other homopolymer-sensitive sequencing technologies.
The falling cost of DNA sequencing has made the technology affordable to many research groups, enabling researchers to link genomic variants to observed phenotypes in a range of species. This review focusses on whole exome sequencing and its applications in humans and other species. The exome has traditionally been defined to consist of only the protein coding portion of the genome; a region where mutations are likely to affect protein structure and function. There are several commercial kits available for exome sequencing in a number of species and, owing to the highly conserved nature of exons, many of these can be applied to other closely related species. The data set produced from exome sequencing is many times smaller than that of whole genome sequencing, making it more easily manageable and the analysis less complex. Exome sequencing for disease gene discovery in humans is well established and has been used successfully to identify mutations that are causative of complex and rare diseases. Exome sequencing has also been used in a number of domesticated and companion species. The successful application of exome sequencing to crops has yielded results that may be used in selective breeding to improve production in these species, and there is potential for exome sequencing to provide similar advances in livestock species that have not yet been realized.
Human populations worldwide are increasingly confronted with infectious diseases and antimicrobial resistance spreading faster and appearing more frequently. Knowledge regarding their occurrence and worldwide transmission is important to control outbreaks and prevent epidemics. Here, we performed shotgun sequencing of toilet waste from 18 international airplanes arriving in Copenhagen, Denmark, from nine cities in three world regions. An average of 18.6 Gb (14.8 to 25.7 Gb) of raw Illumina paired end sequence data was generated, cleaned, trimmed and mapped against reference sequence databases for bacteria and antimicrobial resistance genes. An average of 106,839 (0.06%) reads were assigned to resistance genes with genes encoding resistance to tetracycline, macrolide and beta-lactam resistance genes as the most abundant in all samples. We found significantly higher abundance and diversity of genes encoding antimicrobial resistance, including critical important resistance (e.g. blaCTX-M) carried on airplanes from South Asia compared to North America. Presence of Salmonella enterica and norovirus were also detected in higher amounts from South Asia, whereas Clostridium difficile was most abundant in samples from North America. Our study provides a first step towards a potential novel strategy for global surveillance enabling simultaneous detection of multiple human health threatening genetic elements, infectious agents and resistance genes.
The size and complexity of conifer genomes has, until now, prevented full genome sequencing and assembly. The large research community and economic importance of loblolly pine, Pinus taeda L., made it an early candidate for reference sequence determination.
Routine full characterization of Mycobacterium tuberculosis (TB) is culture-based, taking many weeks. Whole-genome sequencing (WGS) can generate antibiotic susceptibility profiles to inform treatment, augmented with strain information for global surveillance; such data could be transformative if provided at or near point of care.We demonstrate a low-cost DNA extraction method for TB WGS direct from patient samples. We initially evaluated the method using the Illumina MiSeq sequencer (40 smear-positive respiratory samples, obtained after routine clinical testing, and 27 matched liquid cultures). M. tuberculosis was identified in all 39 samples from which DNA was successfully extracted. Sufficient data for antibiotic susceptibility prediction was obtained from 24 (62%) samples; all results were concordant with reference laboratory phenotypes. Phylogenetic placement was concordant between direct and cultured samples. Using an Illumina MiSeq/MiniSeq the workflow from patient sample to results can be completed in 44/16 hours at a reagent cost of £96/£198 per sample.We then employed a non-specific PCR-based library preparation method for sequencing on an Oxford Nanopore Technologies MinION sequencer. We applied this to cultured Mycobacterium bovis BCG strain (BCG), and to combined culture-negative sputum DNA and BCG DNA. For flowcell version R9.4, the estimated turnaround time from patient to identification of BCG, detection of pyrazinamide resistance, and phylogenetic placement was 7.5 hours, with full susceptibility results 5 hours later. Antibiotic susceptibility predictions were fully concordant. A critical advantage of the MinION is the ability to continue sequencing until sufficient coverage is obtained, providing a potential solution to the problem of variable amounts of M. tuberculosis in direct samples.
Whole exome and whole genome sequencing have both become widely adopted methods for investigating and diagnosing human Mendelian disorders. As pangenomic agnostic tests, they are capable of more accurate and agile diagnosis compared to traditional sequencing methods. This article describes new software called Mendel,MD, which combines multiple types of filter options and makes use of regularly updated databases to facilitate exome and genome annotation, the filtering process and the selection of candidate genes and variants for experimental validation and possible diagnosis. This tool offers a user-friendly interface, and leads clinicians through simple steps by limiting the number of candidates to achieve a final diagnosis of a medical genetics case. A useful innovation is the “1-click” method, which enables listing all the relevant variants in genes present at OMIM for perusal by clinicians. Mendel,MD was experimentally validated using clinical cases from the literature and was tested by students at the Universidade Federal de Minas Gerais, at GENE-Núcleo de Genética Médica in Brazil and at the Children’s University Hospital in Dublin, Ireland. We show in this article how it can simplify and increase the speed of identifying the culprit mutation in each of the clinical cases that were received for further investigation. Mendel,MD proved to be a reliable web-based tool, being open-source and time efficient for identifying the culprit mutation in different clinical cases of patients with Mendelian Disorders. It is also freely accessible for academic users on the following URL: https://mendelmd.org.