Emerging sequencing technologies allow common and rare variants to be systematically assayed across the human genome in many individuals. In order to improve variant detection and genotype calling, raw sequence data are typically examined across many individuals. Here, we describe a method for genotype calling in settings where sequence data are available for unrelated individuals and parent-offspring trios and show that modeling trio information can greatly increase the accuracy of inferred genotypes and haplotypes, especially on low to modest depth sequencing data. Our method considers both linkage disequilibrium (LD) patterns and the constraints imposed by family structure when assigning individual genotypes and haplotypes. Using simulations, we show that trios provide higher genotype calling accuracy across the frequency spectrum, both overall and at hard-to-call heterozygous sites. In addition, trios provide greatly improved phasing accuracy-improving the accuracy of downstream analyses (such as genotype imputation) that rely on phased haplotypes. To further evaluate our approach, we analyzed data on the first 508 individuals sequenced by the SardiNIA sequencing project. Our results show that our method reduces the genotyping error rate by 50% compared with analysis using existing methods that ignore family structure. We anticipate our method will facilitate genotype calling and haplotype inference for many ongoing sequencing projects.
The immune responses of natural killer cells are regulated, in part, by killer cell immunoglobulin-like receptors (KIR). The 16 closely-related genes in the KIR gene system have been diversified by gene duplication and unequal crossing over, thereby generating haplotypes with variation in gene copy number. Allelic variation also contributes to diversity within the complex. In this study, we estimated allele-level haplotype frequencies and pairwise linkage disequilibrium statistics for 14 KIR loci. The typing utilized multiple methodologies by four laboratories to provide at least 2x coverage for each allele. The computational methods generated maximum-likelihood estimates of allele-level haplotypes. Our results indicate the most extensive allele diversity was observed for the KIR framework genes and for the genes localized to the telomeric region of the KIR A haplotype. Particular alleles of the stimulatory loci appear to be nearly fixed on specific, common haplotypes while many of the less frequent alleles of the inhibitory loci appeared on multiple haplotypes, some with common haplotype structures. Haplotype structures cA01 and/or tA01 predominate in this cohort, as has been observed in most populations worldwide. Linkage disequilibrium is high within the centromeric and telomeric haplotype regions but not between them and is particularly strong between centromeric gene pairs KIR2DL5∼KIR2DS3S5 and KIR2DS3S5∼KIR2DL1, and telomeric KIR3DL1∼KIR2DS4. Although 93% of the individuals have unique pairs of full-length allelic haplotypes, large genomic blocks sharing specific sets of alleles are seen in the most frequent haplotypes. These high-resolution, high-quality haplotypes extend our basic knowledge of the KIR gene system and may be used to support clinical studies beyond single gene analysis.
In a number of applications there is a need to determine the most likely pedigree for a group of persons based on genetic markers. Adequate models are needed to reach this goal. The markers used to perform the statistical calculations can be linked and there may also be linkage disequilibrium (LD) in the population. The purpose of this paper is to present a graphical Bayesian Network framework to deal with such data. Potential LD is normally ignored and it is important to verify that the resulting calculations are not biased. Even if linkage does not influence results for regular paternity cases, it may have substantial impact on likelihood ratios involving other, more extended pedigrees. Models for LD influence likelihoods for all pedigrees to some degree and an initial estimate of the impact of ignoring LD and/or linkage is desirable, going beyond mere rules of thumb based on marker distance. Furthermore, we show how one can readily include a mutation model in the Bayesian Network; extending other programs or formulas to include such models may require considerable amounts of work and will in many case not be practical. As an example, we consider the two STR markers vWa and D12S391. We estimate probabilities for population haplotypes to account for LD using a method based on data from trios, while an estimate for the degree of linkage is taken from the literature. The results show that accounting for haplotype frequencies is unnecessary in most cases for this specific pair of markers. When doing calculations on regular paternity cases, the markers can be considered statistically independent. In more complex cases of disputed relatedness, for instance cases involving siblings or so-called deficient cases, or when small differences in the LR matter, independence should not be assumed. (The networks are freely available at http://arken.umb.no/~dakl/BayesianNetworks.).
We present the global phylogeography of the black sea urchin Arbacia lixula, an amphi-Atlantic echinoid with potential to strongly impact shallow rocky ecosystems. Sequences of the mitochondrial cytochrome c oxidase gene of 604 specimens from 24 localities were obtained, covering most of the distribution area of the species, including the Mediterranean and both shores of the Atlantic. Genetic diversity measures, phylogeographic patterns, demographic parameters and population differentiation were analysed. We found high haplotype diversity but relatively low nucleotide diversity, with 176 haplotypes grouped within three haplogroups: one is shared between Eastern Atlantic (including Mediterranean) and Brazilian populations, the second is found in Eastern Atlantic and the Mediterranean and the third is exclusively from Brazil. Significant genetic differentiation was found between Brazilian, Eastern Atlantic and Mediterranean regions, but no differentiation was found among Mediterranean sub-basins or among Eastern Atlantic sub-regions. The star-shaped topology of the haplotype network and the unimodal mismatch distributions of Mediterranean and Eastern Atlantic samples suggest that these populations have suffered very recent demographic expansions. These expansions could be dated 94-205 kya in the Mediterranean, and 31-67 kya in the Eastern Atlantic. In contrast, Brazilian populations did not show any signature of population expansion. Our results indicate that all populations of A. lixula constitute a single species. The Brazilian populations probably diverged from an Eastern Atlantic stock. The present-day genetic structure of the species in Eastern Atlantic and the Mediterranean is shaped by very recent demographic processes. Our results support the view (backed by the lack of fossil record) that A. lixula is a recent thermophilous colonizer which spread throughout the Mediterranean during a warm period of the Pleistocene, probably during the last interglacial. Implications for the possible future impact of A. lixula on shallow Mediterranean ecosystems in the context of global warming trends must be considered.
BACKGROUND: Genotyping and massively-parallel sequencing projects result in a vast amount of diploid data that is only rarely resolved into its constituent haplotypes. It is nevertheless this phased information that is transmitted from one generation to the next and is most directly associated with biological function and the genetic causes of biological effects. Despite progress made in genome-wide sequencing and phasing algorithms and methods, problems assembling (and reconstructing linear haplotypes in) regions of repetitive DNA and structural variation remain. These dynamic and structurally complex regions are often poorly understood from a sequence point of view. Regions such as these that are highly similar in their sequence tend to be collapsed onto the genome assembly. This is turn means downstream determination of the true sequence haplotype in these regions poses a particular challenge. For structurally complex regions, a more focussed approach to assembling haplotypes may be required. RESULTS: In order to investigate reconstruction of spatial information at structurally complex regions, we have used an emulsion haplotype fusion PCR approach to reproducibly link sequences of up to 1kb in length to allow phasing of multiple variants from neighbouring loci, using allele-specific PCR and sequencing to detect the phase. By using emulsion systems linking flanking regions to amplicons within the CNV, this led to the reconstruction of a 59kb haplotype across the DEFA1A3 CNV in HapMap individuals. CONCLUSION: This study has demonstrated a novel use for emulsion haplotype fusion PCR in addressing the issue of reconstructing structural haplotypes at multiallelic copy variable regions, using the DEFA1A3 locus as an example.
Many genes are known to have an influence on conformation and performance traits; however, the role of one gene, Myostatin (MSTN), has been highlighted in recent studies on horses. Myostatin acts as a repressor in the development and regulation of differentiation and proliferative growth of skeletal muscle. Several studies have examined the link between MSTN, conformation and performance in racing breeds, but no studies have investigated the relationship in Icelandic horses. Icelandic horses, a highly unique breed, are known both for their robust and compact conformation as well as their additional gaits tölt and pace. Three SNPs (g.65868604G>T [PR8604], g.66493737C>T [PR3737] and g.66495826A>G [PR5826]) flanking or within equine MSTN were genotyped in 195 Icelandic horses. The SNPs and haplotypes were analyzed for association with official estimated breeding values (EBV) for conformation traits (n=11) and gaits (n=5). The EBV for neck, withers and shoulders was significantly associated with both PR8604 and PR3737 (p<0.05). PR8604 was also associated with EBV for total conformation (p=0.05). These associations were all supported by the haplotype analysis. However, while SNP PR5826 showed a significant association with EBVs for leg stance and hooves (p<0.05), haplotype analyses for these traits failed to fully support these associations. This study demonstrates the possible role of MSTN on both the form and function of horses from non-racing breeds. Further analysis of Icelandic horses as well as other non-racing breeds would be beneficial and likely help to completely understand the influence of MSTN on conformation and performance in horses.
To determine the genetic diversity and paternal origin of Chinese cattle, 302 males from 16 Chinese native cattle breeds as well as 30 Holstein males and four Burma males as controls were analysed using four Y-SNPs and two Y-STRs. In Chinese bulls, the taurine Y1 and Y2 haplogroups and indicine Y3 haplogroup were detected in seven, 172 and 123 individuals respectively, and these frequencies varied among the Chinese cattle breeds examined. Y2 dominates in northern China (91.4%), and Y3 dominates in southern China (90.8%). Central China is an admixture zone, although Y2 predominates overall (72.0%). The geographical distributions of the Y2 and Y3 haplogroup frequencies revealed a pattern of male indicine introgression from south to north China. The three Y haplogroups were further classified into one Y1 haplotype, five Y2 haplotypes and one Y3 haplotype in Chinese native bulls. Due to the interplay between taurine and indicine types, Chinese cattle represent an extensive reservoir of genetic diversity. The Y haplotype distribution of Chinese cattle exhibited a clear geographical structure, which is consistent with mtDNA, historical and geographical information.
In this paper we consider a problem from hematopoietic cell transplant (HCT) studies where there is interest on assessing the effect of haplotype match for donor and patient on the cumulative incidence function for a right censored competing risks data. For the HCT study, donor’s and patient’s genotype are fully observed and matched but their haplotypes are missing. In this paper we describe how to deal with missing covariates of each individual for competing risks data. We suggest a procedure for estimating the cumulative incidence functions for a flexible class of regression models when there are missing data, and establish the large sample properties. Small sample properties are investigated using simulations in a setting that mimics the motivating haplotype matching problem. The proposed approach is then applied to the HCT study.
Familial hemophagocytic lymphohistiocytosis (familial HLH or FHL) is a potentially fatal autosomal recessive disorder. Our previous study demonstrated that UNC13D mutations (FHL3) account for ∼90 % of FHL in Korea with recurrent splicing mutation c.754-1G>C (IVS9-1G>C). Notably, half of the FHL3 patients had a monoallelic mutation of UNC13D. Deep intronic mutations in UNC13D were recently reported in patients of European descent. In this study, we performed targeted mutation analyses for deep intronic mutations and investigated on the founder effect in FHL3 in Korean patients. The study patients were 72 children with HLH including those with FHL3 previously reported to have a monoallelic UNC13D mutation. All patients were recruited from the Korean Registry of Hemophagocytic Lymphohistiocytosis. In addition to conventional sequencing of FHL2-4, targeted tests for c.118-308C>T and large intronic rearrangement mutations of UNC13D were performed. Haplotype analysis was performed for founder effects using polymorphic markers in the FHL3 locus. FHL mutations were detected in 20 patients (28 %). Seventeen patients had UNC13D mutations (FHL3, 85 %) and three had PRF1 mutations (FHL2, 15 %). UNC13D:c.118-308C>T was detected in ten patients, accounting for 38 % of all mutant alleles of UNC13D, followed by c.754-1G>C (26 %). Haplotype analyses revealed significantly shared haplotypes in both c.118-308C>T and c.754-1G>C, indicating the presence of founder effects. The deep intronic mutation UNC13D:c.118-308C>T accounts for the majority of previously missing mutations and is the most frequent mutation in FHL3 in Korea. Founder effects of two recurrent intronic mutations of UNC13D explain the unusual predominance of FHL3 in Korea.
The aim of this study was to explore whether prostaglandin D(2) receptor (PTGDR) polymorphisms confer susceptibility to asthma. A meta-analysis was conducted on the associations between the PTGDR -549 C/T, -441 C/T, and -197 C/T polymorphisms and asthma using: (1) allele contrast, (2) the recessive model, (3) the dominant model, and (4) the additive model. Three polymorphism haplotypes were constructed in the order -549/-441/-179. Meta-analysis was performed on the haplotype CCC (high transcriptional activity) and of TCT (low transcriptional activity). A total of 13 separate comparative studies in 9 articles involving 7,155 patients with asthma and 7,285 control subjects were included in this meta-analysis. An association between asthma and the PTGDR -549 C/T polymorphism was found by allele contrast (OR = 1.133, 95 % CI = 1.004-1.279, P = 0.043). Ethnicity-specific meta-analysis showed an association between asthma and the PTGDR -549 C allele in Europeans (OR = 1.192, 95 % CI = 1.032-1.377, P = 0.017). Furthermore, stratifying subjects by age indicated an association between the PTGDR -549 C allele and asthma in adults (OR = 1.248, 95 % CI = 1.076-1.447, P = 0.003), but no association in children (OR = 0.933, 95 % CI = 0.756-1.154, P = 0.324). Analyses using the dominant and additive models showed the similar pattern as that observed for the PTGDR -549 C allele, that is, a significant association in Europeans and adults, but not in children. No association was found between asthma and the PTGDR -441 C/T or -197 C/T polymorphisms, and meta-analysis stratified by ethnicity and age also revealed no association between asthma and these polymorphisms. Furthermore, no association was found between asthma and the CCC and TCT haplotypes of PTGDR, and meta-analysis stratified by ethnicity and age revealed no association between asthma and the CCC and TCT PTGDR haplotypes. This meta-analysis demonstrates that the PTGDR -549 C/T polymorphism confers susceptibility to asthma in Europeans and adults. However, no association was found between the PTGDR 441 C/T and -197 C/T polymorphisms or the CCC and TCT haplotypes and asthma susceptibility.