Concept: Linkage disequilibrium
Emerging sequencing technologies allow common and rare variants to be systematically assayed across the human genome in many individuals. In order to improve variant detection and genotype calling, raw sequence data are typically examined across many individuals. Here, we describe a method for genotype calling in settings where sequence data are available for unrelated individuals and parent-offspring trios and show that modeling trio information can greatly increase the accuracy of inferred genotypes and haplotypes, especially on low to modest depth sequencing data. Our method considers both linkage disequilibrium (LD) patterns and the constraints imposed by family structure when assigning individual genotypes and haplotypes. Using simulations, we show that trios provide higher genotype calling accuracy across the frequency spectrum, both overall and at hard-to-call heterozygous sites. In addition, trios provide greatly improved phasing accuracy-improving the accuracy of downstream analyses (such as genotype imputation) that rely on phased haplotypes. To further evaluate our approach, we analyzed data on the first 508 individuals sequenced by the SardiNIA sequencing project. Our results show that our method reduces the genotyping error rate by 50% compared with analysis using existing methods that ignore family structure. We anticipate our method will facilitate genotype calling and haplotype inference for many ongoing sequencing projects.
To ascertain genetic diversity, population structure and linkage disequilibrium (LD) among a representative collection of Chinese winter wheat cultivars and lines, 90 winter wheat accessions were analyzed with 269 SSR markers distributed throughout the wheat genome. A total of 1,358 alleles were detected, with 2 to 10 alleles per locus and a mean genetic richness of 5.05. The average genetic diversity index was 0.60, with values ranging from 0.05 to 0.86. Of the three genomes of wheat, ANOVA revealed that the B genome had the highest genetic diversity (0.63) and the D genome the lowest (0.56); significant differences were observed between these two genomes (P<0.01). The 90 Chinese winter wheat accessions could be divided into three subgroups based on STRUCTURE, UPGMA cluster and principal coordinate analyses. The population structure derived from STRUCTURE clustering was positively correlated to some extent with geographic eco-type. LD analysis revealed that there was a shorter LD decay distance in Chinese winter wheat compared with other wheat germplasm collections. The maximum LD decay distance, estimated by curvilinear regression, was 17.4 cM (r(2)>0.1), with a whole genome LD decay distance of approximately 2.2 cM (r(2)>0.1, P<0.001). Evidence from genetic diversity analyses suggest that wheat germplasm from other countries should be introduced into Chinese winter wheat and distant hybridization should be adopted to create new wheat germplasm with increased genetic diversity. The results of this study should provide valuable information for future association mapping using this Chinese winter wheat collection.
Although the concept of genomic selection relies on linkage disequilibrium (LD) between quantitative trait loci and markers, reliability of genomic predictions is strongly influenced by family relationships. In this study, we investigated the effects of LD and family relationships on reliability of genomic predictions and the potential of deterministic formulas to predict reliability using population parameters in populations with complex family structures. Five groups of selection candidates were simulated taking different information sources from the reference population into account: 1) allele frequencies; 2) LD pattern; 3) haplotypes; 4) haploid chromosomes; 5) individuals from the reference population, thereby having real family relationships with reference individuals. Reliabilities were predicted using genomic relationships among 529 reference individuals and their relationships with selection candidates and with a deterministic formula where the number of effective chromosome segments (M(e)) was estimated based on genomic and additive relationship matrices for each scenario. At a heritability of 0.6, reliabilities based on genomic relationships were 0.002±0.0001 (allele frequencies), 0.015±0.001 (LD pattern), 0.018±0.001 (haplotypes), 0.100±0.008 (haploid chromosomes) and 0.318±0.077 (family relationships). At a heritability of 0.1, relative differences among groups were similar. For all scenarios, reliabilities were similar to predictions with a deterministic formula using estimated M(e). So, reliabilities can be predicted accurately using empirically estimated M(e) and level of relationship with reference individuals has a much higher effect on the reliability than linkage disequilibrium per se. Furthermore, accumulated length of shared haplotypes is more important in determining the reliability of genomic prediction than the individual shared haplotype length.
In a number of applications there is a need to determine the most likely pedigree for a group of persons based on genetic markers. Adequate models are needed to reach this goal. The markers used to perform the statistical calculations can be linked and there may also be linkage disequilibrium (LD) in the population. The purpose of this paper is to present a graphical Bayesian Network framework to deal with such data. Potential LD is normally ignored and it is important to verify that the resulting calculations are not biased. Even if linkage does not influence results for regular paternity cases, it may have substantial impact on likelihood ratios involving other, more extended pedigrees. Models for LD influence likelihoods for all pedigrees to some degree and an initial estimate of the impact of ignoring LD and/or linkage is desirable, going beyond mere rules of thumb based on marker distance. Furthermore, we show how one can readily include a mutation model in the Bayesian Network; extending other programs or formulas to include such models may require considerable amounts of work and will in many case not be practical. As an example, we consider the two STR markers vWa and D12S391. We estimate probabilities for population haplotypes to account for LD using a method based on data from trios, while an estimate for the degree of linkage is taken from the literature. The results show that accounting for haplotype frequencies is unnecessary in most cases for this specific pair of markers. When doing calculations on regular paternity cases, the markers can be considered statistically independent. In more complex cases of disputed relatedness, for instance cases involving siblings or so-called deficient cases, or when small differences in the LR matter, independence should not be assumed. (The networks are freely available at http://arken.umb.no/~dakl/BayesianNetworks.).
The main purpose of this study is to evaluate whether the population structure in Danish Jersey (DJ) known from the history of the breed also is reflected in its genomic structure. This is done by comparing the linkage disequilibrium and persistence of phase for subgroups of Jersey animals with high proportions of Danish (DNK) or US (USJ) origin. Furthermore, it is investigated whether a model explicitly incorporating breed origin of animals, inferred either through the known pedigree or from SNP marker data, leads to improved genomic predictions compared to a model ignoring breed origin. The study of the population structure incorporated 1,730 genotyped Jersey animals. In total 39,542 SNP markers were included in the analysis. The 1,079 genotyped bulls with de-regressed proof for udder health were used in the analysis for the predictions of the genomic breeding values. A range of random regressions models that included the breed origin were analyzed and compared to a basic genomic model that assumes a homogeneous breed structure. The main finding in this study is that the importation of germplasm from the USJ population is readily reflected in the genomes of modern DJ animals. First, linkage disequilibrium in the group of admixed DJ animals is lower compared to the groups of the original DNK and USJ animals. Second, persistence of linkage disequilibrium phase is not conserved for longer marker distances between animals with mainly Danish or US origin. Third, the STRUCTURE analysis could retrieve genomic based breed proportions in alignment to the pedigree based breed proportions. However, including this population structure in a random regression prediction model, did not clearly improve the reliabilities of the genomic predictions compared to a basic genomic model.
This study determined the population structure and genome-wide marker-trait association of agronomic traits of wheat for drought-tolerance breeding. Ninety-three diverse bread wheat genotypes were genotyped using the Diversity Arrays Technology sequencing (DArTseq) protocol. The number of days-to-heading (DTH), number of days-to-maturity (DTM), plant height (PHT), spike length (SPL), number of kernels per spike (KPS), thousand kernel weight (TKW) and grain yield (GYLD), assessed under drought-stressed and non-stressed conditions, were considered for the study. Population structure analysis and genome-wide association mapping were undertaken based on 16,383 silico DArTs loci with < 10% missing data. The population evaluated was grouped into nine distinct genetic structures. Inter-chromosomal linkage disequilibrium showed the existence of linkage decay as physical distance increased. A total of 62 significant (P < 0.001) marker-trait associations (MTAs) were detected explaining more than 20% of the phenotypic variation observed under both drought-stressed and non-stressed conditions. Significant (P < 0.001) MTA event(s) were observed for DTH, PHT, SPL, SPS, and KPS; under both stressed and non-stressed conditions, while additional significant (P < 0.05) associations were observed for TKW, DTM and GYLD under non-stressed condition. The MTAs reported in this population could be useful to initiate marker-assisted selection (MAS) and targeted trait introgression of wheat under drought-stressed and non-stressed conditions, and for fine mapping and cloning of the underlying genes and QTL.
Five classical designations of sickle haplotypes are made on the basis of the presence or absence of restriction sites and are named after the ethno-linguistic groups or geographic regions from which the individuals with sickle cell anemia originated. Each haplotype is thought to represent an independent occurrence of the sickle mutation rs334 (c.20A>T [p.Glu7Val] in HBB). We investigated the origins of the sickle mutation by using whole-genome-sequence data. We identified 156 carriers from the 1000 Genomes Project, the African Genome Variation Project, and Qatar. We classified haplotypes by using 27 polymorphisms in linkage disequilibrium with rs334. Network analysis revealed a common haplotype that differed from the ancestral haplotype only by the derived sickle mutation at rs334 and correlated collectively with the Central African Republic (CAR), Cameroon, and Arabian/Indian haplotypes. Other haplotypes were derived from this haplotype and fell into two clusters, one composed of Senegal haplotypes and the other composed of Benin and Senegal haplotypes. The near-exclusive presence of the original sickle haplotype in the CAR, Kenya, Uganda, and South Africa is consistent with this haplotype predating the Bantu expansions. Modeling of balancing selection indicated that the heterozygote advantage was 15.2%, an equilibrium frequency of 12.0% was reached after 87 generations, and the selective environment predated the mutation. The posterior distribution of the ancestral recombination graph yielded a sickle mutation age of 259 generations, corresponding to 7,300 years ago during the Holocene Wet Phase. These results clarify the origin of the sickle allele and improve and simplify the classification of sickle haplotypes.
Many microbial populations rapidly adapt to changing environments with multiple variants competing for survival. To quantify such complex evolutionary dynamics in vivo, time resolved and genome wide data including rare variants are essential. We performed whole-genome deep sequencing of HIV-1 populations in 9 untreated patients, with 6-12 longitudinal samples per patient spanning 5-8 years of infection. The data can be accessed and explored via an interactive web application. We show that patterns of minor diversity are reproducible between patients and mirror global HIV-1 diversity, suggesting a universal landscape of fitness costs that control diversity. Reversions towards the ancestral HIV-1 sequence are observed throughout infection and account for almost one third of all sequence changes. Reversion rates depend strongly on conservation. Frequent recombination limits linkage disequilibrium to about 100bp in most of the genome, but strong hitch-hiking due to short range linkage limits diversity.
Genetic similarity of spouses can reflect factors influencing mate choice, such as physical/behavioral characteristics, and patterns of social endogamy. Spouse correlations for both genetic ancestry and measured traits may impact genotype distributions (Hardy Weinberg and linkage equilibrium), and therefore genetic association studies. Here we evaluate white spouse-pairs from the Framingham Heart Study (FHS) original and offspring cohorts (N = 124 and 755, respectively) to explore spousal genetic similarity and its consequences. Two principal components (PCs) of the genome-wide association (GWA) data were identified, with the first (PC1) delineating clines of Northern/Western to Southern European ancestry and the second (PC2) delineating clines of Ashkenazi Jewish ancestry. In the original (older) cohort, there was a striking positive correlation between the spouses in PC1 (r = 0.73, P = 3x10-22) and also for PC2 (r = 0.80, P = 7x10-29). In the offspring cohort, the spouse correlations were lower but still highly significant for PC1 (r = 0.38, P = 7x10-28) and for PC2 (r = 0.45, P = 2x10-39). We observed significant Hardy-Weinberg disequilibrium for single nucleotide polymorphisms (SNPs) loading heavily on PC1 and PC2 across 3 generations, and also significant linkage disequilibrium between unlinked SNPs; both decreased with time, consistent with reduced ancestral endogamy over generations and congruent with theoretical calculations. Ignoring ancestry, estimates of spouse kinship have a mean significantly greater than 0, and more so in the earlier generations. Adjusting kinship estimates for genetic ancestry through the use of PCs led to a mean spouse kinship not different from 0, demonstrating that spouse genetic similarity could be fully attributed to ancestral assortative mating. These findings also have significance for studies of heritability that are based on distantly related individuals (kinship less than 0.05), as we also demonstrate the poor correlation of kinship estimates in that range when ancestry is or is not taken into account.
Creative activities in music represent a complex cognitive function of the human brain, whose biological basis is largely unknown. In order to elucidate the biological background of creative activities in music we performed genome-wide linkage and linkage disequilibrium (LD) scans in musically experienced individuals characterised for self-reported composing, arranging and non-music related creativity. The participants consisted of 474 individuals from 79 families, and 103 sporadic individuals. We found promising evidence for linkage at 16p12.1-q12.1 for arranging (LOD 2.75, 120 cases), 4q22.1 for composing (LOD 2.15, 103 cases) and Xp11.23 for non-music related creativity (LOD 2.50, 259 cases). Surprisingly, statistically significant evidence for linkage was found for the opposite phenotype of creative activity in music (neither composing nor arranging; NCNA) at 18q21 (LOD 3.09, 149 cases), which contains cadherin genes like CDH7 and CDH19. The locus at 4q22.1 overlaps the previously identified region of musical aptitude, music perception and performance giving further support for this region as a candidate region for broad range of music-related traits. The other regions at 18q21 and 16p12.1-q12.1 are also adjacent to the previously identified loci with musical aptitude. Pathway analysis of the genes suggestively associated with composing suggested an overrepresentation of the cerebellar long-term depression pathway (LTD), which is a cellular model for synaptic plasticity. The LTD also includes cadherins and AMPA receptors, whose component GSG1L was linked to arranging. These results suggest that molecular pathways linked to memory and learning via LTD affect music-related creative behaviour. Musical creativity is a complex phenotype where a common background with musicality and intelligence has been proposed. Here, we implicate genetic regions affecting music-related creative behaviour, which also include genes with neuropsychiatric associations. We also propose a common genetic background for music-related creative behaviour and musical abilities at chromosome 4.