Genotype Data Quality Control LIMS ICRISAT


.  

GENOTYPE DATA QUALITY CONTROL 

With advances in DNA sequencer-based technologies, it has become possible to automate several steps of the genotyping process leading to increased throughput. The main limitation in high-throughput microsatellite genotyping is the required manual editing of allele calls. During SSR genotyping, inconsistencies in allele labels arise when software like the ABI-Prism Genotyper are used for allele calling. These software are mostly semi-automatic and require manual intervention,imposing bottlenecks on throughput capacity. If accurate allele calling is desired, a human eye must check all the automated calls by inspecting the electropherograms.

We have implemented an available algorithm as a software program to automate the task of allele calling and binning outputs returned by the ABI Genescan/Genotyper software. This algortihm derives representative discrete allele sizes from the observed allele size data from microsatellite markers using least-squares minimization procedures (Idury and Cordon, Genome Research 11:1104-1109, 1997). The program extends the algorithm to allow specification of ploidy level. It generates useful summary statistics, a quality index for each marker, and checks for the presence of allelic drift so investigators are notified of potential problems in their raw data when the variability is too large for automated processing. The program also produces a histogram for visual inspection of both the observed and the inferred allele sizes. The “allelobin” software has been written in C as well as in Java and is available to all.

Allelobin within LIMS: The ICRISAT laboratory Information Management System incorporates this algorithm to make the allele labeling uniform. The data in terms of non-integer base pair values obtained from fragment analysis software by the user are loaded into the system as Excel sheets. These sheets can then be merged across experiments in a study. The user is alerted if duplicates or null calls are present, otherwise the merged files may be submitted to the allele-binning programme that automatically classifies allele sizes into discrete bins. The “Allelobin” programme incorporated within the LIMS is a variation of the original algorithm, rewritten in Java. The variation introduced concerns the use of additional statistics like median and median absolute deviation as a measure of dispersion and the additional rounding off of the bin median to preserve genetically expected repeat lengths of the marker. The user may download the output from this program, which includes a summary output file containing the newly called alleles, summary statistics and a histogram output.

THE METHOD

The objective is to accurately classify the allele sizes into discrete bins in the presence of variability in the dataset which consists of allele sizes derived from existing software. The output of the method is a categorical allele dsignated for each data point along with a measure of accuracy for the bin classification. The least squares minimization procedure is used to define the bins.


ρ indicates repeat length of the marker.

N indicates number of individuals genotyped.

Aj indicates observed allele sizes in a dataset to be binned ( j=1, 2, …, 2N (diploid))

T reflects the maximum number of alleles possible T = 1 + [(Max Aj – Min Aj)/ρ].

Li represents the lower boundary of bin i, where i=1, 2, …,T

The aim is to determine the optimal Li, i=1, 2, …,T from the allele sizes obtained with Genescan/Genotyper software.


The basic method sorts allele sizes in ascending order. The initial lower boundaries of bins are set.

L1 = A1 - ρ And Li = Li + 1 + ρ = A1 + (i-2) ρ for i=2, …,T Each Aj will belong to one and only one of the T bins. Each Aj is a member of bin i if .

The average variation within bins is then calculated using the formula:

Vw = fi(j) (AjMi)2;

Where the bin median Mi is the basis to model allelic dispersion and Mi = Li + ρ/2


The optimization step computes Vw over k = ρ/s trials, setting step-size s to a small value, say 0.01, such that the following two conditions hold simultaneously Li(k) = Li(k-1) + s where s = Li(k) – Li(k-1) for i=1, 2, …,T And Li(k) = Li-1(k) + ρ where ρ = Li(k) – Li-1(k) for i=1, 2, …,T With ρ=3, k=300 trials For every successive k : The bin position (boundaries) shifts by s; the bin width always remains ρ for every new bin and number of bins is always T and no bins overlap.

Occasionally, the optimal bin set estimated by the least squares optimization procedure may not accurately reflect the observed data.This may happen where bin boundaries are slightly shifted causing inappropriate allele binning for adjacent alleles. Allele bins may differ by a value slightly different from the ρ. An additional parameter is defined in the method of Idury and Cordon to reflect this "allelic drift", δ.

At each of the k iterations, δ is evaluated at small increments, t, between a set of allowable drift values. Thus, at the kth iteration l = (max δ - minδ)/t trials, setting Li(k,l) = Li(k)(1 + δ(l)), where δ(1) = min δ and δ(l) = δ(l-1) + t for l > 1. Thus, the spacing between adjacent alleles is kept constant at a value ρ(1 + δ) rather than ρ.

THE PRGGRAM AND ITS FEATURES

  • Determines the representative discrete allele sizes
  • Provides summary statistics of inferred alleles
  • Calculates a quality Index for each marker
  • Calculates allelic drift, if any
  • Plots the histogram of the inferred and observed allele sizes
  • Stores results in EXCEL file(s)
  • The user may specify ploidy levels.
  • This page is a copy of the GCPwiki pages on Data Quality Control, http://cropwiki.irri.org/gcp/index.php/Genotype_data_qualitycontrol


    .  

    back
    back to the GCP bioinformatics portal page

      

    GCP Bioinformatics
    and Biometrics