Automated microsatellite allele binning for population genetics workflows
All microsatellite analysis software expects allele sizes given in integer numbers, while allele scoring produces allele sizes with two decimals that are dependent not
only on fragment length, but also on fluorescent dye, and GC content. Therefore, allele binning is not a trivial task. Tandem2 fills a gap of the microsatellite workflow
by rounding allele sizes to valid integers, depending on the microsatellite repeat units. Publish-ready vector graphics output shows allele size distribution and
visualizes the rounding method. The average rounding error is given and indicates the overall quality of microsatellite data.
Tandem2 runs natively on 64 bit Mac OSX. Earlier versions of Tandem, including Ruby source code that runs on Mac, Windows, and Windows are available at www.evolution.unibas.ch/salzburger/software.htm. New features of Tandem2 include a brand-new graphical user interface, a simplified input format (see the example file bundled with Tandem2 in the disk image), highlighting of problematic alleles in table format, the possibility to edit these allele lengths right in Tandem2, and various export formats for MSA, Convert, Arlequin, Genepop, Structure, and Beast. If you'ld like to see other export formats or features implemented in Tandem2, of if you have trouble running Tandem2,
When analyzing microsatellites, biologists commonly use software like Genemapper (Applied Biosystems) to score the sizes of fluorescently labelled PCR products. These products are a combination of forward primer, flanking region, microsatellite, flanking region, and reverse primer. Of these, both primer sizes are known, and flanking region sizes are assumed to be constant among all individuals. Thus, variation in PCR product sizes among individuals should directly reflect different numbers of microsatellite repeats. In the case of dinucleotide repeats, alleles are expected to be either only even or only odd integers. However, Genemapper calculates allele sizes from comparison with labeled size standards of known size that are added to all samples before running them on a capillary sequencer. Errors are introduced by minor differences between runs and capillaries, the precision limit of sequencers, and imperfect linear regressions between size standard and PCR product run lengths. As a result, calculated allele sizes hardly ever are integers. Instead Genemapper measures allele sizes to two decimals. Its built-in automated binning method requires reference data, which is often not available in population genetic studies and has a number of problems associated with it (Amos et al. 2007). However, all microsatellite analysis software expect integer allele sizes.
This is where Tandem2 comes in. Tandem2 goes through tab delimited versions of Excel sheets and rounds all allele sizes to integer numbers. But instead of simply rounding to the nearest even or odd number (or other numbers following tri-, tetra-, etc-, nucleotide repeat patterns) tandem finds the most consistent way of rounding. It transforms all observed allele sizes using the power function
(transformed allele size) = a + b × (observed allele size)c
and exhaustively optimizes parameters a, b, and c, so that rounding errors are minimal when rounding transformed allele sizes to integers that fit the expected nucleotide repeat patterns (e.g. when rounding transformed allele sizes only to even, or only to odd integers).
Tandem2's way of rounding is made transparent by HTML output including publish-ready SVG vector graphic plots that show, per locus, the original allele size distributions, as well as the fitting of the data to the according repeat size pattern. Two plots are given per locus, one for the full range of allele sizes, and one focussing on the part with the highest density. Per locus, the optimized parameters a, b, c, and the average rounding error are reported alongside other relevant information. Rounding error outliers are highlighted and indicate individuals that should be removed from the data set, or problems with the specified repeat size. The consistency of tandem's rounding scheme immediately becomes obvious from these plots. Grey vertical lines indicate bin centers after allele transformation. If these match the peaks of the allele size distribution (shown in black), Tandem2 successfully optimized all parameters of the power function, so that rounding errors are minimized. This means that Tandem2 was able to bin alleles in the most consistent way relative to each other. If you've got your own microsatellite data set that you're not going to combine with data sets from other laboratories, and if your next step would be a population genetic analysis with software like MSA, Convert, Arlequin, Genepop, or Structure, you're fine with that, as these programs only use relative distances, and never the absolute values. In these cases, relative consistency is all you need to worry about.
How to cite Tandem2
Matschiner M, Salzburger W (2009) TANDEM: integrating automated allele binning into genetics and genomics workflows. Bioinformatics, 25(15), 1982-1983.