Passport Data Quality


.  



Passport data play a crucial role in the attempt to unlock the genetic diversity in the world genebanks, making use of new technlogy leading to increasing knowledge of the genome.
In this process passport data have a crucial role.
On the basis of passport data core collections are selected and subsequently genotyped with molecular markers. Physiological contrasts, association genetic and genotype panels for DarT studies are feasable with well, rationally the selected germplasm panels, each representing specific niches of diversity, or cross sections of diversity. Successfull selection of such panels is based in the quality of passport data.

Passport data are relatively simple type of data. The standards for the basic passport data structure is formulated in the Multi Crop Passport Descriptor (MCPD) list (FAO/IPGRI 2001), a simple list of passport descriptors providing an easy structure for data exchange. In the discussion on data quality, a prerequisite for rational data usage, the definition of data quality is often stated as “fitness for use”. To get some more grip on the discussion of data quality, different dimensions and classifications of these have been proposed, in general multidimensional but still subjective.
On the domain of passport data, the following issues are recognized, which are considered the most relevant and important ones, and therefore the most commonly used:
  • completeness, referring to the portion of real-world objects represented in the data set or from a more technical point of view the fraction of non-null values,
  • correctness, describing to which extend data are accurate, where correctness means the nearness of a value to the correct real-world values, and
  • consistency, specifying the fraction of data (records, values) not violating given business rules (e.g. integrity constraints).
Other issues on the subject of data quality assessment are:
  • data quality assessment is required to estimate relevance, significance or generally the value of results of analyzing tasks based on the data.

  • assessment provides the indicator for necessary improvements of data quality and allows the evaluation of the cost-benefit ratio of the improvements.
These data quality assesments result in a set of quality scores which are assigned to the individual dimensions.

After associating the scores to the data, an explicite data quality model is required for the interpretation of quality information and the ranking or source selection
The application of data cleaning techniques is required if the data quality does not fullfill the requirements: normalization and transformation, outlier detection, identifying and repairing missing values, duplicate detection and domain-specific approaches.


The GCP makes use of these statistical quality check procedures  for phenotypic and passport data.

.  

back
back to the GCP bioinformatics portal page

  

GCP Bioinformatics
and Biometrics