MisPred uses five priniciples to detect and correct errors in databases, as detailed in BMC Bioinformatics.
Researchers have created a tool that distinguishes and corrects abnormal, incomplete, or mispredicted protein annotations in public databases. They suggest that it may significantly improve the quality of protein sequences data based on gene predictions.
The system, called MisPred, was developed by the Institute of Enzymology of the Hungarian Academy of Sciences, Budapest. The tool rates annotations according to five doctrines: Extracellular or transmembrane proteins must have appropriate secretory signals; A protein with intra- and extra-cellular parts must have a transmembrane segment; Extracellular and nuclear domains must not occur in a single protein; The number of amino acid residues in closely related members of a globular domain family must fall into a relatively narrow range; A protein must be encoded by exons located on a single chromosome.
There are some exceptions to these rules, points out Laszlo Patthy, who led the study. “Some secreted proteins may truly lack secretory signal peptides since they are subject to leaderless protein secretion. Similarly, it cannot be excluded at present that transchromosomal chimeras can be formed and may have normal physiological functions.
“Nevertheless,” he concludes, “the fact that MisPred analyses of protein sequences of the Swiss-Prot database identified very few such exceptions indicates that the rules of MisPred are generally valid.”
The researchers found that the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions. “Interestingly, even the manually curated UniProtKB/Swiss-Prot dataset is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON predicted entries,” notes the team.
The MisPred tool is described online in BMC Bioinformatics.