AVID: An integrative framework for discovering functional relationships among proteins
Background and usage of the web site

For a complete description of AVID, testing of its performance, and examples of applying it to the yeast proteome, see Jiang, T. & Keating, A. E. BMC Bioinformatics 2005, 6:136 (1 Jun 2005).

Please contact keating-web@mit.edu with any questions or problems.

Determining the functions of uncharacterized proteins is one of the most pressing problems in the postgenomic era. Large scale protein-protein interaction assays, global mRNA expression analyses and systematic protein localization studies provide experimental information that can be used for this purpose. We present AVID (Annotation Via Integration of Data), a computational method that integrates such information with sequence data to generate networks reflecting functional similarities among proteins. Edges in the networks connect proteins predicted to share a common molecular function, to participate in a common biological process, or to co-localize to the same cellular component, according to descriptors defined by the Gene Ontology (GO) consortium. Functional similarities inferred from the AVID networks are ~77% accurate for molecular function or cellular component and ~65% accurate for biological process. The networks are used to assign highly specific GO annotations to uncharacterized proteins. When applied to the yeast Saccharomyces cerevisiae proteome, AVID provides new, highly detailed functional annotations for ~50% of yeast proteins, including 1,490 proteins with no previous annotation in GO. Estimated accuracy for these assignments is ~66% for molecular function and cellular component and ~52% for biological process.

AVID (Annotation via Integration of Data) is a computational method for predicting Gene Ontology (link to GO site) annotation terms using high-throughput experimental and sequence data. The method works by constructing functional correlation networks in which proteins are linked if they are likely to share a common GO descriptor. The networks are used to assign very specific functional annotations to individual proteins.

This web site provides access to predictions of molecular function, biological process and cellular component for proteins from the yeast Saccharomyces cerevisiae. It also gives the user access to the underlying functional correlation networks. Because the networks are more accurate that the precise predictions that result from them, and because they contain more information, we encourage uses to explore these for their genes of interest.


AVID GO terms are a subset of all of the functional descriptors defined by the GO consortium. They are the set of terms that have no further subcategories, as of early 2004 when this work was performed. These terms are typically quite detailed, and thus provide maximal information about protein function, within the context of the GO framework. AVID predicts AVID GO terms.

MF - molecular function, as defined by GO

BP - biological process, as defined by GO

CC - cellular component, as defined by GO

Neighbors of a given protein are those ORFs connected to it in our functional correlation networks. They are considered likely to share a GO term with the protein of interest.

A novel prediction is a prediction of an AVID GO term for a protein that did not have any existing GO annotation when this work was performed.

A refined prediction is a prediction of an AVID GO term for a protein that was already described at a more general level when this work was performed.

A known annotation is a detailed functional descriptor (that meets the definition of an AVID GO term) that already existed in GO before our work.

Level-N (N = 2, 3 or 4) parents of AVID GO terms are more general functional descriptors than predicted by AVID. They are GO terms of which the AVID predictions are subcategories. They are generated for novel AVID GO predictions by looking up which MF level 2, BP level 3 or CC level 4 descriptors are parents of the predicted terms in the GO hierarchies. These less descriptive terms may be correct more often than the AVID GO terms are themselves. However, AVID was not designed to make this sort of low-level prediction and we have not tested it rigorously on this task. This information is only included because it may be suggestive to experimentalists using our site.

Data sources
Predictions are made using data from the following sources, with abbreviations as indicated. For each combination of data source (below) and GO category (MF, BP or CC), a coefficient reflecting the correlation of the data with the relevant GO descriptors was computed. These correlation coefficients are a rough indicator of how important each data source is for establishing links among proteins in the functional correlation networks. In the output, for each linked pair of proteins in one of the networks (MF, BP or CC) we indicate which data sources contributed to the prediction, as well as the correlation coefficient that captured that contribution in the calculation. See the paper for further details of how these correlation coefficients are generated and used.

UCSF local refers to the cellular localization data of Huh, W.K. et al. Global analysis of protein localization in budding yeast. Nature 425, 686-91 (2003).

Y2H refers to the large-scale yeast two hybrid experiments of Ito et al. and Uetz et al.
Ito, T. et al. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A 98, 4569-74 (2001).
Uetz, P. et al. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403, 623-7 (2000).

MIPS cplx. refers to complexes among yeast proteins archieved at MIPS, including data from the affinity purification experiments of Gavin et al. and Ho et al.
Ho, Y. et al. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415, 180-3 (2002).
Gavin, A.C. et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 141-7 (2002).

sequence similarity refers to sequence similarity, defined as a PSI-BLAST E value of less than 0.001 after three iterations.

mRNA profile refers to similarity of mRNA expression profiles in the NCBI Gene Expressions Omnibus.