Journal of Molecular Biology
Volume 294, Issue 5, 17 December 1999, Pages 1351-1362
Journal home page for Journal of Molecular Biology

Regular article
Sequence and structure-based prediction of eukaryotic protein phosphorylation sites1

https://doi.org/10.1006/jmbi.1999.3310Get rights and content

Abstract

Protein phosphorylation at serine, threonine or tyrosine residues affects a multitude of cellular signaling processes. How is specificity in substrate recognition and phosphorylation by protein kinases achieved? Here, we present an artificial neural network method that predicts phosphorylation sites in independent sequences with a sensitivity in the range from 69 % to 96 %. As an example, we predict novel phosphorylation sites in the p300/CBP protein that may regulate interaction with transcription factors and histone acetyltransferase activity. In addition, serine and threonine residues in p300/CBP that can be modified by O-linked glycosylation with N-acetylglucosamine are identified. Glycosylation may prevent phosphorylation at these sites, a mechanism named yin-yang regulation.

The prediction server is available on the Internet at http://www.cbs.dtu.dk/services/NetPhos/ or via e-mail to [email protected]

Introduction

Protein kinases catalyze phosphorylation events that are essential for the regulation of cellular processes like metabolism, proliferation, differentiation, and apoptosis Kolibaba and Druker 1997, Hunter 1998, Johnson et al 1996, Johnson et al 1998, Pinna and Ruzzene 1996, Graves et al 1997. This very large family of enzymes share homologous catalytic domains and the mechanism of substrate recognition may be similar despite large variation in sequence. Crystallization studies indicate that a region, between seven and 12 residues in size, surrounding the acceptor residue contacts the kinase active site (Songyang et al., 1994).

The specificity of protein kinases is dominated by acidic, basic, or hydrophobic residues adjacent to the phosphorylated residue, but the large variation makes it difficult manually to inspect protein sequences and predict the location of biologically active sites. This prompted us to investigate if the fuzzy sequence patterns can be recognized using artificial neural networks techniques. Neural networks are capable of classifying even highly complex and non-linear biological sequence patterns, where correlations between positions are important. The network recognizes the patterns seen during training, and retains the ability to generalize and recognize similar, but non-identical patterns. Artificial neural networks have been extensively used in biological sequence analysis Wu 1997, Baldi and Brunak 1998. Since determinants of phosphorylation sites probably are no longer than about ten residues, most local sequence alignment tools, such as BLAST and FASTA, will not be useful for detecting phosphorylation sites due to a large number of irrelevant hits in the protein databases, even to non-phosphorylated proteins.

The related proteins p300 and CBP (CREB (cAMP-response-element-binding)-binding protein) integrate molecular signals at the level of gene transcription and chromatin modification. p300 and CBP interact with transcription factors CREB, Jun and Fos, viral oncoproteins E1a and SV40 large T antigen, and kinases pp90RSK and cyclin E-complexed cyclin-dependent kinase (CDK)-2 Shikama et al 1997, Ait-Si-Ali et al 1998. These interactions may possibly be regulated by reversible phosphorylation of p300/CBP. We demonstrate that regions of p300/CBP, which have been shown to interact with other molecules, contain probable phosphorylation sites. In addition, we describe sites that possibly are regulated by both phosphorylation and glycosylation by N-acetylglucosamine (GlcNac), a regulatory mechanism described as a yin-yang dynamic phosphorylation/glycosylation Hart et al 1995, Hart 1997.

Section snippets

The general sequence context at experimentally verified phosphorylation sites

Based on the large sets of experimentally verified phosphorylation sites, sequence logos were generated for each of the three acceptor residues, tyrosine, serine, and threonine (Figure 1). The sequence logos emphasize residues that are frequently found in the context of the phosphorylation sites. The logo does not show the specificity determinants for a single kinase, but the overall features of all experimentally verified sites.

Tyrosine sequence logo

For tyrosine phosphorylation sites, we found that tryptophan, a

Discussion

The neural network approach presented here for the prediction of phosphorylation sites is top-down, in the sense that an overall, general approach to kinase specificity was taken. This is in contrast to the classical approach, which has been to use a bottom-up philosophy, where the specificity of a single kinase is studied in great detail.

The classical approach is based on determination of the activity of purified protein kinases using in vitro assays with either naturally occurring peptides or

Data sets extracted from phosphobase

Experimentally verified phosphorylation sites were extracted mainly from PhosphoBase (Kreegipuu et al., 1999), which is available from http://www.cbs.dtu.dk/databases/PhosphoBase/.

The phosphoproteins were mostly from mammalian sources, with a few examples from viruses or plants. The data set consisted of 584 serine sites (251 protein entries), 108 threonine sites (85 protein entries), and 210 tyrosine sites (98 protein entries). No sites were identical within a 9-mer sequence. Negative examples

Acknowledgements

We thank Kristoffer Rapacki for competent computer assistance and Jan Hansen for friendly and helpful discussions. This work was supported by the Danish National Research Foundation.

References (38)

  • C.H. Wu

    Artificial neural networks for molecular sequence analysis

    Comput. Chem.

    (1997)
  • S. Ait-Si-Ali et al.

    Histone acetyl-transferase activity of CBP is controlled by cycle-dependent kinases and oncoprotein E1A

    Nature

    (1998)
  • A. Bairoch et al.

    The PROSITE database, its status in 1997

    Nucl. Acids Res.

    (1997)
  • P. Baldi et al.

    Bioinformatics: The Machine Learning Approach

    (1998)
  • N. Blom et al.

    Cleavage site analysis in picornaviral polyproteinsdiscovering cellular targets by neural networks

    Protein Sci.

    (1996)
  • K. Chou et al.

    Protein subcellular location prediction

    Protein Eng.

    (1999)
  • J. Felsenstein

    Phylogeny inference package (version 3.2)

    Cladistics

    (1989)
  • L. Graves et al.

    Historical perspectives and new insights involving the MAP kinase cascades

    Advan. Sec. Mess. Phos. Res.

    (1997)
  • J.E. Hansen et al.

    NetOglycprediction of mucin type O-glycosylation sites based on sequence context and surface accessibility

    Glycoconj. J.

    (1998)
  • Cited by (2662)

    • In silico characterization of the novel SDR42E1 as a potential vitamin D modulator

      2024, Journal of Steroid Biochemistry and Molecular Biology
    View all citing articles on Scopus
    1

    Edited by F. E. Cohen

    View full text