Abstract
The extracellular matrix is fast emerging as important component mediating cell-cell interactions, along with its established role asa scaffold for cell support. Collagen, being the principal component of extracellular matrix, has been implicated in a number ofpathological conditions. However, collagens are complex protein structures belonging to a large family consisting of 28 members inhumans; hence, there exists a lack of in depth information about their structural features. Annotating and appreciating thefunctions of these proteins is possible with the help of the numerous biocomputational tools that are currently available. This studyreports a comparative analysis and characterization of the alpha-1 chain of human collagen sequences. Physico-chemical,secondary structural, functional and phylogenetic classification was carried out, based on which, collagens 12, 14 and 20, whichbelong to the FACIT collagen family, have been identified as potential players in diseased conditions, owing to certain atypicalproperties such as very high aliphatic index, low percentage of glycine and proline residues and their proximity in evolutionaryhistory. These collagen molecules might be important candidates to be investigated further for their role in skeletal disorders.
Keywords: Biocomputational tools, Collagen, Comparative characterization, Extracellular matrix
Background
The extracellular matrix (ECM) is an intricate network ofmacromolecules surrounding a substantial volume of cells. Itcomprises of collagens, proteoglycans, glycoproteins andproteases [1]. These components are arranged in a highlyorganized manner and play a significant role not only as ascaffold to the cell but also in multiple processes such as cellmigration, cell-cell interaction and cell proliferation [2].Collagens are the most abundant component of ECM. Theyform a triple helical structure with three distinct polypeptidechains, commonly known as the alpha chains. Also, this triplehelix is found to possess a peculiar sequence ‘Gly-Xaa-Yaa’ [3].The presence of glycine as every third residue accounts for thestability of the helical structure owing to its property of beingthe smallest amino acid. Xaa and Yaa can be any amino acid butare mostly occupied by the proline residue. Thus, collagen isknown to be a glycine and proline rich entity. Collagen proteinsare synthesized as inactive precursor forms known as procollagens.The cleavage of pro-peptides present at the N and Cterminal by peptidases forms the mature active collagenmolecules.
The collagen family is a large and a complex family comprisingof 28 genetically distinct members found in humans [4]. Thisdiverse family includes fiber forming collagens, amorphouscollagens, transmembrane collagens and a specialized type ofcollagen that forms unique structures [5]. The imperative role ofcollagen can be ascertained through the wide spectrum ofpathological disorders that are associated with it. Mutations inthe genes encoding for collagen proteins can lead to a variety ofdiseases such as Osteogenesis Imperfecta, Ehlers-Danlossyndrome, Spondyloepiphyseal dysplasia, Multiple epiphysealdysplasia [6]. Furthermore, variations in the collagen content ora significant remodeling of the collagen network can lead toseveral dysfunctions like parenchymal disease, hypertensiveheart disease, renal fibrosis, tumor and fibrotic diseases [7].Computational biology is an emerging field that essentiallyhelps to unveil the hidden information of a protein structure,both at the genomics and proteomics level. It integrates biologywith computational algorithms for better understanding ofcomplex molecules. Such analysis can be enhanced through acombinatorial dry to wet lab approach wherein the propositionsmade through the biocomputational findings can help inproviding a direction for further research during wet labstudies. Thus, it facilitates to appreciate and understand thestructural and functional roles of protein molecules, which arethe heart of human disorders. A similar approach has beenadopted to characterize the family of matrix metalloproteinases(MMP), where in silico characterization was followed byexperimental confirmation and MMP-7 was subsequentlyproved to be a potential target in cardiac hypertrophy [8,9].Collagen and its derivatives, such as gelatin, act as substrates ofMMPs and are involved in development of many pathologicalconditions. An extensive analysis of collagen family is henceessential to comprehend the process of matrix remodeling indiseases.
Our study reports a comparative characterization of alpha-1sequences of human collagen using biocomputational tools. Ofthe three alpha chains present in collagen, alpha-1 wasobserved in all 28 human collagens; thus alpha-1 was used forfurther analysis of the collagen proteins. The physico-chemical,secondary structural, functional and phylogenetic analysis ofalpha-1 sequences of human collagen was executed. Thisresearch aims to provide an insight to various protein attributesof collagen proteins and characterize this large family. It alsointends to propose potential members implicated in diseaseconditions based on relative examination of the collagen familyor the presence of any atypical characteristics in the collagenmolecules. This would aid biologists in carrying out furtherinvestigations on these complex molecules, for which, the basicstructure analysis is of prime importance.
Methodology
Retrieval of human collagen alpha-1 protein sequences:
The complete alpha-1 protein sequences of all 28 members ofhuman collagen family reported till date were derived fromUniProtKB/ SWISS-PROT, a curated protein database(http://expasy.org/sprot/) in FASTA format with the help of the accession number provided for each collagen sequence(http://www.uniprot.org/) [10]. Complete information aboutthe origin, attributes, annotation, ontologies, binary interactionsand sequence of proteins was found in this knowledgebase.
Physico-chemical characterization of collagen family:
Various features including number of amino acids, molecularweight, theoretical isoelectric point (pI), amino acidcomposition (%), number of positively (Arg + Lys) andnegatively charged (Asp + Glu) residues, extinction co-efficient,instability index, aliphatic index and Grand Average ofHydropathicity (GRAVY) were computed using ExPASy'sProtParam tool using the protein sequence in FASTA format asthe input data type (http://expasy.org/tools/protparam.html).Other physico-chemical features including number of codons,bulkiness, polarity, refractivity, recognition factors,hydrophobicity, transmembrane tendency, percent buriedresidues, percent accessible residues, average area buried,average flexibility and relative mutability were calculated forprimary structure characterization by ExPASy's ProtScale toolusing the retrieved protein sequence as the input data type(http://expasy.org/tools/protscale.html). A suitable aminoacid scale was chosen for computation of each parameter in thissliding windows based tool that gives each amino acid anumerical value known as the amino acid scale.
Secondary structural characterization of collagen family:
The secondary structural features of proteins comprising ofalpha helix, 310 helix, Pi helix, beta bridge, extended strand, betaturn, bend region, random coil, ambiguous states and otherstates were predicted using Self-Optimized Prediction Methodwith Alignment (SOPMA) tool that takes into account theinformation derived from alignment of protein sequencesbelonging to the same family (http://npsa-pbil.ibcp.fr/cgibin/npsa_automat.pl?page=npsa_sopma.html) [11]. Theprotein sequence in FASTA format was used as the input datatype and the number of conformational states was adjusted tofour in order to predict Helix, Sheet, Coil and Turn. The otherparameters were set as default.
Functional characterization of collagen family:
The Motif scan tool was used to scan and identify all the knownmotifs, their nature and location in the selected alpha-1 proteinsequences of the collagen family based on a profile and patternsearch (http://myhits.isb-sib.ch/cgi-bin/motif_scan) [12,13].The protein sequence in FASTA format was used as the inputdata type and scanned against ‘PROSITE Patterns’, a selectedprotein profile database out of the eight available.
Phylogenetic classification of collagen family:
The human alpha-1 collagen protein sequences were alignedusing multiple sequence alignment tool ClustalW2 using theprotein sequences in FASTA format as the input data type(http://www.ebi.ac.uk/Tools/msa/clustalw2/) [14]. The bestalignment for a set of input sequences was computed and allthe identities, similarities and differences were highlighted. Theevolutionary relationships were established by constructingphylograms through retrieval of the alignments using NeighborJoining (NJ) method.
Discussion
Functional role of alpha-1 chain of each human collagen wasanalyzed (Table 1, see Supplementary material). MultAlin toolwas used to carry out protein sequence alignment wherein adecreased sequence similarity was witnessed with increasingnumber of input sequences [15]. It was inferred that the alpha-1chain showed identities, similarities and differences at variouspositions along the protein sequence in all 28 members. Thecomputation of amino acid composition of each human alpha-1collagen sequence using ExPASY's ProtParam tool indicatedvery high percentages of glycine and proline as compared toother amino acids (Table 2, see Supplementary material).Glycine content in all collagens was more than 12%, exceptCollagen 12, 14 and 20 with a value of 9.2, 10.9 and 11.5%respectively. High percentage of glycine accounts for thestability of collagen triple helical structure, since incorporationof large amino acids can cause steric hindrance [16].Furthermore, proline content was more than 10% in mostcollagens except Collagen 6, 12, 14 and 20 with values 8.7, 8.6,8.7 and 10%, respectively. Proline residues are equally essentialto point outward and stabilize the helix and also to act asstructural disruptor of the secondary structural elements [17].Thus, this not only helps collagen to act as a structural moleculebut also aids in processes like cell-cell adhesion, migration.Other essential physico-chemical parameters were alsocalculated using ExPASy's ProtParam and ProtScale tools(Table 3a and 3b see Supplementary material). The pI valuesfor 15 collagens were found to lie in the acidic range, while forthe remaining half, alkaline range was observed. Also, analysisof instability index classified most collagens as stable(instability index <40) while Collagen 15, 17, 18, 20 and 26 weredeclared as unstable collagens. Additionally, Collagen 20 wasregarded most thermostable, with highest aliphatic index(79.61), describing the relative volume of protein occupied byaliphatic side chains, closely followed by Collagen 14 (77.67)and 12 (75.45). Furthermore, the GRAVY values, signifyinginteraction of collagens with water, were observed within abroad range of 0.261 to 0.919, while the hydrophobicity wasfound to range between 0.3335 of Collagen20 (mosthydrophilic) to 0.3335 of Collagen18 (most hydrophobic).
SOPMA analysis of secondary structural features of humancollagen protein sequences showed a pre-dominance of randomcoils while least percentage of β-turns were found. α-heliceswere found to exceed extended strands in 13 collagens(Collagen 6, 9, 13, 15, 17, 18, 19, 21, 22, 23, 25, 26 and 28) (Table 4 see Supplementary material). A high percentage of randomcoils facilitating self-assembly of the monomer units into welldefinedstructures can also be linked to high glycine and prolinecontent that are vital to provide the desired flexibility andability to bond with adjacent units in collagen monomers. Thisinformation on packaging of secondary structural elements mayassist to derive potential tertiary protein structures and alsopromote advancements in protein engineering. Motifs ofprotein can provide significant knowledge about the protein'smechanism of action and nature. Signature motifs wereobtained within protein sequences using Motif Scan tool (Table 5, see supplementary material). Collagens 1, 2 and 3 werefound to possess the VWFC domain signature, known toparticipate in oligomerization, hence forming an imperativepart of the complex forming proteins [18,19]. The presence ofthis domain can be correlated to the known characteristic ofcollagen molecules passing through a complex assemblyprocess to form a triple helical structure. Additionally,Collagens 7 and 28 were observed to have pancreatic trypsininhibitor (Kunitz) family signature. The 28 human alpha-1collagen protein sequences were aligned based on sequencehomology and phylogram was constructed with distance-basedNJ method to establish evolutionary relations in the complexcollagen family (Figure 1). Various clusters with closerelationships were identified including Collagen 1 and 2, 12 and14, 23 and 25 relating closely to Collagen 3, 20 and 17respectively. These molecules may be analyzed together forpossible similar properties owing to their close evolutionaryhistory. Also, Collagen 4, reported majorly in diseases showssimilarity to Collagen 13, which may also potentially play a rolein pathological conditions. Collagen 9, 12, 14 and 20 areincluded in the FACIT (Fibril Associated Collagens withInterrupted Triple helices) group of collagen family, wherealpha-1 chain of Collagen 9 is recognized for its role in skeletaland rheumatoid disorders, especially in multiple epiphysealdysplasias [20]. Owing to their structural similarities, atypicalfeatures and proximity in evolutionary history, the othermembers of this family also present a prime candidature forinvestigation of their role in skeletal disorders.
Figure 1.
Open in a new tab
Conclusion
With the availability of a wide variety of computational tools,an in-depth study of the information hidden behind a proteinstructure is possible. Comparative studies of the membersbelonging to a protein family and their physico-chemical,secondary structural, functional and phylogenetic classificationcan help give extensive information of protein's structure,function and its relationship with other members of the family.Characterization of the alpha-1 chain of the vast collagenprotein family in humans yielded new insights. Based on thiscomparative characterization, we hypothesize, Collagen 12, 14and 20 as a potential protein cluster showing similarity in manyproperties along with an atypical behavior. These three proteinspossess, low glycine and proline, very high aliphatic index anda close evolutionary relation. Since these collagens form a partof the FACIT collagen family, of which collagen 9 is establishedfor its role in skeletal disorders, these collagen molecules mightbe possible disease candidates. These findings can helpbiologists working with ECM proteins concentrate theirresearch on collagen proteins proposed as putative players indiseased conditions. Moreover, this study is a model forresearchers to fine-tune their specific systems and comprehendtheir outcomes better.
Supplementary material
Data 1
97320630008026S1.pdf (110.4KB, pdf)
Acknowledgments
This work was supported by the research grant awarded to Dr.Vibha Rani by the Department of Science and Technology,Government of India (SR/FT/LS-006/2009: Sept 4, 2009). Weacknowledges Jaypee Institute of Information Technology,Deemed to be University, for providing the required support.
Footnotes
Citation:Nassa et al, Bioinformation 8(1): 026-033 (2012)
References
- 1.H Jarvelainen, et al. PharmacollagenRev. 2009;61:198. [Google Scholar]
- 2.SLK Bowers, et al. J Mol Cell Cardiol. 2010;48:474. [Google Scholar]
- 3.KE Kadler, et al. Biochem J. 1996;316:1. [Google Scholar]
- 4.L Cen, et al. Pediat Res. 2008;63:492. [Google Scholar]
- 5.ML Tanzer. J Orthop Sci. 2006;11:326. doi: 10.1007/s00776-006-1012-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.JF Bateman, et al. Nat Rev Genetics. 2009;10:173. [Google Scholar]
- 7.A Gonzalez, et al. Med Clin North Am. 2004;88:83. doi: 10.1016/s0025-7125(03)00125-1. [DOI] [PubMed] [Google Scholar]
- 8.A Jaiswal, et al. Bioinformation. 2011;6:23. [Google Scholar]
- 9.A Chhabra, et al. J Comput Intell Bioinformatics. 2011;4:1. [Google Scholar]
- 10.A Bairoch, R Apweiler. Nucleic Acids Res. 1997;25:31. [Google Scholar]
- 11.C Geourjon, G Deleage. Comput Appl Biosci. 1995;11:681. doi: 10.1093/bioinformatics/11.6.681. [DOI] [PubMed] [Google Scholar]
- 12.M Pagni, et al. Nucleic Acids Res. 2007;35:W433. [Google Scholar]
- 13.CJA Sigrist, et al. Nucleic Acids Res. 2010;38:D161. doi: 10.1093/nar/gkp885. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.MA Larkin, et al. Bioinformatics. 2007;23:2947. doi: 10.1093/bioinformatics/btm404. [DOI] [PubMed] [Google Scholar]
- 15.F Corpet. Nucleic Acids Res. 1988;16:100881. [Google Scholar]
- 16.M Van der Rest, R Garrone. FASEB J. 1991;5:2814. [PubMed] [Google Scholar]
- 17.GN Ramchandran, et al. Biochemica et Biophysica Acta. 1973;332:166. [Google Scholar]
- 18.LT Hunt, WC Barker. Biochem Biophys Res Commun. 1987;144:876. doi: 10.1016/s0006-291x(87)80046-3. [DOI] [PubMed] [Google Scholar]
- 19.J Voorberg, et al. J Cell Biol. 1991;113:195. doi: 10.1083/jcb.113.1.195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.M Czarny-Ratajczak, et al. Am J Hum Genet. 2001;69:969. doi: 10.1086/324023. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data 1
97320630008026S1.pdf (110.4KB, pdf)