Optimized Spectral Clustering Methods For Potentially Divergent Biological Sequences

Johny Matar; Hicham El  Khoury; Jean-Claude Charr; Christophe Guyeux; Stephane Chretien

doi:10.32350/sir.83.03

Johny Matar 1. Université de Bourgogne Franche-Comté, Besançon, France 2. LaRRIS, Faculty of Sciences, Lebanese University, Fanar, Lebanon
Hicham El Khoury LaRRIS, Faculty of Sciences, Lebanese University, Fanar, Lebanon
Jean-Claude Charr Université de Bourgogne Franche-Comté, Besançon, France
Christophe Guyeux Université de Bourgogne Franche-Comté, Besançon, France
Stephane Chretien Laboratoire ERIC, Université Lyon 2, Bron, France

Keywords: affinity matrices, biological sequence clustering, clustering quality analysis, eigenmap, Gaussian Mixture Model (GMM) spectral clustering

Abstract

Abstract Views: 0

Various recent researches in bioinformatics demonstrated that clustering is a very efficient technique for sequence analysis. Spectral clustering is particularly efficient for highly divergent sequences and GMMs (Gaussian Mixture Models) are often able to cluster overlapping groups if given an adequately designed embedding. The current study used spectral embedding and Mixture Models for clustering potentially divergent biological sequences. The research approach resulted in a pipeline consisting of the following four steps. The first step consists of aligning the biological sequences. The pairwise affinity of the sequences is computed in the second step. Then the Laplacian Eigenmap embedding of the data is performed in the third step. Finally, the last step consists of a GMM-based clustering. Improving the quality of the generated clustering and the performance of this approach is directly related to the enhancement of each one of these four steps. The main contribution is proposing four GMM-based algorithms for automatically selecting the optimal number of clusters and optimizing the clustering quality. A clustering quality assessment method, based on phylogenetic trees, is also proposed. Moreover, a performance study and analysis have been conducted while testing different clustering methods and GMM implementations. Experimental results demonstrated the superiority of using the BIC (Bayesian Information Criterion) for selecting the optimal GMM configuration. Significant processing speed improvements were also recorded for the implementation of the proposed algorithms.

Downloads

Download data is not yet available.

References

Li Y, Xia F, Wang Y, Yan C, Jia A, Zhang Y. Characterization of a highly divergent Sugarcane mosaic virus from Canna indica L. by deep sequencing. BMC Microbiol. 2019;19:1–8. https://doi.org/10.1186/s12866-019-1636-y

Pentney W, Meila M. Spectral clustering of biological sequence data. Paper presented at: The AAAI Conference on Artificial Intelligence; July 9–13, 2005; Pittsburgh, Pennsylvania. https://cdn.aaai.org/AAAI/2005/AAAI05-133.pdf. Accessed December 13, 2023.

Paccanaro A, Casbon JA, Saqi MAS. Spectral clustering of protein sequences. Nucleic Acids Res. 2006;34(5):1571–1580. https://doi.org/10.1093/nar/gkj515

Matar J, El Khoury H, Charr JC, Guyeux C, Chrétien S. SpCLUST: towards a fast and reliable clustering for potentially divergent biological sequences. Comput Biol Med. 2019;114:e103439. https://doi.org/10.1016/j.compbiomed.2019.103439

Bocklet T, Maier A, Bauer JG, Burkhardt F, Noth E. Age and gender recognition for telephone applications based on gmm supervectors and support vector machines. Paper presented at: IEEE International Conference on Acoustics, Speech and Signal Processing; March 31–April 4, 2008; Las Vegas, USA. https://doi.org/10.1109/ICASSP.2008.4517932

Genovese M, Napoli E. ASIC and FPGA implementation of the gaussian mixture model algorithm for real-time segmentation of high definition video. IEEE Transac Very Large Scale Integ Syst. 2013;22(3):537–547. https://doi.org/10.1109/TVLSI.2013.2249295

McLachlan GJ, Peel D. Finite Mixture Models. John Wiley & Sons; 2004.

Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B Methodol. 1977;39(1):1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x

Wu CFJ. On the convergence properties of the EM algorithm. Ann Stat. 1983:95–103.

McLachlan GJ, Krishnan T. The EM Algorithm and Extensions. John Wiley & Sons; 2007.

Chrétien S, Hero AO. Acceleration of the EM algorithm via proximal point iterations. Paper presented at: IEEE International Symposium on Information Theory (Cat. No. 98CH36252); Auguzt 16–21, 1998; Cambridge, USA. https://doi.org/10.1109/ISIT.1998.709049

Chrétien S, Hero AOIII. Kullback proximal algorithms for maximum-likelihood estimation. IEEE Trans Info Theo. 2000;46(5):1800–1810. https://doi.org/10.1109/18.857792

Celeux G, Chrétien S, Forbes F, Mkhadri A. A component-wise EM algorithm for mixtures. J Comput Graph Stat. 2012;10:697–712. https://doi.org/10.1198/106186001317243403

Biernacki C, Castellan G, Chretien S, Guedj B, Vandewalle V. Pitfalls in mixtures from the clustering angle. Paper presented at: Working Group on Model-Based Clustering Summer Session, July17–23, 2016; Paris. https://hal.science/hal-01419755/. Accessed December 13, 2023.

Biernacki C, Chrétien S. Degeneracy in the maximum likelihood estimation of univariate Gaussian mixtures with EM. Stat Probabil Lett. 2003;61:373–382. https://doi.org/10.1016/S0167-7152(02)00396-6

Shi M, Bermak A. An efficient digital VLSI implementation of Gaussian mixture models-based classifier. IEEE Trans Very Large Scale Integ Syst. 2006;14(9):962–974. https://doi.org/10.1109/TVLSI.2006.884048

Kulis B, Jordan MI. Revisiting k-means: new algorithms via Bayesian nonparametrics. arXiv Preprint. earXiv:1111.0352. 2011. https://doi.org/10.48550/arXiv.1111.0352

Cheng Y. Mean shift, mode seeking, and clustering. IEEE Trans Pattern Anal Machine Intell. 1995;17(8):790–799. https://doi.org/10.1109/34.400568

Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–1659. https://doi.org/10.1093/bioinformatics/btl158

Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26(19):2460–2461. https://doi.org/10.1093/bioinformatics/btq461

Ghodsi M, Liu B, Pop M. DNACLUST: accurate and efficient clustering of phylogenetic marker genes. BMC Bioinfo. 2011;12:e271. https://www.doi.org/10.1186/1471-2105-12-271

Matias Rodrigues JF, von Mering C. HPC-CLUST: distributed hierarchical clustering for large sets of nucleotide sequences. Bioinformatics. 2013;30(2):287–288. https://doi.org/10.1093/bioinformatics/btt657

Jiang L, Dong Y, Chen N, Chen T. DACE: a scalable DP-means algorithm for clustering extremely large sequence data. Bioinformatics. 2016;33:834–842. https://doi.org/10.1093/bioinformatics/btw722

Bruneau M, Mottet T, Moulin S, et al. A clustering package for nucleotide sequences using Laplacian Eigenmaps and Gaussian mixture model. Comput Biol Med. 2018;93:66–74. https://doi.org/10.1016/j.compbiomed.2017.12.003

Larrañaga P, Calvo B, Santana R, et al. Machine learning in bioinformatics. Brief Bioinfo. 2006;7:86–112. https://doi.org/10.1093/bib/bbk007

Reynolds D. Gaussian mixture models. In: Li SZ, Jain AK, eds. Encyclopedia of Biometrics. Springer; 2009; 827–832.

Vrieze SI. Model selection and psychological theory: a discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Psychol Methods. 2012;17(2):e228. https://doi.org/10.1037/a0027127

Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–1797. https://doi.org/10.1093/nar/gkh340

Substitution matrix. Wikipedia. https://en.wikipedia.org/wiki/Substitution_matrix. Accessed September 27, 2018.

Waterman MS. Efficient sequence alignment algorithms. J Theo Biol. 1984;108(3):333–337. https://doi.org/10.1016/S0022-5193(84)80037-5

Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecul Biol Evol. 2013;30(4):772–780. https://doi.org/10.1093/molbev/mst010

Wright ES. DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment. BMC Bioinfo. 2015;16:e322. https://doi.org/10.1186/s12859-015-0749-z

Larking MA, Blackshields G, Brown NP, et al. ClustalW and ClustalX version 2. Bioinformatics. 2007;23(21):2947–2948. https://doi.org/10.1093/bioinformatics/btm404

Von Luxburg U. A tutorial on spectral clustering. Stat Comput. 2007;17:395–416. https://doi.org/10.1007/s11222-007-9033-z

Langone R, Alzate C, Suykens JAK. Modularity-based model selection for kernel spectral clustering. Paper presented at: International Joint Conference on Neural Networks; July 31–August 5, 2011; San Jose, USA. https://doi.org/10.1109/IJCNN.2011.6033449. Accessed December 13, 2023.

Saade A, Krzakala F, Zdeborová L. Spectral clustering of graphs with the bethe hessian. Paper presented at: The 27th International Conference on Neural Information Processing Systems; December 8–13, 2014; Cambridge, United States. https://dl.acm.org/doi/proceedings/10.5555/2968826. Accessed December 13, 2023.

Dall'Amico L, Couillet R, Tremblay N. Optimized deformed laplacian for spectrum-based community detection in sparse heterogeneous graphs. Paper presented at: 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); December 8–14, 2019; Vancouver, Canada.

Dall'Amico L, Couillet R, Tremblay N. Revisiting the Bethe-Hessian: improved community detection in sparse heterogeneous graphs. Paper presented at: 33rd Conference on Neural Information Processing Systems, December, 2019; Vancouver, Canada. https://inria.hal.science/hal-02429525/. Accessed December 13, 2023.

Scikit Learn. Gaussian mixture. https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html. Accessed December 13, 2023.

Scikit Learn. Spectral_embedding. https://scikit-learn.org/stable/modules/generated/sklearn.manifold.spectral_embedding.html. Accessed December 27, 2019.

Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–2830.

GitHub. Paperrune/GMM: C++ implementation of gaussian mixture model. https://github.com/paperrune/GMM. Accessed December 2, 2019.

GitHub. A gaussian mixture model library featuring BIC and AIC. https://github.com/johnymatar/GMM. Accessed October 10, 2021.

Raghavan VV, Gudivada VN, Govindaraju V, Rao CR. Cognitive computing: Theory and applications. Elsevier; 2016.

Campanella JJ, Bitincka L, Smalley J. MatGAT: an application that generates similarity/identity matrices using protein or DNA sequences. BMC Bioinfo. 2003;4:1–4. https://doi.org/10.1186/1471-2105-4-29

Myers EW, Miller W. Optimal alignments in linear space. Bioinformatics. 1988;4(1):11–17. https://doi.org/10.1093/bioinformatics/4.1.11

Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 2010;59(3):307–321. https://doi.org/10.1093/sysbio/syq010

Lefort V, Longueville JE, Gascuel O. SMS: smart model selection in PhyML. Molecul Biol Evol. 2017;34(9):2422–2424. https://doi.org/10.1093/molbev/msx149

Wagner S, Wagner D. Comparing clusterings: an overview. Institute for Theoretical Computer Science (ITI). https://publikationen.bibliothek.kit.edu/1000011477. Accessed October 10, 2022.

Guyeux C, Chrétien S, Tayeh GB, Demerjian J, Bahi J. Introducing and comparing recent clustering methods for massive data management in the internet of things. J Sens Actuator Netw. 2019;8:e56. https://doi.org/10.3390/jsan8040056

Optimized Spectral Clustering Methods For Potentially Divergent Biological Sequences

Abstract

Downloads

References

Editor-in-Chief: Dr. Muhammad Ali

ISSN (E): 2521-2435
ISSN (P): 2521-2427
Publication Frequency: Quarterly
Abbreviated Key Title: Sci Inquiry Rev.
Parallel Title: SIR
Area of Publication: Pure and Applied Sciences (Chemistry,Life Sciences,Mathematics and Physics)

Optimized Spectral Clustering Methods For Potentially Divergent Biological Sequences

Abstract

Downloads

References

Editor-in-Chief: Dr. Muhammad Ali

ISSN (E): 2521-2435ISSN (P): 2521-2427Publication Frequency: QuarterlyAbbreviated Key Title: Sci Inquiry Rev.Parallel Title: SIR Area of Publication: Pure and Applied Sciences (Chemistry,Life Sciences,Mathematics and Physics)

ISSN (E): 2521-2435
ISSN (P): 2521-2427
Publication Frequency: Quarterly
Abbreviated Key Title: Sci Inquiry Rev.
Parallel Title: SIR
Area of Publication: Pure and Applied Sciences (Chemistry,Life Sciences,Mathematics and Physics)