Optimized Spectral Clustering Methods For Potentially Divergent Biological Sequences
Abstract
Abstract Views: 0Various recent researches in bioinformatics demonstrated that clustering is a very efficient technique for sequence analysis. Spectral clustering is particularly efficient for highly divergent sequences and GMMs (Gaussian Mixture Models) are often able to cluster overlapping groups if given an adequately designed embedding. The current study used spectral embedding and Mixture Models for clustering potentially divergent biological sequences. The research approach resulted in a pipeline consisting of the following four steps. The first step consists of aligning the biological sequences. The pairwise affinity of the sequences is computed in the second step. Then the Laplacian Eigenmap embedding of the data is performed in the third step. Finally, the last step consists of a GMM-based clustering. Improving the quality of the generated clustering and the performance of this approach is directly related to the enhancement of each one of these four steps. The main contribution is proposing four GMM-based algorithms for automatically selecting the optimal number of clusters and optimizing the clustering quality. A clustering quality assessment method, based on phylogenetic trees, is also proposed. Moreover, a performance study and analysis have been conducted while testing different clustering methods and GMM implementations. Experimental results demonstrated the superiority of using the BIC (Bayesian Information Criterion) for selecting the optimal GMM configuration. Significant processing speed improvements were also recorded for the implementation of the proposed algorithms.
Downloads
References
Li Y, Xia F, Wang Y, Yan C, Jia A, Zhang Y. Characterization of a highly divergent Sugarcane mosaic virus from Canna indica L. by deep sequencing. BMC Microbiol. 2019;19:1–8. https://doi.org/10.1186/s12866-019-1636-y
Pentney W, Meila M. Spectral clustering of biological sequence data. Paper presented at: The AAAI Conference on Artificial Intelligence; July 9–13, 2005; Pittsburgh, Pennsylvania. https://cdn.aaai.org/AAAI/2005/AAAI05-133.pdf. Accessed December 13, 2023.
Paccanaro A, Casbon JA, Saqi MAS. Spectral clustering of protein sequences. Nucleic Acids Res. 2006;34(5):1571–1580. https://doi.org/10.1093/nar/gkj515
Matar J, El Khoury H, Charr JC, Guyeux C, Chrétien S. SpCLUST: towards a fast and reliable clustering for potentially divergent biological sequences. Comput Biol Med. 2019;114:e103439. https://doi.org/10.1016/j.compbiomed.2019.103439
Bocklet T, Maier A, Bauer JG, Burkhardt F, Noth E. Age and gender recognition for telephone applications based on gmm supervectors and support vector machines. Paper presented at: IEEE International Conference on Acoustics, Speech and Signal Processing; March 31–April 4, 2008; Las Vegas, USA. https://doi.org/10.1109/ICASSP.2008.4517932
Genovese M, Napoli E. ASIC and FPGA implementation of the gaussian mixture model algorithm for real-time segmentation of high definition video. IEEE Transac Very Large Scale Integ Syst. 2013;22(3):537–547. https://doi.org/10.1109/TVLSI.2013.2249295
McLachlan GJ, Peel D. Finite Mixture Models. John Wiley & Sons; 2004.
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B Methodol. 1977;39(1):1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x
Wu CFJ. On the convergence properties of the EM algorithm. Ann Stat. 1983:95–103.
McLachlan GJ, Krishnan T. The EM Algorithm and Extensions. John Wiley & Sons; 2007.
Chrétien S, Hero AO. Acceleration of the EM algorithm via proximal point iterations. Paper presented at: IEEE International Symposium on Information Theory (Cat. No. 98CH36252); Auguzt 16–21, 1998; Cambridge, USA. https://doi.org/10.1109/ISIT.1998.709049
Chrétien S, Hero AOIII. Kullback proximal algorithms for maximum-likelihood estimation. IEEE Trans Info Theo. 2000;46(5):1800–1810. https://doi.org/10.1109/18.857792
Celeux G, Chrétien S, Forbes F, Mkhadri A. A component-wise EM algorithm for mixtures. J Comput Graph Stat. 2012;10:697–712. https://doi.org/10.1198/106186001317243403
Biernacki C, Castellan G, Chretien S, Guedj B, Vandewalle V. Pitfalls in mixtures from the clustering angle. Paper presented at: Working Group on Model-Based Clustering Summer Session, July17–23, 2016; Paris. https://hal.science/hal-01419755/. Accessed December 13, 2023.
Biernacki C, Chrétien S. Degeneracy in the maximum likelihood estimation of univariate Gaussian mixtures with EM. Stat Probabil Lett. 2003;61:373–382. https://doi.org/10.1016/S0167-7152(02)00396-6
Shi M, Bermak A. An efficient digital VLSI implementation of Gaussian mixture models-based classifier. IEEE Trans Very Large Scale Integ Syst. 2006;14(9):962–974. https://doi.org/10.1109/TVLSI.2006.884048
Kulis B, Jordan MI. Revisiting k-means: new algorithms via Bayesian nonparametrics. arXiv Preprint. earXiv:1111.0352. 2011. https://doi.org/10.48550/arXiv.1111.0352
Cheng Y. Mean shift, mode seeking, and clustering. IEEE Trans Pattern Anal Machine Intell. 1995;17(8):790–799. https://doi.org/10.1109/34.400568
Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–1659. https://doi.org/10.1093/bioinformatics/btl158
Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26(19):2460–2461. https://doi.org/10.1093/bioinformatics/btq461
Ghodsi M, Liu B, Pop M. DNACLUST: accurate and efficient clustering of phylogenetic marker genes. BMC Bioinfo. 2011;12:e271. https://www.doi.org/10.1186/1471-2105-12-271
Matias Rodrigues JF, von Mering C. HPC-CLUST: distributed hierarchical clustering for large sets of nucleotide sequences. Bioinformatics. 2013;30(2):287–288. https://doi.org/10.1093/bioinformatics/btt657
Jiang L, Dong Y, Chen N, Chen T. DACE: a scalable DP-means algorithm for clustering extremely large sequence data. Bioinformatics. 2016;33:834–842. https://doi.org/10.1093/bioinformatics/btw722
Bruneau M, Mottet T, Moulin S, et al. A clustering package for nucleotide sequences using Laplacian Eigenmaps and Gaussian mixture model. Comput Biol Med. 2018;93:66–74. https://doi.org/10.1016/j.compbiomed.2017.12.003
Larrañaga P, Calvo B, Santana R, et al. Machine learning in bioinformatics. Brief Bioinfo. 2006;7:86–112. https://doi.org/10.1093/bib/bbk007
Reynolds D. Gaussian mixture models. In: Li SZ, Jain AK, eds. Encyclopedia of Biometrics. Springer; 2009; 827–832.
Vrieze SI. Model selection and psychological theory: a discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Psychol Methods. 2012;17(2):e228. https://doi.org/10.1037/a0027127
Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–1797. https://doi.org/10.1093/nar/gkh340
Substitution matrix. Wikipedia. https://en.wikipedia.org/wiki/Substitution_matrix. Accessed September 27, 2018.
Waterman MS. Efficient sequence alignment algorithms. J Theo Biol. 1984;108(3):333–337. https://doi.org/10.1016/S0022-5193(84)80037-5
Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecul Biol Evol. 2013;30(4):772–780. https://doi.org/10.1093/molbev/mst010
Wright ES. DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment. BMC Bioinfo. 2015;16:e322. https://doi.org/10.1186/s12859-015-0749-z
Larking MA, Blackshields G, Brown NP, et al. ClustalW and ClustalX version 2. Bioinformatics. 2007;23(21):2947–2948. https://doi.org/10.1093/bioinformatics/btm404
Von Luxburg U. A tutorial on spectral clustering. Stat Comput. 2007;17:395–416. https://doi.org/10.1007/s11222-007-9033-z
Langone R, Alzate C, Suykens JAK. Modularity-based model selection for kernel spectral clustering. Paper presented at: International Joint Conference on Neural Networks; July 31–August 5, 2011; San Jose, USA. https://doi.org/10.1109/IJCNN.2011.6033449. Accessed December 13, 2023.
Saade A, Krzakala F, Zdeborová L. Spectral clustering of graphs with the bethe hessian. Paper presented at: The 27th International Conference on Neural Information Processing Systems; December 8–13, 2014; Cambridge, United States. https://dl.acm.org/doi/proceedings/10.5555/2968826. Accessed December 13, 2023.
Dall'Amico L, Couillet R, Tremblay N. Optimized deformed laplacian for spectrum-based community detection in sparse heterogeneous graphs. Paper presented at: 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); December 8–14, 2019; Vancouver, Canada.
Dall'Amico L, Couillet R, Tremblay N. Revisiting the Bethe-Hessian: improved community detection in sparse heterogeneous graphs. Paper presented at: 33rd Conference on Neural Information Processing Systems, December, 2019; Vancouver, Canada. https://inria.hal.science/hal-02429525/. Accessed December 13, 2023.
Scikit Learn. Gaussian mixture. https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html. Accessed December 13, 2023.
Scikit Learn. Spectral_embedding. https://scikit-learn.org/stable/modules/generated/sklearn.manifold.spectral_embedding.html. Accessed December 27, 2019.
Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–2830.
GitHub. Paperrune/GMM: C++ implementation of gaussian mixture model. https://github.com/paperrune/GMM. Accessed December 2, 2019.
GitHub. A gaussian mixture model library featuring BIC and AIC. https://github.com/johnymatar/GMM. Accessed October 10, 2021.
Raghavan VV, Gudivada VN, Govindaraju V, Rao CR. Cognitive computing: Theory and applications. Elsevier; 2016.
Campanella JJ, Bitincka L, Smalley J. MatGAT: an application that generates similarity/identity matrices using protein or DNA sequences. BMC Bioinfo. 2003;4:1–4. https://doi.org/10.1186/1471-2105-4-29
Myers EW, Miller W. Optimal alignments in linear space. Bioinformatics. 1988;4(1):11–17. https://doi.org/10.1093/bioinformatics/4.1.11
Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 2010;59(3):307–321. https://doi.org/10.1093/sysbio/syq010
Lefort V, Longueville JE, Gascuel O. SMS: smart model selection in PhyML. Molecul Biol Evol. 2017;34(9):2422–2424. https://doi.org/10.1093/molbev/msx149
Wagner S, Wagner D. Comparing clusterings: an overview. Institute for Theoretical Computer Science (ITI). https://publikationen.bibliothek.kit.edu/1000011477. Accessed October 10, 2022.
Guyeux C, Chrétien S, Tayeh GB, Demerjian J, Bahi J. Introducing and comparing recent clustering methods for massive data management in the internet of things. J Sens Actuator Netw. 2019;8:e56. https://doi.org/10.3390/jsan8040056
Copyright (c) 2024 Johny Matar, Hicham El Khoury, Jean-Claude Charr, Christophe Guyeux, Stephane Chretien
This work is licensed under a Creative Commons Attribution 4.0 International License.