Comparative Analysis of Breast Cancer Detection using Cutting-edge Machine Learning Algorithms (

Recently, machine learning techniques have gained popularity for the medical diagnosis. Medical professionals use this approach to learn and detect the abnormalities of life-threatening chronic diseases. The increasing use of ML approaches may be due in part to better disease diagnosis enabled through improved symptom detection. The current study deployed different machine learning algorithms, including Decision Trees (DT), K-Nearest Neighbors (KNN), classifiers Multilayer Perceptron (MP), Support Vector Machines (SVM), and Random Forest (RF) for early predictions and symptoms of the disease. These models were capable of differentiating between benign and harmful cancer cells Benign tumours, which were non-cancerous and in most cases, non-lethal were mostly confined to the area from where they originated, however, it was observed that malignant cancer can start with abnormal cell growth in the human body. This abnormal cell growth can quickly spread to nearby tissues, which can cause infiltration of adjacent cells, resulting in a potentially fatal condition. Thereby, it was observed that Multilayer Perceptron (MLP) model provided the highest accuracy percentage of 86% when compared with all the other techniques in association with the accuracy rate of the models


I. INTRODUCTION
Cancer is a complex disease, which is characterized by uncontrolled growth of abnormal tissue in the entire body. Normally, old or damaged cells are replaced by new and healthy cells to maintain a healthy functioning of body . Contrastingly, some damaged tissue incessantly grows and become a mass of tissues known as a tumour in the human body. There are two types of tumours, for instance, malignant and benign tissue. The current study focused on breast cancer tumours along with Machine Learning Algorithms (MLA). Primarily, the breast composed of two main types of tissue: glandular tissue and connective tissue. Glandular tissues are responsible for producing milk, whereas connective tissues provide structural support and shape the breast. Glandular tissues may convert into malignant tumours with the passage of time. Most breast cancers emanate in the cells of the lobules, the anatomical structures that consist of milk-producing glands, or in the ducts embedded amid the breast tissue, which acts as the passageways that deliver milk from the lobules to the nipples. Breast stromal tissues, which are composed of fatty and fibrous connective tissues, can also cause breast cancer.
The overall structural deformation of a woman's breast tissue caused by the presence of a malignant tumour is influenced by age-related changes in the amount of fatty and fibrous tissues in her body [1]. The death rate annually has significantly increased due to the rising cases of breast cancer. Recently, it has been observed that a massive death rate is due to breast cancer [2].
The majority of female patients normally do not have a good prognosis for it at the stage when it is eventually recognised for what it is-a cancer of breast tissuewhich accounts for its high lethality./ Its high fatality rate is due to the unfortunate reality and less awareness regarding this disease, which has not only increased the poor-prognosis due to last-stage detection but it has also caused malignancy in breast tissues. Breast cancer usually develops by a genetic abnormality, or defect, in the inherent code. This inherited genetic abnormality can cause erroneous gene expression, which becomes a prime reason for its cause and development. Only 5-10% of the cancer cases were observed to be because of inherited abnormalities from your mother or father. Instead, 85-90% of breast cancer cases are due to genetic abnormalities, which arise with aging [3].
In order to improve different areas of treatment and raise patients' chances of survival, accurate cancer detection is important./ Accurate diagnosis of cancer is essentially important to optimize various aspects of therapy and to increase the chances of survivability of patients. Many researchers have put forth various approaches for automatic cell classification to diagnose breast cancer in the recent years. Therefore, it was observed that Machine learning techniques stand out among them for the classification and prediction of breast cancer among female patients [4]. Machine learning methods were used to identify cancer and determine whether a tumour is present or not, it may be useful in the study of breast cancer. Additionally, these cancer tumors can be predicted using machine-learning techniques (MLAs). When using conventional methods of diagnosis, these cancer tumors might frequently go undetected for a long time [5]. Thereby, increasing the proportion of deaths brought by cancer.
Machine learning techniques (MLs) have recently emerged as a highly relevant area in practical research, which are very constructive in the prompt diagnosis of breast cancer. Over the past ten years, using machine learning techniques for medical diagnosis has gained popularity. This increased use of ML techniques can be partly attributed to the fact that it made better disease identification and symptom detection [6].
The current study is divided into the following sub-sections. The reviews of literature is given in Section 2. Moreover, Section 3 explains the experiment performed and the adopted algorithms to show the obtained results. Furthermore, these results are elaborated in Section 4. Additionally, Sections 5 of the study dicusses the study results and concludes the research.

II. LITERATURE REVIEW
Machine Learing (ML) can be used to submit fresh diagnostic hypotheses, helping to create a more customized therapeutic proposal. Several different MLAs, such as Decision Trees, Multilayer Perceptrons, K-Nearest Neighbors, and Support Vector Machines (SVM) have been used in this study. The CSV file of the breast mammograms was first taken and classification models that had been trained to achieve the aforementioned objective were then used. In contrast to benign tumors, which are non-cancerous andnon-lethal and are confined to the area where they originate whereas malignant cancer starts with abnormal cell growth and the abnormal behavior of messages between cells [7]. There is a vast body of literature on this subject, however, In the current study aims to deploy a computational method that would predict breast cancer with of accuracy.
On the Wisconsin datasets, practitioners analyzed the performance of four classifiers: Naive Bayes (NB), SVM, Decision Tree (DT), and K Nearest Neighbors (K-NN) [8]. SVM has been identified as the best amongst the others by achieving high accuracy percentage of 97.13% with the lowest error rate with respect to others' confusion matrices. It is important to consider this number because the conventional diagnostic methods have higher probability to occur errors, which would significantly impact the treatment strategies and staging of breast cancer. Currently, biopsy-proven breast malignancy is the most accurate method for diagnosing breast cancer, which also directly affects the medical professionals decisions .
Haifeng developed a SVM-based ensembled model for the early prediction of cancer. The proposed ensembled model was made up of six different types of kernel functions and two different types of SVM structures, such as a-SVM and C-SVM [9]. In this study, two datasets from Wisconsin were used, namely WBC and WDBC, which were used to test the model (SEER). The intended model enhanced the diagnosis accuracy when compared to other researches using a single SVM. The major observed disadvantage of this strategy is that it requires a long training period and is computationally expensive. Resultantly, the validity became doubtful because patients possessing breast cancer were constantly at risk of the disease spreading.
Several deep learning and data mining techniques were examined for veneration cancer by several practitioners [10]. However, according to previous researches only a few papers used genetics, , which claimed that imaging was used in majority of the publications. A variety of algorithms, such as CNN and Nave Bayes, were used in imaging techniques. However, Machine Learning Techniques (MLT), Decision Trees, SVM, and Random Forests (RF) were quite famous.
Researchers validated and applied different neural networks (NNs) techniques , especially in early-stage cancer classification. They discovered that most NNs were capable of identifying cancerous cells and these cells are typically difficult to distinguish because they frequently resemble the Parr of healthy breast tissue. Furthermore, they illustrated that these cells must be observed for a specific amount of time by the experts to identify School of System and Technology Volume 3 Issue 1, Spring 2023 their distinctive characteristics, including their structure and growth rate, which can differ from normal breast cells. The imaging method, however, required a lot of processing power to preprocess the images [11].
Sudarshan Nayak in his study used 3D images to illustrate Machine Learning Algorithms (MLA) to categorize breast cancer. Before classifying an abnormal mass in the breast parenchyma, many factors were taken into consideration. Based on his overall results, it was concluded that SVM is the best model for this purpose [12].
Similarly, , Youness Khoudfi and Mohamed Bahaj in a comparison of machine learning algorithms found that SVM is the best classifier, with an accuracy of 97.9%, when compared to K-NN, RF, and NB, which are based on multilayer perception with 5 layers and 10 times cross-validation using MLP [13]. Moreover, Ahmed [14] prepared a method for estimating WBCD, which combined a clustering strategy with a potent probabilistic vector support machine, with a prediction rate of 99.10% made by the SVM technique.
A new approach was introduced by David A. [15], which employed linear discriminant analysis (LDA) to reduce the feature dimensionality and then deployed the new, reduced feature dataset with SVM. On the (BCWD) dataset, the practitioner applied five ML algorithms: (SVM), Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), and KNN. A performance assessment and comparison of these various classifiers were done after getting the results. This study's main objective was to develop a confusion matrix to check the significance of precision and accuracy [16]. All other classifiers outperformed the support vector (SV) machines, which had the highest accuracy of 97.2%.
Utilizing the WEKA program from the Waikato Environment for Knowledge Analysis, Valentina Mikhailova compared the categorization algorithms. A number of 286 cases and 10 attributes made the dataset for this study. The J48, Nave Bayes, Random Forest (RF), MLP, K*, and SVM models were compared using various parameters. Metrics such as machine learning helped to assess the performance of the developed models [17]. The SVM algorithm and the J48 model both offered the highest levels of accuracy, at 75.5% and 79.6%, respectively.
To categorize a WBCD, the practitioner used naive Bayes, SVM, and Decision Trees (DT); support vector SVM, which produced the best results with an accuracy score of 96.99% [18]. Clinical information from medical intensive care units was used in this study. Machine learning techniques were used for the early detection of disease for the patients inside the hospital over the course of 24 hours. The KNN and logistic regression produced the highest accuracy ratings when applied to training data [19].
The biggest problem with breast cancer, according to Nithya et al. [20], is classifying breast tumours because of the structural distortion these tumours cause. It is crucial to determine the type of tumour when one is considering the the correct form of prognosis because it determines the impact of tumour on the breast tissue. Computer-aided diagnosis (CAD) was used to check the significance of breast cancer tumours in the patient. Their main goal was to use Innovative Computing Review Volume 3 Issue 1, Spring 2023 data mining technologies to improve breast cancer projection. The classification performance of many machine learning algorithms was enhanced using Bagging, Multiboot, Random Subspace, and Multilayer Perceptron.
The classification of different patient groups into high-risk and low-risk patients were examined by Kourou et al. using Artificial Neural Networks (ANN), Decision Trees (DT), and SVM to present a model for cancer risks [21].
A study has been conducted using mammogram diagnosis on biopsy of breast cancer [22]. The study used traditional classification techniques such as LR, LDA, QDA, FR, and SVM in their practical work. Md. Milon Islam contrasts supervised machine learning algorithms like SVM, ANNs, and LR [23]. The UCI machine learning database, a well-known machine learning resource, is where the WBC dataset was taken from. The performance of the study was evaluated using the confusion matrix and correlational factors. Additionally, different approaches' receiver operating characteristic curves and precision-recall area under curves were assessed. The findings indicated that SVM had the highest accuracy among all applied algorithms, while ANNs have the highest values of accuracy, precision, and F1-score.
Different versions of the DT algorithm for the diagnosis of cancer were employed using Mat lab, Python, and WEKA. The CART employed in Python gave 97.4% and 98.9% in terms of its accuracy and sensitivity, respectively, while in WEKA both DT algorithms achieved 95.3% and 95.3% accuracy, respectively [24] [25].
In the testing phase, NSVM, LPSVM, SSVM, and LPSVM all achieved the highest accuracy, sensitivity, and specificity, which were 96.5517%, 98.2456%, 96.5517%, and 97.1429%, respectively. Yue et al. provided thorough evaluations of different models by using the standard WBCD dataset and Decision Tree (DT) methods, which were used to predict breast cancer. According to the practitioners by collaborating with two deep learning models the highest accuracy rate can be achieved. This architecture had a classification accuracy of 99.68%, but when combined with the clustering algorithm, the SVM method had a classification accuracy of 99.10%. They also studied the ensemble method, which employed voting to create the J48, SVM, and Naive Bayes models. With the ensemble method, an accuracy of 97.13% was achieved, respectively [26]. Infrared imaging coupled with an agent previously administered to a patient can lead to a very accurate tumour detector, with a thermal sensitivity camera and model of the breast [27].In this research remote health care systems used technological paradigms and enablers to fulfill their needs of remote monitoring, remote aid, and research gaps, which were identified to stimulate the future research [28].  SVM improves breast cancer diagnosis and treatment strategies. [9] Support vector machine ensemble Algorithm

School of System and Technology
Haifeng's ensemble model demonstrated improved diagnosis accuracy, but its long training time and computational expense raise concerns.
[10] Deep learning and machine learning algorithms Haifeng's ensemble model demonstrated improved diagnosis accuracy, but its long training time and computational expense raise concerns.
[11] Artificial Neural Network (ANN) Neural networks can identify cancerous cells but require observation over time and significant processing power processing of image processing.
[ Used data mining technologies to improve breast cancer prediction, using bagging, multiboot, random subspace, and multilayer perceptron to improve the classification performance of machine learning algorithms.

III. MATERIALS AND METHODS
Mammography is a radiological method, which is used for screening breast cancer. Diagnostic mammography is a special kind of mammogram, which is used to detect abnormalities in females who have been diagnosed with having breast issues or cancer after the suggestion or advice of a medical professional. Women over the age of 45 are advised to undergo screening mammography to rule out any breast tumours or malignancies, but this diagnostic procedure in and of itself has risks related to radiation exposure for both young and malignant females. However, mammography only predicts 70% on a true positive scale many other unnecessary biopsies are performed to confirm the indication of this chronic disease.
Recently, several Computer-Aided Diagnostics (CAD) approaches have been put forth to lessen the number of unnecessary screening biopsies. These systems have helped clinicians to choose between performing a screening biopsy for a suspicious patient on the behalf of a mammogram and may perform a comprehensive follow-up instead. Due to some important limitations, a biopsy is not usually used to diagnose breast masses.
This data set can be used to classify benign or malignant)based on BI-RADS (Breast Imaging-Reporting and Data System) features and the patient's age, which is considered a benchmark classification for the breast cancer diagnoses. This system depicts the correlation with the likelihood of malignant breast cancer. Sensitivity and associated specificities can be determined by assuming that all instances with BI-RADS ratings greater than or equal to a specific value (ranging from 1 to 5) are malignant and all other cases are benign. These can demonstrate a CAD system's performance in comparison to radiologists. It can be noticed from previous literature that different practitioners use different methods. whereas the machine learning method was employed in the current study to check the significance of the dataset, which fills the gap. Previously, these type of models have not been applied to check the accuracy of cancer genome by any of the researcher but in this paper novel model School of System and Technology Volume 3 Issue 1, Spring 2023 techniques have been used to aid the research in this field.

A. DATA PREPARATION
Data preprocessing is a tool that locates or eliminates outliers. Additionally, it eliminates self-contradiction. The dataset is usually reduced to just the sample code number. Its removal is justified by the fact that it does not affect illnesses. The dataset contains 16 missing values for traits. For this, the missing traits are replaced by the mean. A random selection process from the dataset has been employed to ensure that the data is distributed correctly. The dataset contains the following features such as id, diagnosis (M = malignant, B = benign), radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. Ten real-valued features are computed for each cell nucleus, namely mean, standard deviation for grey-scale values, and worst or largest of these features. The dataset can be understood by the description given in the Table II below.   TABLE II  DESCRIPTION  Different models were used to verify the best result for the dataset. The aforementioned methods were used in the research because they provided a better classification and interpretation. Moreover, a convolutional neural network in the future can be used to compare this work with a deep learning model. The methods, which were employed in this study are widely used in classification problems, which gave clear directions to conduct better analysis. Moreover, a generic flow diagram for better interpretation was used which is given below:

C. TRAINING AND TESTING PHASE
Primarily, the dataset was set into two chunks for training and testing. In the training phase, main features were extracted to classify the cancer type, whereas in the testing phase, the significance of the predicted case was checked in the confusion matrix. K fold cross-validation showed that one fold was used for testing, while k 1 folds were used repeatedly for training. Overfitting was prevented by using cross-validation. Data was partitioned in the study by using a tenfold cross-validation technique. Each iteration used nine folds for training and the remaining one fold for testing.

D. SETUP FOR EXPERIMENTATION
Six machine learning approaches were utilizedin this study, which include SVM, KNN, DT, RFs, MLP, and LR. The predictions encompassed the benign or cancerous nature of cells. Intel(R) Core (TM) i3-1111G4 CPU @ 3.00GHz 2995Mhz with 8 GB RAM were installed and used in this study. An open-source written library that was written in Python, namely Scikit-learn was employed for the analysis. Reports, which contained narrated text, live code, equations, and graphics were created by using an open-source tool, Colab.
Many performance metrics were employed to measure the effectiveness of Machine Learning Algorithms (MLAs). In case of the evaluation of the concerned parameters, confusion matrices were used such as TP, TN, FN, and FP, which were used to predict data as well as the real data. For all the methods employed, the calculated confusion matrix is as follows:

IV. RESULTS
The study showed an accuracy of 86%, which was the highest accuracy achieved through MLP and the second highest accuracy achieved was 85.2%, which obtained through Logistics Regression. Additionally, the highest precision value (0) was achieved by MLP and Logistic Repression in which MLP achieved 81% of accuracy in case 1. The least value of precision was 75%, which was achieved by SVM for malignant or benign cases. LR provided the highest recall rate (0), whereas the lowermost values were obtained through the Decision Tree (DT) and SVM. MLP provided the highest recall rate (0) and the least value was given by the Decision Tree (DT) and SVM.
In terms of the F1 score rate, the highest value (0) was achieved by MLP,LR, and the Decision Tree (DT), whereasSVM provided the lowest value. The highest recall rate (1) was achieved through MLP, whereas SVM provided the lowest value.

FIGURE 2. Comparison of models
The MLP model provided the highest accuracy percentage when compared toall other techniques in association with the accuracy in Figure 1. We are going to discuss other things in the discussion section below.

V. DISCUSSION AND CONCLUSION
To evaluate the specific terms by their equivalent formula in the investigation, following factors have been abundantly used. There are several characteristics, which are comparable to those that tend to describe the associations as substantial to amply measure a system's performance.
The experimental results are given in Table  VII. However, It can forecast the greatest number of positives when any of the six strategies is noticed to be true. Logistics Regression Models (LRM) may forecast the least amount of positives when they are false positives in addition to predicting the greatest number of true positives. Logistic Regression (LR) predicted the lowest falsepositive rate, whereas the false-positives highest value was achieved by LR and MLP.
The decision Tree (DT) provided the highest rate where false negatives were concerned, with the lowest rate being achieved by MLP. The F1 score for all the techniques is almost 97%, which was significantly better. LR predicted the highest value of True-Negative, whereas MLP provided the lowest value.