Prediction of Breast Cancer Using Machine Learning Techniques

Breast cancer affects the majority of women around the world. Females are more likely to die as a result of this condition. By employing a variety of cutting-edge procedures, the samples are collected and the main cause of breast cancer is sought. The most modern techniques are logistic regression discriminant analysis and principal component analysis, both of which are useful in determining the causes of breast cancer. The Breast Cancer Wisconsin Diagnostic Dataset collects information via the Machine Learning Repository approach. As a result of the data correlation matrix processing, we are able to positively root our job. Principal component analysis, discriminant analysis, and logistic regression are utilized to extract the features. Models like Decision Tree, Naive Bayes, Logistic Regression, Support Vector Machines, and Artificial Neural Networks are utilized and their performance is rigorously examined. The results suggest that the proposed strategy works effectively and reduces training time. These new methods help doctors understand the origins of breast cancer and distinguish between tumor kinds. Data mining techniques are used extensively, especially for feature selection. Conclusion: Among all models, the hybrid discriminant-logistic (DA-LR) feature selection model outperforms SVM and Naive Bayes.


Introduction
After cervical cancer, breast cancer is the most common female cancer and one of the most deadly. Around 12% of women in the US have a malignant tumor that can spread to other organs [1]. Increasing survival rates may be made possible by routine screening in conjunction with accurate diagnostics. Initial examination, mammograms, ultrasound, MRI scans, experimental breast imaging and breast biopsy are all part of the diagnostic process [2]. Breast cancer diagnosis relies heavily on data mining. To help doctors correctly diagnose the disease at an early age, medical facilities have a vast amount of data that can be accessed. In the early detection of breast cancer, a mammogram is one of the most commonly used screening methods. There must be further testing to determine whether or not the tumor is malignant or benign after the mammogram has detected it. There is a plethora of variables to consider when analyzing the breast cancer data. If some features are irrelevant or multi-collinear, the classification model may suffer from a loss of precision [3]. Feature selection is essential before data mining and machine learning. [4,5] Only 20% to 30% of biopsies are found to be cancerous, statistically speaking. The sensitivity of mammography is approximately 84%.The rest of 16% false positive cases are incorrectly referred for further investigatory tests such as a biopsy [6]. Though very accurate, a biopsy is a painful, expensive and time-consuming surgical procedure. Artificial intelligence techniques have been successfully used in breast cancer diagnosis [7][8][9]. Quinlan [10] achieved 94.74 % accuracy using 10-fold cross-validation with C4.5 decision tree method. Pena-Reyes and Sipper [12] proposed a fuzzy-genetic approach and obtained a success rate of 97.36 % [11]. Hamilton et al. 1996 presents rule induction through approximate classification and obtained an accuracy of 96%. Abbass [13] applied an evolutionary multi-objective approach to artificial neural achieving 98.1% accuracy with reduced computational cost as compared to traditional backpropagation. Sahan et al. [14] proposed a hybrid K-NN algorithm and achieved an accuracy of 99.14 % via 10-fold crossvalidation. Akay [15] proposed SVM combined with feature selection using Bare nucleoli, Uniformity of cell shape, Uniformity of cell size, Clump thickness and Bland chromatin as selected feature and obtained an accuracy of 99.51% with 50-50% of training-test partition. Chen et al. [16] suggested rough set-based feature selection combined with support vector machine (RS_SVM) classifier. The classifier achieved an accuracy of 100 % with 70%-30% training test partition with five selected features including Clump thickness, Uniformity of cell shape, Marginal adhesion, Bare nucleoli, and Mitosis. Jin et al. [17] concluded to have better results using two binary classifiers with Naïve Bayes and Functional Trees (FT) as compared to multiclass classifier (one-step classifier) for predicting diagnosis and prognosis of breast cancer. Kaya [18] proposed a hybrid RSELM model. The RS was applied to reduce the attributes and ELM were utilized for classification. The proposed method obtained an accuracy of 100% with 80%-20% training-test partition with four selected features including Clump thickness, Uniformity of cell shape, Bare nucleoli, Normal nucleoli. Zheng [19] proposed a hybrid of K-means and SVM for feature reduction and classification with an accuracy of 97.38%. El-Baz [20] proposed a hybrid intelligent system that uses rough set-based feature selection and K-NN based classifier. Bhardwaj and Tiwari [21] proposed genetically optimized neural network and obtained an accuracy of 100% with 70-30 training-test partition. Onan [22] proposed a hybrid fuzzy-rough nearest neighbor classification model that consists of three phases: instance selection, feature selection, and classification. The model obtained an accuracy of 99.715%. Hasan et al. [23] proposed a hybrid genetic algorithms and simulated annealing (GSA and accomplished an accuracy of 98.84 %. Aalaei et al. [24] applied genetic algorithm-based feature selection and obtained an accuracy of 96.9% with particle swarm classifier. Alickovic and Subasi [25] used genetic algorithm-based feature selection and achieved an accuracy of 99.48 % with rotation classifier. Decision Tree classification is the most commonly used algorithm for decision trees. A simple flowchart like the top-down approach follows its structure. This creates a model for predicting a variable output variable based on one or more input variables. The internal node represents the input variable and the leaf represents the output variable. The classification path is created from the root node to the leaf node by comparing the root attribute with the record attribute. Comparison of all nodes is done until the leaf node is found with the value of N. To select the best attribute, we use a statistical property called Gain, which helps to select a candidate attribute for each node as the tree grows [27]. Decision making is the training phase of classification. After training the tree, it can be converted to the if-then rules [27]. This algorithm gives a better understanding of the overall data structure, but becomes more complicated as the number of features increases. One way to overcome this problem is to use timber pruning. It also solves the problem of over-fitting [26]. New Bayes the New Bayes (NB) algorithm is a machine learning classification technology based on Bayes theory. It is a probabilistic statistical classification used to determine the likelihood of outcomes [28]. The properties are assumed to be independent and contribute to the resultant equivalent input, which reduces the computational complexity to the simple probability multiplication [29]. The training dataset is used to estimate the previous likelihood of a label, and the impact of each attribute meets this pre-probability to obtain a probability estimate. The posterior probability of each label is calculated using the naive Bayesian equation. The highest output labeled is the output of the reference. Little training data is needed to fit together. For most problems in the real world, freedom's assumption is not practical, because the characteristics are often dependent on one another. For example, in the healthcare field, the patient's health conditions and characteristics are dependent on one another and may result in an improper classification of independent assumption and appropriate bias. Nevertheless, the Nave Bayes classification performs better in terms of classification accuracy [30]. Support Vector Machine Support Vector Machine (SVM) is a family of supervised learning algorithms based on statistical learning theory. It is used for the classification and estimation of linear and nonlinear data. The algorithm works by creating a special hyperplane that serves as the boundary for the decision to separate the different classes. Optimal separation is tuned using hyperplane kernel, regularization, gamma and margin. The main advantage of SVM is its high classification accuracy and its ability to create complex linear boundaries, which are robust to over-fitting. The main drawback of this algorithm is that the training time for SVM is very slow [31]. Artificial neural network Artificial neural network (ANN) originates from the biological network of neurons. ANN can be used to model and simulate the relationship between inputs and P outputs. In the ANN model, a layer is a collection of nodes called neurons. The network consists of an input layer, an optional one or more hidden layer, and an output layer. There is a connection between the nodes that transmit the original number as an input signal. The input of each node is calculated based on the inputs and the activation function. Each connection has a fixed weight, which controls the signal between neurons. Known for its ability to study ANN, learning is achieved by constantly updating the associated weight between different neurons. The neural network is a complex adaptive system, and the architecture can vary depending on the flow of information [32 -34]. Logistic Regression Logistic Output Logistic Regression The output of logistic regression differs with two possible outcomes. The mathematical concept that defines logistic regression is the natural logarithm of the logit-inequality ratio. A simple example of logit is taken from a 2 × 2 contingency table. In general, the logistic regression classification is well suited to describe and test hypotheses about the relationships between the outcome variable and one or more of the categorical or continuous predictor variables [35].
Feature selection is done to reduce the number of variables and to determine the important factors in the analysis phase. The dataset used in this research was provided by Dr. William H. Wright. Retrieved from Wohlberg University of Wisconsin Hospitals, Madison and contains 10 attributes and 699 examples. Main objective of this study is feature reduction by using Logistic Regression, Discriminant Analysis and Principal Component Analysis combined with the tools of machine learning to access and classify the breast cancer on the base of characteristics. Hybrid DA-LR feature reduction is proposed, and models created with reduced features can be tested by classifying those using Naive Bayes, Support Vector Machine, Logistic Regression, Decision Tree, and Artificial Neural Network.  Data Preprocessing: Before applying statistical and data mining techniques, the dataset needs to be preprocessed because it contains missing values. Preparation of Information Missing data points necessitate a pre-processing procedure. However, removing the missing value is an option, but it isn't ideal because the dataset isn't particularly large. Substituting the mean or the mode for any missing values is an option. As a result of these methods, it is possible to obtain erroneous estimates of variance and covariance Instead of guessing at the distribution of each variable in the dataset, it is preferable to estimate those distributions and then use those estimates to fill in the blanks. Heuristic algorithms, such as this one, are used to fill in the blanks in a dataset without introducing significant bias.

Correlation Matrix:
In statistics, correlation is used to determine the correlation between two variables. Analyzing the correlation matrix is extremely beneficial prior to developing prediction models. Having multiple predictors in the model leads to more uncertain estimates because of the way multicollinearity affects the precision of each one's impact. When two or more variables are highly linearly related to each other, it is known as multicollinearity.

Feature extraction and Selection:
The correlation matrix indicates whether or not there is multicollinearity in the data. In order to reduce the dimensionality of the data, feature extraction is an effective technique. The amount of data required to produce an accurate result grows in proportion to the size of the dataset. Principal component analysis (PCA), discriminant analysis (DA), and logistic regression (LR) are three feature extraction techniques that are investigated in this section for the purpose of reducing dimensions and extracting informative features. Using principal component analysis, it was possible to investigate and reduce the dimensionality of the information. The assumption of multivariate normality is used in the discriminant analysis process. It is not possible to make the multivariate normality assumption hold if the data contains a mixture of independent and dependent variables. The goal of discriminant analysis is to identify the variables that are most effective at distinguishing between the two groups [39].

Results
The investigated data set was unbalanced as can be observed in Figure 2, which could lead to bias prediction because the prediction model will tend to predict the class with more observations and accuracy measure, in that case, could not be fully trusted.  Table 1 shows the correlation matrix of the dataset. Apart from the uniformity of cell size and uniformity of cell shape, other variables are not as highly correlated. The correlations among some variables are still more than 0.5 and considered moderately large. Therefore, feature selection and extraction are necessary for choosing inputs for the classification.  The correlation matrix indicates multicollinearity, hence PCA is used to create new variables that are a linear combination of original variables. As it is shown in Figure 3-4, the first two principal components represent %69 and %7 of total variance respectively. Figure 3 shows graphically how two new independent variables cover original variables. Figure 4 shows the scree plot that has a steep curve followed by a bend and horizontal line. The steep curve has two principal components that are retained to explain most of the variability of the data.  Table 2 shows the Eigen Vectors of first three principal component. The first three components explain 80 % of the total variance in the data, but the Eigen Vectors within the principal component is not distinguishable. Hence PCA did not provide enough motivation to make dimension reduction. In Table 3 the variables used in discriminant analysis model are shown.   The discriminant analysis suggests that the first five variables discriminate best between the malignant and benign cases with a 0.05 significance level of entry. The F-statistic score determines the order of the variables. The variables entered in the stepwise discriminant analysis will stay if their p-value is less than the significance level of entry. Similarly, the variables entered in the model will stay if their p-value of the overall model is less than the significance level of stay. Feature extraction with Discriminant Analysis is meaningful and is following the correlation matrix. "Uniformity of Cell Shape", "Single epithelial cell size" and "Marginal adhesion" are not selected as they have a very high correlation of 0.9, 0.75 and 0.71 respectively with "Uniformity of Cell Size". The values of the identified logistic regression models expressed as Chi-square test for likelihood ratio, Score and Wald p-value should be within the acceptable significance value and are shown in Table 4. The logistic regression analysis suggests that six variables are essential to classify effectively between malignant and benign tumors. With 0.01 significance level for entry, the first four variables are selected in the model. Classification and prediction of breast cancer type is performed using all features in the dataset with methods named in the prior section, and their performance is compared in Table 5. Table 5 shows the classification result using all features in the model. As can be seen in table 5, SVM performs the best with highest accuracy and AUC. In Tables 5-8 when the upper bound of CI is 1, it is a round number of 0.9 with 5 digits that make it close enough to 1.  Table 6 shows the performance of classifiers with LR feature selection which shows the results with LR selected features in the classification model. As it is seen in table 6, NB and SVM perform better than all other classifiers. Comparing the LR feature selection classification with all feature classification, performance is improved with LR selected features. The significance level of alpha equal to 0.05 is considered when the null hypothesis is tested.

Discussion
The paper presents an extensive comparative data mining and machine learning analysis performed on breast cancer dataset. The correlation matrix of features indicates the presence of multicollinearity. Therefore, feature reduction is investigated using PCA, LR, and DA to reduce the dimension and to increase classification power. Comparing the results of classification performance metrics of ANN, DT, LR, SVM and NB on four different sets of features, showed both NB and SVM have superior performance when they are fed with DA selected features. These four sets of features include a set of all features selected, features that are selected by LR, features that are selected by DA and hybrid DA-LR feature selection. The diagnosis of breast cancer can be very expensive and risky through mammography and biopsy [1,2]. The risk of biopsy is when the diagnosis is positive, but the patient does not have cancer this comes with a huge load of mental and emotional stress and discomfort [40]. During the last decades, many kinds of research have invested in breast cancer diagnosis using data analytics and later on machine learning. To this aim, data of patients that might have breast cancer is analyzed using different techniques. Data sets that are used in the literature might vary in terms of size and variety of variables. Collecting related data is time-consuming and a higher number of features would not necessarily lead to higher accuracy in diagnosis [3,4]. For this reason, in many of breast cancer diagnosis studies, or similar applied health-related studies feature selection is an important part of the methodology. In this study, we tried combining different feature selection methods with different classification models to find out using which of these combinations leads to higher accuracy. Feature selection and classification models are chosen based on their frequency of use in highly cited journal papers [7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25]. Although PCA has been proved to be a strong dimension reduction technique, we did not find it very insightful with our case study and hence did not use its outputs in further steps [45]. From physical examination to biopsy to imaging tests such as mammogram and MRI, diagnostic methods have evolved over the years (MRI). Breast cancer and other types of cancer patients' chances of survival skyrocket when detected early [46]. To fairly get an assessment of the impact of features selection on classification results, first, we applied all five classifiers, NB, DT, LR, SVM and ANN when all 10 features are included within the dataset. As shown in Table 5, SVM outperforms other classifiers with a significantly better accuracy and AUC. Next, after selecting 5 features using LR. As shown in Table 6, the overall performance of all classifiers has significantly improved especially ANN while SVM still has the best rank among other classifiers, NB has also reached a high accuracy as SVM. Furthermore, the analysis was conducted by feeding the classifiers with the results of LR choosing 4 features. While SVM still has the best performance regarding the performance evaluation metrics, the performance of ANN has significantly dropped down which shows the sensitivity of ANN on a number of times the models is being run comparing to other classification models. Later we tried feeding the classification models with the features that are chosen by a hybrid method of LR-DA which lead to 6 features selected. Bare Nuclei and Clump Thickness are the two features that are selected with LR, DA and hybrid LRDA. While Bland Chromatin, Marginal Adhesion and Uniformity of Cell Shape are selected by LR but not DA, it is selected in hybrid LA-DR and Uniformity of Cell Size that was selected by DA is removed from the hybrid selection. It is worth mentioning that Normal Nuclei that was selected by both LR and DA, is not selected by the hybrid model. This selection is worthy because after running all the classification models for 2000 times, there is not much variation in the confidence intervals and p-values show the significance of the results. As shown in Table 8, with hybrid feature selection, NB and SVM outperforms other classifiers and have improved accuracy and AUC as compared to LR and DA feature selection. The proposed DA-LR feature selection performs best out of all techniques using SVM classifier. Therefore, based on the results, SVM is the most suitable method for classification of breast cancer data while proposed hybrid DA-LR is the best technique for feature reduction. As shown and discussed in this study, the power of SVM in diagnosing breast cancer data with high accuracy is aligned with the results of the reviewed literature [15,16,19] and that when right features are selected, SVM can achieve high accuracy in predicting patient's malignancy in a short amount of time. As a future direction to extend this study, we intend to use a data set with a high number of observations and to try different multivariate-classification methods. In addition, running sensitivity analysis on parameters of each classification model can help validate the robustness of each model.

Conclusion
The most advanced techniques involved are logistic regression discriminant analysis and principal component analysis, which are handy to find out the reasons of breast cancer. Different types of cancer can be diagnosed by studying different features with the help of reported techniques. To extract and select the features the most applicable methodology is to use Data mining. Many techniques were developed and analyzed for the diagnosis of tumors. The main purpose to diagnose the breast cancer depends on the relevant features extraction and selection from already fed data. Machine Learning Repository method is used by the Breast Cancer Wisconsin Diagnostic Dataset for the collection of information. The Naive Bayes and Support Vector machine classification outperforms other classification methods, and the hybrid discriminant-logistic (DA-LR) model feature selection performs best among all models.