PREDICTING BREAST CANCER BY MACHINE LEARNING ALGORITHM ON GENE EXPRESSION DATA
*Alpna Sharma, Nisheeth Joshi and Vinay Kumar
ABSTRACT
Breast cancer is a worldwide widely spread disease. Detection of pathogenic variations in gene expression profiles of various genes, responsible for breast cancer in women at early stage, will lower the mortality rate due to this deadly disease. Microarray techniques help in identification of differences in gene expressions of normal and diseased samples. This technique is used extensively in various biological research areas. Microarray Techniques is used to monitor expression level of thousands of genes, simultaneously. Detection of the disease or its extent of spread at an early stage is essential for prognosis (prediction of disease outcome) and treatment options. As the genomic data is big data with thousands of features, the classification accuracies can be achieved by considering only the differentially expressed features. In this study, to deal with the skewness of dataset towards positive class, original dataset, oversampled dataset and undersampled dataset‟s are created. Classification models are tested and results have been compared with all the datasets to get rid of imbalanced class problem. It also helped in the validation of accuracy level of the results. After standardizing data onto standard normal scale (mean = 0 and variance = 1), features responsible for 95% variance in dataset are selected using principle component analysis(PCA). The features extracted using PCA are clustered by applying K-mean clustering algorithm before feeding them to classifier for prediction of diseased and normal classes. Accuracies of different classification models such as K-Nearest Neighbour, Naive Bayesian Classifier, Random Forest and Neural Network have been compared. The algorithms like random forest and Naïve Bayes classifier have given 100% accuracy to classify normal and diseased samples while KNN and Neural network could not give 100% accuracy but the results are quite satisfactory.
Keywords: Breast cancer, gene expression, feature reduction, clustering, classification, Random Forest, Neural Network.
[Download Article]
[Download Certifiate]