machine learning application
Introduction
This report aims at applying a machine learning technique on an available dataset obtained from Kaggle or any government website. All the programming is to be done using python, a statistical tool for the data scientist. The data used for the machine learning implementation was obtained from the Kaggle website https://www.kaggle.com/ronitf/heart-disease-uci. The data contains 14 variables and 303 observations. The data talks about the presence and absence of heart diseases. This is labeled as the target variable. The remaining variables give information that may or may not influence the presence or the absence of heart diseases. Some of the variables that are contained in the data sets are
- The age of the patients.
- The sex of the patients.
- The type of chest pain experienced by the patients.
- The resting blood pressure of the patients.
- The fasting blood sugar > 120 mg/dl
- The resting electrocardiographic results
- The maximum heart rate achieved
- The exercise-induced angina
- The old peak = ST depression induced by exercise relative to rest
- The slope of the peak exercise ST segmen
- The number of major vessels (0-3) colored by flourosopy
- thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
Don't use plagiarised sources.Get your custom essay just from $11/page
The variables are the ones that are to be used to create a machine learning model trying to predict the presence or the absence of heart disease.
The above-mentioned variables are just a few factors that provide risks or cause some heart diseases. Some studies that were conducted previously showed that stress is one of the significant factors which risk the presence of heart disease. The specific type of stress mentioned by Thayer, Yamamoto & Brosschot (2010) is the work stress. Therefore, the problem mainly focuses on the adults i.e., people who work as opposed to the children. This also suggests that age is an important factor when determining the presence and absence of stress. The work stress has been associated with factors such as increased worker turnover, increased absenteeism, and decreased worker satisfaction. A study conducted by Mosca et al. 2010 on Women’s health showed that women above 55 years of age are more likely to be affected and die by heart disease since they are not aware of heart disease is one of the causes of death compared to the young women. The study also stated that black and Hispanic women are 60 % likely to be aware that heart disease can lead to death as compared to white women. A study conducted by Maas & Appelman (2010 showed that Sex is a significant determinant of heart diseases. Women patients are 70 % more likely to experience heart diseases compared to men. This may be caused by the difference in the biological nature of both the male and the female. Heart disease is a significant cause of death in women. The biological body of a woman has some specific hormones that provide risks in women more than in men. According to Nahar et al. 2013, the heart disease is more common on the female compared to their male counterpart (Appelman et al. 2010)
According to Ettehad et al. 2016, the aging population had an increased risk of obtaining heart diseases. Heart disease affects about 10 -15 % of the adult population. The obtained finding shows that there was a significant reduction of heart diseases of the patient with and without chronic kidney diseases (Mitsnefes et al. 2010; Daviglus et al. 2010). Kidney diseases have been one of the disorders which were highly correlated with heart diseases. Almost every patient with chronic kidney diseases had heart disease. The correlation is more on elderly patients compared to other patients.
Following these facts, we are going to conduct machine learning techniques on the data on the presence or absence of heart diseases on the patients. Exploratory data analysis will be done before creating the model. Under the exploratory study will generate some visualization of the data and try to understand the data more. The target will be on the target variables. The distribution of the variables will also be determined by using the necessary visualizations. The data will be portioned into training and testing using a random seed to keep the ensure robustness of the model. After partitioning the data, the variables of the training data will be divided into dependent and independent variables. The independent variables will be encoded to ensure that they are consistent. The encoded variables increase the accuracy score of the model. Two machine learning technique was applied to see which method was the most accurate in modeling the data. The machine learning techniques that will be conducted are the classification techniques, which are random forest and logistic regression. The accuracy scores of the two models will be determined and compared. The precision, recall, and the f1-score score of the two models will also be determined.
Discussion and Analytics
- Exploratory data analysis
- The distribution of the target variable
The distribution of the target variable can be described by the count i.e., the frequencies and the percentages of the binary classification of the target variable. It was obtained that 165 patients had heart disease and 138 patients who did not have heart disease. The patients who had heart disease were 45.54% of the total patients, while those who had heart disease were 54.46 %. The same distribution can be shown in the form of visualization using the bar chart, as shown below.
Fig 1.1: The Distribution of the target
- The Distribution of the Age
From the descriptive statistics, the average age of the patients was obtained to be 54.37 years. The youngest patient was 29 years old, while the oldest patient was 77 years old. The middle patient age was 55 years old. The standard deviation for the age was 9.082 years. This means that the deviation of the age from the average age of the patient was 9.082. A histogram is used to check the distribution of the age. The histogram can be used to determine whether the data had any outliers. The figure below shows the distribution of the age
Fig 2: The distribution of the Age
- The distribution of Sex
The distribution of the Sex can be described by the count i.e., the frequencies and the percentages of the Sex variable. It was obtained that there were 207 female patients and 96 male patients. The male was 31.68 % of the total patients, while the female patients were 68.32 %. The same distribution can be shown in the form of visualization using the bar chart as shown below
- The cross-tabulation of Resting electrocardiographic measurement Vs. target
The figure below describes the cross-tabulation of the resting electrocardiographic measures vs. target. It can be observed that Normal i.e., 0 and the ST-T wave abnormality i.e., 1had the highest variability against the target variable.
- Multivariate Analysis
The box plot above shows the distribution of tresbps with genders to the target variable.
- Machine Learning model
The machine learning model that was used was is a random forest. Random forest is one of the classification machine learning models that are used for prediction. It works when the dependent variable is binary. The model accuracy of the random forest was obtained to 100 %. This means that the predictions were 100 % in line with the testing data. This kind of accuracy is rare and does not need any hyper-parametrization to increase its accuracy. The same accuracy was confirmed using logistic regression, of which the accuracy score was 100 %. The model also classified 27 patients not having heart diseases out of 27 patients who did not have heart disease. The model also classified 34 patients to have a heart disease out of 34 patients who had heart disease. Similarly, the logistic regression classified 34 patients to have a heart disease out of 34 patients who had the heart disease and 27 patients not having heart diseases out of 27 patients who did not have a heart disease
Conclusion
The machine learning technique is mostly used in data prediction. One is able to train the data and test whether the obtained accuracy level can be relied on. The accuracy score of both the random forest score and logistic regression. Both the two models had the same accuracy score, and therefore, it can be recommended to be used in data prediction, especially this kind of data. The classification of machine learning techniques is also one of the best machine learning technique to handle data prediction and modeling.
References
- Appelman, Y., van Rijn, B. B., Monique, E., Boersma, E., & Peters, S. A. (2015). Sex differences in cardiovascular risk factors and disease prevention. Atherosclerosis, 241(1), 211-218.
- Daviglus, M. L., Talavera, G. A., Avilés-Santa, M. L., Allison, M., Cai, J., Criqui, M. H., … & LaVange, L. (2012). Prevalence of major cardiovascular risk factors and cardiovascular diseases among Hispanic/Latino individuals of diverse backgrounds in the United States. Jama, 308(17), 1775-1784.
- Ettehad, D., Emdin, C. A., Kiran, A., Anderson, S. G., Callender, T., Emberson, J., … & Rahimi, K. (2016). Blood pressure lowering for prevention of cardiovascular disease and death: a systematic review and meta-analysis. The Lancet, 387(10022), 957-967.
- Maas, A. H., & Appelman, Y. E. (2010). Gender differences in coronary heart disease. Netherlands Heart Journal, 18(12), 598-603.
- Mitsnefes, M., Flynn, J., Cohn, S., Samuels, J., Blydt-Hansen, T., Saland, J., … & CKiD Study Group. (2010). Masked hypertension associates with left ventricular hypertrophy in children with CKD. Journal of the American Society of Nephrology, 21(1), 137-144.
- Mosca, L., Mochari-Greenberger, H., Dolor, R. J., Newby, L. K., & Robb, K. J. (2010). Twelve-year follow-up of American women’s awareness of cardiovascular disease risk and barriers to heart health. Circulation: Cardiovascular Quality and Outcomes, 3(2), 120-127.
- Nahar, J., Imam, T., Tickle, K. S., & Chen, Y. P. P. (2013). Association rule mining to detect factors which contribute to heart disease in males and females. Expert Systems with Applications, 40(4), 1086-1093.
- Thayer, J. F., Yamamoto, S. S., & Brosschot, J. F. (2010). The relationship of autonomic imbalance, heart rate variability and cardiovascular disease risk factors. International journal of cardiology, 141(2), 122-131.