Association of Vitamin D Receptorgene polymorphisms with the occurrence of decreased bone density among participants with type 2 diabetes using classification and regression tree (CART) analysis
Authors:
Maryam Ghodsi[1]
Abbas Ali Keshtkar [2]
Shahin Roshani [3]
Mahsa Mohammad Amoli [4]
Ensieh Nasli-Esfahani 1
Bagher Larijani[5]
Mohamad Reza Mohajeri 5 *
*Corresponding Author, Email: mrmohajeri@sina.tums.ac.ir
Conflict of interest Maryam Ghodsi, Abbas Ali Keshtkar, Shahin Roshani, Farideh Razi, Mahsa Mohammad Amoli, Ensieh Nasli-Esfahani, Bagher Larijani and, Mohamad Reza Mohajeri declare that they have no conflict of interest.
Other authors Emails:
- Maryam Ghodsi: ghodsi@gmail.com,
- Abbas Ali Keshtkar: abkeshtkar@gmail.com,
- Shahin Roshani:roshani@ncdrc.info,
- Farideh Razi: razi@gmail.com,
- Mahsa Mohammad Amoli: amolimm@tums.ac.ir,
- Ensieh Nasli-Esfahani: nasli@yahoo.com,
- Bagher Larijani: emri@tums.ac.ir
Keywords:CART analysis, Logistic Regression, cross-validation, VDR polymorphism, type 2 diabetes, osteoporosis.
Title
Summary
We compared the accuracy performance estimation of in finding the association between five VDR gene polymorphisms and decreased bone density in people with type 2 diabetes (T2D) usingthe CARTanalyzing method. The results suggest that the ultimate model of the CART method has a highaccuracy performance estimation.
Abstract
Background:The osteoporosis research field is increasingly concerned with finding the most efficient genetic factors associated with bone density. We often investigate the associations through traditional multivariable models. In analyzing data sets that have small sample sizes and many predictors (explanatory) and include high missing variables and remarkable multicollinearity; however, there are fundamental limitations in using these methods. To address these problems, we useda non-parametric nonlinear analysis method named the classification and regression tree (CART) analysis.This study aimedtoinvestigate the accuracy performance of the CART analyzing methodin investigatingthe occurrence pattern of low bone density among a sample of participants with Type 2 diabetes (T2D).
Methods: We used retrospective data of a population-based cross-sectional study entitled the Iranian Multi-Center Osteoporosis Study (IMOS) study.The information of a total of 158 T2D patients recruited to the study.
Results: The CART analysis identified X nodes that were combined to create an algorithm of low bone density occurrence with five predictors. The algorithm yielded a sensitivity of X% and specificity of Y%.
Conclusion:
There is increasing interest in using CART analysis in the health domain, primarily due to its ease of implementation, use, and interpretation, thus facilitating medical decision-making. This method should use for analyzing continuous or categorical outcomes in hemophilia, when applicable.
Thus, the CART model could facilitate medical decision‐making and provide clinicians with a validated practical bedside tool for ACHBLF risk stratification.. Don't use plagiarised sources.Get your custom essay just from $11/page
Keywords: stratified samples, logistic regression, CART method, Partial interaction
Introduction
A better understanding of influencing factors of osteoporosis is essential to develop strategies to mitigate its burden. In this regard, the osteoporosis research field is increasingly concerned with deciphering new efficacious genetic components associated with the bone density like vitamin D Receptor (VDR) gene polymorphisms.We often investigate the associations through traditional multivariable modelsas follows. Where we consider the decreased bone density as a binary outcome variable, the association study will be through multivariable logistic regression (LR) models and, where the bone density is a continuous outcome variable, the association study will be through multivariable linear regressions models[1, 2].
Nevertheless,traditional methodscan be difficult to implement and interpret in particular conditions, especially for the clinician, when making a medical decision[3]. For example, highly skewed data, such as serum vitamin D, cannotcorrectly beanalyzed using these methods.Also, there are problems in using mention traditional methods in analyzing data sets with small sample sizes, many predictors (explanatory variables), missing values, outliers, and multicollinearity. The expressed challenges were the impetus for Breiman to developing an innovative method named “the classification and regression tree analysis” (CART), which can address such challenges [4].This method of analysis is less affected by the low sample size, missing data, and outliers than other methods and does not limit by a large number of explanatory variables (predictors) and multicollinearity.
Despite the increasing interest of researchers in the use of the CART method in data mining and related medical research, we have found there is only one published study using the CART methodology in the fields related to the osteoporosis[5]. In this paper, we aimed to define the importance of five investigated VDR gene polymorphisms (TaqI, BsmI, ApoI, FokI, and EcoRV) on the occurrence patterns of low bone density among individuals with type 2 diabetes (T2D) by implementing the CART statistic method.
Methods
The Study population
The Endocrinology and Metabolism Research Institute (EMRI) had conducted a population-based cross-sectional study entitled the Iranian Multi-Center Osteoporosis Study (IMOS) in the two cities of Kurdistan province in Iran between 2012 to 2013[6].Then in 2015, the institute performed a genetic study of five VDR polymorphisms on remained stored bloodsamples of the participants from the Sanandaj city (1032/1266 ) [7]. We had access to complete merged data of these two studies. In the current research, the study populationwasthose participants of the 3rd phase of the IMOS study considered to be afflictedwith T2D(N=165). We described therecruitment method for participants deemed to be afflicted with T2D in a previouslypublished article(my 2nd article). We also excluded those T2Dparticipants (N=7) that had no report of bone marrow densitometry (BMD).All of the final recruited participants had data records of BMD results at three sites (the lumbar (l2-L4) spine, Hip, and femoral neck), which were performed by Dual-Energy X-ray Absorptiometry (Norland XR46) in 2013 [7].
Given the retrospective nature of this study, we did not require to obtain renewed patient consent.The Ethics Committee ofthe EMRI, at the author’s affiliated institution, approved the initial study protocol.
Measurements: Response and explanatory variables
we described the study measurements for the current research in two categories of response and explanatory variables as fellow:
Response variable:we considered the variable of bone density status as the primary binary response variable; the status defined as low or normal. We defined the two leading groups based on the osteoporosis criteria of the World Health Organization (WHO) [8]. The first group named “low” included those T2D patients that had either low bone density (LBD), osteopenia, or osteoporosis at any of the three mention sites in which BMD had been performed; regarding the WHO criteria [8]. The patients with normal bone density in all three sites were included in the “normal” group.
Explanatory variables: We selected a few numbers of variables as explanatory variables or potential predictors, as follows: age and BMI as the continuous explanatory variables and, gender and vitamin D deficiency as binary explanatory variables and, the five VDR gene polymorphisms (TaqI, BsmI, ApoI, FokI, and EcoRV) as the categorical explanatory variables.
Considering Endocrine Society Clinical Practice Guideline [9], we defined vitamin D deficiency as 25-hydroxyvitamin D below 20 ng/ml (50 nmol/liter). Each variable of gene polymorphism includes three categories (genotypes ) and, we used its first letter for naming the genotypes as follows:
when the first two letters of the gene’s name capitalize, it indicates the homozygous wild variant of the polymorphism; when they are both lower case, it means the homozygous variant, and finally, the big letter followed by a small indicates a heterozygous variant.
Additional information about the mentioned measures and variables, as well as the full description of the genotyping process, are all available in our previous article (My second article).
Analyzing method
According to the low sample size, we used all data in the training stage during the process.Missing values of the explanatory variables were imputed by median values for numerical variables and by the most frequent class for categorical variables [10]. We summarised the explanatory variables in the two groups of primary outcome variable considering the SAMPLE guideline [11] as follow: in the case of continuous variables with normal distribution, we summarised the data as mean (standard deviations); otherwise, we reported it as the median (interquartile range). We expressed the distribution of thecategorical variables as numbers (percentages). We checked the normality assumption using statistical tests (Kolmogorov-Smirnov/Shapiro-Wilk test) and graphical assessments (histograms, Q–Q plots, and box-plots).We performed the Chi-squared test (Pearson/Fisher test), t-test, Mann-Whitney U test, and the Pearson/Spearman correlation coefficient to evaluate relationships between the primary outcome variable and the explanatory variables.
To define the importance of the five investigated VDR gene polymorphisms (TaqI, BsmI, ApoI, FokI, and EcoRV) on the occurrence patterns of low bone density among individuals with type 2 diabetes, We used the CART analysis. We calculated the accuracy performance estimation of the final two modelsby using 10-fold cross-validation and compared it via the implementation of the prediction intervals in forest plots.
We performed statistical analyses using RapidMiner)ver. 9( and R software[12]. All statistical tests were two‐tailed and, the P-value < 0.05 considered significant.
The CARTMethodology
Obtaining the “final optimal” tree, we performed CART analysis in RapidMiner software considering thesplitting rule, including Gainratio, information gain, Gini index, and accuracy as splitting criteria. According to the splitting rules, the CART analysis method branched algorithms that classified participants into two categories namednormal and low (bone density) based on the predictors (explanatory variables). The software selected the final optimum tree based on the criterion that yielded higher accuracy in the cross-validation process; it highlighted the best-fitted model of performing the CART analysis in the investigating data set as well.
Validate and compare the precision of final trees
Using the cross-validation, we had calculated the prediction capacity of each of the optimal CART models as an internal validation. We compared the predictive intervals of the final trees by the implementation of the prediction intervals in the forest plots.
Results
The study sample contained 158 T2D participants; most of them were women (102/158, 61.82%).
The mean (SD) of population study’s age was 50 (12), they aged between 26-83 years.
80 (50.63%) of participants were LBD according to this classification
The target population of our study included 158 /160previous participants with T2D; details have been previously published (the 2end article). To study the unadjusted effects of categorical variables, we used Chi-square tests; we used T-tests for continuous variables (Table 1).
In the final decision tree, each internal (non-leaf) node tests an attribute (dependent variable), each branch corresponds to attribute value, and each leaf node assigns a class.
Conclusion
Our study aimed to determine the association between VDR gene polymorphisms (TaqI, BsmI, ApoI, FokI, and EcoRV) and the occurrence of low bone density (osteopenia/osteoporosis) in individuals with type 2 diabetes, using both LR and CART methods.
The result of confusion matrices supported the use of the CART method as the best approach. The main advantages of CART vs. logistic regression are its easy inherent for implementation, understanding, and interpretation of the analysis.
The CART analysis worked better for our dataset. Logistic regression could be moregeneralizable (parametric) than the CART analysis.
Briefly, the main advantages of CART over logistic regression are its intrinsic nature for implementation, understanding, and interpretation of the analysis. Therefore, the interest of medical researchers in using the CART methodology has become increased during the last years.
The advantages of CART vs. logistic regression
Because of innovative methods of the CART analysis, the technique can be used to address the challenges that other analyzing methods are unable to do. Below we will detail the benefits of the CART method to the multivariate logistic regression interpreting method:
- Sample size. CART has no assumption for sample size. Hence, multivariate logistic regression requires a large sample size.
- Missing values. One of the main issues in logistic regression is missing values. As it completely deletes an observation (individual) when it has a missing in only one of the explanatory variables of the model, the remained sample size for building the final model can be dramatically decreased. The CART method solves the problem by using surrogatevariables. When an explanatory variable is missing for an individual observation, a surrogate splitting variable is sought. A surrogate variable is a variable whose pattern within the dataset, relative to the outcome variable, is similar to the primary explanatory variable.
- Distributional assumptions. CART implementation is more comfortable than logistic regression because it has no assumption about the distribution of outcome or explanatory variables nor the type of relationships between the outcome and explanatory variables. However, the logistics regression has a critical assumption, and that is, there must be a linear association between the explanatory variable and logarithm of odds of the outcome variable (log P/1-P, where P is the probability of binary outcome variable). Hence, where the assumption is not valid, it is the CART method that can handle nonlinear outcome variables through partitioning. In partitioning, the data is split according to a binary sequence, i.e., a series of simple yes/no answers.
- Multicollinearity. In logistic regression, multicollinearity between two or more explanatory variables could cause problems with estimations. In contrast, in the CART method, the problems are solved by selecting the best splitters.
- Outliers. Outlier variables must manually remove when to use logistic regression methods. In contrast, CART analysis has a great ability to automatically identify outlier values and isolate them in a separate node, thus prohibited them from having any effect on splitting.
- Limitation of the number of explanatory variables. In logistic regression, the number of explanatory variables is restricted and should be less than the number of observations (individuals). Moreover, from the first, the potential explanatory variables must be selected. Whereas in the CART method, there is no restriction on the number of explanatory variables that causes CART only takes the best splitter as explanatory variables.
- Complex interactions. In logistic regression analysis, it is almost impossible to point out a complex interaction between explanatory variables, whereas; by selecting the best splitter at each node, the CART method can efficiently deal with the interactions and illustrate them in the final tree.
- Understanding and interpretation. The final product or output of CART analysis is a tree structure. The final production of CART analysis is a tree structure that includes several terminal nodes; each node can specify a subgroup of patients who has a specific condition or indicated for special attention/treatment. For clinicians, such tree structures make more sense than logistic regression methods. Especially while they should decide at the patient’s bedside, it is essential that they ought not to interpret a complex equation to make a correct decision. According to the studied outcome, the tree structure is such that a physician can quickly determine which patient belongs to which available subgroups, and can also specify the member of which subset does need special attention/treatment. Therefore, this method can help physicians to distinguish between the patients who are at higher risk of the studied outcome from those who are not. So, CART analysis can be plenty useful in medical decision-making, particularly in guidelines designing.
Uncategorized References
- Hosmer Jr, D.W., S. Lemeshow, and R.X. Sturdivant, Applied logistic regression. Vol. 398. 2013: John Wiley & Sons.
- Marill, K.A., Advanced statistics: linear regression, part II: multiple linear regression. Acad Emerg Med, 2004. 11(1): p. 94-102.
- Lewis, R.J. An introduction to classification and regression tree (CART) analysis. in Annual meeting of the society for academic emergency medicine in San Francisco, California. 2000.
- Breiman, L., et al., Classification and Regression Trees (Wadsworth and Brooks/Cole, Pacific Grove, CA). CA. Mathematical Reviews (MathSciNet): MR86b, 1984. 62101.
- Su, Y., et al., Can Classification and Regression Tree Analysis Help Identify Clinically Meaningful Risk Groups for Hip Fracture Prediction in Older American Men (The MrOS Cohort Study)? JBMR Plus, 2019. 3(10): p. e10207.
- Keshtkar, A., et al., A Suggested Prototype for Assessing Bone Health. Arch Iran Med, 2015. 18(7): p. 411-5.
- Mohammadi, Z., et al., Prevalence of osteoporosis and vitamin D receptor gene polymorphisms (FokI) in an Iranian general population based study (Kurdistan)(IMOS). Medical journal of the Islamic Republic of Iran, 2015. 29: p. 238.
- Bonjour, J.P., P. Ammann, and R. Rizzoli, Importance of preclinical studies in the development of drugs for treatment of osteoporosis: a review related to the 1998 WHO guidelines. Osteoporos Int, 1999. 9(5): p. 379-93.
- Holick, M.F., et al., Evaluation, Treatment, and Prevention of Vitamin D Deficiency: an Endocrine Society Clinical Practice Guideline. The Journal of Clinical Endocrinology & Metabolism, 2011. 96(7): p. 1911-1930.
- Donders, A.R.T., et al., A gentle introduction to imputation of missing values. Journal of clinical epidemiology, 2006. 59(10): p. 1087-1091.
- Lang, T.A. and D.G. Altman, Basic statistical reporting for articles published in Biomedical Journals: The “Statistical Analyses and Methods in the Published Literature” or the SAMPL Guidelines. International Journal of Nursing Studies, 2015. 52(1): p. 5-9.
- Mierswa, I., & Klinkenberg, R. (2018). RapidMiner Studio (9.1) [Data science, machine learning, predictive analytics]. Retrieved from https://rapidminer.com/.
[1]Diabetes Research Center, Endocrinology and Metabolism Clinical Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran
[2]Department of Health Sciences Education Development, School of Public Health, Tehran University of Medical Sciences, Tehran, Iran
[3]Non-Communicable Diseases Research Center (NCDRC),Endocrinology and Metabolism Molecular-Cellular Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran
[4]Metabolic Disorders Research Center, Endocrinology and Metabolism Molecular-Cellular Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran
[5]Endocrinology and Metabolism ResearchCentre, Endocrinology and Metabolism ClinicalSciences Institute, Tehran University ofMedical Sciences, Tehran, Iran