Putting Big Data analytics to work
Hello Adriaan Stander, Good post
The procedures for the determination of the validity of the regression model comprises of comparing the model predictions as well as the coefficients using theory, data collection to check the predictions of the model. Outcome comparison with the theoretical calculations of the model, as well as splitting of data or cross-validating where part of the data is used in the estimation of the coefficients of the model, the remaining information is then utilized in the measurement of the prediction accuracy of the model. It is also notable that splitting of data is an effective procedure for model validation in an event where the collection of new data for purposes of model testing is not practical. An algorithm known as the Duplex algorithm which was developed by Kennard is a good recommendation for the division of data into the set of estimation as well as prediction set in case there is no visible variable like time for use as a basis for data splitting (Ludwig et al., 2015).
Reference
Ludwig, N., Feuerriegel, S. & Neumann, D., 2015. Putting Big Data analytics to work: Feature selection for forecasting electricity prices using the LASSO and random forests. Journal of Decision Systems, 24(1), pp. 1-28.