The used of the Boosted Regression Tree Optimization Technique to Analyse an Air Pollution data
Noor Zaitun Yahaya1, Zul Fadhli Ibrahim2, Jamaiah Yahaya3
1Noor Zaitun Yahaya, Senior Lecturer, School of Ocean Engineering, University Malaysia Terengganu, Terengganu, Malaysia.
2Zul Fadhli Ibrahim, Researcher, School of Ocean Engineering, University Malaysia Terengganu, Terengganu, Malaysia.
3Jamaiah Yahaya, Assoc. Professor, School of Informatic Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor, Malaysia.

Manuscript received on November 15, 2019. | Revised Manuscript received on November 23, 2019. | Manuscript published on November 30, 2019. | PP: 1565-1575 | Volume-8 Issue-4, November 2019. | Retrieval Number: B3807078219/2019©BEIESP | DOI: 10.35940/ijrte.B3807.118419

Open Access | Ethics and Policies | Cite  | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: The stochastic boosted regression trees (BRT) technique has the capability to quantify and explain the relationships between explanatory variables. We applied this machine learning modelling technique to derive the relationships between the gases air pollutants, meteorological conditions and time system variables of particulate matter (PM10) concentrations. In order to get lowest prediction error and to avoid over-fitting, the parameters of the BRT model need to be tuned. In this experiment, 25 BRT models were generated from 14 years’ worth of hourly data (122,736 a one hour averaged data from January 2000 to December 2013 gathered from four Continuous Automated Air Quality Monitoring Stations in peninsular Malaysia (located in Klang, Selangor (CA0011), Perai, Penang (CA0003), Kota Bharu, Kelantan (CA0022) and Kemaman, Terengganu (CA0002)). Seventy percent of the data were used for training and 30 percent for validation of the models. An experiment was conducted to determine the best iteration that could model hourly PM10 concentrations by optimizing the BRT parameter which are learning rate (lr), tree complexity (tc) and number of trees (nt). Five different lr (0.001, 0.005, 0.01, 0.05 and 0.1) were tested with different tree complexities (1 to 20) in the BRT model development process. From the experiment, the combination of lr = 0.05 and tc = 5 for the training set for the BRT model achieved the lowest root mean squared error (RMSE) compared to the other tested combinations. It was also found that the number of trees increased with the increment in the number of samples. A high coefficient of determinant (R2) value (0.90) for the linear relationship between the number of samples and nt was found for all the four stations. The optimum number of trees for the model was estimated by using 10-fold cross-validation. It was found that the best number of iterations for Klang, Perai, Kota Bahru and Kemaman were 12,327, 32,987, 16,370 and 57,634, respectively. The prediction accuracy of the model was tested by using the fraction of prediction namely a factor of two (FAC2), mean bias, mean gross error, RMSE, correlation coefficient (R), and index of agreement (IOA). The prediction performance of the final BRT model based on the R value was 0.81, 0.78, 0.85 and 0.81 for for Perai, Kemaman, Klang and Kota Bahru, respectively, which indicates that the BRT model developed and applicability of this can be used in other atmospheric environment data.
Keywords: Boosted Regression Tree, Tuning Parameters, Hourly PM10 Model, Particulate Matter.
Scope of the Article: Regression and Prediction.