Statistical Downscaling Output GCM Modeling with Continuum Regression and Pre-Processing PCA Approach

One of the climate models used to predict the climatic conditions is Global Circulation Models (GCM). GCM is a computer-based model that consists of different equations. It uses numerical and deterministic equation which follows the physics rules. GCM is a main tool to predict climate and weather, also it uses as primary information source to review the climate change effect. Statistical Downscaling (SD) technique is used to bridge the large-scale GCM with a small scale (the study area). GCM data is spatial and temporal data most likely to occur where the spatial correlation between different data on the grid in a single domain. Multicollinearity problems require the need for pre-processing of variable data X. Continuum Regression (CR) and pre-processing with Principal Component Analysis (PCA) methods is an alternative to SD modelling. CR is one method which was developed by Stone and Brooks (1990). This method is a generalization from Ordinary Least Square (OLS), Principal Component Regression (PCR) and Partial Least Square method (PLS) methods, used to overcome multicollinearity problems. Data processing for the station in Ambon, Pontianak, Losarang, Indramayu and Yuntinyuat show that the RMSEP values and Rpredict in the domain 8x8 and 12x12 by uses CR method produces results better than by PCR and PLS. KeywordsCR, PCA, PCR, PLS, SD, GCM


I. INTRODUCTION
ecently General Circulation Models (GCM) is recognized by many people as important tools in understanding the climate system.But many scientific communities expressed some dissatisfaction, because it has produced an inadequate space scale forecast [14].One effort to overcome these problems is the use of Statistical Downscaling (SD) method [4].The main advantage of this method is inexpensive computation and easy application in many output simulations and experiments which based on GCM.
Some SD methods for many climate studies were developed in high latitude countries, whereas in low latitude region (such as Indonesia) is still very limited [4] [14].There are SD methods for generating large scale and local scale model relathionship such as based on region or spatial, temporal, dependent variable, independen variable, and statistical methods.SD method often used are classical or multiple regression [1,2], canonical correlation [2,16], Singular Value Decomposition (SVD) [11], and non linear approach such as artificial neural network [3].SD models Institut Teknologi Sepuluh Nopember, Surabaya, 60111, Indonesia.Email: sutikno@statistika.its.ac.id. 2 Hendy Purnomoadi is Student of Statistics Department Master Program, FMIPA, Institut Teknologi Sepuluh Nopember, Surabaya, 60111, Indonesia.
developed in Indonesia are Haryoko (2004) and Wigena & Aunuddin (2004) [13], but it did not consider spatial however correlation, autocorrelation case and problems of non linear structure data.
The problems that arise in the SD method are how to determine domain (grid) and dimensions reduction, how to obtain an independent variable that may explain the diversity of the dependent variable, and obtain appropriate statistical methods of data characteristics that can describe the relationship between independent variables and the dependent variable, accommodate how to employ extreme events.The method often used for pre-processing are the Principal Component Analysis (PCA), Discrete Wavelet Transform (TWD), Robust Principal Component Analysis (ROBPCA), and Kernel PCA; furthermore, Continuum Regression (CR) is also a model for the dependent variable with variable preprocessing.It is one potential method to overcome the multicollinearity.
The purpose of this study is to compare the performance of CR, PCR and PLS with PCA preprocessing by Root Mean Square Error Prediction (RMSEP) and R 2 predict criteria.

A. Principal Components Analysis (PCA)
PCA is a procedure to reduce the dimension of data by transforming the original variables correlated to a set of new uncorrelated variables.New variables are told as a Principal Component (PC) [6].
PC can be obtained from the eigenvalue-eigenvector pairs of covariance matrix or correlation matrix.First, standardization of data is done first when a unit of data between variables are not equal.It isessensially done so that the dominance of one or two variables in a PC can be avoided.If Σ is a variance-covariance matrix from random vector X T =[X 1 ,X 2 ,…, X p ]. Σ is obtained from the method of Maximum Likelihood Estimation (MLE) with the formula in Equation ( 1).

(
)( if the PC is taken as k, where (k<p), then, Furthermore, when it is employed used the beginning is the covariance matrix of standardized data, due to the main diagonal matrix containing the value of one, then the total population variance for the standardized variable is p, representing the diagonal matrix elements ρ,then total variance can be explained by the i-th

B. Partial Least Square (PLS)
PLS method is a statistical method to generalize and combine the methods of factor analysis, PCA, and multiple regressions.The purpose of PLS is to form a component that can capture information from the independent variable to predict the dependent variable.
PCA focuses on diversity in the independent variables, while PLS focuses on the covariance between independent variables and the dependent variable.The model from PLS methods consists of external and internal relations.External relations in the PLS is an individual and group relationships.

C. Continuum Regression (CR)
CR is a regularized regression estimation methods (a set), and used to handle the collinearity or multicollinearity problems, which means there are approaches a linear relationship between the independent variables.CR is developed from the OLS, PCR, and PLS regression.
Based on the following linear regression model: y = Xβ + ε (10) with independent variable X (size nxp) that has been centered and the dependent variable y (size nx1) is the vector that has been centered.In the case of multicollinearity show that X is not full rank matrix.Consequently, matrix X T X is (almost) singular.
In a linear weighted regression model, mathematical formula can be written as follows, by maximizing ( ) With x і is the observation vector with the i-th independent variables (i=1,2, ..., n) size (px1),s = X T yand S = X T X.
PCR principle is to maximize: From formula (12) shows that the basic principle of PCR is used to maximize the variance of the independent variable X thus a new variable is formed in the form of several major components which are linear combinations of original variables (X).Furthermore, the dependent variable y is regressed with several major components using multiple linear regression techniques.
PLS regression principleis to maximize : (13) Then from formula (13) it can be seen that PLS regression principle is used to maximize the covariance between the dependent and independent variables.
New variable in CR are written as follows in Equation ( 14).y = T h ξ + ε, with T h = XW (14) And W h = ( w 1 ,w 2 ,….,w h ) is a matrix containing h columns variable with h < p and called as weighting matrix.
Another alternative is a formula developed by Malpass (1996) as follows [7] : #"* ! ( (16) From the formula (15) made a general formula as follows: (17) Furthermore it was called as Stone methods.From the formula ( 16) can be made into : (18) Furthermore this formula was called the Portsmouth methods [7].
The formula is a generalization of the OLS, PCR and PLS with the following forms of linkage: 1.For δ = 0, then G = (w T s) 2 (w T Sw) -1 this formula is equivalent to Equation (11), thats mean, if δ = 0 CR is OLS. 2. Forδ = 0.5, thenG = (w T s) 2 this formula is equivalent to Equation (12), so that, if δ = 0.5CR is PLS 3. Forδ = 1, thenG = (w T Sw) this formula is equivalent to Equation (13), so that, if δ = 1 CR is PCR.In other words, OLS, PCR and PLS are a special form of CR.Estimation of regression parametersξ in the Equation ( 14) performed using least squares method is formulated as follows: where δ is an adjustment parameters and h is the number of components.

D. Goodness Model
Common measuring using good us model has the coefficient of determination R 2 describing the goodness of prediction.

E. General Circulation Model (GCM)
GCM is climate models based on computer.It uses numerical and deterministic equations which follow the physics rules.GCM is the main tool to predict or forecast climate and weather, understanding climate and climate change studies.According to [15], GCM is a major tool in the study of diversity and climate change.GCM climate models have the form of outcome-grid grid size 100-500 km, according to the latitude and longitude.This model can be used to predict changes in weather elements [16].However, GCM is a global information, so it is difficult to obtain direct information on the local scale.But the GCM is still possible to obtain information about local or regional scale when the downscaling technique is used [13].
Downscaling is defined as an effort to connect between global-scale circulation variables (explanatory variables) and local scale variables (dependent variable) [9].To bridge the large-scale GCM with a smaller scale (the study area), it use SD.SD is a process of downscaling which static, data on large-scale grid-grid in a certain time period and used as the basis for determining the data on a smaller scale grid [13].
SD approach uses regional or global data to obtain the functional relationship between the local scaleto global scale GCM.In general, the relationship is expressed by: Y = f(Z) + ε with, Y :dependent variable (rainfall) Z :independent variable (compound of the reduction result of spatial (latitude and longitude) GCM variables Independent variables are CSIRO Mk3 outcomes.They are precipitable water (PRW), sea level pressure (SLP), meridional wind component (VA), zonal component (UA), geopotential height (ZG), and specific humidity (HUSS).The height (level) is 850 hPa, 500 hPa and 200 hPa.The dependent variable is the monthly rainfall data from five stations.
There two criteria to get the performance of CR, PCR and PLS with PCA dimension reduction, namely: RMSEP and R 2  predict .The best model is the model with small RMSEP and high R 2 predict .

A. Pre-processing SD Modeling
The first step in the SD modeling is by means of dimension reduction, called the pre-processing of data.Spatial dimension reduction is performed on the latitude and longitude or grid and called on all variables in every level and every domain.In this case, each grid is an independent variable, so the domain 3x3, 8x8 and 12x12 are respectively sequenced 9, 64, and 144 variables and they will be reduced.

B. PCA Method
The procedure for preparing the main components with the PCA is done through three steps: first, getting the variance-covariance matrix, second, obtaining eigenvalues and eigenvector matrix of variance-covariance based on the first step, and finally conducting a linear combination of eigenvector with the origin data to obtain the main components.
Through the steps using the PCA method, it is obtained the number of principal components and cumulative variance (CV) for GCM variables, listed in Table 1 until Table 3.
Based on Table 1 the components produced by GCM variables using the PCA method have CV greater than or equal to 85%.Domain 3x3 is using one main component, except for variable HUSS.HUSS variable use three main components, which subsequently written HUSS1, HU-SS2, and HUSS3.Domain 8x8 have main component which ranges from one to three, except HUSS variable that uses six main components (HUSS1, HUSS2, HUSS3, HUSS4, HUSS5, and HUSS6).Domain 12x12 has not more than four main components, except for variable HUSS and VA500.
In general, the variables on the level surface have main components which are comparable to increasing domain size, except for SLP variable.
SLP only has one until two main components.In ZG variable, expanded domain did not affect the number of main components used.Results for Ambon and Pon tianak station can be seen in Table 2 and Table 3.

C. CR, PCR, and PLS Method
SD modeling by means of CR, PCR and PLS methods uses independent variable produced from dimension reduction in PCA method.It was done in Ambon station (type local rain), Pontianak station (type equatorial rain), and Losarang, Indramayu, and Yuntinyuat station (type of monsoon rains).Ambon has total of independent variables used in the domain 3x3 are 16 variables, in the domain 8x8 are 28 variables, and in the domain 12x12 are 39 variables.Pontianak has total of independent variables used in the domain 3x3 are 20 variables, in the domain 8x8 are 40 variables and in the domain 12x12 are 53 variables.
Losarang, Indramayu, and Yuntinyuat have total independent variables used in the domain 3x3 are 19 variables, in the domain 8x8 are 34 variables and in the domain 12x12 are 50 variables.The comparison of actual values and prediction value of rainfall variable each station and each grid is shown in Table 4 -Table 8.It also can be seen in Fig. 1 -Fig.5. Indramayu has better results than other stations.The prediction and actual value have relatively small difference.But in other stations, the comparison has not been satisfactory, because the prediction value is still far from the actual value.
RMSEP values and R 2 predict from SD modeling use Continuum Regression method, PCR, and PLS in Ambon, Pontianak, Losarang, Indramayu and Yuntinyuat Station with domains 3x3, 8x8 and 12x12 as seen in Table 9.In domain 3x3, PLS method has RMSEP smaller and R 2 predict higher than CR and PCR method.In domain 8x8, PLS method has RMSEP smaller and CR method has R 2 predict higher than others.In domain 12x12, CR method has RMSEP smaller and R 2 predict higher than others.So, it can be concluded that CR method has good performance than PCR and PLS method.

IV. CONCLUSION
CR with PCA pre-processing can be used to overcome multicollinearity problems at SD modeling to forecast the monthly rainfall in Ambon, Pontianak, Losarang, Indramayu and Yuntinyuat Station on grid 3x3, 8x8, and 12x12.
CR method show better results method of PCR and PLS Regression.It can be seen from the average value of RMSEP and R 2 predict on each method and each grid.
Mean Square Error Prediction n = number of sample Y i = actual values of out sample data Ŷ = prediction values of out sample data ε : error III.METHOD This research uses secondary data obtained from GCM output model CSIRO-Mk3, resolution of grid latitude and longitude 1,8650 x 1,8750.It can be downloaded at http://www-pcmdi.llnl.gov/ipcc.GCM domains are 3x3, 8x8 and 12x12 from five stations.Pontianak station uses the datafrom 1947-1990.Ambon Station use data in 1900-1940, Losarang Station in 1967-2000, Indramayu Station in 1974-2000, and Yuntinyuat Station in 1974-2000.Monthly rainfall data are obtained from Badan Meteorologi Klimatologi dan Geofisika (BMKG).

TABLE 1 .
TOTAL PC OPTIMAL AND CUMULATIVE VARIANCE (CV) FROM OUTCOME VARIABLES OF GCM BY USING PCA METHOD

TABLE 2 .
TOTAL PC OPTIMAL AND CUMULATIVE VARIANCE (CV) FROM OUTCOME VARIABLES OF GCM BY USING PCA METHOD IN AMBON

TABLE 3 .
TOTAL PC OPTIMAL AND VARIANCE CUMULATIVE FROM OUTCOME VARIABLES OF GCM BY USING PCA METHOD IN PONTIANAK