The Jackknife Interval Estimation of Parametersin Partial Least Squares Regression Modelfor Poverty Data Analysis

One of the major problem facing the data modelling at social area is multicollinearity. Multicollinearity can have significant impact on the quality and stability of the fitted regression model. Common classical regression technique by using Least Squares estimate is highly sensitive to multicollinearity problem. In such a problem area, Partial Least Squares Regression (PLSR) is a useful and flexible tool for statistical model building; however, PLSR can only yields point estimations. This paper will construct the interval estimations for PLSR regression parameters by implementing Jackknife technique to poverty data. A SAS macro programme is developed to obtain the Jackknife interval estimator for PLSR. KeywordsPartial Least Squares Regression, multicollinearity, interval estimator, Jackknife


I. INTRODUCTION
ocial researchers frequently work in a situation with complex and massive amount of variables.In such situation, problem which is often faced in statistical model building is that the independent variables are many and highly collinear.This phenomenon is called multicollinearity or collinearity.Collinearity means codependence.This collinearity problem increases standard error of their estimated regression coefficients.The higher the collinearity among the variables, the higher of the standard error of regression coefficients.High standard error yields a wide interval estimation of parameters.Thus, it increases risk of predictor to be rejected from regression model as non-significant variable [1].
There are a number of ways to detect multicollinearity.One of them is simply to look the correlation between variables by using scatter plot.However, this is not always good enough for a complex multicollinearity case [2].Another approach is to compute Variance Inflation Factor (VIF).The VIF measures how much the variance of each regression coefficient is inflated because of multicollinearity compare to a situation with uncorrelated variables.The larger the VIF, the more serious is the multicollinearity problem.
The inverse of the VIF is the tolerance.When tolerance is small, say less than 0.1, then it would indicate the present of multicollinearity.Another way to diagnose multicollinearity is through the R 2 values.Multicollinearity might exist in condition where there is a high value of R 2 with a few significant coefficients or even with no significant coefficients.In a serious case of multicollinearity, the indication can be figured out from a change 1 Puji Ismartini is Student of Statistics Department Doctorate Program, FMIPA, Institut Teknologi Sepuluh Nopember, Surabaya, 60111, Indonesia.E-mail: ismartini@mhs.statistika.its.ac.id. 2 Sony S. and Setiawan are with Department of Statistics, FMIPA, Institut Teknologi Sepuluh Nopember, Surabaya, 60111, Indonesia.Email: sonny_s@statistika.its.ac.id, setiawan@statistika.its.ac.id.sign (positive/negative) of the regression coefficients when a new variable is added to the regression model.
In a case of multicollinearity, common classical regression technique by using Least Squares yields unstable result [2].Therefore, a such calibration technique is needed to overcome multicollinearity problem in regression model.
Several methods have been developed to cope with multicollinearity problem such as Principle Component Regression (PCR), Ridge Regression (RR) and Partial Least Squares Regression (PLSR).PCR and RR are commonly used methods.However the computation process of PCR and RR is getting more complex when the number of variables is getting large.While the computation process of PLSR is less complex compare to those two methods.PLSR overcomes multicollinearity with smaller number of components than PCR [4].PLSR also uses a unique way of chosing component by using singular value of decomposition of dependent and independent variables [2].While in PCR, each component is obtained based on spectral decomposition of independent variables.So, the components in PLSR are more directly related to variability of dependent variable than PCR.It is also shown that PLSR and RR perform better than PCR [5].Another characteristic of PLSR is statistical efficiency [6].For moderate number of dependent variables, PLSR is most efficient than others [5].Thus for some reasons, PLSR can avoid the dilemma in PCR and RR.
PLSR can only yield point estimations of their parameters.And there is a difficulty to measure such estimates of accuracy for PLSR by using analytical technique.Alternatively, empirical technique such as Jackknife and Bootstrap might be used in an easy way to measure that precision [2].Jackknife and Bootstrap are techniques for estimating standard error of an estimator through resampling process.Compare to Bootstrap, Jackknife is a useful resampling technique in a case of small sample and minimal assumption [10].The Jackknife is also less computationally process [13] than other.The main purpose of this article is to construct the Jackknife interval estimation of the regression coefficient estimates in the PLSR model for poverty data analysis by developing a SAS macro program in order to measure the accuracy of PLSR coefficient regression estimators.

II. MODEL SPECIFICATION OF PARTIAL LEAST SQUARES REGRESSION
Partial Least Squares (PLS) is method developed by Herman Wold in the 1960s as a method for constructing statistical models in a condition where the explanatory variables are many and highly collinear [3].This method might also be used with any number of explanatory

The Jackknife Interval Estimation of Parametersin Partial Least Squares Regression Modelfor Poverty Data Analysis
Pudji Ismartini 1 , Sony Sunaryo 2 , and Setiawan 21 S variables which is more than the number of observation [7].Basically, PLSR is a method which combines dimension reduction process and constructing a regression model.Those two processes are performed simultaneously in PLSR.
The general idea of PLSR is quite similar with Principle Component Regression (PCR) approach.PLSR is indirect modeling since it tries to construct a regression model by transforming a set of independent variable which is highly collinear to a set of new variable which is uncorrelated [1].This new variables are called latent variables or components.Each component is an orthogonal linear combination of the explanatory variables.Therefore, PLS has also been taken to mean "projection to latent structure".Thus, PLS is based on latent component decompositions concept.Unlike in similar approaches such as PCR, the latent components obtained by PLSR are computed by taken into account both the independent and dependent variables of the regression [7].
To regress the response with the explanatory variables, PLSR uses Ordinary Least Squares (OLS) method.Since this estimation method does not need a strict distribution assumption.This is one of the reasons that PLSR is also addressed as a soft modeling method [8,9].
Consider the general setting to predict q continuous response variables Y 1 , …, Y q using p continuous predictor variables X 1 , …, X р and the available data sample consist of n observations.There is the nxp matrix X with vector x i = (x i1 , x i2 , …, x iр ) T as a row element.Similarly, Y is the nxq matrix containing the y i = (y i1 , y i2 , …, y iq ) T .The latent component decomposition of PLSR is given by Where ∈ ℝ × is a matrix of latent components, ∈ ℝ × and ∈ ℝ × are matrices of coefficients (loading matrices of response variable and predictor variables, respectively), ∈ ℝ × and ∈ ℝ × are matrices of random errors.In general, a PLSR analysis consists of the stages: Step 1. Centering and scalling process to both response and predictor variables.
Step 2. Construct a matrix of weights (W) where Wϵℝ .
Step 3. Construct a matrix of latent components (T) as a linear transformation of X, i.e T = XW (3) where the columns of W and T are w i = (w i1 , w i2 , …, w рi ) T and t i = (t 1i , t 2i , …, t ni ) T .Thus the equations of linear transformation of X 1 , …, X р, … are T (5) Thus, the predicted response can be calculated by only using the information of latent components and the response variable.
It is shown that the dimension reduction approach and the regression model is performed simultaneously in PLSR since it produces the matrix of regression coefficients B as well as the matrices W, T, P and Q [6].

III. THE JACKKNIFE PROCESS
Jackknife is a statistical technique which was introduced by Maurice Henry Quenouille in 1949 for estimating the bias of an estimator and to correct for it [10].Thus, it yields a bias corrected estimator.In 1958, John Wilder Tukey proposed the variance of the estimator and hence for its standard error.It is a nonparametric method of statistical error such as the bias and standard error of an estimator [11,12].Since it yields standard error of an estimator, it also can compute the confidence intervals of an estimator [12].This nonparametric technique is trustworthy since parametric analysis required assumptions that are difficult to justifiy [11].The advantage of the Jackknife is less computationally process [13].
Jackknife is a versatile resampling technique.The basic idea of Jackknife is similar to cross validation procedure.In general, the process is performed by deleting one or several observations at a time and the regression coefficients are computed for each subset of data.This process is repeated in order to get a set of regression coefficient vectors [10,11].This set of coefficient vectors gives information about the variability as well as the standard error of the regression coefficients.The Jackknife is a useful resampling technique in a case of small sample and minimal assumption [10].According to [10,11,14], the scheme of Jackknife process can be summarized as inFig.1.
Let an independently and identically distributed sample of size n which is used to estimate a parameter and yields an estimator .Then, removing a group of m observations from the sample to get a set of sample of size n-m andlet be the estimator of the same parameter based on a sample of size n-m.
The estimated bias of is reflected from the difference between and .The Jackknife bias is calculated by using the following equation.
If the size of deleted observation (m) is relatively small compared to n, the bias of -is generally much smaller than the bias of .The bias of -is commonly of order n -2 while the bias of is generally of order n -1 [10].

A. Jackknife by Deleting One Observation
Let is an estimator of parameter which is obtained from a sample of size n.Then, (.) is an estimator of the same parameter by removing the i-th observation from the sample.The deleted one observation Jackknife estimator is given by: - The Jackknife variance estimator by deleting one observation based on the pseudo values 2 (.) = ) − () − 1) (.) , 4 = 1,2, … , ) is given by = ∑ ( (.) − ̅ ( ) ) 9   .1 (9) 7 8 9 -( ) is a consistent estimator of the asymptotic variance of and -( ) .

B. Jackknife by Deleting m Observations
Suppose the sample is divided into g groups which are mutually exclusive and independent with equal size m where m>1 and m = n/g.The estimator of parameter by deleting m observations of j-th group is (&) .In this case, the estimator is obtained based on sample of size n-m.The Jackknife estimator by removing m observations is given by where ̅ ( ) = : ∑ is given by

C. The Jackknife Confidence Interval
The (1-α)*100% approximate confidence intervals for parameter is given by For large sample size, a student's t distribution converges to a standard normal distribution.
IV. APPLICATION TO POVERTY DATA Poverty data analysis usually involves social variables which are many or highly correlated.There are many factors that might affect the poverty level in particular area.Some of those factors are demographical variables.
In this case, the PLSR is applied to analyze whether number of poor people (Y) in Nanggroe Aceh Darussalam (NAD) is influenced by number of children aged 0-4 years old (X1), number of worker (X2), number of elderly people (X3), number of school age people who are not attending anymore (X4), and number of people who work on agriculture sector (X5).The data set is based on Socio-economic survey 2008 conducted by Statistics Indonesia (BPS).
Table 1 shows the high values of VIF for the first three predictors, since the values are over 10.There are also some tolerance values which are less than 0.1.Those indicate multicollinearity in data.The PLSR is used to analyze data by handling the multicollinearity.The result is shown below.
Table 2 illustrates the individual and cumulative variation accounted for the five PLS factors, for both the factors and the response.There are five principal components can be constructed by fivefactors.In general, Table 2 shows that the first components account for about 90 % of variation for both factors and responses.This gives a strong indication that one component are appropriate for modeling the data.It is confirmed by the cross validation analysis through the Predicted Residual Sum of Squares (PRESS) values since model with only one component yields the minimum PRESS (0.3172).Thus for this case, one component will be used in analysis.The point estimation of PLSR based on one component is given in Table 3.The accuracy of those estimations is measured from the interval estimation of the regression coefficients.And the Jackknife technique constructs the interval estimation of PLSR coefficients (Table 4).
The Jackknife confidence interval shows the interval estimations of PLSR coefficients.It also confirms that all of the factors (number of children aged 0-4 years old, number of worker, number of elderly people, number of school age people who are not attending school anymore, number of people who work on agriculture sector) are positively and significantly influence the number of poor people in NAD.Increasing number of children aged 0-4 years old will lead to increasing number of poor people as much as 0.2054.At the same vein, the addition of a single elderly people will increase the number of poor people around 0.2129.The contribution of three other factors to the increasing number of poor is almost the same at around 0.2.

V. CONCLUSION
PLSR is a powerful method for modeling data with multicollinearity problem.PLSR yields a point estimation while its interval estimation can be constructed by using Jackknife technique.The Jackknife confidence interval also can be used to measure the accuracy of PLSR estimation.The application of PLSR and Jackknife process to poverty data analysis in NAD shows that all of the coefficients regression produced by PLSR are positively significant to measure number of poor people in that area.Thus, numbers of children aged 0-4 years old, workers, elderly people, school age people no longer attending school, and people working on agriculture sector give positive contribution to the increase of numbers of poor people in NAD.

Step 4 .
Compute a matrix of component loading Q.This matrix is obtained from Equation (1) by using the least squares method.Y = YQ T T T Y = T T TQ T QT = (T T T) -1 T T YQT Step 5. Compute a matrix of regression coefficients (B) for the Y=XB+F.Since X = TW T and Y = XB thus = TW T B. From Equation (1), Y = TQ T consequently, Q T = W T B. Then the solution for B can be obtained from the following equation.W T B = Q T = (T T T) -1 T T Y B = WQ T = W(T T T) -1 T T Y (4) Step 6. Calculate the response predictions (Ŷ) Y = XB Ŷ sinceX = TW T and B = W(T T) T Y Ŷ = TW T W (T T T) -1 T T Y = T(T T T) -1 T T Y

TABLE 1 .
TOLERANCE AND VIF VALUES