Calibrating Weather Forecast using Bayesian Model Averaging and Geostatistical Output Perturbation

Numerical Weather Prediction (NWP) has not yet been able to produce the weather forecast accurately. In order to overcome that, one approach could be taken is ensemble postprocessing. Ensemble is a combination of several methods to improve its accuracy and precision yet still possesses underdispersive nature. Bayesian Model Averaging (BMA) is intended to calibrate the ensemble prediction and create more reliable interval, though, does not consider spatial correlation. Unlike BMA, Geostatistical Output Perturbation (GOP) reckons spatial correlation among many locations altogether. Analysis applied to calibrate the temperature forecast at eight meteorological sites within Jakarta, Bogor, Tangerang and Bekasi (Jabotabek) are BMA and GOP. The ensemble members of BMA are the prediction of PLS, PCR, and Ridge. For training period over 30 days and based on some assessment indicators, BMA is better than GOP in terms of accuracy, precision, and calibration. Keywords BMA, ensemble, GOP, NWP, Underdispersive.


I. INTRODUCTION 1
n past few years, BMKG Indonesia began to apply numerical weather forecast, that is Numerical Weather Prediction (NWP), to aid the forecasters.However, its forecast bias was quite great since it has not been able to capture the dynamic atmosphere [1].Hence, statistical postprocessing needs to be applied to NWP output by using ensemble, such as combination of NWPs from several meteorological authorities.Though in many cases, ensemble forecast still possesses underdispersive nature, that is the forecast tends to concentrate at a point with low variance causing the observation outside the predictive interval, then as a consequence they need to be calibrated [2].In order to handle such case, BMA and GOP could be applied to calibrate the ensemble forecast, among others.
As in [3], BMA combines the whole ensemble member forecast based on weighted mean, posterior probabilities, that depends to some statistical models instead of the single one.The BMA weights reflecting each member's relative skill then form the predictive PDFs.Despite its advantage, BMA merely considers weather forecast at single location and ignores the spatial correlation which frequently occurs [4].Besides, two parameters, the weights and variances, could not be estimated by Maximum Likelihood and need the iterative approach like Expectation-Maximization (EM) algorithm.
GOP is a method of weather forecast being able to generate ensemble prediction of any size based on spatial association identified from the error correlation [5].
This method perturbs the outputs of NWP models spatially, such as error model, rather than their inputs.In other words, the simulated error has to be added to the regression linear forecast such that one would get spatially calibrated forecast.Like BMA, GOP also needs iterative approach, Limited-Memory BFGS (L-BFGS), to estimate its spatial parameters due to faster convergence when parameters being interest are large in size [6].
This research is intended to calibrate the ensemble temperature forecast at eight meteorological sites within Jabotabek, Indonesia using BMA and GOP to obtain better method being able to utilize NWP for short-range forecast, along with the brief derivation on how to obtain the parameter estimation of both methods.As this research is not equipped with enough NWP from various sources, the member of BMA consists of several statistical models, that is Partial Least Square Regression, Principal Component Regression, and Ridge Regression.
This paper is organized as follows.The materials section reviews the BMA, GOP and the indicator used to assess both method.The method section provides the data information and the way to do parameter estimation and calibrated forecasting.The results section presents the more detailed way of each method parameter estimation and forecasting.Finally, the last section gives the conclusion.

A. Bayesian Model Averaging (BMA)
BMA is a method to calibrate the underdispersive, overdispersive, and biased ensemble forecast where those might cause the forecast being less reliable, particularly the underdispersive one.The predictive PDF of BMA is linear combination of several competing model in which each of them has different weight to the PDF, relative to other I member [3].They are called the posterior probabilities due to their changing value over sliding training period.
Suppose y is the observation of weather quantities with  1 ,  2 , … ,   are different M-forecast models where m = 1, 2, ..., M. Each member m forecast can be corrected using one of many possible methods that yields bias-corrected forecast fm.The forecast fm corresponds to a conditional PDF, g  (|  ), which could be defined as the conditional PDF of y on fm if fm is the best member [7].

(
) ( ) where wm is the m ensemble member weight or posterior probability recognized as the "best" one which is nonnegative and sums up to 1, that is The weight wm depends on forecast fm's performance in training period.
In the case of normally distributed forecast fm, then the interest y's posterior distribution given fm is the best member is shown in Eq. ( 2).
( ) Based on [3], the BMA predictive mean, deterministic forecast, the weighted average of fitted member forecast, is given by Eq. ( 3).
( ) ( ) Following the BMA mean, the BMA variance is shown in Eq. (4) ( ) With   =  ̂0, +  ̂1,   Based on Eq. ( 4), it could be said that the BMA variance is greater than both term on the right-hand side.
Unlike the bias-corrected coefficient β0 and β1, the weight wm and variance σ 2 could not be estimated by MLE and replaced by iterative approach, that is Expectation-Maximization (EM) algorithm.It alternates between two steps, the Expectation or E step and the Maximization or M step [3].This algorithm utilizes unobserved, latent variables zmt , where zmt = 1 if member m is the best forecast for time t, otherwise zmt = 0.
For the normally distributed BMA model, the E step is written on Eq. ( 5).

𝑧𝑧̂𝑚𝑚 𝑚𝑚
The superscript i in Eq. ( 5) refers to the i-th iteration with the density g  �  �  ,  (−1) � is normal with mean  ̂0, +  ̂1,   and standard deviation σ (i-1) evaluated on observation yt [3].In the next M step, iterative estimation of wm and σ 2 is calculated based on current estimate of latent z in Eq. ( 5).Hence, exist Eq. ( 6) where T is the number of observation in the training period.
Both E and M step are iterated to convergence, which is no change greater than tolerance limit in terms of parameter value and zmt in one iteration [8].

B. Geostatistical Output Perturbation (GOP)
GOP is a spatial method modifying and perturbing NWP deterministic outputs by taking the errors correlation among locations of interest into account [5].Such errors were obtained through geostatistical simulation to get calibrated weather field forecast which is reliable and sharp as well.Considering multivariate among sites, let   = [ 1 ,  2 , K,   ]′ and   = [ 1 ,  2 , K,   ]′ denote the  × 1 vector of observed weather quantity and vector of interest NWP output, respectively, where t = 1,2,K,T.Then, GOP model is represented in Eq. ( 7) where 1 is an s x 1 unity vector and εt is residual vector.As in [4], error εt in Eq. ( 7) follows the normal distribution which has covariance Σ relying on covariance structure of error spatially.From here on, error is referred to the difference between observation and bias-corrected forecast.
Given C (si,sj) is stationer and isotropy exponential correlation function, the (ij)th element of Σ is obtained based on Eq. ( 8). ( 1 var 1 , , 2 where ρ 2 is nugget effect, that is the variance of measurement error as well as small-scale variability [4]. Then, the marginal variance of error is known as sill, obtained from ρ 2 + σ 2 .As in Eq. ( 8), the error correlation might be identified through common exponential semivariogram, denoted in Eq. ( 9) where d is obtained from �  −   � denotes the Euclidean distance between set of pair of location si and sj.Range r (in km) indicates the distance from which the spatial error correlation began to diminishing exponentially [4].
In order to estimate ρ 2 , σ 2 , and r, the applied approach is the iterative one using Limited-Memory BFGS (L-BFGS).It minimize the objective function as shown in Eq. ( 10), derived from weighted least square ( ) where nl is number of location pair in bin Bl and N is number of obtained bin [4].

C. Verification Method
This subsection presents some methods used to assess the predictive quality obtained from calibrated ensemble forecast.Calibration is the consistency between the ensemble forecasts and observations.For the research attempting to calibrate the weather forecast, RMSE or MAE is not sufficient to conclude the best model in terms of accuracy.It also needs other tools, such as CRPS and coverage, to verify the bias correction level and sharpness.RMSE and CRPS is expected as little as possible, while coverage is expected much closer to 50% and 90 %, respectively for BMA and GOP in this research.

1) Root Mean Square Error (RMSE)
As an accuracy indicator, RMSE in Eq. ( 11) is calculated from squared root of MSE, the average of sum of square of difference between forecast and verifying observation, ( ) where n is the number of observation [9].

2) Coverage
The sharpness of ensemble forecast could be verified from coverage by comparing the standard coverage and empirical coverage.If an observation lies within the ensemble range, then it could be said that the observation is inside the coverage [9].The standard coverage is given by Eq. ( 12).
1 100% 1 The notation M denotes the number of ensemble member.The ensemble forecasts is calibrated if the empirical coverage is much closer to the standard coverage.

3) Continuous Rank Probability Score (CRPS)
CRPS is used to verify how reliable or precise the predictive interval obtained from BMA or other probabilistic forecast.The less the CRPS, the more reliable the prediction interval [7].CRPS is written on Eq. (13) ( ) ( ) ( ) where n is the number of observation, i is the time basis,    ()and    () are the predictive CDF and empirical CDF at i-th time, respectively.The CDF    () = 1 for observation ≥ forecast, otherwise 0.

III. METHOD
The data in this research are obtained from Meteorology, Climatology, and Geophysics Agency (BMKG) Indonesia, that is NWP Conformal Cubic Atmospheric Model (CCAM) output within period January 1 st , 2009 until December 31 st , 2010 or about 708 observation days.The meteorological stations of interest within Jabotabek, Indonesia are of Kemayoran, Priok, Cengkareng, Pondok Betung, Curug, Dermaga, Tangerang, and Citeko, shown on figure 1 with red dot.As could be seen in figure 2, the middle red dot represents the grid nearest to the station and the rests (black dots) surround the station itself.Preprocessing using PC might be the appropriate way to summary the correlation among 9 grids solely in each NW P parameter.The short description about NWP parameters used to generate calibrated weather forecast is given on table 1.Besides seven parameters measured at three kind of levels, there are eleven parameters measured at surface level, 2 meters above sea surface.Thus, one obtain 32 parameters in total.Since this research seems to involve many meteorological variables, which are highly correlated, then it should be noted that the ensemble which consists of three statistical models is in favor to overcome multicolinearity despite the BMKG's lack of NWP source.
Thus, the predictor (X) of BMA is temperature forecast produced by PLSR, PCR, and Ridge.Meanwhile, the single predictor (X) of GOP is tmaxcr and tmincr for maximum and minimum temperature forecast, respectively, whose grid is very close to the station.This research considers 24hour ahead forecast of NWP parameters, hence the BMA and GOP have to provide the temperature forecast for the same time ahead over 30-day training period.Based on [3], such training length gives more stability and ability to adapt with dynamical properties of weather.

A. Derivation of BMA parameters estimation
The BMA parameters of each member m are able to be classified to two parts, bias-corrected coefficient βm, that is β0,m and β1,m, alongside the weight and variance (wm and σ 2 ).These parameters are used to yield the calibrated weather forecast and predictive interval.Since the weather quantity of interest is temperature which is normally distributed, then it is easily shown based on [10], either by Maximum Likelihood (ML) or Ordinary Least Square (OLS), that the estimation of β0,m and β1,m are expressed on Eq. ( 14) Given yt and fmt are temperature and ensemble member forecast verified at time t where t = 1, 2, ..., T. Since BMA utilizes the concept of sliding training period, the entire parameters, including β0,m and β1,m, changes relative to the trend value of observation and ensemble forecast [7].
Unlike β0,m and β1,m which easily could be estimated, the weight wm and variance σ 2 is not able to be estimated by ML.In order to overcome that, BMA has to extend the derivation using iterative method, for instance using EM algorithm which considers the complete-data likelihood (; , ) based on incomplete-data likelihood (; ) [8].The steps needed to yield the estimation of wm and σ 2 are briefly given below.

1) Step 1: Obtaining the incomplete-data likelihood
As usual, the incomplete-data likelihood given on Eq. ( 15) is carried out to estimate the weight of ensemble member m wm.
( ) ( ) ( ) Eq. ( 15) might be transformed to log-likelihood to make the estimation easier and the result is expressed on Eq. ( 16) It might be said that the solution for wm based on Eq. ( 16) is not exist since it is definitely complicated to be derived.Therefore, the EM algorithm considering latent variable Z where  =  1 T ,  2 T , K,  T T has to be applied to estimate wm as well as σ 2 [8].The subscript T denotes the T-th sliding training period where t = 1, 2, ..., T. It is also given that  t = (Z 1 , Z 2 , K, Z  ).For any t-th time, only one element of Zt is 1, otherwise 0. Hence, for k = 1, 2, ..., M, there exists Eq. ( 17) where k denotes the best ensemble member.

2) Step 2: Obtaining the complete-data likelihood
As the subsequent step after performing incomplete-data likelihood, the complete-data one is obtained by taking latent Z into account [8].By applying indicator function  (  =) to represent the latent, so that one obtain the likelihood on Eq. (18).

3) Step 3: Performing Expectation (E) step
This E step is iteratively conducted to obtain the expectation from likelihood or log-likelihood function of complete-data.Given i as the i-th iteration, then exists Eq. ( 19).
Based on further derivation to the expectation on Eq. (19) as in [8], it would be easily shown ( ) The latent guess   () given by Eq. ( 20) is the posterior probability of the observation at t-th time of ensemble member m [8].Thus, one would obtain Eq. ( 21) to proceed to the next M step.

∑∑
For instance, there are 2 ensemble members so that M = 2, then w2 = 1-w1.It would yield Hence, it would be easily shown that Generally, as in [3], for an ensemble with M in size where m = 1, 2, ..., M, there exists Eq. ( 22) Similar with wm, the M step of which variance σ 2 should carry out yields the below derivative Therefore, the current estimation of σ 2 is given on Eq. ( 23).

B. Derivation of GOP spatial parameters estimation
Before estimating the spatial parameters ρ 2 , σ 2 and r, the first step one should to do is calculating empirical semivariogram  �( 1 ) based on Eq. ( 24) where k is the number of pair of distance between two locations and  1 =  1 (1),  1 (2),  is the distance which represents the whole pair of two locations being involved [4].The next step is carrying out the estimation of those three parameters based on objective function given on Eq. ( 25).
( ) Due to complicated derivative, then the step of those parameters estimation is only represented by nugget ρ 2 .By applying the first-order derivation of Eq. ( 25) with respect to ρ 2 , then equals to 0, one would obtain Eq. ( 26).
( ) ) (1 e ) Eq. ( 26) shows unclosed-form of ρ 2 estimation as it still contains partial sill σ 2 and range r which are going to be estimated as well.Thus, one need L-BFGS approach to obtain ρ2, σ2 and r estimation simultaneously.If  = [ 2 ,  2 ,  2 ], then exists the gradient vector , , , , , , g r g r g r g r with the following steps to carry out L-BFGS, where k = 0, 1, 2, ... denotes the iteration [6].
1. Determining the initial value of z and H, that is z0 and H0.On k-th iteration, z0 should be non-negative with H0 is symmetric positive definite matrix, such as identity matrix.This first step determines a positive integer m as well to seek how long H0 information used to renew iteration.Then, determining β and γ where 0 0.5 In general, m should be less than 10 in order that the iteration is running shortly as well as effectively.2. Calculating ∆  = −  ∇ g (  ) and  +1 =   +   ∆  where ∇ g (  ) is the gradient at zk with constant   satisfying Wolfe condition.3.If �∇ g ( +1 ) − ∇ g (  )� < , with ε arguably small value, then the current iteration should be terminated.Otherwise, the iteration should proceed to next step.4. Updating matrix Hk, shown on Eq. ( 27), based on the information from Hessian H0 m times so that one obtain Hk+1 with ( ) where 5. Returning to the second step to obtain the new ∆  and updating  +1 to check the convergence where k = k + 1.From here on, the discussion of Dermaga station in Bogor, West Java is explained more specific rather than other stations since Bogor is not only the location at which the weather changes more uncertain, but also the steps taken to analyze the PCA pre-processing until produce the calibrated forecast using BMA and GOP are similar.

C. PCA Pre-Processing and Statistical Model Forecast
As have been said, the NWP parameters should be dimensionally reduced by PCA.PCA pre-processing might reduce the model complexity as well, so that it enables the further analysis and interpretation easier to handle.For Dermaga station, table 2 provides the number of PC representing each NWP parameter (variable) with its cumulative variance.Table 2 shows the PCA pre-processing yields 41 PCs in total from 32 NWP parameters being involved.The explained cumulative variances range from 81.76% to almost 100%.Therefore, the association of weather quantities among 9 grids of each NWP parameter itself is extremely high.Those PCs thus would be involved as the predictor in three aforementioned statistical models, that is PLSR, PCR, and Ridge, which yield the forecasts shown only for the first 100 days on figure 3. Figure 3 implied that the forecast of each ensemble member initially could have been follow the general trend of maximum temperature and minimum temperature, they were going up if the observed temperature acted the same and were doing so when it went down.But, the lingering problem was under-fitting or overfitting which happened on the same day.Such problem even was more significantly seen on minimum temperature.
Based on figure 3, it could be said the forecasts produced by those 3 probabilistic models are quite far beyond the verifying temperatures even if they could capture the pattern of temperature.Therefore, it might be necessary to calibrate those models as the ensemble member to produce more accurate and precise weather --temperature--forecast as well as being calibrated.The weighting adopted by BMA supposedly minimizes the impact of under-fitting or overfitting, even of seasonal pattern.

C. BMA Calibrated Temperature Forecast
The calibration is performed in order that the variance would adapt to inevitably under-dispersive nature experienced by ensemble member before.After calibrating, one would obtain more reliable forecast with more proportional variance and narrower predictive interval.To assess whether the ensemble forecast is under dispersive or not, the Verification Rank Histogram (VRH) on figure 4 was necessarily carried out.The under dispersive ensemble is recognized from shaped-U histogram, while the over dispersive one is recognized from bell-shaped histogram [9].As shown on figure 4, both ensemble forecast, either for maximum or minimum temperature, still possess under dispersive properties since each VRH resembles U-shape.They implied that many verifying temperatures are beyond ensemble range, the difference between maximum and minimum value of an ensemble forecast.
Based on figure 4 as well, the empirical coverages obtained were 20% and 6.7% for maximum and minimum temperature, respectively.It is identified from the percentage of observations lie within the second rank and the third rank.It could be said that each coverage is far less than standard coverage 50% and convinces us that the ensemble forecasts are uncalibrated due to under dispersive.It would influence the predictive interval to be unreliable, so that it needs to be calibrated by BMA.
The first step of BMA calibration is performing regression for each ensemble member (PLSR, PCR, and Ridge) with respect to the verifying observation.For instance, table 3 presents the regression coefficient and the weight of each member along with BMA mean as deterministic forecast on November 14th, 2009, particularly for Dermaga station based on 30-day training period.
It should be noted from table 3 that PLSR has the biggest contribution to BMA maximum temperature forecast as its weight is 0.724, greater than PCR's and Ridge's which have weight 0.276 and 0, respectively.However, the latter has the biggest contribution to BMA minimum temperature forecast rather than PLSR and Ridge.It might be said then the BMA's accuracy for both temperatures on that day were not significantly distinguished compared to the each member forecast.
The next optional analysis is graphing the BMA predictive density to find out how fit the calibration done by BMA for maximum temperature at the same station on the same day, as shown on figure 5.It could be seen that figure 5 shows the observation (vertical solid line) lies inside 95% BMA predictive interval (dashed line).As indicated by figure 5, BMA yields more reliable interval, particularly for November 14 th , 2009.The maximum temperature observation lies inside the ensemble range as well.It means that BMA manages to increase each member's precision, shown from PDF BMA which shifts closer to the middle.It likely behaves the same for the minimum temperature.One thing should be noted, though, is the variance of BMA was somehow always larger than each member's itself due to its role to broaden the variance.Then, one should compare the forecast produced by BMA and deterministic NWP merely for Dermaga station with 30-day sliding training period, as shown on table 4.
Based on Table 4, it implied that BMA manages to improve the forecast accuracy over 50% as it has less RMSE.In addition to the improved accuracy, BMA is able to yield the narrower predictive interval rather than raw ensemble since it has less CRPS.As the main goal, BMA manages to calibrate the ensemble forecast for temperature in particular.It could be seen from the coverage which advances significantly ever than before, 20.65% to 49.41% and 6.7% to 49.26%, respectively for maximum temperature and minimum temperature.It means that BMA is able to produce the weather, in this case temperature, forecast which is more consistent.

D. GOP Calibrated Temperature Forecast
The first step of GOP is to estimate β0 and β1 biascorrected coefficient in order to obtain the residual for constructing empirical semivariogram.With 30-day training period, the GOP models for both temperature are given on Eq. (28).
In general, for GOP model, the bias-corrected coefficient on Eq. ( 28) is retained.The next step then is to construct exponential semivariogram, particularly for the maximum temperature, shown on Figure 7.The semivariance on figure 7 is roughly constant after reaching the distance about 8.69 kilometres.It implied that the maximum temperature between two stations is no longer dependent after 8.69 km, with sill recorded at 3.65.The high amount of sill might cause the greater variance of estimation or influence the forecast.
Figure 7 also indicates that spatial inconsistency is exist upon maximum temperature in particular.On some distance pairs, there are few bins which have semivariance far greater than others.Since these patterns are seen on bins representing distance 50 km or more, it might be triggered by Citeko station situated on Puncak plateau.It has been the biggest effect on the unexpected inconsistency.Regardless of above fact, Table 6 shows the forecast along with GOP predictive interval, represented by 5 th percentile and 95 th percentile on January, 31 st 2009.

Station Obs. (°C) NWP(°C) GOP(°C) P5 (°C) P95 (°C)
As shown on Table 6, GOP clearly manages to improve the bias correction rate, reflected on the RMSE of GOP (2.12° C) which is higher than of NWP (2.18° C), although the rate is somewhat less than 5%.It might have been the impact of spatial inconsistency mentioned before.It means GOP is severely vulnerable and risky in case of inadequate location of interest, inappropriate data properties, data mishandle, etc.

E. The assesment of the best calibration method
After carrying out both calibration method, BMA and GOP, one should verify and compare which method gives the better forecast based on few indicator presented on Table 7, using the same 30-day training period.Based on Table 7, the maximum temperature forecast is more accurate using BMA, while the minimum temperature using mean ensemble (the average of PLSR, PCR, and Ridge).Furthermore, BMA manages to better calibrate both temperature than GOP, indicated by less CRPS.Hence, BMA would yield the weather forecasts which are more accurate and reliable than GOP, particularly on those eight stations at Jabotabek.V. CONCLUSION Some parameters of BMA and GOP, such as the weight and variance of BMA and the spatial parameters of GOP, should be estimated by iterative approach, for instance EM and L-BFGS respectively.For 30-day training period, the accuracy of BMA is not different than of the three members, while the former was more reliable, indicated by less CRPS.Furthermore, BMA manages to calibrate the forecast, indicated by the coverage closer to 50%.Lack of fit, though, is still owned by GOP since it has higher RMSE.From both method, BMA forecasts apparently have greater accuracy and precision.

Figure 1 .
Figure 1.Meteorological stations of interest over Jabotabek [1].The predictand (Y) of BMA and GOP is observed temperature, consisting of the maximum one and the minimum one.To produce deterministic BMA forecast, it should define the ensemble members, that is Partial Least Square Regression (PLSR), Principal Component Regression (PCR) and Ridge fitted value of temperature.Such value is obtained first by reducing dimension of each 32 NWP parameter using Principal Component (PC).Each NWP parameter has 9 grids or 3 x 3 in square that corresponds to each station, roughly illustrated on figure2.

Figure 3 .
Figure 3. Trend of ensemble member forecasts and observations for (a) maximum temperature and (b) minimum temperature.

Figure 4 .
Figure 4. VRH for three ensemble members of (a) maximum temperature and (b) minimum temperature, January 31 st 2009 -December 31 st 2009.

Figure 5 .
Figure 5. BMA predictive PDF and its members on November 14 th , 2009.

Figure 7 .
Figure 7. Empirical semivariogram of maximum temperature over 30day training period.

TABLE 2 .
PC REPRESENTATION OF EACH NWP PARAMETER AT DERMAGA.

TABLE 4 .
ASSESSMENT FOR THE FORECASTS OVER 30-DAY TRAINING PERIOD.

TABLE 6 .
GOP FORECAST AND ITS CONFIDENCE LIMIT OF MAXIMUM TEMPERATURE ON JANUARY 31 ST , 2009.

TABLE 7 .
THE COMPARATION OF FORECAST METHODS OVER 30-DAY TRAINING PERIOD.