Construction the Statistics Distributions for Characterizing the Transfer Factors of Metals from Soil to Plant (TFsp) Using Bayesian Method

Plants have the faculty of levy the metals in the soil. The consumption of this plants can represent in some situations a health risk to be assessed. The transfer of contaminants from soil to food crops is a major route connecting the soil contamination to human exposure. The Transfer Factors Soil-Plant (TFsp) (the ratio between the concentration of contaminants in plants and the concentration of contaminants in the soil) is a value commonly used in the assessment of exposure and health risks. This research use the BAPPET database (database contents the informations of elements metal traces plants and vegetables). The goal of this research is for define the variable that influent the variability of TFsp and for characterizing their effects from their posteriors distributions using bayesian methods, Metropolis-Hastings. There are 3 metals (Cd, As and Pb), 4 plant types (leaf, fruit, root and tuber) and 2 analysis (using 4 plant types and 3 plant types, without tuber) with 4 models of analysis of varians (ANOVA, using normal and lognormal distribution for likelihood) that used in this research. The results of analysis for 4 plant types is chosing the model II with lognormal distribution for likelihood (yi ~ LN(μi, σi)) for the best model and for 3 plant types is chosing the model IV with lognormal distribution for likelihood (yi ~ LN(μi, σ), μi = μ + αi + Bj + δk, Bj ~ N(0, σB)) for the best model. The contains of metal Cd, As and Pb in leaf has the highest risk for the health because that has the biggest posterior mean of TFsp Keywords BAPPET Database, Metropolis-Hastings, Plant Types, ANOVA, Health Risk, Posterior Distribution.

I. INTRODUCTION 1 he plant foods consumed by humans not only come from agriculture but also the cultivation of plants in gardens and gathering. Plants have the ability to remove metals in the soil, consumption of vegetables can represent in some situations a health risk to be assessed. The transfer of contaminants from soil to food crops is a major route connecting the soil contamination to human exposure. The transfer factors of metals from soil to plant (TFsp) (ratio between the concentration of contaminants in plants and the concentration of contaminants in the soil) is a value commonly used in the assessment of exposure and health risks. Especially in studies of exposure modeling, it is recognized as one of the key parameters. BAPPET database (database contents Elements metal traces Plants Vegetables) which contains many experimental data of contaminated vegetable plants by trace metals, and therefore information on the TFsp parameter is used. This data is analyzed with the Bayesian approach.
The aim of the research is to define the variables responsible for the variability of TFsp and characterize their effects through posterior distribution. In these analyzes, the three metals (Cd, As and Pb), four plant types (leaf vegetable, fruit vegetable, root vegetable and tuber vegetable) and two methods of extracting the amount of metals in soil (extraction total and semi-total) are used. 1 Pratnya Paramitha Oktaviana is with Departement of Statistic, Faculty of Mathematic and Sains, Institut Teknologi Sepuluh Nopember, Surabaya, 60111, Indonesia. E-mail: pratnya.paramitha@yahoo.co.id. 2 Marie-Pierre Etienne is with Department of Mathematics and Informatics Applied, AgroParisTech, Paris, France. E-mail: marie.etienne@agroparistech.fr.

II. METHOD
Bayesian method that explained in this chapter is used in the analysis for this research. The step of analysis will be explained after the theory of Bayesian.
Hierarchical Bayesian models are really a combination of two things: i) written in hierarchical form which model is ii) estimated using Bayesian methods. A hierarchical model is one that is written in a modular way, or in terms of sub-models. The sub-models are combined to form a hierarchical model, Bayes' theorem is used to integrate the pieces together and realize all the uncertainty that is present. The MCMC (Monte Carlo Markov Chain) are numerical methods to describe the posterior distributions and work especially well with hierarchical models, and it is the engine that has fueled the development and application of Bayes' theorem [1].

A. Bayes Formula
In the sequel, the notation [.] Is used for a probability of whether they are of a probability density function or a discrete distribution. In a Bayesian framework the estimation of a model is defined by updating the prior distribution of the parameters [ ] θ due to the training data contained in it, through the Bayes formula [9].
for all n ≥ 0. That is to say that the probability distribution of θ n+1 given past variables depends only on θ n . This conditional probability distribution is called transition kernel K is K (θ n , θ n+1 ) [11].
Most of Markov chains encountered in Monte Carlo Markov Chain (MCMC) methods have a property of very high stability to ensure convergence to a stationary probability distribution, that verifies Markov Chain is irreducible, aperiodic and recurrent.. There is a probability distribution f such that if θ n ~ f, then θ n +1 ~ f.

2) Metropolis-Hastings Algorithm
The Metropolis-Hastings algorithm associated with the objective (target) density f and conditional density g produces a Markov chain {θ n } are defined by Robert and Casella [11] as follows: a. Given n θ , generates the candidate Y n ~ ) ( n y g θ , which is randomly distributed b. take Distribution g is called the candidate probability distribution and ρ(θ,y) is called the probability of acceptance of Metropolis-Hastings.
This algorithm is the basis of the operation of OpenBUGS software that was used to carry out the inferences of the models considered.
3) The Method of Gelman-Rubin Convergence Diagnostics in MCMC To calculate the Gelman-Rubin statistic, as modified by Brooks and Gelman [2], the basic idea is to generate multiple chains from dispersed initial values, and assess convergence by comparing the extra variability and international channels. The researcher denote the number of strings generated by M and the length of each chain 2T. We take as a measure of the variability of the posterior width of the 100 (1α)% credible interval for the parameter of interest (in OpenBUGS, α = 0.2). According to the final iterations of T, calculate the empirical credible interval for each channel, then the average width intervals through chains of M and denote by W. Finally, calculate the width B of the empirical credibility interval based on all samples MT grouped together. The ratio pooling to medium widths range must be greater than 1 if the starting values are dispersed on properly, it will also tend to 1 as convergence is approached, so we can assume convergence for practical reasons if R <1.05, for example.

4) Deviance Information Criterion
In Bayesian analysis, Deviance Information Criterion (DIC) summarizes the model fit in posterior expectation of deviance, , D and the complexity of a model by the actual number of parameters, D p [3]. Models that receive the most support from the data are those that have the lowest values of the DIC. The definition of DIC is The process has six steps. Stages of data analysis that are used to achieve the desired objectives are as follows: 1. Calculate the factors of soil-plant transfer ( TF sp ) 2. Identify outliers: a. Transform the original data to TFsp natural logarithm (Ln) of TFsp function. To overcome the complexity issues related to asymmetric information in the process of outlier detection , transformation functions and algorithms exist that can help to increase the symmetry of the distribution [4]. Symmetrical data is important in most methods for detecting outliers because they were designed around the management of data following a normal distribution. Two commonly used functions are the functions log and square root because they have advantageous properties compared to the variance. b. Calculate SD X 2 ± of transformed data. (SD = Standard Deviation ) This is the method of standard deviation, if the transformed data are less than or greater than are therefore these data are outliers [12]. Study again the data identified as outliers, if there is sufficient reason to doubt the value of the data (usually in the case of a framework for testing other very different experiences based ) on deletes this data. If there is no objective reason to remove it from the guard. 3. Several models will be considered, modeled through a random variability of plant species considered within its group ( leaf vegetable , fruit vegetable , root vegetable and tuber vegetable ) effect. Since the potato tuber is only representative group was removed when the group considered this model. We need to file that contained both models, data and boots. For the (likelihood ) probability data (TFsp ), normal and log normal distribution is used. 5. The outputs of the software are then analyzed in R (R Script). It is the Bayesian analyzes (the MCMC , Metropolis -Hastings algorithm , 3 chains) by using the R software package BRUGS . 6. Choose the best model that has the smallest value of DIC (Deviance Information Criterion). Depending on the model chosen (best model), we can assess which variables responsible for the variability of TFsp and also we can characterize their effects on each type of plant according to the posterior distribution. Detail of the model compared is presented in Table 1.

III. RESULT AND DISCUSSION
This chapter discusses analysis of TF sp metals Cadmium (Cd), arsenic (As) and lead (Pb) using the Bayesian method. The variability of TF sp of each metal type of plant (leaf vegetable [1], fruit vegetable [2], root vegetable [3] and tuber vegetable [4]) and characterize their effects according to the results the posterior distribution analysis can be known. 4 models are used for this analysis, the model I and II for analysis four types of plant, while all models (I to IV) for analysis of three types of plant (without tuber vegetable).

A. Application to Metal Cadmium (Cd)
In this section, we will apply the method described above to analyze the factors of soil-plant transfer (TF sp ) metal cadmium (Cd). The number of data used to analyze four types of plant is 820; 330 data TF sp in leaf vegetables, 163 data TF sp in fruit vegetable, 198 data TF sp in root vegetable and 129 data TF sp in tuber vegetable; and then, for three types of plant analysis, 691 data are used (without tuber vegetable).

1) The Outputs of the Metropolis-Hastings MCMC
Method Before the discussion the MCMC output, the first is to show the descriptive analysis TFsp metal Cd Descriptive statistics TFsp metal Cd plant types is shown in the box plots in Figure 1. Figure 1. shows that the value of TFsp metal Cd in leaf vegetables is greater than other types of plant (vegetable, fruit, vegetable and vegetable root tuber). There is still data that are not included in interval (based on the boxes in Figure 5.1), while it was already removing outliers, but it cannot deleted because there not strong evidence. Values can TFsp great for concentration Metal Trace Elements (ETM) in the plant is large, it depends on the context. a. Analysis 4 Plant Types Table 2 present values of DIC for model I and II analysis four types of plant using likelihood normal and log normal.
According to the DIC value in Table 2, the model II that uses the likelihood of y i is log normal distribution (y i ~ LN(µ i , σ i 2 )) as the best model because it has the smallest DIC value (828.4) is chosen. Figure 2 shows the posterior densities (the red line and also the histogram) of parameters µ i and σ i 2 (i = 1, 2, 3, 4) TFsp metal Cd Musample1 is the sample parameter μ 1 and all are sigmasample4 is similar to the sample parameter σ 4 2 . The green line shows the prior density. Summary subsequently shown in Table 3 [2]) ; 1107 for root vegetable (mu [3]) and 2436 for tuber vegetable (mu [4])).
The conclusion is the first type of plant (leaf vegetable) has a higher value TFsp Cd metal effect and the fourth type of plant (vegetable tuber) has a lower value TFsp metal Cd in effect the results of this model. The posterior variance (sigma[i]) that this uncertainty in the parameter mu [1] is 1187; 2192 for parameter mu [2] ;1041 for parameter mu [3] and 1101 for parameter mu [4]. b. Analysis 3 Plant Types Table 4 shows the value of DIC for each model of analysis three plant types. According to the results in this table, the model IVb using log normal distribution as the likelihood of yi is chosen. It has the smallest DIC value (909.8). Then the best model for analysis of three types of plant is the model with the likelihood IV b log normal ).

B. The Results for The Others Metal 1) Metal Arsenic (As)
The comparasion of value of DIC for model I and II analysis four types of plant is shown in the following Table 5. The model II with likelihood log normal is chosen (DIC = -473.3). Table 6 presents the comparison of the value of DIC for the analysis of three types of plant models. According to the results in Table 6, the value of DIC is smaller in all likelihood is the model with log normal distribution. The difference between these values is not very large. So we must consider what model you choose. According to the results previously Cd metal analysis, Model IV is chosen b log normal. We will most definitely consider this model. Then, because the smallest DIC is the log normal model IV and the difference with the DIC model IV b is only 0.6, the model is still IV b chooses with the likelihood of yi is log normal distribution (DIC = -453.1) as the best model for this analysis.

2) Metal Plomb (Pb)
The value of DIC analysis of four types of plant using model I and II is presented in Table 7. Model II is chosen with the probability distribution of yi is log normal again as the best model (DIC = -4047). Model IV b with probability yi is log normal distribution (DIC = -2954) is chosen as the best model (with the same consideration of metal analysis As before, because the difference is small) in Table 8.

IV. CONCLUSION
Based on the results and discussion that has been done in the previous chapter, we can conclude that: 1. For analysis of four types of plant for TFsp (metal Cd, As and Pb), the best model is selected Model II with the lognormal probability distribution (y i ~ LN(µ i , σ i 2 )). Variable responsible for the variability of TF sp metals is only the type of plant.

Model
Explanation