THE EFFICACY OF CHOOSING STRATEGY WITH GENERAL REGRESSION NEURAL NETWORK ON EVOLUTIONARY MARKOV GAMES

Nowadays, Evolutionary Game Theory which studies the learning model of players, has attracted more attention than before. These Games can simulate the real situation and dynamic during processing time. This paper creates the Evolutionary Markov Games, which maps players’ strategy-choosing to a Markov Decision Processes (MDPs) with payoﬀs. Boltzmann distribution is used for transition probability and the General Regression Neural Network (GRNN) simulating the strategy-choosing in Evolutionary Markov Games. Prisoner’s dilemma is a problem that uses the method and output results showing the overlapping the human strategy-choosing line and GRNN strategy-choosing line after 48 iterations, and they choose the same strategies. Also, the error rate of the GRNN training by Tit for Tat (TFT) strategy is lower than similar work and shows a better result.


INTRODUCTION
Game Theory (GT) is a theory of the science of logical decision-making in humans and computers, whose actions affect each other. When a situation with two or more players involves known payouts or quantifiable consequences, game theory can help determine the most likely outcomes. A few terms commonly used in the study of game theory are listed as follows: • Game: Any set of conditions that their results depend on decision-makers actions (players).
• Players: A strategic decision-maker within the context of the game.
• Strategy: The action which the player will take at any stage of the game.
• Payoff: When a player arriving at a particular outcome, they receive the payout, which can be in any quantifiable form, from dollars to utility.
• Information Set: The facts available at a given point in the game. The term information set is most usually applied when the game has serial parts.
• Equilibrium: The phase in a game where both players have made their final decisions, and the game attains an optimal solution.
A game typically includes several players, strategies or actions, orders, and payoffs, similar to the individuals, genetic operators, and fitness involved in evolutionary algorithms. Players usually select a strategy based on the expected payoffs. A Markov process can be represented by the stochastic extension of a finite state automaton. In a Markov process, state transitions are probabilistic, and in contrast to a finite state automaton, there is no input to the system. Furthermore, the system is only in one state at each time step. Supposing the space of state is = 0, 1, 2, ... and a set of time is = 0, 1, 2, .... If for arbitrary and state 0 , 1 , ..., , , then: Then a stochastic process { , ∈ } is called Markov Process which its most recent values determine future probabilities. The one-step transition probability is a conditional probability that has the following definition: Transition probabilities can be then computed as a power of the transition matrix shown in Equation 3.
And one-step Markov transition probability matrix, which this paper uses, has a definition as shown in Equation 6.
In some situations, players would exploit learning methods. Intelligence consists of making the right decisions in a given situation to achieve a certain goal. Game theory provides mathematical models of real-world situations for studying intelligent behavior. Most of the time, effective decision-making in strategic situations (such as competitive situations) requires nonlinear mapping between stimulus and response. This can be provided by artificial neural networks, such as reinforcement learning, belief learning, imitation, neural network, and Fuzzy neural networks to guarantee gains. In this case, the player can decide based on their perception of the actions of other players.
Neural networks are algorithms modeled loosely after the human brain that is designed to recognize patterns. They can be effective in the programming of learning, generalization, and optimization functions [1] . The General Regression Neural Network (GRNN) was introduced by Donald Specht [2] . The GRNN was introduced as a memory-based neural network that would store all the independent and dependent training data available for a particular mapping. When a new input vector presents, the GRNN uses a distance function to compute the weighted average of the dependent training data whose independent parameters are close to the input vector. Apart from the fact that the GRNN and the Radial Basis Functions (RBFs) were motivated by different principals, their implementation is very similar. Both networks perform identical processing in the hidden layer and linear operations in the output layer. The main difference is that the GRNN output layer performs a weighted average while the RBF performs a weighted sum. Which of these two networks is better depends mainly on the application? The GRNN definition requires that it store the majority of the training data.
When computational constraints are not significant, such as in-stock market prediction, where predictions may only be required once per day, it is probably best to use the memory-based GRNN. However, computational efficiency is more crucial in most other applications, such as embedded control systems, and the RBF network would be preferred. Figure 1 shows a case of a GRNN neural network, this kind of neural network includes: an input layer, pattern layer, summation layer, and output layer.
In this paper, the GRNN neural network simulates the player decision-making for the learning process. In evolutionary Markov Games, the inputs of the neural network are decided by the last actions of players. The output is the next action that the player will choose. Simulation results show that the neural network can successfully simulate evolutionary Markov Games. In continuation of this paper first, the previous researches will be explained. In second, the theory of evolutionary Markov games and GRNN neural networks will present, and in final, the discussion about the output of the simulation will declare.

PREVIOUS RESEARCHES
In biology, game theory is a critical part of a mathematical and computational approach. So the 1970s evolutionary game theory was introduced by Maynard Smith. Ganesan et al. [3] in 2015 presented another evolution of game-theoretic, name GTDE. GTDE was conducted in the framework of differential evolution (DE) with the strategies and two computational times of DE. Wood et al. [4] in 2016 studied OPEC and the Seven Sisters. They used a methodological toolkit that involved evolutionary game theory and agent-based modeling. For modeling evolutionary game theory, they used heterogeneous populations, energyspecific variables, and behavioral considerations to capture the essentials of the applied problem. To provide detailed results, an agent-based model was used.
Sharma [5] in 2016 proposed a Lyapunov theory-based Markov game fuzzy controller, which was both safe and stable. They attempted to optimize a reinforcement learning (RL) based controller using Markov games and hybridizing it with a Lyapunov theory-based control for stability. Yang [6] in 2017 suggested an algorithm that used an adaptive technique based on cellular automata and named it evolutionary algorithms based on game theory and cellular automata with coalitions (EACO). In most cases, this algorithm exceeds the compared cellular genetic algorithm but for some benchmarks of combinatorial optimization problems is not functional. This method provides game strategies of better quality than genetic algorithms but needs longer running times.
Garay et al. [7] in 2017 introduced a matrix game under time constraints, where each pairwise interaction had two consequences: both players received a payoff and could not play the next game for a specified time duration. So, their model is defined by two matrices: a payoff matrix and an average time duration matrix. They illustrated the effect of time constraints by the prisoner's dilemma game, where additional time constraints can ensure the existence of unique evolutionary stable strategies (ESS), both pure and mixed, or the coexistence of two pure ESS.
Tosh et al. [8] in 2018 used evolutionary dynamics to analyze both participation and information sharing games and understood the outcomes of the game, i.e., stable evolutionary strategies (ESS). Various conditional constraints derived from the analysis helped to devise a dynamic cost adaptation algorithm that exploits the participation cost to act as an incentive, in the beginning, to motivate as many firms to participate.
Quan et al. [9] in 2018 studied the stochastic evolutionary optional public game by considering both continuous noise and the finite size population effects. The evolutionary process of strategies in the population is described as a two-dimensional Markov process. They focused on the SSEs of the system, which was analyzed by the limited distribution of the Markov process. Furthermore, they investigated the influence of parameters on the probabilities of the system to choose different SSEs.
Overton et al. [10] in 2019 evaluated the fixation probability of invading mutants for given initial conditions by showing how to derive existing pair approximation models and developing node-level approximations to stochastic evolutionary processes on arbitrarily complex structured populations represented by finite graphs. They discussed that the Construction of these models facilitates a systematic analysis and can be valuable for a greater understanding of the influence of population structure on evolutionary processes.
Izquierdo et al. [11] explained how ABED, free and open-source software for simulating evolutionary game dynamics in finite populations, can simulate a wide range of dynamics considered in the literature and many novel dynamics. They introduced a general model of revisions for dynamic evolutionary models, which decomposes strategy updates into selecting candidate strategies, payoff determination, and choice among candidates.
jia Wang et al. [12] constructed the Wright-Fisher evolutionary game dynamic model by introducing different intensities for different strategies. According to their analysis, in finite populations, fixation probability is related to game payoffs and the population size N and is affected by the different selection intensities.
Gu et al. [13] in 2020 expressed the game payoffs as the normal fuzzy number. In their study, the stochastic evolution game of the Moran process of the finite population was generalized to the fuzzy environment. They also introduced the fuzzy dilemma, according to classified the fuzzy games. Singh et al. [14] in 2020 considered a dynamic game model of the bitcoin market, where miners or players use mining systems to mine bitcoin by investing electricity into the mining system. This work is motivated by BTC and can be applied to other mining systems similar to BTC. Their work proposed three concepts of dynamic game-theoretic solutions to the model: Social optimum, Nash equilibrium, and myopic Nash equilibrium.
Srivastava et al. [15] in 2020 presented a novel algorithm to control data security using a hybrid model that used an adaptive encoding technique alongside Chaotic Hopfield Neural Network. The proposed computation upgraded the security of a key shared between any nodes. They showed that the security of transmitted data had better performance than traditional algorithms, and also, the computational time for the proposed algorithm was less than known traditional algorithms.
This group of games has dynamic characteristics; their choosing strategies can change through processing time. One idea to deal with this problem is to map players' strategy-choosing to a Markov decision process with payoffs.

MATERIAL AND METHOD
Human modeling gameplay with a powerful neuro-fuzzy network, like the GRNN network, is the main issue in the context of game theory. The cooperation of two or more players leads to a higher payoff for the cooperating group. To be clear, this research is interested in building a GRNN neural network that converges to optimal results. In continuation of this research, the hope is to illuminate decision-making strategies that may be at work to be applied on this network and compare their results to find the optimal ones which reach the Nash equilibrium.

Evolutionary Markov Games
This paper considers the repeated games, which look at iteration times of games as a set of time t=1,2,... in the Markov process. Strategy space was mapped to the state space I of the Markov process. In repeated games, players by their payoffs update strategies, and the payoffs in repeated games were represented as rewards for state transition. This paper discusses the iterated prisoner's dilemma, which uses a binary choice for the players, either cooperation (C) or defection (D).
In general, if at time (the − ℎ repeated game) the player is in state-( = , ), then at time + 1, the probability that the player chooses strategy or can be obtained by Markov transition probability matrix, which shows in Figure 2 . So the iterated prisoner's dilemma is a Markov process with the two-state transition.

FIGURE 2 Markov Process for iterated prisoner's dilemma.
Consider a Markov process with states. If the state transfers from to , then can be the payoff and a series of rewards for state transition, which are stochastic variables. The Markov transition probability matrix determines its probability distribution.
For more definitions, consider 2-player symmetric repeated games. The player's strategy is , the total expected payoff after n times repeated games can calculate by using the Markov process. If ( ) would be the total expected payoff after times transition from state , then the expected reward after the one-time transition from state can be calculated by Equation 7.
And also the following equation can define: .
Equation 2 will be defined as the following by using the form of vectors, where V(t) is the column vector of vi(t), Q is the column vector of qi, and P is the Markov transition probability matrix: This paper uses the Boltzmann distribution for calculating . Equation 11 calculates the probability of state transition from state to at time : In Equation 11, ( ) is the payoff when the player chooses strategy . For the learning process, the parameter has an important role in choosing the optimal strategy more accurately. By increasing , the randomness of decisions can increase and upside down.

General Regression Neural Network (GRNN)
Artificial Neural Networks (ANNs) have two main types: The Feed Forward ANNs (FFANNs), in which the input will only flow to the output layer in the forward direction, and the Recurrent ANNs (RANNs) in which data flow can be in any direction. Generalized Regression Neural Networks (GRNN) is a single-pass associative memory feed-forward type Artificial Neural Networks and uses normalized Gaussian kernels in the hidden layer as activation functions. GRNN is made of input, hidden, summation, division layer, and output layers.
When GRNN is trained, it memorizes every unique pattern. This is the reason why it is a single-pass network and does not require any backpropagation algorithm. After training GRNN with good training patterns, it will be able to generalize for new inputs. The output of GRNN can be calculated using Equation 12 and Equation 13.
Where is the Euclidean distance between the input and the training sample input , is the training sample output, is the smoothing parameter of GRNN.
GRNN's advantages include its quick training approach and its accuracy. On the other hand, one of the obstacle ideas of GRNN is the growth of the hidden layer size. However, this issue can be solved by implementing a special algorithm that reduces the growth of the hidden layer by storing only the most relevant patterns.

RESULTS AND DISCUSSION
The prisoner's dilemma is a classic model in Game Theory; This game is a non-cooperative, non-zero-sum game and shows why two completely rational individuals might not cooperate, even if it appears that it is in their best interests to do so. The prisoner's dilemma has illustrated with the payoff matrix. Figure 3 shows the calculation of payoff players with the following descriptions: Two members of a criminal gang are arrested and imprisoned. Each prisoner is in solitary confinement with no means of communicating with the other. The prosecutors lack sufficient evidence to convict the pair on the principal charge, but they have enough to convict both on a lesser charge. Simultaneously, the prosecutors offer each prisoner a bargain. Each prisoner is given the opportunity either to betray the other by testifying that the other committed the crime or to cooperate with the other by remaining silent. The possible outcomes are: • If A and B each betray the other, they both receive reward three for cooperating.
• If A betrays B, but B remains silent, A will be set free, and B will serve five years in prison.
• If A remains silent but B betrays A, A will serve five years in prison, and B will be set free.
• If A and B both remain silent, both of them will serve only one year in prison (on the lesser charge).
So each player has two choices; either cooperation or defection. Figure 3 shows the calculation of payoff players.
It can be understood from Figure 3 that defection is a dominant action, and any rational player will choose defection no matter what others choose. But if they cooperate, they would get more.
The iterated prisoner's dilemma repeats the conventional game numerous times without knowing the number of repetitions of both players. By using this iteration, the hope of cooperation would be increased. This method is widely used to model systems in biology, sociology, psychology, and economics. The situation of this game is dynamic and can be modeled as an evolutionary Markov game with payoff and transition probability is made by Boltzmann distribution.
The GRNN neural networks match pair to play iterated prisoner's dilemma games with a human. The GRNN neural network comprises an input layer, one hidden layer, and one output layer. The input layer includes six input units, which is his and the opponent's last actions, and the hidden layer is composed of 20 units. Table 1 shows the simulation parameters: The evolutionary curve of human and neural networks has shown in Figure 4 . It shows that after 70 iterations, human and GRNN Network cooperate and get the result that it has better performance than other similar work with MLP neural network. This figure shows that after 48 iterations, humans and GRNN choose the same strategy.
The most successful strategy for playing the prisoner's dilemma is Tit for Tat (TFT) strategy submitted by Anatol Rapoport. The start of this game is with a move to cooperation but repeats the last move made by its opponent. GRNN neural network uses to play with TFT strategy, and Figure 5 is the curve of epoch and error by GRNN neural network. It can be understood from the results that after 40 times iterated games, the error rate would start to decrease. The best error rate would be 0.0059 in 256 epochs.
Although MLP networks have better performance than GRNN networks in some problems, in this case, GRNN shows better results than Weibing et al. [16] . But in their work, after 217 iterations, the MLP neural network and humans successfully choose the same strategy. Still, by using the GRNN network, the iterations decrease to 48, and also the error rate, which was sum square error, in their works in playing with TFT strategy was 10 −2 but mean square error has decreased by GRNN neural networks and was less than that, which showed in Figure 5 .

CONCLUSION
In recent year's evolutionary game theory, especially, the study on the learning model of players has attracted more attention than before because they can simulate the real situation and dynamic during processing time. This paper map players' strategychoosing to a Markov decision process with payoffs to create the evolutionary Markov process. The transition probability is made by Boltzmann distribution and also used the GRNN neural network to simulate the strategy-choosing in evolutionary Markov games. This process applied to iterated prisoner's dilemma, and output results showed that after 48 iterations, the human

FIGURE 4
The evolutionary curve of human and GRNN neural network.

FIGURE 5
The curve of epoch-error.
strategy-choosing line and GRNN strategy-choosing line overlapped, and they choose the same strategies. Also, GRNN trained by TFT strategy, the error rate was less than similar work and showed improvement in result. For future work, it's useful to use other kinds of neural networks in unsupervised groups to get better performance and decrease the number of iterations in choosing a strategy. It would lower the computation. On the other hand, it's better to make tournament strategies to find fitter one in reaching Nash equilibrium, such as Defector, the handshaking CollectiveStrategy, and Aggravater appearing, which are good competitors for TFT strategy.