This paper aimed to bring out the evolution of cooperation in our society by examining the cooperative hunting scenarios in lions extensively studied by biologists over the years. The objective was to simulate the random behavior of animals and show how their actions converge/diverge under different conditions. The simulation was done using Multi Agent Reinforcement Learning. The agents learnt their rewards through Nash Q-Learning, and we were able to simulate three different scenarios under different conditions - one in which the animals fight, one in which they cooperate, and one in which they mix
both these strategies.
Our work revolved around the scenario where two lions exist in a limited area, and can choose either to cooperate to catch the prey and share the reward, or kill off/injure the opponent and then keep all prey for themselves. The prey is assumed to be part of the environment and the agents are the players. To simulate this game, our first job was to design a theoretical model based on real world data, on which we would run our computational MARL model. To do this, we first analyzed similar papers in the past, which had talked about the cooperative hunting scenario. In the paper, Evolution of Cooperative Hunting, Packer and Ruttan examined several Evolutionary Stable Strategies for the agents (hunters, in this case). Their paper considered the following possible roles i.e. actions each agent could have/play -
They came up with their own payoff matrices, with parameters such as prey encounter rates, prey capture rates, costs of pursuing and costs of subduing prey. Taking inspiration from these, we came up with two models.
Our first model uses payoffs from the matrix given by Packer and Ruttan, with some modifications. However, instead of a solitary agent, we have come up with a new role in which the agent has to first eliminate(fight) its opponent to play solitary. So the agent has to incur a cost to be able to hunt alone.
(Cooperate, Cooperate) and (Fight, Fight) have the same utilities for both agents, and are hence denoted by single values. In this payoff matrix, subscript one denotes the value for a solitary agent, where as subscript two denotes the value for the pair of agents.
(Fight, Fight) has subscript w which denotes extremely low valued real numbers, and represent the case when both predators decide to fight each other and end up wounding each other. This means that the probability they catch a prey is extremely low now.
We assume that the predators split the
Fig. 1: Environment
is not equally divided among the players but in proportional to their capability. The case of (Fight, Cooperate) and (Cooperate, Fight) is similar to Case 1, with different values of
The case of (Fight, Fight) is quite interesting as the loss of energy during fight is not equal for
The Second model which we have designed to show the Hunting Scenario is based on the similar principles discussed above. It has two different cases depending on how powerful each predator is. In the first case both the predators are considered as equally powerful and the cost to catch a prey is equal. In the second case one predator is more powerful than the other and thus the payoff matrix changes accordingly.
P1/P2 | Cooperate | Fight |
---|---|---|
Cooperate | (K-C)N/2, (K-C)N/2 | 0, (K-C)/N |
Fight | (K-C)/N, 0 | (K-C-I)N/2, (K-C-I)N/2 |
TABLE II: Payoff Matrix for Case 1
As mentioned before both the predators are equally powerful and thus they equally divide the payoff in case of (Cooperate,Cooperate) and (Fight,Fight). When either of the one tries to Cooperate but the other one wants to fight or vice-versa, the value of payoff for the one who wants to Cooperate is zero, as he did not get cooperation from the other predator to catch the prey.
P1/P2 | C | F |
---|---|---|
C | (K-C1)NC2/(C), (K-C2)NC1/(C) | 0, (K-C2)/N |
F | (K-C1)/N, 0 | (K-C1-I1)NC2/(C2), (K-C2-I2)NC1/(C) |
TABLE III: Payoff Matrix for Case 2
In this second case, if we assume
We modeled the the scenario into a grid based environment as shown in Fig 1. Predator A and B start at positions 1 and 2 respectively and they both have to catch the prey, which is at position 7. Prey is blissfully unaware of the existence of the predators. For simplicity, we have assumed that the predator A can move only right and up while predator B can move left and up. If predator A decides to move left from position 1 he ends up in position 4. The same goes for predator
The goal of the predators is to catch the prey. However, the game also ends when both of them decide to fight each other out and end up in position 4 after the first move. The rewards are given to both the predators only when the game finishes depending on how they hunt the prey.
To train our predators to choose the best possible option available to them we use a technique called as Nash
In Reinforcement learning, an agent tries to learn the optimal control policy by interacting with the external world. The external world is modeled as a discrete-time, finite state, Markov decision process (MDP). Each state, action pair
Q-Learning is an approach to incrementally estimate the utility values of executing an action from a given state by continuously updating the
Q(s, a) = Q(s, a) + α(r(s, a) + γ(maxa′ Q(s′, a′) - Q(s, a))) --------------------(1)
Q(s, a) = (1-α)Q(s, a) + α(r(s, a) + γmaxa′ Q(s′, a′)) --------------------(2)
where
Watkins and Dayan(1992) have shown that Q-learning algorithm converges to an optimal decision policy for a finite Markov decision process.
Nash Q-learning is an extension of Q learning to the multi agent reinforcement scenario. In Multi Agent reinforcement learning (MARL), the external world seen by an agent can't be modelled by an MDP and is instead modelled as a Stochastic Game. Every agent's value function at a state
Q(s,(a, b)) = (1-α)Q(s,(a, b)) + α(r(s,(a, b)) + γNashEqa′,b′ Q(s′,(a′,b′))) ------------------(3)
The environment for the agents' to train in was built from scratch, keeping in mind OpenAI Gym standards, for seamless integration with other powerful Reinforcement Learning libraries like Keras-R1.
OpenAI Gym is an open source library for developing Reinforcement Learning Algorithms and giving the agents a platform to learn by interacting with the environment.
As mentioned previously, the Nash Q Learning algorithm was used to train the agents. The Algorithm was implemented in (More Information can be found here : gym.openai) scratch in the programming language of Python. The hyperparameters used for training the agents with the algorithm are:
We tested three scenarios, the first one in which the Nash Equilibrium was Cooperation, the second one in which Fight was the Nash Equilibrium and the final one in which there were multiple Nash Equilibria.
The payoff matrix for the first scenario is given in the table below. The Nash Equilibrium here is (Cooperate, Cooperate),
P1/P2 | Cooperate | Fight |
---|---|---|
Cooperate | 3,3 | 0,2 |
Fight | 2,0 | -1,-1 |
TABLE IV: Payoff Matrix for Cooperate Scenario
and we do in fact observe our agents converging to that behaviour.
The payoff matrix for the second scenario is given in the table below. The Nash Equilibrium here is (Fight, Fight). We
P1/P2 | Cooperate | Fight |
---|---|---|
Cooperate | 3,3 | 0,4 |
Fight | 4,0 | 1,1 |
TABLE V: Payoff Matrix for Fight Scenario
observe that our agents' behaviour does converge to the Nash Equilibrium.
The payoff matrix for the final scenario is given in the table below. In this scenario, the Nash Equilibria are (Fight, Fight)
P1/P2 | Cooperate | Fight |
---|---|---|
Cooperate | 3,3 | 0,2 |
Fight | 2,0 | 1,1 |
TABLE VI: Payoff Matrix for Multiple Equilibria Scenario
as well as (Cooperate, Cooperate). The agents randomly choose to fight or to cooperate. The converged behaviour alternates between Fig 2, and Fig 3 and other ways to cooperate.
Visit the github link : Github for more figures and complete code.
(a) At
(b) At
(c) At
(d) At
Fig. 2: Converged behavior of the two predators as Cooperation
While this paper uses cooperative hunting to devise a model for predators and prey, it should be noted that the underlying premise on which this is based on is that of Supply and Demand. The context of this paper can indeed be extrapolated to any situation or scenario where there exists a scarcity of resources and more than one competitor.
A very direct context that can be derived from our paper is that of economic trade. More specifically, a Duopoly model. Similar to our model, there exist two firms who are competitors for the same market and can either choose to cooperate or try to exploit the market on their own. The results obtained from our experiments can indeed be observed in this particular context. In situations of cooperation, we do observe firms working together on fixing a certain price level or a certain supply level with a promise to not undercut one another and in situations of Fighting, we do observe situations escalating to price wars.
Another interesting application of our methods and model would be in the field of Anthropology and Evolution. There is significant evidence that humans and other species of animals pass down information through their genes enabling them to learn certain behavioural patters that enable them to live better
(a) At
(b) At
Fig. 3: Converged behavior of the two predators as Fight
and longer. The methods used in this paper can be utilized to model such behaviour and in some cases even predict future behaviour to aid better understanding of the workings of evolution.
The implications of the methods and ideas explored in this paper do have far reaching consequences. A direct increase in complexity can be achieved by simply increasing the number of predators and prey in our model. The problem with solving large scale game theory problems for the Nash Equilibrium with traditional techniques is that the mathematical complexity falls under the category of PPAD which lies along the line of np-hard problems. The techniques used in this paper however can be used to reach the Nash Equilibrium in a much shorter duration.
By utilizing the power of RNNs, we can solve similar large scale complex problems in the fields of economics and businesses. Future work on this paper would include making models with multiple predators and prey with additional constraints on movement and/or actions, modifying the payoff matrix with additional variables, including actions for prey, etc.
[1] Packer, C. and Ruttan, L. (1988). The Evolution of Cooperative Hunting. The American Naturalist, 132(2), pp.159-198.
[2] Hu, J. and Wellman, M. (2003). Nash Q-Learning for General-Sum Stochastic Games. Journal of Machine Learning Research. [online] Jmlrorg.
[3] Watkins, C. J. C. H. (1989). Learning With Delayed Rewards. Ph.D. thesis, Cambridge University Psychology Department.
[4] Watkins, C. J. C. H. Dayan, P. (1992) Technical Note: Q-Learning. Machine Learning, 8(3/4), Kluwer Academic Publishers.
[5] Constantinos D., Paul W. and Christos H. (2008) The Complexity of Computing a Nash Equilibrium
There are no datasets linked
There are no datasets linked