Case: Teaching Artificial Intelligence to Engage in Mobile Game Play

Many programmers begin their path hoping to impart fresh knowledge to a computer. Another approach to teach machines has evolved with the development of machine learning; and with reinforcement learning, one can let the machine learn on its own. Setting the objective of teaching an artificial intelligence to play their mobile game, one developer attempted to find out how practical this was.

Learning via Reinforcement

Reinforcement learning had long captivated the developer, particularly following the DQN algorithm’s triumph over venerable Atari games. Lack of theoretical knowledge and computational resources made attempts to duplicate these results using their own Q-learning agent useless.

Later on, they discovered the PPO (Proximal Policy Optimization) algorithm, regarded as more stable and generally applicable since it solves various problems with DQN. This became the justification for trying reinforcement learning once more with their mobile game acting as a Code Code playground.

Game Title and Description

Like Bejeweled or Candy Crush, the experiment employed a basic match-3 mobile game whereby stages progressively get harder and the player starts from the beginning after losing.

Observation Domain

The game revolves on a 7×7 board on which every tile has one of five basic colors or one of seven unique objects. There are essentially between 10³⁴ and 10⁵² possible board states.

Action Zone

Grouping three or more tiles of the same color will help the game to establish a match. More than three at once generates unique objects with particular behaviors (e.g., area damage on activation). Every potential aim could be accomplished by letting an agent swipe each tile in one of two directions, therefore necessitating a 7x7x2=98 distinct action space. Later on, the developer understood that although the action space may be cut from 14 to 84, he should still apply the initial presumption.

Making the Computer Game Playable

Runs on mobile devices, the game was created for people not computers. The difficulty was distilling the core of the game such that a reinforcement learning agent could access it.

To quickly add the game logic in an OpenAI Gym environment, the developer thought of reusing Python. Though the game mechanics are basic, the initial implementation required much time and included extensive unit testing. Reimplementing it elsewhere did not let them run the danger of fresh bugs and flawed game logic. Before allowing an agent attempt to understand it, it was imperative to guarantee the game logic was solid.

Luckily, the game was created in Dart, which desktop computers and servers can run. They so opted to encircle the original core game functionality with a REST API. They then built a Gym environment playing the game using this API. Although local HTTP calls required additional overhead, they were eager to try it.

The resultant architecture was thus an agent operating in a Gym environment with external game logic.

Acquiring a Working PPO Agent

There were no aspirations to create a unique version of the method since the primary objective was to determine whether an existing PPO implementation could be recycled. Designed to solve the Cart Pole scenario from OpenAI Gym, the developer drew on an example PPO from the Keras website.

The agent was hooked up to the bespoke Gym setting and let to study the game following confirmation they could also tackle the Cart Pole problem. Still, it wasn’t that easy. To discover what worked best, one had to consider and play about with a lot of details.

Learn From Experience Training the Agent

Start Small

The agent learned absolutely nothing when first linked to the Gym environment. Thus, letting the agent play a 3×3 board with just two colors substantially shrank the observation space and action space. Additionally disabled were special items that streamline the game even more. Some advancement resulted from these developments.

Encoding of Input

The first encouraging outcomes made it abundantly evident how much form observations and rewards had. Experiments produced a one-hot encoding for the observation space: every potential tile state in the input gets a dimension or channel and is assigned to either 0 or 1. Just this modification made a significant difference. It was also discovered that rewards should be rather centered on 0 and more or less contained between -1 and 1.

Architectural Network Development

For the smaller game board, the PPO Cart Pole example made use of two hidden completely connected layers with size 64. Replacing the first of the two inner layers with two convolutional layers produced far better results after trying several network configurations.

For the 3×3 board with two colors, all of these first modifications and tests produced really good outcomes.

Scaling Is Not Easy

Promising outcomes for this smaller problem space meant that the next action was to gradually scale it up and make appropriate network architectural and hyperparameter changes.

Although this presumption was accurate, determining what had to be done to enable scalability was challenging. The developer’s MacBook Pro couldn’t manage the training as the game grew, hence rented GPU time in the cloud was used instead, which made tests quite expensive. It soon became evident that 200 draws would not be enough to get decent outcomes. An epoch size of 40,000 was finally decided upon, hence all further studies took far longer time than the first ones.

The toughest challenge was that, no matter what was tried, a 7×7 board with five colors simply would not work. At first, it was believed that the action space was excessively large for a board of this kind. Training for a 7×7 board with just four colors then proved successful. The size of the action space couldn’t be the problem after all as the board size determines it rather than the color count.

Nothing worked while attempting other network designs, the amount of epochs, the learning rate, and other hyperparameters.

They simplified the game once further and had the agent learn just the opening move, just as they were ready give up. Good outcomes came from this. Then games spanning several moves more were played using a network taught in this way. The agent first performed badly once more, but she gradually improved. This recently learnt network was then used to play whole games after some time. Once more, the performance fell; still, it gradually learned until at last convergence in a rather decent condition.

Including the Agent Into the Game

Seeing the agent play the game came next using a trained TensorFlow model.

TensorFlow Lite thankfully lets trained models be transferred to mobile apps. The agent could at last be seen playing the game once the model was turned into a TensorFlow Lite model and the TensorFlow Lite SDK was included into the game. Thus, how did it go?

Not as good, though, than an experienced human player. It still clearly made blunders and didn’t seem to have investigated all the game elements entirely yet. Its narrow observation area also prevented it from knowing the main goal of each level, therefore raising its likelihood of failing a level.

Apart from that, it rather performed rather nicely. It even worked out how to inflict additional damage using unique objects.

Possible Uses

Level Design and Balance

Using this trained model verified a concern that stages 6–15 were overly hard relative to what followed. Each level’s difficulty could be precisely calculated using artificial intelligence, then a suitable level progression might be found.

NPC, or Opponent

Perhaps not for this particular kind of game, but for related ones, an artificial intelligence like this could be employed as an NPC or opponent a human player would have to contend with.

Assistant in Charge

Games like these often feature indications of possible next steps to the player should they be stuck. For this one could employ a trained artificial intelligence.

Real-Time AI Assistance

AI is also enhancing learning in other game genres. For example, Poker Singh app uses AI to provide real-time guidance during poker games, helping players learn strategy and improve their skills through active gameplay. This approach allows users to master the game by receiving instant feedback and understanding the logic behind each move. PokerSingh is currently in pre-release testing with a free trial available.

Conclusions

Learning whether reinforcement learning could be applied as a software engineer instead of a data scientist or researcher was an intriguing experience. It is quite likely. Nonetheless, because this approach takes a lot of time and does not always yield the greatest results, the use case and any alternative solutions must be carefully thought out.

Should this be carried one step further, the action space, observation space, and reward function would most certainly change. Since it influences the actual learning of the model, a suitable reward function is absolutely crucial. Although it was always maintained somewhat basic, the reward function underwent several changes a few times during the trials.

The aim of the current level could be included within the viewing area. As was already noted, the action space might be cut back even more or perhaps split up in some way.

This project will stop here for now, though, as running all of these tests on cloud GPUs became somewhat costly at last.

Apart from enhancing the model and more training, it would be quite fascinating to observe the type of patterns the convolutional layers of the neural network have acquired to identify. One could investigate several methods of visualizing this at some point.