Optimize compute resources for multiplayer games: bare metal, cloud or hybrid?

Optimizing Compute for Multiplayer Games
14 February 2023

 
Finding the right compute solution for your multiplayer game can be a very simple or difficult feat, depending on how you go about it. Searching about the topic online will often result in contradictory posts in favor of the company who wrote the article. A lot of the claims being made on the internet are either not relevant anymore in today’s world, have a thick marketing sauce on top or are simply not relevant for your use case. That is what happens when the compute product has truly become a commodity that many providers try to simplify or put their own branding flash on top to stand out from the crowds.

Table of Contents

Pros and cons for bare metal and cloud

We will not list why bare metal is better than cloud or vice-versa, nor the difference between virtualized machines (generally used by the cloud providers) and bare-metal machines or even dedicated servers as you probably know the most common ones such as noisy neighbors or hypervisor tax as described here. Instead, we’ll try to go a layer deeper and figure out what is most important for you developing a multiplayer game, list some things to look out for when selecting a provider and de-bunk some common find answers you’ll find on the internet when looking for the differences between cloud and bare metal for multiplayer games.  

So, there’s obviously pros and cons to both approaches, bare metal and cloud, for running a multiplayer game’s compute resources. Let’s start off by the basics and some factors to consider for both approaches: 

  • Leveraging bare metal would mean running your game built on physical servers as opposed to virtual servers in the cloud (although the actual hardware for both tends to be the same). Bare metal in this case would tend to offer better performance since the hypervisor layer to virtualize the physical server is not running on the server. Therefore, allocate all resources for your game as it runs directly on the hardware. For fast-paced games with high-bandwidth requirements this is a good thing, for turn-based games, do factors such as hypervisor tax and noisy neighbors really matter?   
  • Running your game with a cloud provider could mean that you get some benefits regarding scalability — virtual servers usually are quickly provisioned and resized as needed, allowing for scaling up and down. This can be useful if your multiplayer game has variable concurrent users, or peak usage times during the evenings or weekends. Yet, cloud servers can potentially have higher latency and less predictable performance than bare metal, depending on the workload and the actual service provider. 
 

Both pros and cons have become generally slim by today’s standards. Bare metal providers have become more flexible by provisioning their physical machines through their APIs, Terraform and other tools, whereas most cloud providers don’t scale as easily and limitlessly as we once thought, and the overhead of virtualization also does not usually bother anyone. 

So, with that, we’d reach the same conclusion as on most articles on the internet, where ultimately, the best option for you depends on your needs and constraints. In this case certain questions would guide your decision-making process. Examples include: What type of game are you building? What is the scale of your game and your budget? How much expertise do you have when it comes to infrastructure?  

We’d however, like to challenge that a bit and talk about some of the smart ways providers of both bare metal and cloud add value to their commodity product.  

Assess the true value of your compute provider

Interconnection with ISPs

At the end of the day, the compute resources are accessed both by your engineers as well as your players through the internet. The way the internet works, it is not always about the physical distance between the server (physical or virtual) and the player, but the connection of their home internet service provider to the server. We’ve seen examples between two different networks where server x was in the same country as the player, and server y in the neighboring country, but still server y provided lower latency to the player simply because of the interconnection of the internet service provider in the region. This is also the reason why edge” does not automatically translate to lower latencythere is a point of diminishing returns in putting all your servers at the edge. So don’t just look at the number of locations where a provider is available: 100 locations may sound cool substantial, but it pays off to question what their local network looks like, and which networks they peer with locally. Usually what you’ll find is that the more generalized companies are not as interconnected with internet service providers as the more specialized companies are so ask for their local peers, do some ping tests and My Traceroute (MTRs) to figure out what works for you.  

Software

How intuitive is the usage of their product for you? Are their APIs clear and easy to integrate within your own backend? What about the data flow between the compute resources and your backend? To what extent do you have to manage the product yourself? 

Scalability and flexibility

Both bare metal and cloud providers have become scalable and flexible to a similar degree. Where cloud providers seemed to be endlessly scalable, there have been many occasions where a site of one of the providers was sold-out, at the end of the day those are also physical servers. Obviously, not every game can cause a cloud region to sell out, so this is a bit of a moot point, however. There are quite a few bare metal providers that have built their product in a way it has the same scalability and flexibility as the cloud providers, allowing you to leverage bare metal resources for just an hour, without committing and just scaling up and down as you go. 

Support

Your multiplayer game is a live service, something you should take into consideration when leveraging compute resources. At the end of the day, you do not own but rent, so if something happens you need to pick up your phone and dial the person who needs to hurry up and help you get your servers back online otherwise your hardearned CCU will drop, and people will start flocking back to FIFA or whatever game they were playing before. Before deciding, see how deep you can go within the organization, is the VP of network available to troubleshoot a certain region with high latency for you? How does the escalation matrix work and how quickly can they get your compute resources back online if something goes bad?  

The costs of compute resources for multiplayers

Comparing the cost of compute resources with various providers is a trickier exercise than just comparing compute prices.  

Bandwidth

You must consider bandwidth, because if you have 400 players on an instance averaging 150kbit of traffic, the bulk of your cost will be derived from bandwidth, not the cost of hardware resources.  

Workforce

Furthermore, what is the internal overhead you’ll need for staff to manage the environment? Even with managed compute resources or edge providers, you cannot be fully hands off. You will still need to have people on board that understand how it runs, pull things properly into the backend, work with the providers’ APIs to properly set it up and keep it online.  

Maximizing compute resource efficiency

Then there are a bunch of opportunity costs you should consider:

The hybrid solution approach

Taking these factors into account is important for you to make your decision. As we have established, neither of the products is perfect for running multiplayer games and both bring their own pros for your video game. Therefore, the hybrid approach is an interesting and viable strategy for you to host your multiplayer game as you can run your base, or minimum number of concurrent users on bare-metal for cost-control and quality control for your always-on machines. You can scale into various public cloud or bare metal providers that provide cloud-like flexibility to ensure capacity for the traffic peaks on the weekends and evenings, scaling all those extra resources down immediately after your peak is over. That is not something you’d have to build yourself, there are numerous providers that provide server orchestrators that take care of just that. Some only scale into their own compute resources, while others do a mix of owned compute resources and public cloud, while others are purely software that scale only in non-owned compute resources.  

The hybrid model allows you to keep reiterating your compute strategy. As your game’s CCU would change over time (positive or negative), it makes perfect sense to review your compute strategy over time rather than picking one provider at the get-go and letting that run for years as that could really hurt your bottom-line, lock you in, close off certain geographical areas for your game and hurt quality.  

To conclude, determining the compute resources you should use for your multiplayer game is not an easy decision. With many products and services out there, there are many ways you can go about it. Yet that doesn’t make your choice any easier. So, start with the basics, what do you really need and how can you mold all the options to a global server fleet that communicates perfectly with your backend, to easily patch new game builds and limit any issues to keep your players engaged?

Main Take-Aways

Optimizing your multiplayer game’s resources requires understanding the true value of a compute provider, such as the level of interconnectivity with other ISP’s, understanding the true costs of compute resources and where you can maximize efficiency.