For AAA games now there is a consumer expectation that the developer has a post release strategy. This strategy goes beyond just DLC content. Users expect to receive bug fixes, balancing updates, gamemode variations and constant tuning of the game experience. So how can you architect your game technology to facilitate all of this? Stewart explains the unique patching system developed for Crysis 3 Multiplayer which allowed the team to hot-patch pretty much any asset or data used by the game. He also details the supporting telemetry, server and testing infrastructure required to support this along with some interesting lessons learned.
2. So... You have just gone Gold. Copies of the game are being distributed in readiness for
launch day and crunch is finally over o/.
At this point the team has already been ramped down to the last few people, most will
have moved onto DLC or the next project and as far as higher management are
concerned, the project is done.... Right?
Maybe, there isn’t a great deal you can do now anyway as the game has been tested, QA-
ed & certified.
The reality is however, that this is the most important time. You are about to have the most
people you have ever had play the game and they have paid money expecting an awesome
experience. You need to be ready to react to what happens next....
2
3. ...and what happens next is people play your game. If you are lucky then everyone loves
your game, it sells millions, never hangs and is perfectly balanced. This talk is not for those
lucky people. What i’m interested in is the real world and this talk is for those who make
games that are not perfect.
What I am going to talk about today is all the post release technologies that you can have at
your disposal to keep the players of your game happy and keep them playing for as long as
possible.
3
4. So I should mention that this talk is based on the technologies we developed and evolved
during the production of Crysis 3. Crysis 3 was developed by both Cytek Frankfurt & Crytek
UK. The UK studio’s main responsibility and my role involvement was with the Multiplayer
component but we are also responsible for CryEngine development for the rest of the
studios and third party licences.
4
5. Although I have referred to these technologies so far as allowing you to directly fix-up
issues that are discovered, they are not limited to just these.
5
6. As well as the obvious example of weapon rebalancing that may need to take place, it
makes sense that you might want to drip-feed extra content over time to keep people
coming back.
With an online multiplayer title finding and getting into good games is always going to be
something that is important to get right. Improving the players experience and perception
of this part goes a long to keeping players happy so if makes sense to be able to collect and
react to data wuth regards to this.
When you suddently have a million people stress testing your system for you it is inevitable
the someone will uncover an edge case. Having mechansisms in place to capture data from
that 1 in a million user is crucial to being able to recreate and diagnose that issue so that
you have some chance of resolving it.
Fixing bugs would traditionally require a title update in the case of console games and
come at a cost in terms of time and money. For this reason we have been limited in the
number of these we could facilitate.
Themed weekends are great community events but tend to end up being the standard
‘Double XP’ style events, What if we could do something better than this?
6
7. So there are clearly plenty of reasons to justify spending time developing these sorts of
systems but what systems are they?...
7
8. It’s all about the having the full cycle in place. You need ways to capture information on a
global level and an individual basis but the key is turning that into something that you can
actually apply back to that individual or everyone as a whole.
In this case, data patching means being able to update asssets and data without the need
for a traditional title update and without the user having to perform any actions. Release-
debug may sound like a strange concept but it is the ability to gather debug information
from builds that are your final released version or large scale tests like closed alpha’s or
open beta’s.
Telemetry gives you statistical data that you need to balance the gameplay and also provide
a way to quantify user feedback.
8
9. Hopefully you already have some ideas why we are dedicating so much time on a post
release strategy. While some of those reasons are clear, others are maybe not so obvious...
9
10. It would be great if we could simulate the real world. We spend a great deal of time trying
to get as close to this as we can with alpha’s, beta’s, simulations and stress testing but after
all of this we are still human and can still overlook something.
Image Courtesy of Levy & fils -
http://en.wikipedia.org/wiki/File:Train_wreck_at_Montparnasse_1895.jpg
10
11. Just to show you the kind of testing strategy we had on Crysis 3, this timeline here shows
you the last few months of development. We had a closed alpha on PC and an open beta
(Demo) on all three platforms. For the majority of production we would also take part in
what we would call Tech 200’s. These are events organised by our publisher (EA) which
involved a scheduled 4 hour testing period with 200+ players from Crytek and as many EA
QA dept’s around the world. One of our big risk’s was that we were switching to a new
game lobby solution, Blaze. The game lobby solution is the technology that
provides, matchmaking, stats, leaderboards, persistant data storage, etc and so a lot of our
Tech 200 tests were geared around stressing these systems. For example we would
coordinate so that we had 10 minutes of people constantly matchmaking, dropping
out, pulling cables, button bashing just to see if we could break the game. Dispite all of this
we still managed to not quite hit the nail on the head with matchmaking on day1 taking on
average 15 seconds. The point is that there will always be something you miss, overlook or
uncover on day1.
11
12. If Microsoft or Sony find a critical cert issue you end up having to resubmit. If we have any
sense then we will build a buffer into the release schedule to allow for at least one re-
submission. But if we had the ability to turn around and say that we could fix this with a
day1 datapatch and can demonstrate this during certification then there is potential that
the platform holders would look favourably on this.
Handily, there have been some recent news stories (Referring to ‘The last of us’ using a
Boston tube map without permission) around certain games using textures that they didn’t
have permission for. Wouldn’t it be great if we could react to this and turn around a fix for
this in hours rather than days (or weeks)
Players go out of their way to find holes in the game and exploit these for competitive
advantage, for example...
12
13. Apart from the player having a different weapon, there is one obvious difference between
these two screenshots. The tree is missing in the second screenshot. The reason for this is
that we support different levels of graphical specification on PC. What had happened here
is that those tree brushes had been marked as high-spec only so would not appear if the
user has set their graphical settings to medium or less. So niaveBob, a player who is stood
on top of the airplane, enjoying hi-spec graphics and thinking he can duck in and out of the
tree is actually in plain site of Player B who has intentionally dropped his graphical settings
to low to gain the advantage. Client authoritive killing. This is something that we would
want to resolve as quickly as possible.
13
14. and without any other mechanism the only way to patch is by deploying a title update. This
inevitably means that you have to go through certification and the associated costs of these.
Microsoft’s recent announcement that they are dropping the financial charges associated with
submitting title updates is a positive step but the fact still remains that turning around a patch will
take at a minimum a week under most circumstances. This fact alone limits the number of times
you can update the game and what we ultimately want is a solution that allows you to turn around
a datapatch in the order of hours.
As an example take a look at the timeline below...
Some key points
-We had to submit the day 10 patch into cert 2 week’s before launch
- The final game was finishing cert as the open beta went live
-There was only a 6 day window between the open beta going live and us submitting the day 10
patch to certification
All of this represents a very compressed timeline around the launch window, but it is not unusual.
What it show’s is the impact that these long certifications have on our ability to patch. Relying on
title updates only would mean that we had 6 days to turn around resolutions for any feedback from
the open-beta in time to get it into the day 10 patch. After that the next planned update on
consoles was DLC. What the consumer see’s is the open-beta, the final game and then 10 days later
our first title update. What they expect to see is a progression of improvements based on user
feedback. With the timeline shown below we cannot clearly support this with traditional title
updates alone. The stat above states that 40% of all the perforce commits between final cert and
RTM (Between 8th Jan and 18th Feb) were assets and data. Of course we used a lot of this time to
tackle known issues in readiness for the day10 patch but a lot of that 40% represents the results of
the feedback from the open-beta.
In fact what we did was have a day 1 data patch. We released a number of data patches over the
first two weeks of release which addressed issues discovered in the open-beta and when the game
went live.
What did we data patch... Weapons (bug fixes, damage balancing, recoil), Melee fixes, Pinger
fixes, Perks balancing, Stats tracking issues, Challenges, Controller
14
15. It is inevitable that PSU Numbers will drop and more games will get released. You may
already be competing other AAA releases based on when you launch. So having ways to
continually update the game content to keep it fresh is essential. This doesn’t mean whole
new levels or brand new gamemodes, that is what DLC is for. The ability to stream new
content over a period of days and weeks and keep the playing public interested in between
releases of DLC is something that can stop that PSU graph falloff looking so dramatic.
For Crysis 3 we created a new system that would create challenges based on your
achievements and play style in the game. This also took into account the playstyle of your
players in your friends list too. So for example,... But because this was all data driven we
could release new challenge sets periodically or even create themed weekends around
those if we wanted to.
Some other examples of the types of content we could update include:
Gamemode variants. A lot of functionality was data driven so that the designers could
create variations on our 8 gamemodes. For example...
Playlist’s. Having the ability to create new playlists coupled with the above variant’s was a
pretty powerful mechanism.
It maybe that your game can already support some of the things described above, but the
key is being able to do this in a framework which doesn’t require a bespoke system for
updating each of these and having the ability to patch content transparently at the asset
loading level.
15
16. Being able to build themed weekends around new content or temporary changes is
something that we have all seen in double-xp weekend’s but advertising that this is
happening, or happening soon tends to be small rss style feeds or things outside of the
game. But what if you could re-skin the frontend to provide a high impact message, after all
its just another asset.
16
17. Here is an example of how we indicated that a double XP weekend was active. Now I
wouldn't blame you all for maybe having to do a double take on this. It’s not immediately
obvious. It does demonstrate one point though that even with the ability to patch assets on
the fly it maybe the case that it’s full potential is not realised...
Also notice that we updated the frontend screen with all patches as a visual way of
verifying that the patch had been downloaded and applied correctly. Without this is was
hard to debug.
17
18. Here is how I envisaged it would look. Being able to deliver relevant messaging directly to
your players in the form of in game visuals updates can be a very powerful concept. In the
case of themed weekends it could be used to advertise an upcoming event as well as the
event itself.
18
19. Players are very opinionated. There is not always a consensus in the feedback that we get
as developers but where there is a lot of noise about certain feature’s its good to be able to
address these in a timely manner. Players don’t want to concern themselves with the
restrictions placed on us by certification procedures or title updates, as far as they are
concerned they have raised an issue and therefore expect a resolution NOW!
Image courtesy of – http://www.flickr.com/photos/83394598@N00 Under the following
licence - http://creativecommons.org/licenses/by/2.0/deed.en
19
20. A lot of our feedback came via the website that supports our game. MyCrysis.com. They are
typical forums and cover the full spectrum of colourful feedback. However, having a direct
dialogue with those players is a very useful thing. They are very willing helpers and respond
really well to the fact that you are investigating an issue or have indeed prepared a fix for
this. Even though our data patching system was pretty much transparent to the user it is
still important to announce that changes have been made.
We did also engage a lot of individuals through the forums and use those to help us identify
common issues using some of the release debug features you will see later.
20
21. So being able to override existing assets with new versions is what we want so how did we
do that? Well, its actually very simple and it all starts with the game side asset system...
21
22. Like any asset system you need a way to reference files which maps to those on the physics
storage media.
22
28. Why Multiplayer only?
- More suited to multiplayer (We collect lots of telemetry to make sensible
decisions, potentially unlimited replay value)
-The additional overhead involved in checking for the existence of patches and then
downloading them was deemed too high for people just wanting to play the single player
campaign. Paks would have to be downloaded before any game systems were initialised to
get the full benefit of this automated patching. We are talking about up to 2 seconds extra
on average but could be up to 15seconds if patches are available.
-
-TCR’s dictate that you have to disable online access based on specific user settings. File
save location for cached paks
-
28
32. Ultimately allow for in memory paks that get downloaded everytime. Didnt want to have to
handle the case where a users saved game had ended up corrupt. Silently fail
32
41. Why Multiplayer only?
- More suited to multiplayer (We collect lots of telemetry to make sensible
decisions, potentially unlimited replay value)
-The additional overhead involved in checking for the existence of patches and then
downloading them was deemed too high for people just wanting to play the single player
campaign. Paks would have to be downloaded before any game systems were initialised to
get the full benefit of this automated patching. We are talking about up to 2 seconds extra
on average but could be up to 15seconds if patches are available.
-
-TCR’s dictate that you have to disable online access based on specific user settings. File
save location for cached paks
-
41
42. Why Multiplayer only?
- More suited to multiplayer (We collect lots of telemetry to make sensible
decisions, potentially unlimited replay value)
-The additional overhead involved in checking for the existence of patches and then
downloading them was deemed too high for people just wanting to play the single player
campaign. Paks would have to be downloaded before any game systems were initialised to
get the full benefit of this automated patching. We are talking about up to 2 seconds extra
on average but could be up to 15seconds if patches are available.
-
-TCR’s dictate that you have to disable online access based on specific user settings. File
save location for cached paks
-
42
43. Why Multiplayer only?
- More suited to multiplayer (We collect lots of telemetry to make sensible
decisions, potentially unlimited replay value)
-The additional overhead involved in checking for the existence of patches and then
downloading them was deemed too high for people just wanting to play the single player
campaign. Paks would have to be downloaded before any game systems were initialised to
get the full benefit of this automated patching. We are talking about up to 2 seconds extra
on average but could be up to 15seconds if patches are available.
-
-TCR’s dictate that you have to disable online access based on specific user settings. File
save location for cached paks
-
43
45. Collecting telemetry is something many studio’s now do. It serves many purposes about
how people are playing your game and helps you to qualify the severity of issues being
reported back verbally via forums etc. Collecting telemetry is relatively simple, collecting
telemetry that turns into something useful is where the difficulty lies...
45
46. So at Crytek UK we collect a lot of telemetry. Much more so in development than we do in
release. This is because we use it to track all kinds of performance metrics from
bandwidth, CPU spikes, memory and anything we have budget set for. So we need a
gameside system which makes it easy for programmers across the board to be able to
submit telemetry. Collecting and maintaining telemetry is a group effort so it isn’t
something we task one individual with.
The API we have in place is a fire and forget mechansim. You call a function with a block of
memory or a file and the rest is handled for you. Behind the scenes this data is gzipped and
uploaded via http to a destination server. Because the nature of the data we collect is
session based it tends to all be sent at the end of a match. So we accept that not all data
we submit will make it. Players choose the end of a match to turn off their consoles or jump
back to single player so it maybe that it gets cut off midway through sending as well as the
possibility of the connection being flakey. If we really wanted to we could cache to disk and
repeatedly upload but we have never felt the need to add this. In any sense it is fine though
as we tend to track and analyse large data sets or data over long periods of time and so
don’t rely on the success of any particular upload.
46
47. When the zipped up files hit our server we don’t really do a lot with them. Its pretty much a
case of just writing that data to disk. Of course it makes sense to organise the data based
on things like the platform, the date and the type of data so that it can easily batched up for
downloading. We do add all files uploaded to a per type index list as well so that this can be
used to quickly iterate through those files in scripts and also let engineers get some idea of
the numbers of files uploaded of that type.
Why don’t we do any processing of this data on the server? Well because there is really no
need to. Everything we want to do with the data can be done offline and on demand. We
have no uses of this data which demands any immediate access to it. This is preferable as it
lowers risk. Server load is always a concern and so having to accommodate processing data
for thousands of users or support long term storage of data in databases has an associated
cost. On Crysis 2 we did have some of these requirements because the data we collected
was directly used on the Mycrysis.com website. Things like stats, replays and user accounts
were all needed. For Crysis 3 we dropped a lot of these features on the supporting website
to the point where there were no longer any dependencies. We did still upload a lot of that
data though. A lot of that data actually had two uses, providing content for the game’s
supporting website and as a basis for gameplay analysis.
47
48. At this point our telemetry servers don’t have to be fast and expensive setups with load-
balancers and terrabytes of data storage. The data we collect gets synced to Crytek servers
daily so we can actually define a fixed time period for all data. The only thing left is just
making sure that we can accommodate the PSU’s we expect to see and I’ll address that
point in a minute or two.
48
49. So we have lots of data and want to do something useful with it. First of all we need to
determine what it is we want to know but really we are only interested in things that we
can ultimately change. But assuming we have figured all of that out we basically write a
number of scripts to process the raw telemetry data into an intermediate format which we
can then visualise in excel via pivot tables. For production telemetry this is typically the
weakest point in our cycle. Trying to quantify gameplay and reach some objective
conclusions is difficult. On the other hand analysis of development telemetry where you
want to know how you are performing against defined budgets is much easier. For this
reason production telemetry tends to be less automated and guided by our own intuition
and feedback from the community. There tends to be very little time spent on writing these
analysis scripts and they therefore end up being very brute force which ultimately means
that they take several hours to process.
49
50. I already stated previously that we wanted to avoid load balancing server hardware if
possible so how do we make sure that we don’t flood our target server with data? Well the
solution is to limit what you upload...
50
51. What we need is a deterministic way to say who can upload and we do this by sampling our
players. If you take a look again at Permissions.xml which is now turning into a generic
config file hosted on a remote server, you can see that we have these entries here at the
top. Each one of these represents a type of telemetry file we upload and along side it a
sampling ratio. Our ultimate goal is to sample a fixed number of players, and these players
should be the same players if possible. To achieve this determinism we hash the user name
and then apply some maths to reduce the set of global players down to a manageable size.
51
52. In our case we wanted to gather about 10,000 samples but the maths on the previous slide
mean that you are going to get a fixed percentage of the actual users uploading telemetry.
To be able to maintain our target samples we just need to vary the numerator such that the
overall sampling ratio increases or decreases. At the same time the players you are
sampling remain the same set. The reason for such a large denominator is to give you the
fidelity to be able to vary the sampling ratio by small increments.
52
53. Production wise we capture four sets of telemetry to meet our needs.
1) Player specific Stats
2) Anticheat analysis data
3) Match replay data – This is the biggest file and averages around 400Kb. It contains all
the compressed player movement data as long as all player events, for example, shots
fired, player kills etc. This is the main source of data for our gameplay analysis but we
have used this to show full match replays.
4) Finally, matchmaking telemetry....
53
54. Matchmaking telemetry was something we only started collecting for Crysis 3. This was
born out of our inability to quantify the feedback we had received in previous titles.
Matchmaking is a tricky thing to get right. On one hand players want to get into a game as
fast as possible but without allowing enough time to evaluate all the data you aren’t able to
choose the best group of players to join. It essentially boils down having good ping times
and this is a tradeoff. One that you need to constantly re-evaluate over time as the player
numbers change.
At the same time you have to also contend with players who may intentionally choose to
play with friends who are located in different continents which only adds to the complexity
of the issue.
But in any sense the type of feedback received from users complains about high ping
times, matching against people in different territories and it taking too long to find a
session.
54
55. So what do we do. For this title we had much more control over the matchmaking
algorithms and the configuration of these could be driven by client and server data. Unlike
Crysis 2, matchmaking was performed atomically by a matchmaking server as opposed to a
clients trying to choose the best session to join in a changing environment. On the server
side we can configure groups of rule which determine the matching conditions and each of
these rules can have relaxation conditions.
As an example the ping site rule associated each user against one of several ping sites
around the world. In the initial phase of matchmaking users would only match against
people associated with the same pingsite. Over time this condition would be relaxed to
allows neighbouring ping sites to match against each other. The times and the
configurations of the ping-sites could all be driven by data files on the server.
At the same time we also allowed the client to specify which group of rules to use and
change these over time.
There was a reason why we chose to control the matchaking from both the client and the
server perspectives. This was down to the way that the blaze servers required a full restart
for any matchmaking configuration changes and this was a process which we didnt want to
instigate regularly as it would end disconnect all games currently in progress. On the other
hand datapatching the client was less destructive because the changes would only be
picked up by each individual next time the user entered Multiplayer.
55
56. Now with all this configuration possibility we need to collect some data. The starting point
for determining what to collect should always be based on what questions are are trying to
answer and what can you actually change. Here are some of those questions.
56
57. Just because we can collect certain data doesnt mean we should so we wanted to be
sensible about what we collected. The nature of matchmaking is that it is a series of events
and user actions that happen over time so the telemetry we collect should reflect that.
If we timestamp each event based on a zero base time then we can easily calculate various
stats such as the time it takes from the first matchmaking attempt to successfully getting
into a match.
Attaching metadata to certain events also proved useful as it allowed us gain extra insight
into each stage such as which type of matchmaking was taking place. We had a number of
different ways that you could enter matchmaking and we had a heavy focus on squad and
friend based play.
57
58. Take note of
TotalTimeSearching – Time the algorithm took to find a match succesfully.
TotalNumberAttempts – We can see how many time a user matchmakes before entering
a game. Say for example if they back out.
Time between JOinedSession and StartedLevelLoad is how long the user sat in the lobby
Ping times. We collected ping times to every other player every 30 seconds. Would allow
you to formulate a metric about the quality of the game.
58
59. The results were very insightful. It allowed us to tune various server rules such as the
pingsite times but ultimately any modifications became less effective as the number of
players matchmaking decreased.
There was a bug in the client side logic whereby if the player backed out while
matchmaking it would progress immediately to the next set of rules. Our default
configuration only had two rule sets and the second set was relaxed to the point where it
would pretty much match you into any game which met the gamemode and map criteria.
We used client side patching to temporarily fix this by injecting more copies of the same
strict criteria in sequence so that it was less likely to reach the final relaxed state.
59
60. The results were very insightful. It allowed us to tune various server rules such as the
pingsite times but ultimately any modifications became less effective as the number of
players matchmaking decreased.
There was a bug in the client side logic whereby if the player backed out while
matchmaking it would progress immediately to the next set of rules. Our default
configuration only had two rule sets and the second set was relaxed to the point where it
would pretty much match you into any game which met the gamemode and map criteria.
We used client side patching to temporarily fix this by injecting more copies of the same
strict criteria in sequence so that it was less likely to reach the final relaxed state.
60
61. The results were very insightful. It allowed us to tune various server rules such as the
pingsite times but ultimately any modifications became less effective as the number of
players matchmaking decreased.
There was a bug in the client side logic whereby if the player backed out while
matchmaking it would progress immediately to the next set of rules. Our default
configuration only had two rule sets and the second set was relaxed to the point where it
would pretty much match you into any game which met the gamemode and map criteria.
We used client side patching to temporarily fix this by injecting more copies of the same
strict criteria in sequence so that it was less likely to reach the final relaxed state.
61
63. Telemetry is all about collecting data en-mass. This is great to gauge general tendencies and
player preferences but we also have to deal with issues that affect individual users. We
tended to get a lot of people reporting very specific problem with graphical settings &
network configurations that resulted in sub-par performance or the inability to enter a
multiplayer match. Although you can’t offer support to everyone with an issue being able
to quickly isolate the cause of common issues is really useful. Sometimes this means
providing a means for getting more specific information about that users setup or the
nature of what is happening without having to rely on what they tell you. In this case you
need some form of debug data...
63
64. We added the ability to switch on debug output primarily to give us network state
information on the amount of issues raised post release on Crysis 2. This proved invaluable
in large scale public test’s where users could post us a screenshot which would help us
isolate the problem. To prevent users from being able to access this functionality in the
general case it was enabled on a per user basis user entitlements on the account.
64
65. Another simple way to get more context on an error is to provide the actual error code
which led to a particular fail case. If you have ever been involved with TCR compliance, you
will know that the messages that you present to your users end up becoming so generic
they lose any meaning. Its therefore very useful to get direct access to these error codes
which don’t have to be that prominent. In this case we displayed the Blaze error codes
directly which were already very specific and could lead us back to just a couple of potential
areas of code.
Its worth mentioning at this point that although it was great to have these error codes in
place we actually didn't end up using them that much. This was in part to using a proven
technology like Blaze as our game lobby solutions. But it illustrates well the point that you
may spend time developing certain aspects of this technology and potentially never use it.
A lot of what we are doing here is preparing for the unknown and sometimes that means
you don't always need everything you create.
65