Impressions Of The GDMC AI Settlement Generation Challenge In Minecraft

Expires in 9 months

23 June 2022

Views: 7

The GDMC AI settlement generation challenge is a PCG competition about producing an algorithm that can create an “interesting” Minecraft settlement for a given map. This paper contains a collection of written experiences with this competition, by participants, judges, organizers and advisors. We asked people to reflect both on the artifacts themselves, and on the competition in general. The aim of this paper is to offer a shareable and edited collection of experiences and qualitative feedback - which seem to contain a lot of insights on PCG and computational creativity, but would otherwise be lost once the output of the competition is reduced to scalar performance values. We reflect upon some organizational issues for AI competitions, and discuss the future of the GDMC competition.

The GDMC AI Settlement Generation Challenge [20, 19] in Minecraft [18] is an annual (since 2018) competition, where participants submit code capable of generating a settlement on a given Minecraft map. The submitted generators are then applied to three different maps, previously unseen by the participants and provided by the organizers. Making generators that can adapt to various types of terrain, and even produce settlements that reflect the peculiarities of said terrain is an essential part of the GDMC challenge. Another key element of the challenge is that there is no computable function that determines the quality of the generated settlement - the algorithm has to create a design appropriate to a given scenario with ill-defined goals [24, Chapt. 8.]. For evaluation the maps are sent out to a range of human judges, including experts from fields such as Minecraft Modding, Game Design, AI and Games research, Architecture, City Planning, and to volunteers who applied out of their own initiative. All judges are asked to look at the three maps for each entry, explore the settlements, and then score each generator from 0 to 10 in four categories - Adaptivity, Functionality, Evocative Narrative and Aesthetics. The judges are given a list of questions that should illustrate the categories (see [20]). In short, Adaptivity is about how well the generated settlement changes in reaction to different input maps. Functionality is about both game-play and fictional affordances being provided by the settlement. Evocative Narrative concerns how well the settlement tells a story about the people who supposedly live in it and how it came about. Aesthetics are less about how beautiful the settlement is, and more about avoiding the very simple design errors that are immediately obvious to a human but not to an algorithm. Scores range from 0, for no discernible effort, over 5 for settlements where it becomes unclear if this was done by a human or AI, to 10, for an artifact that looks superhuman, or could only have been built manually by a team of experts with a lot of time. The detailed guide for the judges is also available online for competitors as reference111For detailed criteria and other information see: http://gendesignmc.engineering.nyu.edu/.

We originally had several aims when designing this competition [28]. For one, we wanted to increase interest and stimulate work in the field of procedural content generation (PCG) [23, 5, 4, 22, 16] and computational creativity [4]. While we wanted the competition to be accessible to students and the general public, we also wanted it to serve as a test bed to try out and compare different PCG approaches and techniques. In contrast to other “citizen science” approaches we were not just interested in the general public working for us, but were genuinely hoping that an open approach to this problem might lead to the development of new ideas that we, in academia and the industry, could learn from. To this end, we tried to design a competition that is not biased towards a certain approach or technique, such as deep learning or genetic algorithms, but rather provides a leveled playing field, as far as this is even possible [2].

The lack of a clearly computational quality metric was also a deliberate design feature and motivation of the GDMC competition; as a secondary goal, the present human evaluations might inform the necessary features of such computational measures as a future engineering challenge.

Many modern PCG methods are based, directly or indirectly, on a notion of optimizing an objective function. While this line of thinking has seen many successes, it also limits what content generation can be to what can at present be quantified. If we based our competition on fulfilling some well-defined objective, we would be liable to the effects of Goodhart’s law [7, 27], according to which a well-defined measure ceases to be useful when it starts being used as a target. In other words, competitors might have created generators that achieved a high value according to an objective function while generating settlements that were unsatisfying to a human. Additionally, creating a meaningful yet well-defined function to optimize for also proved quite hard [1].

The approach we chose can instead be likened with efforts to create open-ended [25, 8] agent behavior in artificial intelligence. As far as we are aware, all existing open-ended AI research is concerned with the behavior of agents in an environment; this competition is an attempt to bring the open-ended mindset to creating generators of environments. The GDMC competition also differs from existing PCG work and competitions [14, 13, 26] in that it focuses on holistic and adaptive content generation. Holistic PCG means that we are not looking at the generation of one aspect of the environment on its own, but rather at all of them together: buildings, paths, natural features, backstories, and so on. Adaptive PCG means that the generators must work with a complex input that they have been given. In this case, it means that the generators are provided with maps (unseen by the generator designers) and must generate a settlement that works with that map. This requirement was in part introduced to counteract the submission of generators that simply create the same settlement over and over (in the extreme, a “generator” could be a single, a priori and manually designed settlement). However, the topic of adaptive PCG is an interesting one in its own right, so we decided to lean into this aspect.

On the critical side, it is somewhat ironic that after deliberately not reducing the “creative” output of the generators to a simple scalar value, we ask the judges to score them on a scale from 1 to 10 which, after calculating the average, is used to determine the winner. It is telling that several of our participants were actually not that interested in their scores - but showed much more appreciation of the additional notes and written feedback provided by the judges. In the first year this feedback was only sent directly to participants, but in the later years we also published the feedback given to the participants on our website and on Discord for all involved parties. It became quickly evident that it contained a lot of interesting thoughts, anecdotes, etc. that were deeply insightful for the advancement of both, the GDMC competition in particular, and computational creativity in general.

This paper is an attempt to collect, preserve, summarize and discuss this feedback, and then provide it in a publishable form to those interested. To this end, we have contacted the judges and participants from the previous years and asked them to provide us some form of written impressions about their experience with the competition. We also allowed for submissions from our community, by advertising this project on our social media channels and via the GDMC’s community Discord channel. For all those wanting to participate, we provided a list of question prompts (see Appendix A), but also expressed our interest in any written text about their experience with the competition. The question prompts were written to elicit responses related to the artifacts (both positive and negative aspects), but also to the competition itself, the way it is judged, and how it relates to computational (co-)creativity. Participants were given the freedom to address any of these points, and answer all, some or none of the questions. We collected all submitted texts, performed minor formatting but no content edits, and now show them in Appendix B.

In the remainder of this paper, we provide a general overview of earlier findings on the importance of textual and less structured feedback, and summarize insights from the experience write-ups for this paper specifically. This write-up also relies on discussions we had on various social media channels, which can all be found via our website 222http://gendesignmc.engineering.nyu.edu/.

2 Summary

2.0.1 Participation

The GDMC competition has a growing number of participating teams, with 4, 6 and 11 submissions in 2018, 2019 and 2020. Among our competitors are hobbyists and academic researchers, and in particular university students who have created a generator as part of their coursework. Some of the past approaches that have been used or developed for the GDMC competition have been published as peer-review papers [9, 29].

2.0.2 Code Reuse

Encouraging modular development, facilitating reuse of solutions and lowering the barrier of entry were also inter-related points cited by various respondents, which were both seen as positive and negative. As the competition unfolded over the years, techniques for solving common problems such as calculating heightmaps, removing trees and path-finding needed to be independently implemented and re-implemented by various participants. Our judges reported that the establishment of best practice solutions led to an increase in overall quality, as some things are solved, and work that goes beyond these basics can be attempted. Others criticized the lack of creativity and surprisal they experienced when seeing the same building or solution over and over again - such as the iconic high-rises first introduced by Eduardo Hauck in 2019. As returning participants incorporate more and more of these solutions in their settlements, this could create a perceived barrier of entry for new participants if they feel like these features are needed to compete. While we encourage participants to make their code public after submission, reverse-engineering an existing solution and incorporating it into a new settlement are non-trivial tasks.

We, the organizers, are interested in facilitating positive code reuse by identifying some of these tasks with high reuse potential and providing modules to address them. Examples of how to incorporate these modules, such as a tutorials or complete settlement generators showcasing the modules, would be desirable. It would also be possible to provide incentives for participants and the larger community to contribute to the effort, e.g. by creating categories for best code documentation or best standalone modules. On a related note, the entry barrier for the Chronicle Generation challenge [21] is currently particularly high as it requires a team with proficiency both on settlement generation and storytelling. The availability of such modules could particularly benefit teams with skills in either of these domains.

2.0.3 Large Scale Adaptation

Another open question is how to delegate more responsibility for high-level decisions to the generator and thus “climb the computational creativity meta-mountain” [3]. Currently, designers are typically responsible for both, choosing a high-level theme, and translating that theme into elements such as structures, materials, decorations etc. To a large extent, the responsibilities delegated to the generator occur at a lower level, such as selecting suitable spots with appropriate terrain for the structures, laying down paths between various points of interest and diversifying the settlement by combining the elements in different ways.

While delegation of creative responsibilities is an ongoing challenge in the field of computational creativity [4] and not exclusive to GDMC, one respondent of the survey highlighted a way in which the judging process discourages participants from investing development time into the delegation of high-level tasks: a generator making a high-level decision runs the risk of not only making a poor choice on a given evaluation map but also of not showcasing a fair share of its high-level capabilities with a limited number of evaluation maps, especially considering the chance of repetition of high-level themes. A problem also often cited to prevent the use of interactive or branching narratives in commercial computer games - something that is great might never be seen by the player. Increasing the number of evaluation maps enough to mitigate this issue would likely be infeasible without placing an undue burden on judges under the current process, but might be feasible under a potential crowd-sourced judging process.

2.0.4 Qualitative Feedback

Already in the first year it became very clear how powerful the textual feedback would be in this competition. Themes such as a lack of bridges over water, absence of light, or big, unique set pieces, would pop-up in the qualitative feedback, be discussed in the social channels, and then either reappear or be solved in subsequent years. In many cases the discussion would identify some features of the previous year’s winner, and subsequent submissions would aim to incorporate them. Based on the urging of several community members, we now also make the textual feedback accessible to a wider public with potential future participants. The textual feedback was also, in parts, surprising to the organizers, and showed that there are elements in the evaluation we did not previously consider.

2.0.5 Video Judging

Two of our judges (Jupiter Hadley and Tommy Thompson) used YouTube and similar platforms to broadcast their judging sessions. This found very positive resonance. Many of our later participants reported that this was the pathway they found out about the competition. In particular, the live judging session with Dr. Thompson (Fig. 1) was well attended online, saw lively discussions, and attendance of several of the competitors. Comments indicate that for many of the participants winning is less important then to showcase their work and receive attention and feedback from professionals. When we discussed potential prizes that would work for our diverse participants one suggestion was the ability to speak to a Game AI professor about their work for 15 minutes and get some advice or feedback.

Observing the live judging session also provided great insight into the importance of genuinely interacting with the created artifact. All our judges are encouraged to actually walk through the settlements, but we, of course, cannot check that. Seeing someone actually interact with the generated artifacts, and exclaim in excitement, was illustrative in figuring out what parts or elements evoke emotions in people. This, and playing the settlements itself, also gives an opportunity to experience some of the more ephemeral effects of interacting with PCG. As one commenter pointed out - there is a slightly different relationship with PCG artifacts compared to human-made artifacts when it comes to ownership: changing someone else’s Minecraft settlement without their permission is considered rude, while doing so to an AI is less so. This was also evidenced by Dr. Thompson in his live judging session - as he apologized for smashing apart houses to then realize that it probably does not matter. This raises the interesting question whether this perceived need for protection, which we usually assign to pieces of art and creative works by humans, could give us a yardstick to measure the quality and human-likeness of PCG artifacts.

Finally, the video judging also demonstrated through its comments how the cultural and biographical backgrounds of a person matter for their relationship to the PCG artifacts. This is not surprising itself - but can be seen as positive in the sense that the complexity of the artifacts designed here has risen to a level were it starts to matter. This was also addressed in ICE_JIT’s submissions text, which explicitly discusses the Japanese influences used for their generator.

2.0.6 Evaluation

We specifically queried people about our evaluation methodology, which stands apart from what is used by most existing AI competitions. Not only do we not use an objective evaluation function, we even forgo the use of well established human-centric methods, such as ranking [31], to establish the “best” solution. This is, in part, due to the fact that it is unclear if there really is a best generator, or if several interesting and great solutions can stand side-by-side. Several GDMC community members have suggested that we should embrace this concept more fully, and forgo having winners all-together, and rather go for a “festival of ideas” approach, where we only give qualitative feedback and celebrate different and interesting ideas. Other suggestions are to introduce achievement tiers, where scores are still given, but the aim is to get to a certain level, rather than to beat others. Finally, there are repeated suggestions to crowd-source the evaluation. We do already publish the generated settlement maps and most generators, and have in the last year even started to host a public server that contains a composite map showing all submitted generators. However, this is only for a post-judging evaluation by the public. Setting up a crowd based evaluation would, if successful, provide us with a statistically more sound evaluation of the settlement, but would probably, against our goals, yield a more results- and less feedback-focused evaluation.

The evaluation debate also brought up some other issues, namely that it is unclear what exactly the ideal generated settlement is: should it be similar to what humans would actually build in a Minecraft game - which can often by quite chaotic and aesthetically jarring - or should it resemble a well designed piece as they are often done by professional speed builders? These issues have also been raised in regard to functionality as a criterion, where the actual functionality of a settlement, such as protection from danger, keeping mobs from spawning, navigability for a human and function blocks, is not the same as having a functionally looking city, that has car lanes or canals in a world without cars or ships. Several of our criteria are just about an imagined functionality, such as a trading harbor, even though there is not need for this in game mechanic terms.

2.0.7 Future Direction

Apart from revising the evaluation criteria, there is also the desire to expand the scope of the competition. We already introduced a bonus competition last year, the Chronicle Challenge [21], which asks participants to also produce a book in the Minecraft settlement that tells a meaningful story about the settlement, hence addressing challenges in computational storytelling [6]. There are also requests to permit larger settlements - which would allow for differently scaled building styles. We also discussed several additional forms of adaptivity - such as having a generator continue to build a small settlement started by a human, or to have one settlement generator applied after another. In 2021 we will experiment with both having larger evaluation maps, and maps that have smaller hand-made settlements already present.

There have also been discussion on moving further towards embodied, co-creative aspects [15] - by having actual agents that build settlements in the game, or that could built together with humans. There has not been much development on this front, as we believe this would not only be technically challenging for us to set up, but also for participants to get into. Some initial steps could be taken by cooperating or integrating with other existing Minecraft competitions. For example, there is a reinforcement learning focused competition for Minecraft Bots [17, 10], and it might be interesting to see if they could operate in procedural generated settlements. Similarly, there is now a framework for using evolutionary algorithms to build machines in Minecraft [8], and it might be worthwhile to see if that framework could be used for our purposes. We are also currently looking at either using additional or new frameworks, beyond the MCEdit [30] and Java Mod based-approaches that we embrace at the moment. In 2021 we will be using a new framework developed by on of our community members, that allows for interaction with a live Minecraft world via a http interface. This will allows competitors to write clients in a range of programming languages, and also would allow for editing the world live, allowing for a player to observe the generation process as it unfolds. While this is not a evaluation form planned for the coming year, it is something that has been requested several times, as it would allow for the generation of an experience similar to watching the popular time-lapse videos of teams building Minecraft settlements.

Furthermore, using this http client would also allow us to update to the current Minecraft version. There is also Need realtor that building a PCG generator in a framework that could be directly modded into the game might make it see more use once developed.

2.1 Conclusion

Overall, we received a lot of positive and constructive feedback - both from participants and the wider academic community. We are planning to keep developing the GDMC competition, and will organize and present the 2021 round of the competition at the Foundations of Digital Games conference.

Acknowledgments to the number of unnamed participants and GDMC community members, whose hard work in making the generators discussed here. CG is funded by the Academy of Finland Flagship programme Finnish Center for Artificial Intelligence (FCAI). RC gratefully acknowledges the financial support from Honda Research Institute Europe (HRI-EU).

Homepage: https://needrealtor.net/

Share