Sunday, January 31, 2016

Hot Reloadable JavaScript, Batman!

JavaScript is my new favorite prototyping language. Not because the language itself is fantastic. I mean, it's not too bad. It actually has a lot of similarity to Lua, but it's hidden under a heavy layer of WAT!?, like:

  • Browser incompatibilities!?
  • Semi-colons are optional, but you "should" put them there anyway!?
  • Propagation of null, undefined and NaN until they cause an error very far from where they originated!?
  • Weird type conversions!? "0" == false!?
  • Every function is also an object constructor!? x = new add(5,7)!?
  • Every function is also a method!?
  • You must check everything with hasOwnProperty() when iterating over objects!?

But since Lua is a work of genius and beauty, being a half-assed version of Lua is still pretty good. You could do worse, as languages go.

And JavaScript is actually getting better. Browser compatibility is improving, automatic updates is a big factor in this. And if your goal is just to prototype and play, as opposed to building robust web applications, you can just pick your favorite browser, go with that and don't worry about compatibility. The ES6 standard also adds a lot of nice little improvements, like let, const, class, lexically scoped this (for arrow functions), etc.

But more than the language, the nice thing about JavaScript is that comes with a lot of the things you need to do interesting stuff -- a user interface, 2D and 3D drawing, a debugger, a console REPL, etc. And it's ubiquitous -- everybody has a web browser. If you do something interesting and want to show it to someone else, it is as easy as sending a link.

OK, so it doesn't have file system access (unless you run it through node.js), but who cares? What's so fun about reading and writing files anyway? The 60's called, they want their programming textbooks back!

I mean in JavaScript I can quickly whip up a little demo scene, add some UI controls and then share it with a friend. That's more exciting. I'm sure someone will tell me that I can do that in Ruby too. I'm sure I could, if I found the right gems to install, picked what UI library I wanted to use and learned how to use that, found some suitable bundling tools that could package it up in an executable, preferably cross-platform. But I would probably run into some annoying and confusing error along the way and just give up.

With increasing age I have less and less patience for the sysadmin part of programming. Installing libraries. Making sure that the versions work together. Converting a configure.sh script to something that works with our build system. Solving PATH conflicts between multiple installed cygwin and mingw based toolchains. Learning the intricacies of some weird framework that will be gone in 18 months anyway. There is enough of that stuff that I have to deal with, just to do my job. I don't need any more. When I can avoid it, I do.

One thing I've noticed since I started to prototype in JavaScript is that since drawing and UI work is so simple to do, I've started to use programming for things that I previously would have done in other ways. For example, I no longer do graphs like this in a drawing program:

Instead I write a little piece of JavaScript code that draws the graph on an HTML canvas (code here: pipeline.js).

JavaScript canvas drawing cannot only replace traditional drawing programs, but also Visio (for process diagrams), Excel (graphs and charts), Photoshop and Graphviz. And it can do more advanced forms of visualization and styling, that are not possible in any of these programs.

For simple graphs, you could ask if this really saves any time in the long run, as compared to using a regular drawing program. My answer is: I don't know and I don't care. I think it is more important to do something interesting and fun with time than to save it. And for me, using drawing programs stopped being fun some time around when ClarisWorks was discontinued. If you ask me, so called "productivity software" has just become less and less productive since then. These days, I can't open a Word document without feeling my pulse racing. You can't even print the damned things without clicking through a security warning. Software PTSD. Programmers, we should be ashamed of ourselves. Thank god for Markdown.

Another thing I've stopped using is slide show software. That was never any fun either. Keynote was at least tolerable, which is more than you can say about Powerpoint. Now I just use Remark.js instead and write my slides directly in HTML. I'm much happier and I've lost 10 pounds! Thank you, JavaScript!

But I think for my next slide deck, I'll write it directly in JavaScript instead of using Remark. That's more fun! Frameworks? I don't need no stinking frameworks! Then I can also finally solve the issue of auto-adapting between 16:9 and 4:3 so I don't have to letterbox my entire presentation when someone wants me to run it on a 1995 projector. Seriously, people!

This is not the connector you are looking for!

And I can put HTML 5 videos directly in my presentation, so I don't have to shut down my slide deck to open a video in a separate program. Have you noticed that this is something that almost every speaker does at big conferences? Because apparently they haven't succeeded in getting their million dollar presentation software to reliably present a video file! Software! Everything is broken!

Anyhoo... to get back off topic, one thing that surprised me a bit about JavaScript is that there doesn't seem to be a lot of interest in hot-reloading workflows. Online there is JSBin, which is great, but not really practical for writing bigger things. If you start googling for something you can use offline, with your own favorite text editor, you don't find that much. This is a bit surprising, since JavaScript is a dynamic language -- hot reloading should be a hot topic.

There are some node modules that can do this, like budo. But I'd like something that is small and hackable, that works instantly and doesn't require installing a bunch of frameworks. By now, you know how I feel about that.

After some experimentation I found that adding a script node dynamically to the DOM will cause the script to be evaluated. What is a bit surprising is that you can remove the script node immediately afterwards and everything will still work. The code will still run and update the JavaScript environment. Again, since this is only for my personal use I've not tested it on Internet Explorer 3.0, only on the browsers I play with on a daily basis, Safari and Chrome Canary.

What this means is that we can write a require function for JavaScript like this:

function require(s)
{
    var script = document.createElement("script");
    script.src = s + "?" + performance.now();
    script.type = "text/javascript";
    var head = document.getElementsByTagName("head")[0];
    head.appendChild(script);
    head.removeChild(script);
}

We can use this to load script files, which is kind of nice. It means we don't need a lot of <script> tags in the HTML file. We can just put one there for our main script, index.js, and then require in the other scripts we need from there.

Also note the deftly use of + "?" + performance.now() to prevent the browser from caching the script files. That becomes important when we want to reload them.

Since for dynamic languuages, reloading a script is the same thing as running it, we can get automatic reloads by just calling require on our own script from a timer:

function reload()
{
    require("index.js");
    render();
}

if (!window.has_reload) {
    window.has_reload = true;
    window.setInterval(reload, 250);
}

This reloads the script every 250 ms.

I use the has_reload flag on the window to ensure that I set the reload timer only the first time the file is run. Otherwise we would create more and more reload timers with every reload which in turn would cause even more reloads. If I had enough power in my laptop the resulting chain reaction would vaporize the universe in under three minutes. Sadly, since I don't all that will happen is that my fans will spin up a bit. Damnit, I need more power!

After each reload() I call my render() function to recreate the DOM, redraw the canvas, etc with the new code. That function might look something like this:

function render()
{
    var body = document.getElementsByTagName("body")[0];
    while (body.hasChildNodes()) {
        body.removeChild(body.lastChild);
    }

    var canvas = document.createElement("canvas");
    canvas.width = 650;
    canvas.height = 530;
    var ctx = canvas.getContext("2d");
    drawGraph(ctx);
    body.appendChild(canvas);
}

Note that I start by removing all the DOM elements under <body>. Otherwise each reload would create more and more content. That's still linear growth, so it is better than the exponential chain reaction you can get from the reload timer. But linear growth of the DOM is still pretty bad.

You might think that reloading all the scripts and redrawing the DOM every 250 ms would create a horrible flickering display. But so far, for my little play projects, everything works smoothly in both Safari and Chrome. Glad to see that they are double buffering properly.

If you do run into problems with flickering you could try using the Virtual DOM method that is so popular with JavaScript UI frameworks these days. But try it without that first and see if you really need it, because ugh frameworks, amirite?

Obviously it would be better to reload only when the files actually change and not every 250 ms. But to do that you would need to do something like adding a file system watcher connected to a web socket that could send a message when a reload was needed. Things would start to get complicated, and I like it simple. So far, this works well enough for my purposes.

As a middle ground you could have a small bootstrap script for doing the reload:

window.version = 23;
if (window.version != window.last_version) {
    window.last_version = window.version;
    reload();
}

You would reload this small bootstrap script every 250 ms. But it would only trigger a reload of the other scripts and a re-render when you change the version number. This avoids the reload spamming, but it also removes the immediate feedback loop -- change something and see the effect immediately which I think is really important.

As always with script reloads, you must be a bit careful with how you write your scripts to ensure thy work nicely with the reload feature. For example, if you write:

class Rect
{
    ...
};

It works well in Safari, but Chrome Canary complains on the second reload that you are redefining a class. You can get around that by instead writing:

var Rect = class {

Now Chrome doesn't complain anymore, because obviously you are allowed to change the content of a variable.

To preserve state across reloads, I just put the all the state in a global variable on the window:

window.state = window.state || {}

The first time this is run, we get an empty state object, but on future reloads we keep the old state. The render() function uses the state to determine what to draw. For example, for a slide deck I would put the current slide number in the state, so that we stay on the same page after a reload.

Here is a GIF of the hot reloading in action. Note that the browser view changes as soon as I save the file in Atom:

(No psychoactive substances where consumed during the production of this blog post. Except caffeine. Maybe I should stop drinking coffee?)

Friday, January 29, 2016

Stingray Support -- Hello, I Am Someone Who Can Help


Hello, I am someone who can help. 


Here at the Autodesk Games team, we pride ourselves on supporting users of the Stingray game engine in the best ways possible – so to start, let’s cover where you can find information!




General Information Here!

Games Solutions Learning Channel on YouTube:
This is a series of videos about Stingray by the Autodesk Learning Team. They'll be updating the playlist with new videos over time. They're pretty responsive to community requests on the videos, so feel free to log in and comment if there's something specific you'd like to see.
Check out the playlist on YouTube.

Autodesk Stingray Quick Start Series, with Josh from Digital Tutors:
We enlisted the help from Digital Tutors to set up a video series that runs through the major sections of Stingray so you can get up and running quickly.
Check out the playlist on YouTube.

Autodesk Make Games learning site:
This is a site that we've made for people who are brand new to making games. If you've never made a game before, or never touched complex 3D tools or a game engine, this is a good place to start. We run you through Concept Art and Design phases, 3D content creation, and then using a game engine. We've also made a bunch of assets available to help brand new game makers get started.

Creative Market:
The Creative Market is a storefront where game makers can buy or sell 3D content. We've got a page set up just for Stingray, and it includes some free assets to help new game makers get started.

Stingray Online Help
Here you'll find more getting started movies, how-to topics, and references for the scripting and visual programming interfaces. We're working hard to get you all the info you need, and we're really excited to hear your feedback.

Forum Support Tutorial Channel on YouTube:
This is a series of videos that answers recurring forums questions by the Autodesk Support Team. They'll be updating the playlist with new videos over time. They're pretty responsive to community requests on the videos, so feel free to log in and comment if there's something specific you'd like to see.
Check out the playlist on YouTube.

You should also visit the Stingray Public Forums here, as there is a growing wealth of information and knowledge to search from.



Let's Get Started

Let’s get started. Hi, I’m Dan, nice to meet you. I am super happy to help you with any of your Stingray problems, issues, needs or general questions! However, I’m going to need to ask you to HELP ME, HELP YOU!!





It’s not always apparent when a user asks for help just exactly what that user is asking for. That being the case, here is some useful information on how to ask for help and what to provide us so that we can help you better and more quickly!
  •  Make sure you are very clear on what your specific problem is and describe it as best you can.
    • Include pictures or screen shots you may have
  • Tell us how you came to have this problem         
    • Give us detailed reproduction steps on how to arrive at the issue you are seeing
  • Attach your log files!
    • They can be found here: C:\Users\”USERNAME”\AppData\Local\Autodesk\Stingray\Logs
  • Attach any file that is a specific problem (zip it so it attaches to forum post)
  •  Make sure to let us know your system specifications 
  •  Make sure to let us know what Stingray engine version you are using


On another note … traduire, traduzir, 翻, Übersetzen, þýða, переведите, ਅਨੁਵਾਦ, , and ... translate! We use English as our main support language, however, these days – translate.google.com is really, really good! If English is not your first language, please feel free to write your questions and issues in your native language and we will translate it and get back to you. I often find that it is easier to understand from a translation and this helps us get you help just that much more quickly!





In Conclusion

So just to recap, make sure you are ready when you come to ask us a question! Have your issue sorted out, how to reproduce it, what engine version you are running, your system specs and attach your log files. This will help us, help you, just that much faster and we can get you on your way to making super awesome content in the Stingray game engine. Thanks!

Dan Matlack
Product Support Specialist – Games Solutions
Autodesk, Inc.





Wednesday, January 20, 2016

Introducing the Stingray Package Manager (spm)

The Stingray Package Manager, or spm, is a small Ruby program that is responsible for downloading specific versions of the external artifacts (libraries, sample projects, etc) that are needed to build the Stingray engine and tools. It's a small but important piece of what makes one-button builds possible.

By one-button builds I mean that it should be possible to build Stingray with a single console command and no human intervention. It should work for any version in the code history. It should build all tools, plugins and utilities that are part of the project (as well as meaningful subsets of those for faster builds). In addition, it should work for all target platforms, build configurations (debug, development, release) and options (enabling/disabling Steam, Oculus, AVX, etc).

Before you have experienced one-button builds it's easy to think: So what? What's the big deal? I can download a few libraries manually, change some environment variables when needed, open a few Visual Studio projects and build them. Sure, it is a little bit of work every now and then, but not too bad.

In fact, there are big advantages to having a one-button build system in place:

  • New developers and anyone else interested in the code can dive right in and don't have to spend days trying to figure out how to compile the damned thing.

  • Build farms don't need as much baby sitting (of course build farms always need some baby sitting).

  • All developers build the engine in the same way, the results are repeatable and you don't get bugs from people building against the wrong libraries.

  • There is a known way to build any previous version of the engine, so you can fix bugs in old releases, do bisect tests to locate bad commits, etc.

But more than these specific things, having one-button builds also gives you one less thing to worry about. As programmers we are always trying to fit too much stuff into our brains. We should just come to terms with the fact that as a species, we're really not smart enough to be doing this. That is why I think that simplicity is the most important virtue of programming. Any way we can find to reduce cognitive load and context switching will allow us to focus more on the problem at hand.

In addition to spm there are two other important parts of our one-button build system:

  • The cmake configuration files for building the various targets.

  • A front-end ruby script (make.rb) that parses command-line parameters specifying which configuration to build and makes the appropriate calls to spm and cmake.

But let's get back to spm. As I said at the top, the responsibility of spm is to download and install external artifacts that are needed by the build process. There are some things that are important:

  • Exact versions of these artifacts should be specified so that building a specific version of the source (git hash) will always use the same exact artifacts and yield a predictable result.

  • Since some of these libraries are big, hundreds of megabytes, even when zipped (computers are a sadness), it is important not to download more than absolutely necessary for making the current build.

  • For the same reason we also need control over how we cache older versions of the artifacts. We don't want to evict them immediately, because then we have to download hundreds of megabytes every time we switch branch. But we don't want to keep all old versions either, because then we would pretty soon run out of space on small disks.

The last two points are the reason why something like git-lfs doesn't solve this problem out-of-the box and some more sophisticated package management is needed.

spm takes inspiration from popular package managers like npm and gem and offers a similar set of sub commands. spm install to install a package. spm uninstall to uninstall, etc. At it's heart, what spm does is a pretty simple operation:

Upon request, spm downloads a specific artifact version (identified by a hash) from an artifact repository. We support multiple artifact repositories, such as S3, git and Artifactory. The artifact is unzipped and stored in a local library folder where it can be accessed by the build scripts. As specific artifact versions are activated and deactivated we move them in and out of the local artifact cache.

We don't use unique folder names for artifact versions. So the folder name of an artifact (e.g., luajit-2.1.0-windows) doesn't tell us the exact version (y0dqqY640edvzOKu.QEE4Fjcwxc8FmlM). spm keeps track of that in internal data structures.

There are advantages and disadvantages to this approach:

  • We don't have to change the build scripts when we do minor fixes to a library, only the version hash used by spm.
  • We avoid ugly looking hashes in the folder names and don't have to invent our own version numbering scheme, in addition to the official one.
  • We can't see at a glance which specific library versions are installed without asking spm.
  • We can't have two versions of the same library installed simultaneously, since their names could collide, so we can't run parallel builds that use different library versions.
  • If library version names were unique we wouldn't even need the cache folder, we could just keep all the versions in the library folder.

I'm not 100 % sure we have made the right choice, it might be better to enforce unique names. But it is not a huge deal so unless there is a big impetus for change we will stay on the current track.

spm knows which versions of the artifacts to install by reading configuration files that are checked in as part of the source code. These configuration files are simple JSON files with entries like this:

cmake = {
    groups = ["cmake", "common"]
    platforms = ["win64", "win32", "osx", "ios", "android", "ps4", "xb1", "webgl"]
    lib = "cmake-3.4.0-r1"
    version = "CZRgSJOqdzqVXey1IXLcswEuUkDtmwvd"
    source =  {
        type = "s3"
        bucket = "..."
        access-key-id = "..."
        secret-access-key = "..."
    }
}

This specifies the name of the packet (cmake), the folder name to use for the install (cmake-3.4.0-r1), the version hash and how to retrieve it from the artifact repository (these source parameters can be re-used between different libraries).

To update a library, you simply upload the new version to the repository, modify the version hash and check in the updated configuration file.

The platforms parameter specifies which platforms this library is used on and groups is used to group packages together in meaningful ways that make spm easier to use. For example, there is an engine group that contains all the packages needed to build the engine runtime and a corresponding editor group for building the editor.

So if you want to install all libraries needed to build the engine on Xbox One, you would do:

spm install-group -p xb1 engine

This will install only the libraries needed to build the engine for Xbox One and nothing else. For each library, spm will:

  • If the library is already installed -- do nothing.
  • If the library is in the cache -- move it to the library folder.
  • Otherwise -- download it from the repository.

Downloads are done in parallel, for maximum speed, with a nice command-line based progress report:

The cache is a simple MRU cache that can be pruned either by time (throw away anything I haven't used in a month) or by size (trim the cache down to 10 GB, keeping only the most recently used stuff).

Of course, you usually never have even have to worry about calling spm directly, because make.rb will automatically call it for you with the right arguments, based on the build parameters you have specified to make.rb. It all happens behind the scene.

Even the cmake binary itself is installed by the spm system, so the build is able to bootstrap itself to some extent. Unfortunately, the bootstrap process is not 100 % complete -- there are still some things that you need to do manually before you can start using the one-button builds:

  • Install Ruby (for running spm.rb and make.rb).
  • Specify the location of your library folder (with an SR_LIB_DIR environment variable).
  • Install a suitable version of Visual Studio and/or XCode.
  • Install the platform specific SDKs and toolchains for the platforms you want to target.

I would like to get rid of all of this and have a zero-configuration bootstrap procedure. You sync the repository, give one command and bam -- you have everything you need.

But some of these things are a bit tricky. Without Ruby we need something else for the initial step that at least is capable of downloading and installing Ruby. We can't put restricted software in public repositories and it might be considered hostile to automatically run installers on the users' behalf. Also, some platform SDKs need to be installed globally and don't offer any way of switching quickly between different SDK versions, thwarting any attempt to support quick branch switching.

But we will continue to whittle away at these issues, taking the simplifications where we can find them.

Friday, December 18, 2015

Data Driven Rendering in Stingray


We’re all familiar with the benefits that a data driven architecture brings to gameplay: code is decoupled from data, enabling live linking and rapid iteration. Placing new objects in the editor or modifying the speed of a character has an immediate effect on a live game instance. Really speeds up the development process as you fine tune scripts, gameplay and other content.

What about graphics programming?  It turns out that the same architecture and associated benefits apply to Stingray’s renderer.

Just by modifying configuration files (albeit somewhat complex configuration files) we can implement new shader programs, post-processing effects and even different cascading shadow map implementations. All in real time, on a live game instance. Which is a big win for graphics programmers: try out new ideas, fine tune shaders all with real-time feedback. No more of that long edit/compile/run/debug cycle. And this applies to the entire rendering pipeline: everything from the object space to world space transforms to shadow casting and the final rendering pass is all exposed as config file data, not as C++ code as with traditional architectures.

I gave a presentation on this topic a while back which has now found it’s way to our YouTube channel:

https://www.youtube.com/channel/UC0fIe6XV1PjilADTei9JMOA

By the way, there’s a lot of other great Stingray content up there so please check it out! The renderer presentation can be found under “Stingray Render Config Tutorial.”

The details as well as a PowerPoint can be found there. The code changes to add a trivial greyscale post-processing effect involve:

settings.ini: 

The render_config variable points to the renderer.config file. Settings.ini also provides a section to override default settings found in the next file, renderer.render_config

core/stingray_renderer/renderer.render_config:

Points to our shader libraries, text files containing actual shader programs. A section called global_resources allocates graphics buffers, such as scratch buffers for the cascading shadow maps and G-buffers for deferred rendering along with the main framebuffer. And most of the actual rendering is invoked in the resource_generators section. Again, more details in the YouTube video though a surprising amount can be learned just by grepping through the various config files and playing with the settings. Which is easy to do since it’s all data driven!

core/stingray_renderer/shader_libraries/development.shader_source:

One of several shader libraries. While shader code can be entered as text here, Stingray also provides a graphical node-based shader editor. And we support ShaderFX materials from Max or Maya. It’s often easier (and more portable) to implement shaders graphically.

But whatever method you choose to implement shaders in, the key point is that Stingray's entire rendering pipeline is fully accessible through configuration files. With our data driven architecture making complex rendering changes, while still non-trivial, is a whole lot faster and easier (and portable!) than working with platform-specific C++ code.

Thursday, September 10, 2015

Temporal Reprojection and SAO

We've recently re-visited our AO solution with the goal of improving its performance on consoles. We currently use the Scalable Ambient Obscurance algorithm presented by Morgan McGuire. Out target was to bring down the cost of the entire effect to something between 1-1.5ms on Xbox One. To achieve this it was clear that we needed to reduce the number of taps we were taking for each AO sample. An important part of this work went towards improving the efficiency of the temporal reprojection of our AO buffer. I thought I'd share a few observations we've made along the way.

Distributing AO Samples

When reprojecting data the key to success is to make sure your samples are well distributed through time. The Halton sequence was made popular for reprojection methods after Brian Karis presented his High Quality Temporal Supersampling. It is a sequence that gives well distributed samples in space as well as in time.

If we add an offset to a single AO sample we can see that after 2π it repeats (as expected)



So we use 8 samples ranging from [0, 2π] distributed by using the first 8 terms of a base 3 Halton sequence: {1/3, 2/3, 1/9, 4/9, 7/9, 2/9, 5/9, 8/9} x 2π



And this is the result using 6 AO samples.



Notice that there is some banding that can appear. By adding a dithered offset to the current sample's radius we can remove this quite nicely. We use a 4x4 pattern based on the Bayer matrix.



And here are our samples distributed through time. If anyone has a better way of distributing these I would be very interested to know.



Reprojection function

When doing any temporal reprojection it is crucial to have a good reprojection function. I like to refer to this as a similarity function. Its purpose is to identify how likely it is that the reprojected samples correspond to the samples of the current pixel. When writing this function it's quite important to have a convenient way to visualize it. If the function gets complicated, it's a good idea to isolate the different terms of the function so that you can reason and debug them individually. The reprojection function we use is a combination of three main terms.

Disocclusion Term


This is a simple term which identifies depth differences and classifies previous pixels as disoccluded. We use the relative depth difference described by Huw Bowles in Iterative Image Warping.
depth_similarity = saturate(pow(prev_depth/current_depth, 4) + DEPTH_MIN_SIMILARITY);

Velocity Term

The second term is also very straight forward and consists simply of reducing the similarity for fast moving pixels. A moving pixel has less chance of reprojecting successfully than a still one.
velocity_similarity = saturate(velocity * VELOCITY_SCALAR);

Dangerous Samples Term

The idea is to identify if the AO samples we are gathering are touching moving objects. To do this efficiently, we encode a moving bit as part of the depth buffer info passed in to the SAO algorithm. If you use the mip chain described by the SAO paper, make sure that you forward that 'moving bit' to the lower levels of the mip chain. This idea was presented by Anton Michels during the Labs R&D: Rendering Techniques in Rise of the Tomb Raider presentation earlier this year. Since each AO sample will need to read the depth information we get the 'moving bit' read for free.
samples_similarity = saturate(num_moving_samples * MOVING_SAMPLES_SCALAR);
samples_similarity *= (LOW_VELOCITY_SIMILARITY - MOVING_SAMPLES_MIN_SIMILARITY);
samples_similarity = lerp(samples_similarity, prev_samples_similarity, 0.9);
samples_similarity = min(samples_similarity, current_samples_similarity);


To try and invalidate samples associated with fast moving object, the Dangerous Samples term is accumulated through time. An idea described quite well by Oliver Mattausch in his TSSAO Gpu-Pro2 article which he called 'smooth invalidation'.



Here's the kind of ghosting we get without identifying 'dangerous' samples.



With this term as part of the reprojection function we can eliminate most of the ghosting that arises from the temporal reprojection:


Putting it all together

The final similarity term is calculated by combining all terms together.
similarity = depth_similarity * LOW_VELOCITY_SIMILARITY - velocity_similarity;
similarity *= (LOW_VELOCITY_SIMILARITY - HIGH_VELOCITY_SIMILARITY);
similarity = saturate(similarity - samples_similarity);

Well, that's it! Nothing too ground breaking but I thought I'd share. If anyone has ideas or suggestions on how to improve any of this please let us know!

Thursday, August 20, 2015

Allocation Interlude: JavaScript Animation

Making technical illustrations in drawing programs is tedious and boring. We are programmers, we should be programming our illustrations. Luckily, with JavaScript and its canvas, we can. And we can make them move!

To try this out, and to brush up my rusty JavaScript so that I can hang with all the cool web kids, I made an illustration of the how the buddy allocator described in the last article works:

Note, I use ECMAScript 2015, so this code currently only works in recent versions of Chrome and Firefox. Sorry, but worrying about compatibility takes all the fun out of JavaScript.

Tuesday, August 4, 2015

Allocation Adventures 3: The Buddy Allocator

Hello, allocator!

The job of a memory allocator is to take a big block of memory (from the OS) and chop it up into smaller pieces for individual allocations:

void *A = malloc(10);
void *B = malloc(100);
void *C = malloc(20);

------------------------------------------------------
|  A  |  free  |   B   |  C  |         free          |
------------------------------------------------------

The allocator needs to be fast at serving an allocation request, i.e. finding a suitable piece of free memory. It also needs to be fast at freeing memory, i.e. making a previously used piece of memory available for new allocations. Finally, it needs to prevent fragmentation -- more about that in a moment.

Suppose we put all free blocks in a linked list and allocate memory by searching that list for a block of a suitable size. That makes allocation an O(n) operation, where n is the total number of free blocks. There could be thousands of free blocks and following the links in the list will cause cache misses, so to make a competitive allocator we need a faster method.

Fragmentation occurs when the free memory cannot be used effectively, because it is chopped up into little pieces:

------------------------------------------------------
|  A  |  free  |   B   |  C  |         free          |
------------------------------------------------------

Here, we might not be able to service a large allocation request, because the free memory is split up in two pieces. In a real world scenario, the memory can be fragmented into thousands of pieces.

The first step in preventing fragmentation is to ensure that we have some way of merging free memory blocks together. Otherwise, allocating blocks and freeing them will leave the memory buffer in a chopped up state where it is unable to handle any large requests:

-------------------------------------------------------
|  free  |  free  |  free  |  free  |  free  |  free  |
-------------------------------------------------------

Merging needs to be a quick operation, so scanning the entire buffer for adjacent free blocks is not an option.

Note that even if we merge all neighboring free blocks, we can still get fragmentation, because we can't merge the free blocks when there is a piece of allocated memory between them:

-----------------------------------------------------------
| free | A |  free  | B | free | C |   free    | D | free |
-----------------------------------------------------------

Some useful techniques for preventing this kind of fragmentation are:

  • Use separate allocators for long-lived and short-lived allocations, so that the short-lived allocations don't create "holes" between the long lived ones.
  • Put "small" allocations in a separate part of the buffer so they don't interfere with the big ones.
  • Make the memory blocks relocatable (i.e. use "handles" rather than pointers).
  • Allocate whole pages from the OS and rely on the page mapping to prevent fragmentation.

The last approach can be surprisingly efficient if you have a small page size and follow the advice suggested earlier in this series, to try to have a few large allocations rather than many small ones. On the other hand, a small page size means more TLB misses. But maybe that doesn't matter so much if you have good data locality. Speculation, speculation! I should provide some real numbers instead, but that is too much work!

Three techniques used by many allocators are in-place linked lists, preambles and postambles.

In-place linked lists is a technique for storing linked lists of free memory blocks without using any extra memory. The idea is that since the memory in the blocks is free anyway, we can just store the prev and next pointers directly in the blocks themselves, which means we don't need any extra storage space.

A preamble is a little piece of data that sits just before the allocated memory and contains some information about that memory block. The allocator allocates extra memory for the preamble and fills it with information when the memory is allocated:

void *A = malloc(100);

------------------------
| pre |    A     | post|
------------------------

In C we pretty much need to have a preamble, because when the user calls free(void *p) on a pointer p, we get no information about how big the memory block allocated at p is. That information needs to come from somewhere and a preamble is a reasonable option, because it is easy to access from the free() code:

struct Preamble
{
    unsigned size;
    ...
};

void free(void *p)
{
    Preamble *pre = (Preamble *)p - 1;
    unsigned size = pre->size;
}

Note that there are other options. We could use a hash table to store the size of each pointer. We could reserve particular areas in the memory buffer for allocations of certain sizes and use pointer compare to find the area (and hence the size) for a certain pointer. But hash tables are expensive, and having certain areas for allocations of certain sizes only really work if you have a limited number of different sizes. So preambles are a common option.

They are really annoying though. They increase the size of all memory allocations and they mess with alignment. For example, suppose that the user wants to allocate 4 K of memory and that our OS uses 4 K pages. Without preambles, we could just allocate a page from the OS and return it. But if we need a four byte preamble, then we will have to allocate 8 K from the OS so that we have somewhere to put those extra four bytes. So annoying!

And what makes it even more annoying is that in most cases storing the size is pointless, because the caller already knows it. For example, in C++, when we do:

delete x;

The runtime knows the actual type of x, because otherwise it wouldn't be able to call the destructor properly. But since it knows the type, it knows the size of that type and it could provide that information to the allocator when the memory is freed..

Similarly, if the memory belongs to an std::vector, the vector class has a capacity field that stores how big the buffer is, so again the size is known.

In fact, you could argue that whenever you have a pointer, some part of the runtime has to know how big that memory allocation is, because otherwise, how could the runtime use that memory without causing an access violation?

So we could imagine a parallel world where instead of free(void *) we would have free(void *, size_t) and the caller would be required to explicitly pass the size when freeing a memory block. That world would be a paradise for allocators. But alas, it is not the world we live in.

(You could enforce this parallel world in a subsystem, but I'm not sure if it is a good idea to enforce it across the board in a bigger project. Going against the grain of the programming language can be painful.)

A postamble is a similar piece of data that is put at the end of an allocated memory block.

Postambles are useful for merging. As mentioned above, when you free a memory block, you want to merge it with its free neighbors. But how do you know what the neighbors are and if they are free or not?

For the memory block to the right it is easy. That memory block starts where yours end, so you can easily get to it and check its preamble.

The neighbor to the left is trickier. Since you don't know how big that memory block might be, you don't know where to find its preamble. A postamble solves that problem, since the postamble of the block to the left will always be located just before your block.

Again, the alternative to using preambles and postambles to check for merging is to have some centralized structure with information about the blocks that you can query. And the challenge is to make such queries efficient.

If you require all allocations to be 16-byte aligned, then having both a preamble and a postamble will add 32 bytes of overhead to your allocations. That is not peanuts, especially if you have many small allocations. You can get around that by using slab or block allocators for such allocations, or even better, avoid them completely and try to make fewer and bigger allocations instead, as already mentioned in this series.

The buddy allocator

With that short introduction to some general allocation issues, it is time to take a look at the buddy allocator.

The buddy allocator works by repeatedly splitting memory blocks in half to create two smaller "buddies" until we get a block of the desired size.

If we start with a 512 K block allocated from the OS, we can split it to create two 256 K buddies. We can then take one of those and split it further into two 128 K buddies, and so on.

When allocating, we check to see if we have a free block of the appropriate size. If not, we split a larger block as many times as necessary to get a block of a suitable size. So if we want 32 K, we split the 128 K block into 64 K and then split one of those into 32 K.

At the end of this, the state of the allocator will look something like this:

Buddy allocator after 32 K allocation:

    -----------------------------------------------------------------
512 |                               S                               |
    -----------------------------------------------------------------
256 |               S               |               F               |
    -----------------------------------------------------------------
128 |       S       |       F       |
    ---------------------------------
 64 |   S   |   F   |                        S - split
    -----------------                        F - free
 32 | A | F |                                A - allocated
    ---------

As you can see, this method of splitting means that the block sizes will always be a powers of two. If you try to allocate something smaller, say 13 K, the allocation will be rounded up to the nearest power of two (16 K) and then get assigned a 16 K block.

So there is a significant amount of fragmentation happening here. This kind of fragmentation is called internal fragmentation since it is wasted memory inside a block, not wasted space between the blocks.

Merging in the buddy allocator is dead simple. Whenever a block is freed, we check if it's buddy is also free. If it is, we merge the two buddies back together into the single block they were once split from. We continue to do this recursively, so if this newly created free block also has a free buddy, they get merged together into an even bigger block, etc.

The buddy allocator is pretty good at preventing external fragmentation, since whenever something is freed there is a pretty good chance that we can merge, and if we can't the "hole" should be filled pretty soon by a similarly sized allocation. You can still imagine pathological worst-case scenarios. For example, if we first allocate every leaf node and then free every other of those allocations we would end up with a pretty fragmented memory. But such situations should be rare in practice.

Worst case fragmentation, 16 K block size

    -----------------------------------------------------------------
512 |                               S                               |
    -----------------------------------------------------------------
256 |               S               |               S               |
    -----------------------------------------------------------------
128 |       S       |       S       |       S       |       S       |
    -----------------------------------------------------------------
 64 |   S   |   S   |   S   |   S   |   S   |   S   |   S   |   S   |
    -----------------------------------------------------------------
 32 | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S |
    -----------------------------------------------------------------
 16 |A|F|A|F|A|F|A|F|A|F|A|F|A|F|A|F|A|F|A|F|A|F|A|F|A|F|A|F|A|F|A|F|
    -----------------------------------------------------------------

I'm being pretty vague here, I know. That's because it is quite hard in general to say something meaningful about how "good" an allocator is at preventing fragmentation. You can say how good it does with a certain allocation pattern, but every program has a different allocation pattern.

Implementing the buddy allocator

Articles on algorithms and data structures are often light on implementation details. For example, you can find tons of articles describing the high-level idea behind the buddy allocator as I've outlined it above, but not much information about how to implement the bloody thing!

This is a pity, because the implementation details can really matter. For example, it's not uncommon to see someone carefully implement the A*-algorithm, but using a data structure for the open and closed sets that completely obliterates the performance advantages of the algorithm.

So let's get into a bit more detail.

We start with allocation. How can we find a free block of a requested size? We can use the technique described above: we put the free blocks of each size in an implicit linked list. To find a free block we just take the first item from the list of blocks of that size, remove it from the list and return it.

If there is no block of the right size, we take the block of the next higher size and split that. We use one of the two blocks we get and put the other one on the free list for that size. If the list of blocks of the bigger size is also empty, we can go to the even bigger size, etc.

To make things easier for us, let's introduce the concept of levels. We say that the single block that we start with, representing the entire buffer, is at level 0. When we split that we get two blocks at level 1. Splitting them, we get to level 2, etc.

We can now write the pseudocode for allocating a block at level n:

if the list of free blocks at level n is empty
    allocate a block at level n-1 (using this algorithm)
    split the block into two blocks at level n
    insert the two blocks into the list of free blocks for level n
remove the first block from the list at level n and return it

The only data structure we need for this is a list of pointers to the first free block at each level:

static const int MAX_LEVELS = 32;
void *_free_lists[MAX_LEVELS];

The prev and next pointers for the lists are stored directly in the free blocks themselves.

We can also note some mathematical properties of the allocator:

total_size == (1<<num_levels) * leaf_size
size_of_level(n) == total_size / (1<<n)
max_blocks_of_level(n) = (1<<n)

Note that MAX_LEVELS = 32 is probably enough since that gives a total size of leaf_size * 4 GB and we know leaf_size will be at least 16. (The leaf nodes must have room for the prev and next pointers of the linked list and we assume a 64 bit system.)

Note also that we can create a unique index for each block in the buddy allocator as (1<<level) + index_in_level - 1. The node at level 0 will have index 0. The two nodes at level 1 will have index 1 and 2, etc:

Block indices

    -----------------------------------------------------------------
512 |                               0                               |
    -----------------------------------------------------------------
256 |               1               |               2               |
    -----------------------------------------------------------------
128 |       3       |       4       |       5       |       6       |
    -----------------------------------------------------------------
 64 |   7   |   8   |   9   |  10   |  11   |  12   |  13   |  14   |
    -----------------------------------------------------------------
 32 |15 |16 |17 |18 |19 |20 |21 |22 |23 |24 |25 |26 |27 |28 |29 |30 |
    -----------------------------------------------------------------

The total number of entries in the index is (1 << num_levels) - 1. So if we want to store some data per block, this is how much memory we will need. For the sake of simplicity, let's ignore the - 1 part and just round it of as (1 << num_levels).

What about deallocation?

The tricky part is the merging. Doing the merging is simple, we just take the two blocks, remove them from the free list at level n and insert the merged block into the free list at level n-1.

The tricky part is to know when we should merge. I.e. when we are freeing a block p, how do we know if it is buddy is also free, so that we can merge them?

First, note that we can easily compute the address of the buddy. Suppose we have free a block p at level n. We can compute the index of that in the level as:

index_in_level_of(p,n) == (p - _buffer_start) / size_of_level(n)

If the index i is even, then the buddy as at index i+1 and otherwise the buddy is at i-1 and we can use the formula above to solve for the pointer, given the index.

So given the address of the buddy, let's call it buddy_ptr, how can we know if it is free or not? We could look through the free list for level n. If we find it there we know it is free and otherwise it's not. But there could be thousands of blocks and walking the list is hard on the cache.

To do better, we need to store some kind of extra information.

We could use preambles and postambles as discussed earlier, but that would be a pity. The buddy allocator has such nice, even block sizes: 1 K, 2 K, 4 K, we really don't want to mess that up with preambles and postambles.

But what we can do is to store a bit for each block, telling us if that block is free or allocated. We can use the block index as described above to access this bitfield. This will require a total of (1 << num_level) bits. Since the total size of the tree is (1 << num_levels) * leaf_size bytes, we can see that the overhead of storing these extra bits is 1 / 8 / leaf_size. With a decent leaf_size of say 128 (small allocations are better handled by a slab alloactor anyway) the overhead of this table is just 0.1 %. Not too shabby.

But in fact we can do even better. We can get by with just half a bit per block. That sounds impossible, but here is how:

For each pair of buddies A and B we store the single bit is_A_free XOR is_B_free. We can easily maintain the state of that bit by flipping it each time one of the buddies is freed or allocated.

When we consider making a merge we know that one of buddies is free, because it is only when a block has just been freed that we consider a merge. This means we can find out the state of the other block from the XORed bit. If it is 0, then both blocks are free. If it is 1 then it is just our block that is free.

So we can get by with just one bit for every pair of blocks, that's half a bit per block, or an overhead of just 1 / 16 / leaf_size.

At this point, careful readers may note that I have been cheating.

All this time I have assumed that we know the level n of the block that we are freeing. Otherwise we cannot compute the address of the buddy or its index in the node tree.

But to know the level n of ptr we must know the size of its allocated block. So this only really works if the user passes the size of the allocation when freeing the block. I.e, the free(void *, size_t) interface that we discussed earlier.

If we want to support the simpler and more common API free(void *p), the alloator needs to somehow store the size of each alloation.

Again, using a preamble is possible, but we don't really want to.

We could use an array, indexed by (p - _buffer_start) / leaf_size to store the sizes. Note that this is not the same as the block index. We can't use the block index, since we don't know the level. Instead this is an index of size 1 << (num_levels - 1) with one entry for each possible pointer that the buddy allocator can return.

We don't have to store the full size (32 bits) in the index, just the level. That's 5 bits assuming that MAX_LEVELS = 32. Since the number of entries in this index is half that of the block index this ammounts to 2.5 bits per block.

But we can do even better.

Instead of storing the size explicitly, we can use the block index and store a single bit to keep track of whether the block at that level has been split or not.

To find the level n of an allocated block we can use the algorithm:

n = num_levels - 1
while n > 0
    if block_has_been_split(ptr, n-1)
        return n
    n = n - 1
return 0

Since the leaf blocks can't be split, we only need 1 << (num_levels - 1) entries in the split index. This means that the cost of the split index is the same as for the merge index, 0.5 bits per block. It's a bit amazing that we can do all this with a total overhead of just 1 bit per block.

The prize of the memory savings is that we now have to loop a bit to find the allocated size. But num_levels is usually small (in any case <= 32) and since we only have 1 bit per entry the cache usage is pretty good. Furthermore, with this approach it is easy to offer both a free(void *) and a free(void *, size_t) interface. The latter can be used by more sophisticated callers to avoid the loop to calculate the block size.

Memory arrangements

Where do we store this 1 bit of metadata per block? We could use a separate buffer, but it is not that elegant. It would mean that our allocator would have to request two buffers from the system, one for the data and one for the metadata.

Instead, let's just put the metadata in the buffer itself, at the beginning where we can easily find it. We mark the blocks used to store the metadata as allocated so that they won't be used by other allocations:

Initial state of memory after reserving metadata:

    -----------------------------------------------------------------
512 |                               S                               |
    -----------------------------------------------------------------
256 |               S               |               F               |
    -----------------------------------------------------------------
128 |       S       |       F       |
    ---------------------------------
 64 |   S   |   S   |
    -----------------
 32 | S | S | S | F |
    -----------------
 16 |A|A|A|A|A|F|
    -------------
    ********** Metadata

Note that when allocating the metadata we can be a bit sneaky and not round up the allocation to the nearest power of two. Instead we just take as many leaf blocks as we need. That is because when we allocate the metadata we know that the allocator is completely empty, so we are guaranteed to be able to allocate adjacent leaf blocks. In the example above we only have to use 5 * 16 = 80 K for the metadata instead of the 128 K we would have used if we rounded up.

(The size of the metadata has been greatly exaggerated in the illustration above to show this effect. In reality, since the tree in the illustration has only six levels, the metadata is just 1 * (1 << 6) = 64 bits, that's 8 bytes, not 80 K.)

Note that you have to be a bit careful when allocating the metadata in this way, because you are allocating memory for the metadata that your memory allocation functions depend on. That's a chicken-and-egg problem. Either you have to write a special allocation routine for this initial allocation, or be very careful with how you write your allocation code so that this case is handled gracefully.

We can use the same technique to handle another pesky issue.

It's a bit irritating that the size of the buddy allocator has to be a power of two of the leaf size. Say that we happen to have 400 K of memory lying around somewhere. It would be really nice if we could use all of that memory instead of just the first 256 K.

We can do that using the same trick. For our 400 K, we can just create a 512 K buddy allocator and mark the first 144 K of it as "already allocated". We also offset the start of the buffer, so that the start of the usable memory coincides with the start of the buffer in memory. Like this:

    -----------------------------------------------------------------
512 |                               S                               |
    -----------------------------------------------------------------
256 |               S               |               F               |
    -----------------------------------------------------------------
128 |       S       |       S       |
    ---------------------------------
 64 |   S   |   S   |   S   |   F   |
    ---------------------------------
 32 | S | S | S | S | S | F |
    -------------------------
 16 |A|A|A|A|A|A|A|A|A|A|
    ---------------------
    *******************    Unusable, unallocated memory
MET                    *   Metadata
                       ^
                       +-- Usable memory starts here

Again, this requires some care when writing the code that does the initial allocation so that it doesn't write into the unallocated memory and causes an access violation.

The buddy allocator and growing buffers

As mentioned in the previous post, the buddy allocator is perfect for allocating dynamically growing buffers, because what we want there is allocations that progressively double in size, which is exactly what the different levels of the buddy allocator offer.

When a buffer needs to grow, we just allocate the next level from the buddy allocator and set the capacity of the buffer so that it fills up all that space.

Note that this completely avoids the internal fragmentation issue, which is otherwise one of the biggest problems with the buddy allocator. There will be no internal fragmentation because the dynamic buffers will make use of all the available space.

In the next post, I'll show how all of this ties together.