The Setup

Social media has won. In order to have any hope of getting a slice of that sweet, sweet viral marketing pie, we decided it was important for our engine to be able to export shareable snippets of gameplay that the kids can post on their socials.

Goal: We want players to be able to hit a button that exports the last n seconds of gameplay to their desktop. The solution needs to work on Windows, MacOS, and Linux.

Lets get into it.

Outline:

This is a three-part post.

  • (Part 1) Efficient transfer of frames from GPU -> RAM
  • (Part 2) Handling large quantities of image data in memory
  • (Part 3) Encoding raw frame data into a video file.

Dependencies:

  • OpenGL
    • We use OpenGL 4.1. Most versions should work.
    • We use GLEW for bindings. GLAD and friends should work fine too, but you may need to tweak the below examples.
  • A header file from STB.
  • An ffmpeg executable for the target operating system. We use CMake to provide each target with the proper ffmpeg binary.

Rendering a frame

How to render a frame is outside the scope of this post. LearnOpenGL is, and forever will be, the best resource for learning about this.

I’m assuming that you already have a rendering loop going. The important thing is that glDrawX() renders a frame, leaving it in a GPU framebuffer, where it is available to be displayed on the monitor and/or downloaded back into RAM.

Pixel Buffer Objects

A typical render thread looks something like:

  • Get the latest render data
  • Issue draw commands
  • Repeat as fast as possible

We want it to look like:

  • Get the latest render data
  • Issue draw commands
  • Download the rendered frame into RAM
  • Repeat as fast as possible

The naive implementation of the above is a performance disaster. Making the render thread wait for a frame to download (a relatively slow operation) before it can begin work on the next one dramatically lowers our frame rate. If we used glReadPixels to directly copy data from the GPU into RAM - this is exactly what would happen.

Bad:

// Memory for the pixel data
GLubyte* pixels = new GLubyte[width * height * 4];
// Synchronously read pixels from the framebuffer. Too slow for us to do every frame.
glReadPixels(0, 0, width, height, GL_RGB, GL_UNSIGNED_BYTE, pixels);

Pixel Buffer Objects (PBOs) solve this problem by doing asynchronous transfers of data, allowing the downloading of data from GPU->RAM to happen ‘in the background’ while the render thread and OpenGL continue to do their thing.

Better:

Gluint bufferId;
GLsizei sizeInBytes = width * height * 4

// Generate and bind a Pixel Buffer Object
glGenBuffers(1, &bufferId);
glBindBuffer(GL_PIXEL_PACK_BUFFER, bufferId);
glBufferData(GL_PIXEL_PACK_BUFFER, sizeInBytes, NULL, GL_STREAM_READ);

// Note '0' rather than the 'pixels' pointer as the last parameter.
// This will cause glReadPixels to asynchronously copy data to the currently bound pixel buffer
glReadPixels(0, 0, m_pboConfig.width, m_pboConfig.height, m_pboConfig.pixelFormat, GL_UNSIGNED_BYTE, 0);

Implementation

Lets go ahead and write a C++ wrapper for a Pixel Buffer Object.

struct PixelBufferConfig
{
    GLsizei width = 1920; // should match the current window dimensions
    GLsizei height = 1080;
    GLsizei numChannels = 3; //RGB. RGBA is 4.
    GLenum pixelFormat = GL_RGB; //GL_RGB, GL_RGBA, etc.
    GLsizei sizeInBytes() const { return width * height * numChannels; }
    GLsizei rowSizeInBytes() const { return width * numChannels; } // STBI and OpenGL might also call this the "stride"
};
class PixelBufferObject
{
private:
    Gluint m_id;
    PixelBufferConfig m_pboConfig;

public:
    PixelBufferObject(const PixelBufferConfig& pboConfig) : m_pboConfig(pboConfig) {}
    PixelBufferObject() { if(m_id != 0) glDeleteBuffers(1, &m_id);}
    void bind() { glBindBuffer(GL_PIXEL_PACK_BUFFER, m_id);} 

    void PixelBufferObject::generate()
    {
        glGenBuffers(1, &m_id);
        bind();
        glBufferData(GL_PIXEL_PACK_BUFFER, m_pboConfig.sizeInBytes(), NULL, GL_STREAM_READ);
    }

    // Trigger a read of the pixel data into the currently bound pixel buffer. 
    // This invocation is is asynchronous and will return immediately (yay! no stalling!)
    void triggerFrameCapture()
    {
        glReadPixels(0, 0, m_pboConfig.width, m_pboConfig.height, m_pboConfig.pixelFormat, GL_UNSIGNED_BYTE, 0);
    }

    // Copy data from the Pixel Buffer into RAM. We'll use it later to make a video. 
    bool downloadPixelData(GLubyte* dataOut)
    {  
        GLintptr offset = 0;
        GLsizeiptr size = m_pboConfig.sizeInBytes();
        GLbitfield access = GL_MAP_READ_BIT;

        // Warning! 
        // glMapBufferRange is *synchronous* and will block our render thread if we're not careful.
        GLubyte* ptr = (GLubyte*)glMapBufferRange(GL_PIXEL_PACK_BUFFER, offset, size, access);

        if(!ptr)
        {
            std::cout << "glMapBufferRange failure\n";
            return false;
        }

        std::memcpy(dataOut, ptr, m_pboConfig.sizeInBytes());
        glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
        return true;
    }
};

One Isn’t Enough

But it’s never that easy, is it? Using a single PBO doesn’t actually solve our problem. Consider the following:

glDraw();
myPbo.triggerFrameCapture(); // trigger async operation
myPbo.downloadPixelData(&somewhereInRam) // block until async operation finishes

If we call downloadPixelData() immediately after triggerFrameCapture(), we totally defeat the purpose of using PBOs. We need to give OpenGL time to complete the pixel copy before we go asking for the result.

The solution to this is to use multiple Pixel Buffer Objects and rotate between them each frame. This allows us to wait a few frames for the async triggerFrameCapture() to finish. Giving that time to finish means downloadPixelData() will return much more quickly when we do get around to calling it.

We ended up going with three PBOs (triple buffering), as increasing the count higher didn’t do anything to improve performance. The idea of triple buffering is to always be downloading from the oldest PBO (as it has had the most time to transfer pixel data). To illustrate:

// Keep multiple PBOs around
 PixelBufferObject a;
 PixelBufferObject b;
 PixelBufferObject c;

// First frame
 glDraw();
 a.triggerFrameCapture(); // a starts transfer the first frame
 b.downloadPixelData(&somewhereInRam);

// Second frame
 glDraw();
 b.triggerFrameCapture() // b starts transferring the second frame. a is still working
 c.downloadPixelData(&somewhereInRam);

// Third frame
 glDraw()
 c.triggerFrameCapture(); 
 a.downloadPixelData(&somewhereInRam); // get the data from a, which is likely finished by now

 // rinse and repeat in the render loop

You’ll likely want some other class to handle the rotation of the PBOs and the massive amounts of data they’re downloading. We’ll dive into that next time.

Summary

Recording gameplay requires you to download rendered frames from the GPU. Downloading these frames is fairly slow, so it’s important that the download operation not block the rest of your render thread. This post explained how we can use Pixel Buffer Objects for async frame transfer, and how we can rotate between them to smoothly copy every rendered frame into RAM.

Next time, we’ll explore how to handle the incredible amounts of data being downloaded.

Papa Squat