3D Projection

Mouse Picking

Backcountry is our first game which is fully controlled with mouse only, with point and click walking and shooting. It made it possible to play the game on mobile without any extra on-screen controls.
Knowing which object was clicked and figuring out the 3D world coordinates of the mouse position are interesting problems. While the rendering pipeline transforms 3D positions into 2D screen coordinates, mouse selection uses a reverse process: it goes from 2D coordinates to a position in the 3D world.
- Going from 2D to 3D means that we don't have enough data to precisely determine the 3D position. The source data has one dimension too few.
- A good way of solving this problem is to select objects closest to the camera, i.e. objects which are in front of other possible candidates for selection.
There are two common techniques used to implement mouse controls in 3D games:
- Second-pass rendering into a special buffer rather onto the screen. Each object is rendered using a unique color, and then the color under the mouse cursor is inspected to determine which object should be picked.
- Ray-casting from the camera's origin into the far plane, and finding objects which intersect with the ray.
See Anton Gerdelan's Mouse Picking with Ray Casting for more information about this technique.
Backcountry uses the ray-casting technique. The main reason why we went with it was that we already had ray-casting implemented for shooting. In a twisted turn of events, we later re-implemented shooting to use AABB collision detection between bullets, NPC-s and the player.
Because the camera is orthographic, there isn't a single origin point from which to cast rays. All rays are parallel to each other in the orthographic projection, so we need to somehow choose both the origin and the destination point for them. The near and the far plane of the camera projection are good candidates for this.
For a great explanation of 3D projection math, together with wonderful visuals, read Jordan Santell's Model View Projection and 3D Projection.
The whole process consists of casting a ray from the near plane to the far plane, from the position corresponding to the screen position of the cursor, and finding intersections of the ray with colliders in the scene. In more detail:
- First, the screen position of the cursor is normalized in the [-1, 1] range, to correspond to the NDC, or normalized device coordinates. For the near plane, z = -1, and for the far plane, z = 1.
- The two points are then transformed (or, "unprojected") into the camera's space (also called the "eye" space) by multiplying them by the inverse of the camera's projection matrix and dividing the resulting x, y and z by w.
  - gl-matrix's vec3.transformMat4 function takes care of the division by w.
- Next, we transform both points one more time to get to the world space coordinates, which is the space all colliders are defined in.
- Finally, a ray is cast between both points, and we check for intersections with the ground or any other entity with the Collide component and a special RayTarget flag.
- Each intersection stores the world position of the hit as well as the intersection time, or distance from the ray origin to the hit. The collider closest to the ray's origin is then considered "selected".

Frustum Culling

The maps in Backcountry are randomly generated and are much larger than the area visible on the screen.
Our rendering pipeline is very simple: for each entity in the scene we compute its world position and send it off to the GPU to be rendered.
The GPU takes into account the position and the projection matrix of the camera and then discards any vertices which are not going to appear on the screen.
- This is one of many optimizations built into GPUs to avoid having to run the fragment shader on vertices which aren't visible anyways.
For tens of thousands of vertices as it is the case in Backcountry's voxel art, the cost of doing the projection * view * model matrix multiplication in the vertex shader was significant, and for the majority of entities it resulted in all vertices being discarded because they were outside the camera's frustum.
That's a lot of draw calls made and many bytes transferred from RAM to the GPU, all for nothing!
The map is a square grid of NxN tiles, each composed of 64 voxels. Even with the voxels for an individual tile rendered using WebGL's instanced drawing, we were still issuing N^2 draw calls to the GPU. For N=50, that's 2500 draw calls each frame.
We added a system called sys_cull which checks for entities outside the camera frustum and turns components off to exclude them from their respective systems. In the snippet below you can see how it's used to toggle the Render component in the cactus blueprint.
```
export function get_cactus_blueprint(game: Game): Blueprint {
    let model = game.Models[Models.CACTUS];
    return {
        Translation: [0, integer(2, 5) + 0.5, 0],
        Using: [render_vox(model), cull(Has.Render)],
    };
}
```

Because components are encoded as bit masks, it's also possible to toggle more than one component at once! Here's how cull is defined in the campfire's particle emitter:

{
    Using: [
        shake(Infinity),
        emit_particles(2, 0.1),
        render_particles([1, 0, 0], 15),
        cull(Has.Shake | Has.EmitParticles | Has.Render),
    ],
}

The frustum check is very simple and only considers the world position of the entity. It doesn't take into account the actual size of the entity nor its bounding box. This works because of three reasons:
- The objects in the game are roughly the same size. We simply added a hardcoded padding to the frustum check so that objects on the edge of the screen are still rendered correctly.
- The camera projection is orthographic, meaning that objects far from the camera are the same size as those closer to it. This allows the padding to be the same for all objects regardless of how far from the camera they are.
- The camera angle is fixed which means that at all times we control what the users sees. This means we can also hardcode the near and the far planes of the frustum without the risk of clipping the objects visible on the screen.
sys_cull became our most expensive system and it made the game CPU-bound rather than GPU-bound. The overall performance win was worth it, however.

Drawing 2D Elements in 3D Space

There are a number of 2D UI elements which are attached to world-space positions:
- The healthbars always appear over the characters' heads.
- The amount of damage taken.
- The amount of gold collected when picking up a gold bar.
- The exclamation mark levitating over the sheriff's head in to town.
- The dollar sign over the outfitter's head in the town.
When the camera pans, these UI widgets must be redrawn to reflect the relative change in the position of the anchor on the screen.
There's a second <canvas> element stretched over the main WebGL canvas, filling the entire screen space, and a separate drawing system, sys_draw which draws those UI elements using regular Canvas2DRenderingContext API.
- The first iteration of this system used DOM elements rather than Canvas2D. The biggest benefit of this approach was that it made it easy to animate the UI widgets through CSS transforms and animations.
- The drawback, however, was that it required extra code to remove elements from DOM when their anchors were destroyed in the game, to prevent memory leaks and UI artifacts corresponding to ghost entities.
In order to draw on the screen we need to transform the 3D world-scape coordinates of the anchor entity into the 2D space of the screen. The process is similar to the one performed by the GPU to render vertices on the screen.
- First, transform the world position of the anchor into the eye space (the camera's local space). This can be achieved by multiplying the position by the inverse of the camera's model matrix, also called the view matrix.
- Next, transform the result into the NDC (normalized device coordinates) by multiplying it by the projection matrix. Same as with mouse picking, it's important to realize that gl-matrix's vec3.transformMat4 also performs the division by w, so that it's not necessary to do it again.
  - Without the division by w, the transformation by the projection matrix only moves us into the so-called clip space
- In practice, these two transformations are performed at once with a single multiplication by the camera's PV matrix, i.e. Projection * View.
  - The camera's PV matrix is computed every frame and also used in sys_render as a uniform passed into shaders.
- The NDC coordinates are in the [-1, 1] range on each axis, where -1 and 1 can be interpreted as the edges of the screen. Knowing the size of the viewport it's easy to compute the final screen position:
  - let screen_x = 0.5 * (ndc_x + 1) * viewport_width,
  - let screen_y = 0.5 * (ndc_y + 1) * viewport_height,
  - ndc_z is discarded.