Data conventions#

Coordinate conventions#

Here we explain the coordinate conventions for using our repo.

Camera/view space#

We use the OpenGL/Blender (and original NeRF) coordinate convention for cameras. +X is right, +Y is up, and +Z is pointing back and away from the camera. -Z is the look-at direction. Other codebases may use the COLMAP/OpenCV convention, where the Y and Z axes are flipped from ours but the +X axis remains the same.

World space#

Our world space is oriented such that the up vector is +Z. The XY plane is parallel to the ground plane. In the viewer, you’ll notice that red, green, and blue vectors correspond to X, Y, and Z respectively.

Dataset format#

Our explanation here is for the nerfstudio data format. The transforms.json has a similar format to Instant NGP.

Camera intrinsics#

At the top of the file, we specify the camera intrinsics. We assume that all the intrinsics parameters are the same for every camera in the dataset. The following is an example

  "fl_x": 1072.281897246229, // focal length x
  "fl_y": 1068.6906965388932, // focal length y
  "cx": 1504.0, // principal point x
  "cy": 1000.0, // principal point y
  "w": 3008, // image width
  "h": 2000, // image height
  "camera_model": "OPENCV_FISHEYE", // camera model type
  "k1": 0.03126218448029553, // first radial distorial parameter
  "k2": 0.005177020067511987, // second radial distorial parameter
  "k3": 0.0006640977794272005, // third radial distorial parameter
  "k4": 0.00010067035656515042, // fourth radial distorial parameter
  "p1": -6.472477652140879e-5, // first tangential distortion parameter
  "p2": -1.374647851912992e-7, // second tangential distortion parameter
  "frames": // ... extrinsics parameters explained below

The valid camera_model strings are currently “OPENCV” and “OPENCV_FISHEYE”. “OPENCV” (i.e., perspective) uses k1-2 and p1-2. “OPENCV_FISHEYE” uses k1-4.

Camera extrinsics#

For a transform matrix, the first 3 columns are the +X, +Y, and +Z defining the camera orientation, and the X, Y, Z values define the origin. The last row is to be compatible with homogeneous coordinates.

  // ... intrinsics parameters
  "frames": [
      "file_path": "images/frame_00001.jpeg",
      "transform_matrix": [
        // [+X0 +Y0 +Z0 X]
        // [+X1 +Y1 +Z1 Y]
        // [+X2 +Y2 +Z2 Z]
        // [0.0 0.0 0.0 1]
        [1.0, 0.0, 0.0, 0.0],
        [0.0, 1.0, 0.0, 0.0],
        [0.0, 0.0, 1.0, 0.0],
        [0.0, 0.0, 0.0, 1.0]
      // Additional per-frame info

Depth images#

To train with depth supervision, you can also provide a depth_file_path for each frame in your transforms.json and use one of the methods that support additional depth losses (e.g., depth-nerfacto). The depths are assumed to be 16-bit or 32-bit and to be in millimeters to remain consistent with Polyform. You can adjust this scaling factor using the depth_unit_scale_factor parameter in NerfstudioDataParserConfig. Note that by default, we resize the depth images to match the shape of the RGB images.

  "frames": {
    // ...
    "depth_file_path": "depth/0001.png"



The current implementation of masking is inefficient and will cause large memory allocations.

There may be parts of the training image that should not be used during training (ie. moving objects such as people). These images can be masked out using an additional mask image that is specified in the frame data.

  "frames": {
    // ...
    "mask_path": "masks/mask.jpeg"

The following mask requirements must be met:

  • Must be 1 channel with only black and white pixels

  • Must be the same resolution as the training image

  • Black corresponds to regions to ignore

  • If used, all images must have a mask