Spatial Video Tool Documentation – Mike Swanson's Blog

The spatial video tool has five commands:

info – reports information about a spatial video file.
export – exports from a spatial video to a flat video in full or half over/under and side-by-side formats. It can also export to two separate videos, one for the left eye and another for the right eye.
make – makes a spatial video from a flat video in one of the supported formats. It can also make a spatial video from two separate input videos, one for the left eye and another for the right eye.
combine – combines (muxes) already-encoded audio and MV-HEVC video files into a single output file.
metadata – reports and modifies spatial video metadata in an input file and writes it to an output file.

To find the version number of the spatial tool:

./spatial --version

Info

The info command reports information about a spatial video file. This is a great way to interrogate a file to view its spatial metadata. If a spatial entry is present in the file, it will be reported here. For help about the info command and its available parameters:

./spatial help info

To use the info command on a file:

./spatial info --input spatial_file.mov

Where spatial_file.mov is a path to the spatial video file you’d like to investigate.

To get more detailed information about metadata, pass the --debug flag. This lists more technical properties of the file, though it can be more difficult to read.

./spatial info --input spatial_file.mov --debug

The final feature of the info command is to probe a file to see if it contains any spatial video. The probe feature is mostly useful as part of another script. It returns a value of 0 if the file contains a spatial video track, otherwise, it returns 1.

./spatial info --input spatial_file.mov --probe

Parameter Names and Order

In the above commands, the long format of each parameter has been used. Long parameter names are preceded by two dashes. Because they’re longer, they’re generally easier to read and understand. However, many of the parameters also have short one-letter versions, and you can discover them by using the aforementioned help feature. For example, the short version of --input is -i (and you can tell, because it only has a single dash). Short versions can make commands easier to type.

Parameters can be specified in any order for all commands. The only thing that has to come first is the name of the command itself (i.e. info, export, make, combine, or metadata).

Export

The export command converts video from a spatial format (perhaps captured by iPhone 15 Pro or Apple Vision Pro) to a standard, flat video file or to two separate video files. These video files can be used in many VR/spatial video players and editors. Because most video tools do not yet understand the new MV-HEVC format, this is a great way to obtain a video that can be edited with common tools. For help about the export command and its available parameters:

./spatial help export

The input video is expected to be a MV-HEVC-encoded spatial video, and if it isn’t, the export command will fail with an error message. The export command needs to know how to format the output video. Four formats are available:

over/under ("ou") – this format puts the left eye image at the top of the frame and the right eye image at the bottom. This is a great format, because it preserves the full resolution of the input video while maintaining an overall aspect ratio that is friendly to many video tools.
half over/under ("hou") – just like over-under, except each eye is scaled vertically by 50%. This results in an output file that is the same height as the original video. This format is a compromise, because it reduces vertical resolution by half. But, it’s useful if the tool you need to use has limitations about the size of your video frames.
side-by-side ("sbs") – this format puts the left eye image at the left of the frame and the right eye image at the right. Like over/under, this format preserves the full resolution of the input video. The output video is twice as wide as the input, and this extra-wide aspect ratio can be a challenge for some tools.
half side-by-side ("hsbs") – just like side-by-side, except each eye is scaled horizontally by 50%. This results in an output file that is the same width as the original video. Like half over/under, this format is a compromise for tools that can’t handle the full-sized formats.

To export from a spatial video file to an over/under format:

./spatial export --input spatial_file.mov --format ou \
  --output over_under.mov

To save typing, we can accomplish the same thing by using short parameter names:

./spatial export -i spatial_file.mov -f ou \
  -o over_under.mov

To automatically overwrite/replace the output file if it already exists (without prompting), pass the -y flag:

./spatial export -y -i input_file.mov -f ou \
  -o over_under.mov

Because we didn’t specify a specific video codec, it has defaulted to hevc. However, if we want to be more explicit, we can include the name of the codec along with the desired video bitrate:

./spatial export -i spatial_file.mov -f ou \
  --vcodec hevc --bitrate 20M -o over_under.mov

This produces an output file that is encoded with hevc and a video bitrate of 20 MBps. To encode to h264:

./spatial export -i spatial_file.mov -f ou \
  --vcodec h264 --bitrate 20M -o over_under.mov

Instead of providing a video bitrate, you can provide a quality setting for the h264 and hevc codecs. The value is from 0.0 (low quality) to 1.0 (visually lossless). A good starting value for experimentation is 0.5. Note that values above 0.95 can sometimes cause system instability.

./spatial export -i spatial_file.mov -f ou \
  --vcodec hevc --quality 0.5 -o over_under.mov

When video is encoded with h264 or hevc, a full image (similar to a JPG) starts the data stream, and the following frames use a sophisticated technique to encode only the things that change. After some time, a new full image is placed in the data stream, again followed by changes, and this process repeats until the end of the file. These full image frames are called keyframes (or I-frames).

Because keyframes encode a full image, they require the most data; and the more keyframes in a video stream, the more bandwidth that’s required. As with anything related to video encoding, finding the best distance between keyframes is a bit of an art.

To specify a keyframe cadence, use the --maxkey parameter with a value the represents the maximum time between keyframes. If a value is not specified, the default is every 2.0 seconds. This example sets the maximum number of seconds to 4.0:

./spatial export -i spatial_file.mov -f ou \
  --vcodec hevc --quality 0.5 --maxkey 4.0 \
  -o over_under.mov

Choosing the best keyframe interval is a topic that is beyond the scope of this documentation. In general, though, longer durations are more bandwidth-efficient. It’s normal to see values between 1.0 and 6.0 seconds or so, but feel free to experiment!

The final two output codecs that can be used are proRes422 and proRes422HQ. These both provide high quality output that is more suitable for professional applications. Note that these codecs do not need bitrate nor quality settings, so there is no need to include them (in fact, they will be ignored).

./spatial export -i spatial_file.mov -f ou \
  --vcodec proRes422 -o over_under.mov

By default, the first audio track in the input file is copied (not re-encoded) to the output file. Note that the bitrate parameter that can be passed on the command-line does not affect (nor account for) the audio track bitrate. If you don’t want to copy the audio track, you can add --no-audio to your command line, like:

./spatial export -i spatial_file.mov -f ou \
  --no-audio -o over_under.mov

If you need to swap the left- and right-eye images, use the --reverse flag:

./spatial export -i spatial_file.mov -f ou \
  --reverse -o over_under.mov

If you want to export to two video files: one for the left eye and a separate video file for the right:

./spatial export -i spatial_file.mov -o left_eye.mov \
  -o right_eye.mov

Notice that no format is needed (it is unnecessary) because each video is a full-resolution frame.

To scale the output, pass --scale followed by height and width values like: w=200:h=100, 200:100, 200x100, or 200:-1. A value of -1 maintains the original aspect ratio.

./spatial export -i spatial_file.mov -o left_eye.mov \
  -o right_eye.mov --scale 960x540

To add st3d spatial metadata to the output file, use the --add-st3d flag to tag the stereo mode of the output video (this flag is ignored for multi-video output).

./spatial export -i spatial_file.mov -f ou \
  -o over_under.mov --add-st3d

Time Ranges

To process just a portion of the input video (even if it’s just to see if your settings are producing the results you’d like), you can specify three different time-related options that form a time range. Note that these are not intended for frame-accurate editing; they’re mostly intended to process a general subsection of the input video.

--ss – to specify a start time. The time can be a number of seconds or a time formatted like hh:mm:ss[.xxx]. If you only specify a start time, processing begins at that time and continues for the remainder of the input video. This command starts processing at 10.5 seconds: ./spatial export -i spatial_file.mov -f ou --ss 10.5 -o over_under.mov
--t – to specify a duration. The duration can also be a number of seconds or a time formatted like hh:mm:ss[.xxx]. The duration is relative to the start time. If a start time is not specified, the duration is relative to the start of the input video. So, to just process the first 5 seconds of an input video: ./spatial export -i spatial_file.mov -f ou --t 5 -o over_under.mov To start at 6 seconds and continue for 5: ./spatial export -i spatial_file.mov -f ou --ss 6 --t 5 -o over_under.mov
--to – to specify an end time. The end time can be a number of seconds or a time formatted like hh:mm:ss[.xxx]. If you only specify an end time, processing begins at the beginning of the input file and continues until the end time is reached. This command processes from the start of the video for 8 seconds: ./spatial export -i spatial_file.mov -f ou --to 8 -o over_under.mov If you also include a start time, video is processed from the start time to the end time. Note that unlike the duration parameter, the end time is relative to video time. So, to process everything starting at 3 seconds and continuing until 10 seconds: ./spatial export -i spatial_file.mov -f ou --ss 3 --to 10 -o over_under.mov

Note that it is an error to specify both a duration (--t) and an end time (--to).

Argument Files

If you find yourself wanting to use specific parameters and values over and over again, you can save yourself some effort and create a standard text file that contains those arguments. For example, let’s say you frequently export to over/under format using the hevc video codec at 20 Mbps and don’t need audio. Your text file might look like this:

-f ou --vcodec hevc --bitrate 20M --no-audio

If you called that file ou.args, you can include those values as-if they were typed on the command line. Example:

./spatial export -i spatial_file.mov --args ou.args \
  -o over_under.mov

This is exactly the same as typing:

./spatial export -i spatial_file.mov -f ou \
  --vcodec hevc --bitrate 20M --no-audio \
  -o over_under.mov

Any of the parameters for any of the spatial commands can be included in an arguments file. Well, except for the --args parameter itself.

You can also include comments by starting a line with ’#’. And all blank space is ignored, making this file the same as the prior example:

# Common over/under parameters
-f ou
--vcodec hevc
--bitrate 20M
--no-audio

Make

Make is the command that takes a flat, standard formatted video (like the videos made by the export command) or two input videos and outputs a MV-HEVC spatial video. For help about the make command and its available parameters:

./spatial help make

Similar to the export command, the format of the input video needs to be specified (unless two video inputs are provided). The formats are the same as described in the export section (ou, hou, sbs, and hsbs).

No video codecs need to be defined for this command, because the input format is determined when reading the input file, and the output format is always MV-HEVC. A bitrate or quality setting for the output file can be specified (also described in the export section).

./spatial make -i over_under.mov -f ou --bitrate 20M \
  -o spatial_file.mov

While this command creates a MV-HEVC-encoded output file with left- and right-eye views, there are other spatial values that should be specified while encoding. In fact, to be listed as “SPATIAL” in Apple Photos, a couple of these values need to be specified. NOTE: It’s unclear at this time how all of these parameters will affect playback. Also, these values are only written as metadata in the output video; they do not affect how videos are processed by the spatial tool itself.

--cdist – this is the distance (in millimeters) between the centers of the two camera lenses that were used to capture the video. This value can contain a decimal value but no units, like 19.24. If you know the camera distance, it should be included.
--hfov – this is the horizontal field-of-view of the video capture in degrees. For example, 63.4. If you know the horizontal field-of-view, it should be included. Note that this value is required, along with either/both --cdist and --hadjust for a video to be listed as “SPATIAL” in Apple Photos.
--hadjust – this is the horizontal disparity adjustment and it reflects the amount of overlap (-1.0 to 1.0) that should be applied to the left- and right-eye frames during playback. For compatible players, this should influence the amount of depth that is present in the frame. Positive values increase apparent depth and negative numbers reduce it.
--projection – this specifies the projection type of the video. For standard video, like from iPhone 15 Pro, this value is rect for rectilinear. Other values are equirect, halfEquirect, and fisheye (see Projection Types section below).
--hero – this sets which eye (left or right) should be considered the “hero” or best eye to show when played back in a spatial-aware player. For example, if a thumbnail or UI element needs to show a monoscopic version of the video, it could choose this hero eye for playback. This will not affect which eye is displayed in a standard 2D video player.
--primary – this determines which eye (left or right) is played back in a standard, non-spatial-aware 2D video player. MV-HEVC files contain two or more video layers, and all video players that playback standard HEVC video will only play the first layer (and ignore the others). This setting determines which eye will be encoded to the first layer in the video file. On iPhone 15 Pro, this setting typically matches the camera with the highest-quality image (which might be right in one orientation of the iPhone 15 Pro, and left if it’s captured when the iPhone 15 Pro is upside down).

If you’re interested in a deeper dive about these parameters, check out my Encoding Spatial Video post.

So, for a rectilinear input video that is similar to what would be captured on an iPhone 15 Pro, these settings could be used:

./spatial make -i over_under.mov -f ou --cdist 19.24 \
  --hfov 63.4 --hadjust 0.02 --projection rect \
  --hero right --primary right -o spatial_file.mov

For a 180-degree (a.k.a. half equirectangular) video, something like:

./spatial make -i over_under.mov -f ou --cdist 63.0 \
  --hfov 180.0 --hadjust 0.0 \
  --projection halfEquirect --hero right \
  --primary right -o spatial_file.mov

For a 360-degree (a.k.a. equirectangular) video:

./spatial make -i over_under.mov -f ou --cdist 63.0 \
  --hfov 360.0 --hadjust 0.0 --projection equirect \
  --hero right --primary right -o spatial_file.mov

Don’t forget the tip where you can put frequently-used arguments in a file and reference the file on the command line. It’s very useful for make commands that have a lot of parameters.

To make a spatial video from two separate video files (one for the left eye and one for the right):

./spatial make -i left_eye.mov -i right_eye.mov \
  -o spatial_file.mov

Notice that no format is needed, because each input video is a full-resolution frame. Audio is copied from the first input that contains an audio track (if any).

To improve start times for progressive downloads, add the --faststart flag. This first encodes video to a temporary file, then copies the data to the final output file, adding the moov atom at the front.

Also, just like with the export command, you can specify time ranges for the make command using the --ss, --t, and --to parameters described in that section.

The --scale parameter is also available.

Projection Types

Apple’s spatial media specifications currently include four different projection types. These projections indicate how each video frame is formatted (i.e. what a raw video frame actually looks like) and how each formatted frame should be displayed to each eye.

For a deeper dive into projection formats, read Apple’s Mysterious Fisheye Projection.

Rectilinear

The first projection type is called rectilinear, and the --projection parameter for this type is just rect. You can think of this type as what you see when you watch a traditional 3D movie.

A rectilinear frame looks just like what you’d expect…an image that represents what would be shown on a rectangular screen. Horizontal and vertical lines aren’t curved or distorted. This is what Apple refers to as spatial video, and it’s what iPhone 15 Pro and Apple Vision Pro devices capture.

Equirectangular

The second type is called equirectangular, and the --projection parameter for this type is equirect. Equirectangular images (sometimes called lat/lons) typically represent a full 360 degree by 180 degree image that completely wraps around the viewer like a sphere. In fact, when it’s projected, it is often mapped to a sphere with your viewpoint right in the middle.

If you think about it, you can’t take a normal rectangular video frame and wrap it completely around a sphere without gaps. To work around this problem and to make it possible to encode 360 degree by 180 degree content in a rectangular video frame, there is a simple mapping that converts points on a sphere to points on the encoded frame. And when it’s played back, the same mapping is performed in reverse.

If you look at an equirectangular video frame in its raw format, everything appears to be warped and distorted. This is due to the mapping algorithm.

Full equirectangular projections are the most common format for 360-degree content, and that’s why you’d choose this projection. Note that the spatial tool doesn’t perform any remapping of images, but if you include equirectangular content (generated via some other means), be sure to add this projection type to the metadata of your video.

Half Equirectangular

A half equirectangular projection is simply a normal equirectangular projection that is half as wide, representing 180 degree by 180 degree content. The --projection parameter for this type is halfEquirect.

This is the most common format for content that has a 180-degree (or thereabouts) horizontal field-of-view, and it is typically mapped to a hemisphere during playback. Again, the spatial tool doesn’t perform any remapping of video frames, but if you provide frames in this format, be sure to specify this projection type.

Fisheye

The fourth and final projection type is fisheye, and the --projection parameter for this type is fisheye. There is no documentation about this projection type, though it is typically a square (1:1 aspect ratio) video frame that contains a spherical image that is the output of a fisheye camera lens. Fisheye projections normally represent a horizontal field-of-view of 180 (or so) degrees.

Interestingly, the Immersive Videos that Apple provides in the Apple TV app are encoded in the fisheye projection type. They’ve encoded the highest quality version of those videos at 4320 x 4320, 90fps, HDR10 with a ~50Mbps bitrate. Read more about this format in Apple’s Mysterious Fisheye Projection.

Combine

The combine command muxes an already-encoded audio file with an already-encoded video file. It’s useful if you have a video file and would like to quickly add an audio file without re-encoding the entire video. For help about the combine command and its available parameters:

./spatial help combine

Because the spatial tool relies on Apple’s AVFoundation, the output file must have a .mov extension. This is a technical requirement of the framework.

Input files with the following extensions should work: .ac3, .m4a, .flac, .wav, .mp3, .mp4, and .mov. Note that other formats may also work, so feel free to try them.

The combine command requires input audio (-a) and input video (-v) files:

./spatial combine -a audio.m4a -v spatial_file.mov \
  -o combined.mov

Metadata

The metadata command reports and modifies spatial video metadata in an input file and writes it to an output file. This command can also list information about available metadata and create --args file templates. For help about the metadata command and its available parameters:

./spatial help metadata

This command works with Spherical Video V1 (uuid), Spherical Video V2 (st3d, sv3d), and Apple HEVC Stereo Video (vexu, hfov) metadata keys.

To view spatial metadata in an input file:

./spatial metadata -i spatial_file.mov

Which results in output like:

Input: spatial_file.mov

vexu:cameraBaseline                = 19.24
vexu:eyeViewsReversed              = false
vexu:hasAdditionalViews            = false
vexu:hasLeftEyeView                = true
vexu:hasRightEyeView               = true
vexu:heroEyeIndicator              = right
vexu:horizontalDisparityAdjustment = 0.02
vexu:horizontalFieldOfView         = 63.4

Each metadata entry is identified by a specific key that includes its category prefix followed by a colon and its parameter name. Available categories are: st3d, sv3d, uuid, and vexu.

To show more information about a specific metadata tag (like its data type, description, possible values, and default value), use --show:

./spatial metadata --show vexu:cameraBaseline

To see all of the available tags for a specific metadata category:

./spatial metadata --show vexu

Setting Values

To add or modify metadata, use --set followed by the metadata key, an equals sign, and the value to set. No spaces are allowed before or after the equals sign. So, to set the vexu camera baseline:

./spatial metadata -i spatial_file.mov \
  -o metadata.mov --set vexu:cameraBaseline=65.0

To set multiple values, you can add another --set:

./spatial metadata -i spatial_file.mov \
  -o metadata.mov --set vexu:cameraBaseline=65.0 \
  --set vexu:heroEyeIndicator=left

Or, you can include multiple key=value settings separated by commas (with no spaces around the commas). This has the same effect as the prior example:

./spatial metadata -i spatial_file.mov \
  -o metadata.mov \
  --set vexu:cameraBaseline=65.0,vexu:heroEyeIndicator=left

If the value contains spaces, surround it with quotation marks:

./spatial metadata -i spatial_file.mov \
  -o metadata.mov \
  --set sv3d:projection=equirectangular,sv3d:metadataSource="Encoded with the Spatial Tool"

Note that because sv3d metadata requires a projection, it was necessary to include it in the --set statement.

Removing Values

Normally, all existing metadata values are carried forward from the input file to the output file and added or modified by the --set parameters along the way. To specifically remove metadata values for a category, use --remove:

./spatial metadata -i spatial_file.mov \
  -o metadata.mov --remove sv3d

Like the --set parameter, multiple categories can be removed by specifying multiple --remove parameters or by listing categories separated by commas:

./spatial metadata -i spatial_file.mov \
  -o metadata.mov --remove st3d,sv3d

Use the special category all to remove all spatial metadata.

Note that all metadata removals are processed before metadata sets.

Templates

To make it easier to modify and set multiple metadata values, a template can be created to later be passed as an argument file. For example, to create a template for the vexu metadata category:

./spatial metadata --template vexu

Which outputs:

# Use --args <file> to include these metadata settings in a command line.
# Uncomment and edit lines to pass parameters.

# vexu
#--remove vexu
#--set vexu:cameraBaseline=
#--set vexu:eyeViewsReversed=false
#--set vexu:hasAdditionalViews=false
#--set vexu:hasLeftEyeView=true
#--set vexu:hasRightEyeView=true
#--set vexu:heroEyeIndicator=none|left|right
#--set vexu:horizontalDisparityAdjustment=
#--set vexu:horizontalFieldOfView=
#--set vexu:projectionKind=rectilinear|equirectangular|halfEquirectangular|fisheye

To save a template for later use, specify an output file:

./spatial metadata --template vexu -o vexu.args

Now, you can edit the vexu.args file in any standard text editor. Notice that this template includes all available vexu metadata along with a default value or a list of possible values separated by the pipe character. Initially, each line is commented-out, effectively disabling it.

To set a value, un-comment its line by removing the pound sign, then edit the value after the equals sign. For values with a list of options, remove everything but the value you’d like to set.

After saving your edits, include the file as an argument list:

./spatial metadata -i spatial_file.mov \
  -o metadata.mov --args vexu.args

Feedback

Let me know if I missed anything or if you have questions or feedback.