Zero-Copy in Matrix API

Zero-Copy in DekTec Matrix API<sup>®</sup> 2.0

Zero-Copy in DekTec Matrix API^® 2.0

Turbo charge 4K processing by saving CPU and memory bandwidth

DekTec's Matrix API^® 2.0 is an easy-to-use API to create software-based video-processing solutions that receive and transmit SDI streams through our PCIe cards. The new Zero-Copy feature reduces the overhead of transferring video between card and PC and makes it as efficient as possible. This article explains the inner workings of Zero-Copy.

Technology Papers

Matrix API for SDI Applications

The Matrix API is a set of C++ classes that allows real-time processing of audio and video in software. It is widely used in PC-based professional A/V applications such as video encoders and decoders, that are software based yet require real time operation and low latency. Especially when interfacing via DekTec's SDI PCIe cards, the Matrix API makes writing applications much easier, completely hiding the complexities of the SDI layer.

However, when pushing the boundaries, e.g. all-software solutions for encoding UHD or multichannel video, there are a couple of system-level challenges to overcome:

For starters, processing video streams consumes a lot of CPU cycles, so writing optimized code to efficiently use of the cores is essential. For this reason, the pixel conversions in Matrix API extensively use SSE/AVX instructions. This greatly helps in reducing the CPU load caused by the Matrix API, leaving as many CPU cycles as possible for the user application.
Second, perhaps less obvious, memory bandwidth is also a limited resource. When video frames are moved around in host memory, the bandwidth consumed quickly becomes astronomical, especially when 4K video is processed. For example, a single UHD frame is about 20MB of data and at 60fps this equates to about 10Gbps. A copy operation requires data to be read and written, so the total bandwidth is a whopping 20Gbps.

In practice, memory bandwidth can prove to be a bigger bottleneck than CPU power. To help reduce memory bandwidth, the latest Matrix API version includes a new feature called Zero-Copy. In the rest of this article we'll explain how this feature works.

Inner Workings of the Matrix API

The diagram below shows the standard flow of SDI data when using the Matrix API with the DTA-2174B (as an example).

Standard Matrix API flow

Figure 1. Standard Matrix API flow: Video, audio and ANC data is stored in the frame.

From left to right:

The DTA-2174B receives an SDI frame and writes it to a cyclic DMA buffer in host memory using a high-performance DMA engine on-board of the DTA-2174B.
A process in the Matrix API waits for an interrupt from the DekTec card indicating that a portion of an SDI frame (typically ¼ frame) is available in the DMA buffer.
Upon receiving this interrupt, the Matrix API begins parsing SDI frames from the DMA buffer, extracting video, embedded audio and ancillary data packets. For the video, conversion from the native 10-bit UYVY video format to the user-configured pixel format is performed using SSE/AVX instructions.
When all SDI data has been parsed and converted to a Frame structure, a user-supplied call-back function is called so that frame processing can start.

Note: The Matrix API flow for SDI output (e.g in a decoder) is similar, but in the opposite direction.

The Cost of Convenience

In the flow described above, the user application reads (or in the case of output, writes) the video data in the pixel format most convenient for further processing. The downside of this convenience is that the pixel conversion routine effectively copies the video, while transforming the pixels, from the DMA buffer to an intermediate video buffer. This “copy” in the Matrix API results in additional traffic to and from memory, increasing the memory bandwidth used.

For applications that already require high memory bandwidth, such as video encoders and decoders, every avoidable copy is worth a lot. Enter Zero-Copy.

Zero-Copy to the Rescue

The Zero-Copy feature eliminates the copy from the driver's DMA buffer to a video buffer (or vice versa for output). The way this is done is by giving the user application direct access to the video data in the DMA buffer. This avoids the copy, but it does mean that the video is only available to the user application in its native 10-bit UYVY format.

For a high performance encoder implementation it will be more efficient to directly combine the pixel conversion with the encoding process itself, instead of letting the Matrix API convert the pixel format before encoding the video. The diagram below shows again the flow of data for an encoder type of application, but this time using Zero-Copy for video.

Standard Matrix API flow

Figure 2. Zero-Copy for video: The user reads video directly from the DMA buffer via line pointers.

The core idea is that video data is not transferred from the DMA buffer to a frame buffer, but the frame structure is given a list of pointers to the individual video lines in the DMA buffer. This way, the user callback function can iterate over the pointers, read the video lines directly from the DMA buffer, and feed the video data to the pixel processing pipeline.

Note: For an application with SDI output, the Zero-Copy concept works the same, but the other way around: the application writes the video directly to the DMA buffer, at the location indicated by the line pointers.

Conclusion

When reading or writing 4K UHD video with the Matrix API, the new Zero-Copy feature saves about 20Gbit/s of memory bandwidth. This is a significant portion of the total available PC memory bandwidth, which can be put to good use for other purposes. As a result, multi-core systems in particular will run much smoother.