What Everyone Ought to Know about SIP Video

What Everyone Ought to Know about SIP Video
Trying to figure out the H.264 resolution in a SIP Video call

One day, I was trying to figure out what maximum resolution was supported by the far end in a SIP video call.

Yes, you can check the statistics on the VC unit showing the codec negotiated, resolution, frame rate, etc. This is what is currently being used though. How do you know it’s not due to a bandwidth limitation?

SDP Will Tell You The Truth

What would you normally do when trying to check the capabilities of the far end in a SIP video call? Have a look at the SDP 🙂

This is part of an SDP advertised by Webex for a VC unit dialing in. Only showing the relevant parts of the first video m line.

a=rtpmap:97 H264/90000 
a=fmtp:97 profile-level-id=42E014;max-mbps=108000;max-fs=3600

Looks like rubbish, right?
Actually it’s quite easy to decode but wait, where’s the resolution?
Where is the famous 1080p30 or 720p60?

You’d be surprised to know (hopefully) that in H.264 you do not negotiate resolution but macroblocks!

What Is a Macroblock?

A Macroblock is a block of 16 x 16 pixels.
If you look at an image size, it’s defined in terms of a number of macroblocks. How many of these macroblocks change every second is called the macroblock processing rate. Knowing both numbers allows calculating the frame-rate.

Let’s have a look at an example of the two most popular resolutions in video conferencing today – 1080p and 720p.

Macroblocks

In the case of 1080p, 120 multiplied by 68 equals 8160 Macroblocks.
For 720p, it’s 80 times 45 resulting in 3600 Macroblocks.

Having a look at the most commonly used frame rate of 30 fps, the macroblock processing rate for 1080p30 is 244800 MB/s while for 720p30 it’s 108000 MB/s.

You can probably already see how those numbers relate to the SDP example above. But there is one more thing we need to discuss..

What About Profiles and Levels?

When advertising H.264 capability in a SIP video call, a Profile and a Level must be included.
Optional parameters may also be included. What you will notice in practice is that these are always present.
They can increase the signaled capabilities above what the Profile and Level defines, but cannot reduce them.

So what are those mysterious Profiles and Levels?

A Profile defines a set of codec features that are supported.
If you’re interested in the different profiles and features, have a look at Wikipedia.
In practice, especially when dealing with Cisco, you’ll notice the Baseline Profile being advertised.

A Level defines the specification of the decoder performance. A decoder can be expected to decode video within the constraints a specific Level defines. Maximum frame size, macroblock processing rate and bandwidth are the important ones. 

To look up different levels, the best place is Table A-1 from the H.264 specification or the Wikipedia article mentioned before.

As an example, Level 3.1 is defined with a macroblock processing rate of 108000 per second and a frame size of 3600 macroblocks.
Those numbers look familiar, right? (If not, have a look at the SDP and Macroblock section above.)

We already know that 1280×720 resolution is in fact 3600 macroblocks. If you want to have a moving picture, that’s typically 30 frames per second, which gives us 108000 MB/s or 720p30.

The only problem with Level 3.1 is it requires 14 Mbps of bandwidth.
This is not something most VC units support. As an example, most of the Cisco VC devices support up to 6 Mbps on a point-to-point SIP video call.

This is where the optional parameters mentioned earlier come into play.
As they can only be used to increase the signaled capabilities, it’s not possible to signal Level 3.1 and reduce the bandwidth requirement. What can be done (and is done) though, is to signal e.g. level 2.0 (which requires only 2 Mbps of bandwidth) and increase the macroblock processing rate and maximum frame size using the following optional parameters as defined in the H.241 specification:

CustomMaxMBPS – indicates that the decoder has a higher processing rate capability.
In other words – it’s used to increase the signaled macroblock processing rate above what is defined by the Level.

CustomMaxFS – indicates that the decoder can decode larger picture (frame) sizes.
Again, it’s used to increase the maximum supported resolution above what is defined by the Level.

How Does It All Come Together?

Let’s combine all of the information above and have another look at our SIP video call m line in details:

a=rtpmap:97 H264/90000 
a=fmtp:97 profile-level-id=42E014;max-mbps=108000;max-fs=3600
  • a=rtpmap:97 – this line describes the codec associated with RTP payload number 97
  • H264/90000 – the advertised codec is H.264, with a clock rate of 90000
  • profile-level-id=42E014 – this is where the fun starts – the H.264 Profile and Level:
    • 42 – the first two characters indicate the Profile.
      42 means Baseline Profile.
    • E0 – trust me, ignore these two characters. If you don’t, have a look at section 8.1 from RFC3984.
    • 14 – the last two characters indicate the Level in hex. Convert this to decimal, divide by 10 and look up in the mentioned above Table A-1 from the H.264 specification.
      0x14 is 20 in decimal, divided by 10 is Level 2.0
  • max-mbps=108000 – optional CustomMaxMBPS parameter with the value of 108000 MB/s
  • max-fs=3600 – optional customMaxFS parameter with the value of 3600 MB’s, in other words, a resolution of 1280×720 or 720p

To calculate the frame-rate, simply divide the macroblock processing rate by the frame size in macroblocks.
In the example above – 108000 divided by 3600 = 30 fps.

Summary

There you have it.
It’s not really about H.264 resolution but Macroblocks.
Knowing how many form a specific frame size and how many get processed in a second allows you to easily derive the signaled capabilities and troubleshoot any related issues.