2. Unit - IV
Stream characteristics for Continuous media –
Temporal Relationship – Object Stream
Interactions,
Media
Levity,
Media
Synchronization – Models for Temporal
Specifications – Streaming of Audio and Video
– Jitter – Fixed playout and Adaptive playout –
Recovering from packet loss – RTSP ––
Multimedia Communication Standards –
RTP/RTCP – SIP and H.263
3. Multimedia- Definitions
– Multi - many; much; multiple
– Medium - a substance regarded as the means of
transmission of a force or effect; a channel or
system of communication, information, or
entertainment
(Merriam-Webster Dictionary )
• So, Multimedia???
– The terms Multi & Medium don’t seem fit well
– Term Medium needs more clarification !!!
3
4. Multimedia- Definitions
Perception
• Medium
– Means for distribution
and presentation of
information
• Classification based on
perception (text, audio,
video) is appropriate for
defining multimedia
Representation
Presentation
Criteria for the classification
of medium
Storage
Transmission
Information
Exchange
4
5. Multimedia- Definitions
• Time always takes separate dimension in the media
representation
• Based on time-dimension in the representation space,
media can be
– Time-independent (Discrete)
• Text, Graphics
– Time dependent (Continuous)
• Audio, Video
• Video, sequence of frames (images) presented to the user periodically.
• Time dependent aperiodic media is not continuous!!
– Discrete & Continuous here has no connection with internal
representation !! (relates to the viewers impression…)
5
6. Multimedia- Definitions
• Multimedia is any combination of digitally manipulated text,
art, sound, animation and video.
• A more strict version of the definition of multimedia do not
allow just any combination of media.
• It requires
– Both continuous & discrete media to be utilized
– Significant level of independence between media being
used
• The less stricter version of the definition is used in practice.
6
7. Multimedia- Definitions
• Multimedia elements are composed into a project using
authoring tools.
• Multimedia Authoring tools are those programs that provide
the capability for creating a complete multimedia presentations
by linking together objects such as a paragraph of text (song),
an illustration, an audio, with appropriate interactive user
control.
7
8. Multimedia- Definitions
• By defining the objects' relationships to each other, and by
sequencing them in an appropriate order, authors (those who
use authoring tools) can produce attractive and useful
graphical applications.
• To name a few authoring tools
– Macromedia Flash
– Macromedia Director
– Authorware
• The hardware and the software that govern the limits of what
can happen are multimedia platform or environment
8
9. Multimedia- Definitions
• Multimedia is interactive when the end-user is allowed to
control what and when the elements are delivered.
• Interactive Multimedia is Hypermedia, when the end-user is
provided with the structure of linked elements through which
he/she can navigate.
9
10. Multimedia- Definitions
• Multimedia is linear, when it is not interactive and the users
just sit and watch as if it is a movie.
• Multimedia is nonlinear, when the users are given the
navigational control and can browse the contents at will.
10
11. Multimedia System
• Following the dictionary definitions, Multimedia system is any
system that supports more than a single kind of media
– Implies any system processing text and image will be a
multimedia system!!!
– Note, the definition is quantitative. A qualitative definition
would be more appropriate.
– The kind of media supported should be considered, rather
the number of media
11
12. Multimedia System
• A multimedia system is characterized by computer-controlled,
integrated production, manipulation, storage and
communication of independent information, which is encoded
at least through a continuous (time-dependent) and a discrete
(time-independent) medium.
12
13. Data streams
• Data Stream is any sequence of individual packets transmitted
in a time-dependent fashion
– Packets can carry information of either continuous or
discrete media
• Transmission modes
– Asynchronous
• Packets can reach receiver as fast as possible.
• Suited for discrete media
• Additional time constraints must be imposed for
continuous media
13
14. Data streams
– Synchronous
• Defines maximum end-to-end delay
• Packets can be received at an arbitrarily earlier time
• For retrieving uncompressed video at data rate
140Mbits/s & maximal end-to-end delay 1 second the
receiver should have temporary storage 17.5 Mbytes
– Isochronous
• Defines maximum & minimum end-to-end delay
• Storage requirements at the receiver reduces
14
15. Characterizing continuous media streams
• Periodicity
– Strongly periodic
• PCM -Pulse-code modulation coded speech in telephony behaves
this way.
– Weakly Periodic
– Aperiodic
• Cooperative applications with shared windows
• Variation of consecutive packet size
– Strongly Regular
• Uncompressed digital data transmission
– Weakly regular
• Mpeg standard frame size I:P:B is 10:1:2
– Irregular
15
16. Characterizing continuous media streams
• Contiguous packets
– Are contiguous packets are transmitted directly one after
another/ any delay?
– Can be seen as the utilization of resource
– Connected / unconnected data streams
16
17. Streaming Media
• Popular approach to continuous media over the internet
• Playback at users computer is done while the media is
being transferred (no waiting till complete download!!!)
• You can find streaming in
– Internet radio stations
– Distance learning
– Cricket Live!!!
– Movie Trailers
17
19. Multimedia- Applications
Multimedia plays major role in following areas
– Instruction
– Business
– Advertisements
– Training materials
– Presentations
– Customer support services
– Entertainment
– Interactive Games
19
20. Multimedia- Applications
– Enabling Technology
– Accessibility to web based materials
– Teaching-learning disabled children & adults
– Fine Arts & Humanities
– Museum tours
– Art exhibitions
– Presentations of literature
20
24. Notion of Synchronization
• Multimedia synchronization is understood in correspondence
of content relation, spatial relation and temporal relation
• Content Relation: defines dependency of media objects for
some data
– Example: dependency between a filled spreadsheet and
graphics that represent data listed in spreadsheet
25. Spatial Relation
• Spatial relation is represented through layout relation and defines
space used for presentation of a media object on an output device at a
certain point of time in a multimedia document
– Example: desktop publishing
– Layout frames are placed on an output device and a content is
assigned to this frame
• Positioning of layout frames:
– Fixed to a position of a document
– Fixed to a position on a page
– Relative to the position of other frame
– Concept of frames is used for positioning of time-dependent
objects
• Example: in window-based systems, layout frames correspond
to windows and video can be positioned in a window
26. Temporal Relation
• Temporal relation defines temporal dependencies between
media objects
– Example: lip synchronization
• This relation is the focus of our papers, we will not talk
about content or spatial relation
• Time-dependent objects – represent a media stream
because there exists temporal relations between
consecutive units of the stream
• Time-independent objects are traditional media such as
images or text.
27. Temporal Relations (2)
• Temporal synchronization is supported by many system
components:
– OS (CPU scheduling, semaphores during IPC)
– Communication systems (traffic shaping, network
scheduling)
– Databases
– Document handling
• Synchronization needed at several levels of a Multimedia
System
28. Temporal Relations (3)
• 1. level: OS and lower communication layers bundle
single streams
– Objective is to avoid jitter at the presentation time of
one stream
• 2. level: on top of this level is the RUN-TIME support for
synchronization of multimedia streams (schedulers)
– Objective is bounded skews between various streams
• 3. level: next level holds the run-time support for
synchronization between time-dependent and timeindependent media together with handling of user
interaction
– Objective is bounded skews between time-dependent
and time-independent media
29. Specification of Synchronization
• Implicit Specification
– Temporal relation may be specified implicitly during capturing of
the media objects; the goal of a presentation is to present media in
the same ways as they were originally captured
• Audio/video recording and playback /VOD applications
• Explicit Specification
– Temporal relation may be specified explicitly in the case of
presentations that are composed of independently captured or
otherwise created objects
• Slide show where presentation designed
– Selects appropriate slides
– Creates audio slides
– Defines units of audio presentation stream
– Defines units of audio presentation stream where slides
have to be presented
30. Inter-object and Intra-Object
Synchronization
• Intra-object synchronization refers to the time relation between
various presentation units of one time-dependent media object
• Inter-object synchronization refers to the synchronization
between media objects
40ms
40ms
Audio 2
Audio 1
Slides
Video
Animation
31. Classification of Synchronization Units
• Logical Data Units (LDU)
– Samples or Pixels,
– Notes or frames
– Movements or scenes
– Symphony or movie
• Fixed LDU vs Variable LDU
• LDU specification during recording vs LDU defined by
user
• Open LDU vs Closed LDU
32. Live Synchronization
• Goal is to exactly reproduce at a presentation the temporal
relations as they existed during the capturing process
• Need to capture temporal relation information during the
capturing
• Live Sync is needed in conversational services
– Video Conferencing, Video Phone
– Recording and Retrieval services are considered
retrieval services, or presentations with delay
33. Synthetic Synchronization
• Temporal relations are artificially specified
• Often used in presentation and retrieval-based systems
with stored data objects that are arranged to provide new
combined multimedia objects
– Authoring and tutoring systems
• Need synchronization editors to support flexible
synchronization relations between media
• Two phases: (1) specification phase defines temporal
relations, (2) presentation phase presents data in a
synchronized mode
– 4 audio messages recorded relate to parts of an engine
in an animation; the animation sequence shows a slow
360 degrees rotation of the engine.
34. Synchronization Requirements
• For intra-object synchronization:
– Accuracy concerning jitters and end-to-end delays in the
presentations of LDUs
• For inter-object synchronization:
– Accuracy in the parallel presentation of media objects
• Implication of blocking method:
– Fine for time-independent media
– Gap problem for time-dependent media
• What does the blocking of a stream mean for the output
device?
• Should be repeated previous parts in case of speech or
music?
• Should the last picture of a stream be shown?
• How long can such a gap exist?
35. Synchronization Requirements (2)
• Solutions to Gap problem
– Restricted blocking method
• Switching to alternative presentation if gap between late
video and audio exceeds a predefined threshold
• Show last picture as a still image
– Re-sampling of a stream
• Speed up or slow down streams for the purpose of
synchronization
– Off-line-re-sampling – used after capturing of media
streams with independent devices
» Concert which is captured with two independent
audio and video devices
– Online re-sampling – used during a presentation in the
case that at run-time a gap between media streams occurs
36. Synchronization Requirements (3)
• Lip Synchronization Requirements refer to temporal relation
between audio and video streams for human speaking
• Time difference between related audio and video LDUs is
called synchronization skew
• Streams are in sync if skew = 0 or skew ≤ bound
• Streams are out of sync if skew > 0
• Bounds:
– Audio/Video in sync means -80ms ≤ skew ≤ 80ms
– Audio/Video out of sync means skew < -160ms or skew >
160ms
– Transient means -160ms ≤ skew < -80ms, and 80ms < skew
≤ 160ms
37. Synchronization Requirements (4)
• Pointer Synchronization Requirements are very important in
computer-supported cooperative work (CSCW)
• We need synchronization between graphics, pointers and audio
• Comparison
– Lip sync error is skew between 40 to 60ms
– Pointer sync error is skew between 250 to 1500ms
• Bound:
– Pointer/Audio/Graphics in sync means -500ms ≤ skew
750ms
– Out of sync means skew < -1000ms and skew > 1250ms
– Transient means -1000ms ≤ skew < -500ms and 750ms <
skew ≤ 1250ms
38. Synchronization Requirements (5)
• Digital audio on CD-ROM:
– maximum allowable jitter delay in perception
experiments 5-10ns, other experiments suggest 2ms
• Combination of audio and animation is not as stringent as
lip synchronization
– Maximum allowable skew is +/- 80ms
• Stereo audio is tightly coupled
– Maximum allowable skew is 20ms; because of
listening errors, suggestion for skew is +/- 11ms
• Loosely coupled audio channels: speaker and background
music
– Maximum allowable skew is 500ms.
39. Synchronization Requirements (6)
• Production-level synchronization should be guaranteed
prior to the presentation of data at the user interface
– For example, in case of recording of synchronized data
for subsequent playback
• Stored data should be captured and recorded with
no skew
• For playback, the defined lip sync boundaries are
80 ms
• For playback at local and remote workstation
simultaneously, sync skews should be between 160ms and 0ms (video should be ahead of audio for
remote station due to pre-fetching)
40. Synchronization Requirements (7)
• Presentation-level synchronization should be defined at
the user interface
• This synchronization focuses on human perception
– Examples
• Video and image overlay +/- 240ms
• Video and image non-overlay +/- 500ms
• Audio and image (music with notes) +/- 5ms
• Audio and slide show (loosely coupled image) +/500ms
• Audio and text (Text annotation) +/-240ms
41. Reference Model for Synchronization
• Synchronization of multimedia objects are classified with
respect to four-level system
SPECIFICATION LEVEL (is an open layer, includes applications and tools
Which allow to create synchronization specifications, e.g. sync editors); editing
And formatting, mapping of user QoS to abstractions at object level
OBJECT/SERVICE LEVEL (operates on all types of media and hides differences
between discrete and continuous media), plan and coordinate presentations, initiate
presentations
STREAM LEVEL (operates on multiple media streams, provides inter-stream
synchronization), resource reservation and scheduling
MEDIA LEVEL (operates on single stream; treaded as a sequence of LDUs,
provides intra-stream synchronization), file and device access
42. Synchronization in Distributed
Environments
• Information of synchronization must be transmitted with audio
and video streams, so that the receiver side can synchronize the
streams
• Delivery of complete sync information can be done before the
start of presentation
– This is used in synthetic synchronization
• Advantage: simple implementation
• Disadvantage: presentation delay
• Deliver of complete sync information can be using out-of-band
communication via a separate sync channel
– This is used in live synchronization
• Advantage: no additional presentation delays
• Disadvantage: additional channel is needed; additional
errors can occur
43. Synchronization in Distributed
Environments (2)
• Delivery of complete synchronization information can be done
using in-band communication via multiplexed data streams,
i.e., synchronization information is in headers of the
multimedia PDU
– Advantage: related sync information is delivered together
with media units
– Disadvantage: difficult to use for multiple sources
44. Synchronization in Distributed
Environments (3)
Location of Synchronization Operations
• It is possible to synchronize media objects by recording
objects together and leave them together as one object, i.e.,
combine objects into a new media object during creation;
Synchronization operation happens then at the recording site
• Synchronization operation can be placed at the sink. In this
case the demand on bandwidth is larger because additional
sync information must be transported
• Synchronization operation can be placed at the source. In this
case the demand on bandwidth is smaller because the streams
are multiplexed according to synchronization requirements
45. Synchronization in Distributed Environments (4)
Clock Synchronization
• Consider synchronization accuracy between clocks at
source and destination
• Global time-based synchronization needs clock
synchronization
• In order to re-synchronize, we can allocate buffers at the
sink and start transmission of audio and video in advance,
or use NTP (Network Time Protocol) to bound the
maximum clock offset
46. Synchronization in Distributed
Environments (5)
Other Synchronization Issues
• Synchronization in distributed environment is a multi-step
process
– Sync must be considered during object acquisition (during video
digitization)
– Sync must be considered during retrieval (synchronize access to
frames of a stored video)
– Sync must be considered during delivery of LDUs to the network
(traffic shaping)
– Sync must be considered during transport (use isochronous
protocols if possible)
– Sync must be considered at the sink (sync delivery to the output
devices)
47. Synchronization Specification Methods
• Interval-based Specification
– presentation duration of an object is considered as
interval
• Examples of operations: A before(0) B, A overlap
B, A starts B, A equals B, A during B, A while(0,0)
B
– Advantage: easy to handle open LDUs ad therefore
user interactions
– Disadvantage: model does not include skew
specifications Audio 1
Audio 2
slides
Video 1
animation
48. Synchronization Specification (2)
• Control Flow-based Specification – Hierarchical
Approach
– Flow of concurrent presentation threads is
synchronized in predefined points of the presentation
– Basic hierarchical specification: 1. serial
synchronization, 2. parallel synchronization of actions
– Action can be atomic or compound
– Atomic action handles presentation of a single media
object, user input or delay
– Compound actions are combinations of
synchronization operators and atomic actions
– Delay as an atomic action allows modeling of further
synchronization (e.g. delay in serial presentation)
49. Synchronization Specification (3)
• Control Flow-based Specification – Hierarchical Approach
• Advantage: easy to understand, natural support of hierarchy and
integration of interactive objects is easy
• Disadvantage: Additional description of skews and QoS is necessary, we
must add presentation durations
Slides
Animation
Audio 1
Video
Audio 2
50. Synchronization Specification (3)
• Control flow-based Synchronization Specification –
Timed Petri Nets
transition
Input place with token
Output place
• Advantages: Timed Petri nets allow all kinds of
synchronization specification
• Disadvantage: Difficulty with complex specification and
offers insufficient abstraction of media object content
because the media objects must be split into sub-objects
52. OBJECTIVES:
To show how audio/video files can be downloaded for future use or
broadcast to clients over the Internet. The Internet can also be used for live
audio/video interaction. Audio and video need to be digitized before being
sent over the Internet.
To discuss how audio and video files are compressed for transmission
through the Internet.
To discuss the phenomenon called Jitter that can be created on a packetswitched network when transmitting real-time data.
To introduce the Real-Time Transport Protocol (RTP) and Real-Time
Transport Control Protocol (RTCP) used in multimedia applications.
To discuss voice over IP as a real-time interactive audio/video application.
TCP/IP Protocol Suite
52
53. OBJECTIVES (continued):
To introduce the Session Initiation Protocol (SIP) as an application
layer protocol that establishes, manages, and terminates multimedia
sessions.
To introduce quality of service (QoS) and how it can be improved
using scheduling techniques and traffic shaping techniques.
To discuss Integrated Services and Differential Services and how
they can be implemented.
To introduce Resource Reservation Protocol (RSVP) as a signaling
protocol that helps IP create a flow and makes a resource reservation.
TCP/IP Protocol Suite
53
54. INTRODUCTION
We can divide audio and video services into three broad
categories: streaming stored audio/video, streaming live
audio/video, and interactive audio/video, as shown in Figure 1.
Streaming means a user can listen (or watch) the file after the
downloading has started.
TCP/IP Protocol Suite
54
59. DIGITIZING AUDIO AND VIDEO
Before audio or video signals can be sent on the
Internet, they need to be digitized. We discuss audio and
video separately.
TCP/IP Protocol Suite
59
60. Topics Discussed in the Section
Digitizing Audio
Digitizing Video
TCP/IP Protocol Suite
60
62. AUDIO AND VIDEO COMPRESSION
To send audio or video over the Internet requires
compression. In this section, we first discuss audio
compression and then video compression.
TCP/IP Protocol Suite
62
63. Topics Discussed in the Section
Audio Compression
Video Compression
TCP/IP Protocol Suite
63
72. STREAMING STORED AUDIO/VIDEO
Now that we have discussed digitizing and compressing
audio/video, we turn our attention to specific applications.
The first is streaming stored audio and video.
Downloading these types of files from a Web server can
be different from downloading other types of files. To
understand the concept, let us discuss three approaches,
each with a different complexity.
TCP/IP Protocol Suite
72
73. Topics Discussed in the Section
First Approach: Using a Web Server
Second Approach: Using a Web Server with Metafile
Third Approach: Using a Media Server
Fourth Approach: Using a Media Server and RTSP
TCP/IP Protocol Suite
73
74. Figure 10 Using a Web server
1
GET: audio/video file
RESPONSE
3
Audio/video
file
TCP/IP Protocol Suite
74
2
75. Figure 11 Using a Web server with a metafile
1
GET: metafile
RESPONSE
2
3
Metafile
4
GET: audio/video file
RESPONSE
TCP/IP Protocol Suite
75
5
76. Figure 12 Using a media server
1
GET: metafile
RESPONSE
2
3
Metafile
4
GET: audio/video file
RESPONSE
TCP/IP Protocol Suite
76
5
77. Figure 13 Using a media server and RSTP
1
GET: metafile
RESPONSE
2
3
Metafile
4
SETUP
RESPONSE
6
5
PLAY
RESPONSE
7
Audio/video
Stream
8
TEARDOWN
RESPONSE
TCP/IP Protocol Suite
77
9
78. STREAMING LIVE AUDIO/VIDEO
Streaming live audio/video is similar to the
broadcasting of audio and video by radio and TV
stations. Instead of broadcasting to the air, the
stations broadcast through the Internet. There are
several similarities between streaming stored
audio/video and streaming live audio/video. They are
both sensitive to delay; neither can accept
retransmission. However, there is a difference. In the
first application, the communication is unicast and
on-demand. In the second, the communication is
multicast and live.
TCP/IP Protocol Suite
78
79. REAL-TIME INTERACTIVE AUDIO/VIDEO
In real-time interactive audio/video, people communicate
with one another in real time. The Internet phone or voice
over IP is an example of this type of application. Video
conferencing is another example that allows people to
communicate visually and orally.
TCP/IP Protocol Suite
79
90. Note
Translation means changing the encoding of a
payload to a lower
quality to match the bandwidth
of the receiving network.
TCP/IP Protocol Suite
90
92. Note
TCP, with all its sophistication, is not suitable
for interactive multimedia
traffic because we cannot allow retransmission
of packets.
TCP/IP Protocol Suite
92
93. Note
UDP is more suitable than TCP for interactive
traffic. However, we need
the services of RTP, another
transport layer protocol, to make
up for the deficiencies of UDP.
TCP/IP Protocol Suite
93
94. RTP
Real-time Transport Protocol (RTP) is the protocol
designed to handle real-time traffic on the Internet. RTP
does not have a delivery mechanism (multicasting, port
numbers, and so on); it must be used with UDP. RTP
stands between UDP and the application program. The
main contributions of RTP are timestamping, sequencing,
and mixing facilities.
TCP/IP Protocol Suite
94
95. Topics Discussed in the Section
RTP Packet Format
UDP Port
TCP/IP Protocol Suite
95
96. Figure 18 RTP packet header format
TCP/IP Protocol Suite
96
97. Figure 19 RTP packet header format
TCP/IP Protocol Suite
97
99. Note
RTP uses a temporary even-numbered UDP
port.
TCP/IP Protocol Suite
99
100. RTCP
RTP allows only one type of message, one that
carries data from the source to the destination. In
many cases, there is a need for other messages in a
session. These messages control the flow and
quality of data and allow the recipient to send
feedback to the source or sources. Real-Time
Transport Control Protocol (RTCP) is a protocol
designed for this purpose.
TCP/IP Protocol Suite
100
101. Topics Discussed in the Section
Sender Report
Receiver Report
Source Description Message
Bye Message
Application-Specific Message
UDP Port
TCP/IP Protocol Suite
101
103. Note
RTCP uses an odd-numbered UDP port number
that follows the port number
selected for RTP.
TCP/IP Protocol Suite
103
104. VOICE OVER IP
Let us concentrate on one real-time interactive
audio/video application: voice over IP, or Internet
telephony. The idea is to use the Internet as a
telephone network with some additional capabilities.
Instead of communicating over a circuit-switched
network, this application allows communication
between two parties over the packet-switched
Internet. Two protocols have been designed to
handle this type of communication: SIP and H.323.
We briefly discuss both.
TCP/IP Protocol Suite
104
112. Figure 27 H.323 example
Find IP address
of gatekeeper
Q.931 message
for setup
RTP for audio exchange
RTCP for management
Q.931 message
for termination
TCP/IP Protocol Suite
112
114. Chronology of Video Standards
H.261
ITU-T
H.263
H.263++
H.263+
ISO
MPEG 1
1990
MPEG 4
MPEG 2
1992
H.263L
1994
1996
MPEG 7
1998
2000
2002
115. Chronology of Video Standards
• (1990) H.261, ITU-T
– Designed to work at multiples of 64 kb/s (px64).
– Operates on standard frame sizes CIF, QCIF.
• (1992) MPEG-1, ISO “Storage & Retrieval of Audio &
Video”
– Evolution of H.261.
– Main application is CD-ROM based video (~1.5 Mb/s).
116. Chronology continued
• (1994-5) MPEG-2, ISO “Digital Television”
– Evolution of MPEG-1.
– Main application is video broadcast (DirecTV, DVD,
HDTV).
– Typically operates at data rates of 2-3 Mb/s and above.
117. Chronology continued
• (1996) H.263, ITU-T
– Evolution of all of the above.
– Supports more standard frame sizes (SQCIF, QCIF,
CIF, 4CIF, 16CIF).
– Targeted low bit rate video <64 kb/s. Works well at
high rates, too.
• (1/98) H.263 Ver. 2 (H.263+), ITU-T
– Additional negotiable options for H.263.
– New features include: deblocking filter, scalability,
slicing for network packetization and local decode,
square pixel support, arbitrary frame size, chromakey
transparency, etc…
118. Chronology continued
• (1/99) MPEG-4, ISO “Multimedia Applications”
– MPEG4 video based on H.263, similar to H.263+
– Adds more sophisticated binary and multi-bit
transparency support.
– Support for multi-layered, non-rectangular video display.
• (2H/’00) H.263++ (H.263V3), ITU-T
– Tentative work item.
– Addition of features to H.263.
– Maintain backward compatibility with H.263 V.1.
119. Chronology continued
• (2001) MPEG7, ISO “Content Representation for Info
Search”
– Specify a standardized description of various types of
multimedia information. This description shall be
associated with the content itself, to allow fast and
efficient searching for material that is of a user’s
interest.
• (2002) H.263L, ITU-T
– Call for Proposals, early ‘98.
– Proposals reviewed through 11/98, decision to proceed.
– Determined in 2001
120. Section 1: Conferencing Video
Video Compression Review
Chronology of Video Standards
The Input Video Format
H.263 Overview
H.263+ Overview
121. Video Format for Conferencing
• Input color format is YCbCr (a.k.a. YUV). Y is the
luminance component, U & V are chrominance (color
difference) components.
• Chrominance is subsampled by two in each direction.
• Input frame size is based on the Common Intermediate
Format (CIF) which is 352x288 pixels for luminance and
176x144 for each of the chrominance components.
=
Y
Cr
Cb
122. YCbCr (YUV) Color Space
• Defined as input color space to H.263, H.263+, H.261,
MPEG, etc.
• It’s a 3x3 transformation from RGB.
Y
Cb
Cr
0.299 0.587 0.114 R
=
-0.169 -0.331 0.500 G
0.500 -0.419 -0.081 B
Y represents the luminance of a pixel.
Cr, Cb represents the color difference or chrominance of a
pixel.
123. Subsampled Chrominance
• The human eye is more sensitive to spatial detail in
luminance than in chrominance.
• Hence, it doesn’t make sense to have as many pixels
in the chrominance planes.
124. Spatial relation between luma and
chroma pels for CIF 4:2:0
Different than
MPEG-2 4:2:0
luminance pel
chrominance pel
sample
block edge
125. Common Intermediate Format
•The input video format is based on Common Intermediate
Format or CIF.
•It is called Common Intermediate Format because it is
derivable from both 525 line/60 Hz (NTSC) and 625 line/50
Hz (PAL) video signals.
•CIF is defined as 352 pels per line and 288 lines per frame.
•The picture area for CIF is defined to have an aspect ratio of
about 4:3 . However,
352 / 4 3 264 288
126. Picture & Pixel Aspect Ratios
288
Pixels are not square in CIF.
Pixel
12:11
352
Picture
4:3
127. Picture & Pixel Aspect Ratios
Hence on a square pixel display such as a computer screen,
the video will look slightly compressed horizontally. The
solution is to spatially resample the video frames to be
384 x 288 or 352 x 264
This corresponds to a 4:3 aspect ratio for the picture area on
a square pixel display.
128. Blocks and Macroblocks
The luma and chroma planes are divided into 8x8 pixel
blocks. Every four luma blocks are associated with a
corresponding Cb and Cr block to create a macroblock.
macroblock
Cb
Y
8x8 pixel blocks
Cr
129. Section 1: Conferencing Video
Video Compression Review
Chronology of Video Standards
The Input Video Format
H.263+ Overview
131. ITU-T Recommendation H.263
• H.263 targets low data rates (< 28 kb/s). For example it can
compress QCIF video to 10-15 fps at 20 kb/s.
• For the first time there is a standard video codec that can be
used for video conferencing over normal phone lines (H.324).
• H.263 is also used in ISDN-based VC (H.320) and
network/Internet VC (H.323).
132. ITU-T Recommendation H.263
Composed of a baseline plus
four negotiable options
Baseline Codec
Unrestricted/Extended Motion Vector Mode
Advanced Prediction Mode
PB Frames Mode
Syntax-based Arithmetic Coding Mode
134. Picture & Macroblock Types
• Two picture types:
– INTRA (I-frame) implies no temporal prediction is
performed.
– INTER (P-frame) may employ temporal prediction.
• Macroblock (MB) types:
– INTRA & INTER MB types (even in P-frames).
• INTER MBs have shorter symbols in P frames
• INTRA MBs have shorter symbols in I frames
– Not coded - MB data is copied from previous decoded
frame.
135. Motion Vectors
• Motion vectors have 1/2 pixel
granularity. Reference frames must
be interpolated by two.
• MV’s are not coded directly, but
rather a median predictor is used.
• The predictor residual is then coded
using a VLC table.
B
A
X
MV MVX median MVA , MVB , MVC
C
137. Transform Coefficient Coding
Assign a variable length code according to three
parameters (3-D VLC):
1 - Length of the run of zeros preceding the current
nonzero coefficient.
2 - Amplitude of the current coefficient.
3 - Indication of whether current coefficient is the
last one in the block.
3 - The most common are variable length coded (313 bits), the rest are coded with escape sequences
(22 bits)
138. Quantization
• H.263 uses a scalar quantizer with center
clipping.
• Quantizer varies from 2 to 62, by 2’s.
• Can be varied ±1, ±2 at macroblock boundaries
(2 bits), or 2-62 at row and picture boundaries (5
bits).
out
2Q
Q
-Q
-2Q
in
139. Bit Stream Syntax
Hierarchy of three layers.
Picture Layer
GOB* Layer
MB Layer
*A GOB
is usually a row of macroblocks, except
for frame sizes greater than CIF.
Picture Hdr
GOB Hdr
MB
MB
...
GOB Hdr
...
140. Picture Layer Concepts
Picture Start
Code
Temporal
Reference
Picture
Type
Picture
Quant
• PSC - sequence of bits that can not be emulated
anywhere else in the bit stream.
• TR - 29.97 Hz counter indicating time reference for a
picture.
• PType - Denotes INTRA, INTER-coded, etc.
• P-Quant - Indicates which quantizer (2…62) is used
initially for the picture.
141. GOB Layer Concepts
GOB Headers are Optional
GOB Start
Code
GOB
Number
GOB
Quant
• GSC - Another unique start code (17 bits).
• GOB Number - Indicates which GOB, counting
vertically from the top (5 bits).
• GOB Quant - Indicates which quantizer (2…62) is
used for this GOB (5 bits).
• GOB can be decoded independently from the
rest of the frame.
142. Macroblock Layer Concepts
Coded
Flag
MB
Type
Code Block
Pattern
DQuant
MV
Deltas
Transform
Coefficients
• COD - if set, indicates empty INTER MB.
• MB Type - indicates INTER, INTRA, whether MV is
present, etc.
• CBP - indicates which blocks, if any, are empty.
• DQuant - indicates a quantizer change by +/- 2, 4.
• MV Deltas - are the MV prediction residuals.
• Transform coefficients - are the 3-D VLC’s for the
coefficients.
143. Unrestricted/Extended Motion Vector
Mode
• Motion vectors are permitted to point outside the picture
boundaries.
– non-existent pixels are created by replicating the edge
pixels.
– improves compression when there is movement across
the edge of a picture boundary or when there is camera
panning.
• Also possible to extend the range of the motion vectors
from [-16,15.5] to [-31.5,31.5] with some restrictions. This
better addresses high motion scenes.
146. Advanced Prediction Mode
• Includes motion vectors across picture boundaries from
the previous mode.
• Option of using four motion vectors for 8x8 blocks
instead of one motion vector for 16x16 blocks as in
baseline.
• Overlapped motion compensation to reduce blocking
artifacts.
147. Overlapped Motion Compensation
• In normal motion compensation, the current block is composed
of
– the predicted block from the previous frame (referenced by
the motion vectors), plus
– the residual data transmitted in the bit stream for the current
block.
• In overlapped motion compensation, the prediction is a
weighted sum of three predictions.
148. Overlapped Motion Compensation
• Let (m, n) be the column & row indices of an 88 pixel block
in a frame.
• Let (i, j) be the column & row indices of a pixel within an 88
block.
• Let (x, y) be the column & row indices of a pixel within the
entire frame so that:
(x, y) = (m8 + i, n8 + j)
149. Overlapped Motion Comp.
• Let (MV0x,MV0y) denote the
motion vectors for the current
block.
• Let (MV1x,MV1y) denote the
motion vectors for the block
above (below) if the current pixel
is in the top (bottom) half of the
current block.
• Let (MV2x,MV2y) denote the
motion vectors for the block to
the left (right) if the current pixel
is in the left (right) half of the
current block.
MV0
MV1
MV1
MV2
MV2
150. Overlapped Motion Comp.
Then the summed, weighted prediction is denoted:
P(x,y) = (q(x,y) H0(i,j) + r(x,y) H1(i,j) + s(x,y) H2(i,j) +4)/8
Where,
q(x,y) = (x + MV0x, y + MV0y),
r(x,y) = (x + MV1x, y + MV1y),
s(x,y) = (x + MV2x, y + MV2y)
153. PB Frames Mode
• Permits two pictures to be coded as one unit: a P frame
as in baseline, and a bi-directionally predicted frame or
B frame.
• B frames provide more efficient compression at times.
• Can increase frame rate 2X with only about 30%
increase in bit rate.
• Restriction: the backward predictor cannot extend
outside the current MB position of the future frame.
See diagram.
154. PB Frames
-V 1/2
V 1/2
Picture 1
P or I Frame
Picture 2
B Frame
Picture 3
P or I Frame
PB
2X frame rate for only 30% more bits.
155. Syntax based Arithmetic Coding Mode
• In this mode, all the variable length coding and
decoding of baseline H.263 is replaced with
arithmetic coding/decoding. This removes the
restriction that each sumbol must be represented by
an integer number of bits, thus improving
compression efficiency.
• Experiments indicate that compression can be
improved by up to 10% over variable length
coding/decoding.
• Complexity of arithmetic coding is higher than
variable length coding, however.
156. H.263 Improvements over H.261
• H.261 only accepts QCIF and CIF format.
• No 1/2 pel motion estimation in H.261, instead it uses a
spatial loop filter.
• H.261 does not use median predictors for motion vectors
but simply uses the motion vector in the MB to the left as
predictor.
• H.261 does not use a 3-D VLC for transform coefficient
coding.
• GOB headers are mandatory in H.261.
• Quantizer changes at MB granularity requires 5 bits in
H.261 and only 2 bits in H.263.