Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

London Video Tech - Adventures in cutting every last millisecond from glass-to-glass latency

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 29 Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a London Video Tech - Adventures in cutting every last millisecond from glass-to-glass latency (20)

Anuncio

Más reciente (20)

London Video Tech - Adventures in cutting every last millisecond from glass-to-glass latency

  1. 1. Adventures in cutting every last millisecond from glass-to-glass latency Kieran Kunhya – kierank@obe.tv @openbroadcastsy, @kierank_
  2. 2. Who am I, who are we?• I work on FFmpeg, x264 and others… • A lot related to professional video in OSS, probably has my fingerprints on it • At $job, Open Broadcast Systems builds software for broadcasters mainly around video point to point encoding/decoding for news/sport etc... • Not to be confused with:
  3. 3. What I will talk about • Minimising every last millisecond of latency from broadcast production processes (before distribution) • Encoding and Decoding often being the dominant form of latency – will focus on this • Doing this from a software engineering standpoint • Not much (if any) about this at all • Hardware-centric industry – “secret sauce” thinking
  4. 4. What I will not talk about • Doing live production with high-bandwidth (10-100GbE) networking • Network stack in between (FEC vs SRT vs RIST) • Not the right audience • Demuxed 2017 video
  5. 5. Live broadcast production processes (1) • Processes in black boxes, e.g: Routing, graphics, switching, mixing, recording, monitoring, playout, subtitling, standards conversion etc… • Infrastructure as complex if not more complex than delivery
  6. 6. Live broadcast production processes (2) • Heavily hardware (FPGA/DSP) centric. • Fixed function, black-box products • Low-latency processes in studio • “Video Lines” of latency – order of 10-100 us. • Uncompressed video - high data rates, many Gbps. • Legacy usage of satellite, fibre, SDI, ASI • Includes premium live web video!
  7. 7. Video contribution • Getting content from a remote place to one or more central places, often studio or aggregation centre • Minimise latency • Often fast-paced interviews/debates • Often uneconomical to pay for uncompressed • Remote production, director not onsite, back at base
  8. 8. The live production environment • Largely SDI (coax) based • Unidirectional, Gbps video • Latency on order of ~video lines (40us) • Many I/O Boards to do this • Abstracted away low latency into ~frames (40ms) (1000x increase!) • SDKs hide capabilities of electronics • Internal buffering? • Hardware doing the data processing (offload)
  9. 9. SDI from a software engineers point of view • I want the software to do as much as is reasonably possible • A driver, not a SDK+driver hybrid as *all* manufacturers have • “Offload” is irrelevant in 2019 • Start processing the data as soon as a field arrives, not whole frame. • Later on processing chunk by chunk • I/O in the purest sense • Write data and it be put to wire *now*
  10. 10. What you often get in reality • Video and Audio on a separate File Descriptor • Can never open them simultaneously so can never have exact lipsync • Long delays in and out of card (~2-3 frames) …. • Not all audio tracks available • Audio out of sync • Video downconverted to 8-bit • Not all blanking data available, less common parameters not changeable
  11. 11. Built our own SDI card
  12. 12. SDI from a software engineers point of view • Massive time and expense for the most important 4 lines of code • DMA (direct memory access) buffers of 8192 bytes (approx. 1 HD line) • Get an interrupt every 32 buffers • Can capture, process chunks of video and push out in the ~100s lines! • Tight timescales, need to be aware of thread priority, CPU powersaving etc
  13. 13. SDI from a software engineers point of view • CRC not software-centric (10-bit data, 25-bit polynomial) • We offload this otherwise big waste of CPU • Very tedious to build frame correctly, lots of legacy • Difficulty to verify, tools all hardware-based • 1080p50/60 – 3G-SDI Level B, very software unfriendly • (and lots of other implementation details)
  14. 14. Pixel formats Only YUV 4:2:2 domain (as example)! • Planar 10b – main working format • Planar 8b - preview quality • UYVY 10b (16-bit aligned) – SDI datastream • Apple v210 – some hardware • Contiguous 10-bit – SDI wire format
  15. 15. Pixel formats Handwritten (no intrinsics!) SIMD for every mapping (and others). • 5-15x speed improvements compared to C • Do it once, make it fast once and for all (until new CPU…) • Generic conversion library a difficult problem • Intermediate pixel format(s) always a compromise • Add special cases until you’ve done them all!
  16. 16. Basic encode / decode pipeline • Encoder • Capture: 1-2 frames • Encode (x264 lowestlatency, no audio compression): 1-frame • Mux and other processing (~5ms) • Decoder • Wait for frame to arrive: 1-frame • Decode the frame: 1-frame • Frame synchronisation: 1-frame (drop and duplicate video, resample audio) • Push to wire: 1-2 frames • Basic implementation: 7-frames, 280ms at 1080i25
  17. 17. Better encode / decode pipeline • Encoder • Capture: 1-frame • Encode (x264 lowestlatency, no audio compression): 1-frame • Mux and other processing (~5ms) • Decoder • Wait for frame to arrive: 1-frame • Decode the frame: 1-frame • Frame synchronisation: 1-frame (drop and duplicate video, resample audio) • Push to wire: 1 frame (10ms) • Better implementation: 5.x-frames, 210ms at 1080i25
  18. 18. Better encode / decode pipeline • Encoder • Capture: 1-frame • Encode (x264 lowestlatency, no audio compression): 1-frame • Mux and other processing (~5ms) • Decoder • Wait for frame to arrive: 1-frame • Decode the frame: 1-frame • Frame synchronisation: 1-frame (drop and duplicate video, resample audio) • Push to wire: 1 frames • Better implementation: 6-frames, 240ms at 1080i25
  19. 19. Decode frame as it arrives on the wire • Fix FFmpeg chunk decode • Slices arrive at destination • Complete frame is built
  20. 20. Better encode / decode pipeline • Encoder • Capture: 1-frame • Encode (x264 lowestlatency, no audio compression): 1-frame • Mux and other processing (~5ms) • Decoder • Wait for frame to arrive: 1-frame • Decode the frame as it arrives: 1-frame • Frame synchronisation: 1-frame (drop and duplicate video, resample audio) • Push to wire: 10ms • Better implementation: 4.x-frames, 170ms at 1080i25
  21. 21. Better encode / decode pipeline • Encoder • Capture: 1-field • Encode (x264 lowestlatency, no audio compression): 1-field • Mux and other processing (~5ms) • Decoder • Decode the frame as it arrives: 1-frame • Frame synchronisation: 1-frame (drop and duplicate video, resample audio) • Push to wire: 10ms • Better implementation: 3.x-frames, 130ms at 1080i25
  22. 22. Better encode / decode pipeline • Encoder • Capture: 1-field • Encode (x264 lowestlatency, no audio compression): 1-field • Mux and other processing (~5ms) • Decoder • Decode the frame as it arrives: 1-frame • Frame synchronisation: 1-frame (drop and duplicate video, resample audio) • Push to wire: 10ms • Better implementation: 3.x-frames, ~130ms at 1080i25
  23. 23. Clocks • Drift clock to match remote clock • Clocks do not match (temperature etc), drift can be fast • Control the onboard oscillator on the SDI Card to match remote clock • Saves having to drop/duplicate video and resample audio to match • Same number of frames pushed per hour, per day etc • At low latencies, clock drift bites you quicker
  24. 24. Better encode / decode pipeline • Encoder • Capture: 1-field • Encode (x264 lowestlatency, no audio compression): 1-field • Mux and other processing (~5ms) • Decoder • Decode the frame as it arrives: 1-frame • Push to wire: 10ms • Better implementation: 2.x-frames, 90ms at 1080i25
  25. 25. Better encode / decode pipeline • Encoder • Capture: 1-field • Encode (x264 lowestlatency, no audio compression): 1-field • Mux and other processing (~5ms) • Decoder • Decode the frame as it arrives: 1-frame • Push to wire: 10ms • Decode the frame to the wire as it arrives • Better implementation: 1.x-frames, ~50ms at 1080i25
  26. 26. Chunk based encode and decode • Throughout all of these improvements, bitrate roughly the same, no loss in picture quality owing to H.264 bitexact decode. • Diminishing returns now but some very high end applications demand even lower latency • Not a good idea for H.264, ratecontrol would prefer full frame • Codecs like JPEG2000, VC-2, JPEG-XS operate on slices • Limited use of slice based encoding in software • Capture, Encode, Decode and Render before the frame has even finished arrive on the wire at source (~20ms latency) • Concert video wall, VR etc
  27. 27. Chunk based encode and decode Destination Source • 10-20ms end-to-end • Huge bitrate penalty (~100s Mbps) • High quality network also required
  28. 28. Thanks • Thanks to team working on this • James Darnley • Rafael Carre • Sam Willcocks
  29. 29. The END

×