Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Criteo Labs Infrastructure Tech Talk Meetup Nov. 7

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 100 Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a Criteo Labs Infrastructure Tech Talk Meetup Nov. 7 (20)

Anuncio

Más reciente (20)

Criteo Labs Infrastructure Tech Talk Meetup Nov. 7

  1. 1. Copyright © 2014 Criteo Criteo Labs Infrastructure Tech Talk November 7, 2017 By Dailymotion, Criteo & Leboncoin Copyright © 2017 Criteo
  2. 2. © 2017 Hardware assisted transcoding Tuesday, November 7th 2017
  3. 3. SOME BACKGROUND © 2017
  4. 4. © 2017 • uploaded videos come in various containers and codecs • you need to feed the native web player with a specific format • various qualities are available to our users SOME BACKGROUND What is transcoding and why it is so important to us www.dailymotion.com/ upload TRANSCODE www.dailymotion.com /video/12345 1080p 576p 4k 720p M3U8 H264/AAC TS HLS ABR …
  5. 5. © 2017 • more than 150k videos uploaded every day • 4 to 8 qualities per video (HLS) • from 144p to 2160p • 20M transcoding tasks per month • fast publication time constraint (3x to 10x) • getting the best possible video quality SOME BACKGROUND Dailymotion facts
  6. 6. LEGACY © 2017
  7. 7. © 2017 • 160 blade servers • up to 56 logical threads Xeon E5–2683 CPU • 240W TDP per blade • FFmpeg 3.x with libx264 for video encoding • pure software transcoding SOME BACKGROUND Legacy encoding farm
  8. 8. WHAT WE WANT © 2017
  9. 9. © 2017 • reduce OPEX: power consumption • reduce CAPEX: unit price • better performances means faster publication • use the existing video workflow WHAT WE WANT Can we improve our existing transcoding workflow ?
  10. 10. SOMETHING NEW © 2017
  11. 11. © 2017 • choosing the right solution: • Nvidia NVENC • Intel Medial SDK (Quicksync) • how can it it fit in our workflow ? • what is the gain in term of: • performance • cost • power consumption SOMETHING NEW GPU accelerated solution
  12. 12. HARDWARE © 2017
  13. 13. © 2017 •HPE Moonshot 1500 chassis •up to 45 Intel Xeon cartridges per chassis: •Xeon E3–1585L V5 (4 cores Hyper-Threaded) with Iris Pro Graphics P580 (72 processing units) •TDP: 45W •500 GB SSD •64 GB RAM HARDWARE What we use now
  14. 14. SOFTWARE © 2017
  15. 15. © 2017 •Intel Media SDK 2017R3 •Kernel 4.4.83 with Intel patches •FFmpeg 3.3.3 with Intel patches •unchanged in-house scheduling solution •monitoring through Datadog SOFTWARE What we use now
  16. 16. WORKFLOW © 2017
  17. 17. © 2017 WORKFLOW Transcoding workflow Input file Demux (MP4/MKV/…) Software (FFMPEG) Hardware (Intel GPU) Video frames Decode frames Filters (Deint/Scale/…) Encode frames Transcode audioAudio frames Remux (MP4) Output file
  18. 18. © 2017 WORKFLOW Look-ahead algorithm Input file Demux (MP4/MKV/…) Video frames Decode frames Next decoded frames buffer Encode frames Look-ahead bitrate analyser Set bitrate Transcode audioAudio frames Remux (MP4) Output file
  19. 19. RESULTS © 2017
  20. 20. © 2017 The following FFmpeg version were used during the test: • QSV version: FFmpeg 3.3.3 + Intel patches • SW only: FFmpeg 3.3.3 (16 threads) We also tried to use the same parameters, when possible: • AVC profile and level • Keyframes interval (forced every 3 seconds for HLS) • Frame-rate left untouched when possible • AAC audio • MP4 container We enable Variable Bit-Rate Look-ahead for QSV transcoding RESULTS
  21. 21. © 2017 RESULTS Performance test 1: single 1080p transcoding
  22. 22. © 2017 RESULTS Performance test 2: concurrent 10x1080p transcodings
  23. 23. © 2017 RESULTS Performance test 3: concurrent 20x480p transcodings
  24. 24. © 2017 The following graph show the power consumption of a full chassis (45 cartridges) over the last 2 weeks ~ 3700 W /45 = 82W per cartridge (SW only solution has a theoretical TDP of 240W) RESULTS Power consumption
  25. 25. © 2017 RESULTS Quality Test SW transcoding 2 pass HW transcoding Average Gain SW vs HW 4k→ mp4_h264_aac_uhd 33 32 ~= 720p→ mp4_h264_aac_hd 44 44 = 1080p→ mp4_h264_aac_fhd 47 40 x1,2 movie_sample (1080p)→ mp4_h264_aac_fhd 44 44 = • Results are in PSNR units, higher is better
  26. 26. © 2017 RESULTS Quality BW graph with Look-ahead enabled BW graph without Look-ahead enabled
  27. 27. MONITORING © 2017
  28. 28. © 2017 MONITORING
  29. 29. CONCLUSION © 2017
  30. 30. © 2017 • Pros • much… much faster for single transcoding (up to 12x faster) • power consumption is much lower • cheaper (more the 2.5 times cheaper per unit) • Cons • slower with multiple low-res tasks • Quality is not 100% as good, Look-ahead helps though CONCLUSION We built our new transcoding farm!
  31. 31. © 2017 • Blog post on medium: • http://medium.com/dailymotion-engineering/hardware-assisted- video-transcoding-at-dailymotion-66cd2db448ae • SFFmpeg (static FFmpeg build): • https://github.com/pyke369/sffmpeg • Intel media SDK FFmpeg patches: • https://github.com/Intel-FFmpeg-Plugin/Intel_FFmpeg_plugins CONCLUSION Some references
  32. 32. DEMO AND Q&A © 2017
  33. 33. © 2017 Thank you Gilles Vieira gilles.vieira@dailymotion.com
  34. 34. 1 RFP Challenge The quest to find our next DCs Mohamed Benazza Nicolas Pérez
  35. 35. 2 RFP: Request For Proposal An RFP is a set of specifications that describe the sought-after solution, and evaluation criteria that disclose how proposals will be graded. (Margaret Rouse - https://goo.gl/uVHKqM)
  36. 36. 3 Traffic 90+ Gbps Internet Traffic 1+ Tbps inter-DCs capacity Servers 26 000+ Servers 28 Pb Storage Growth in 2017 +2 Data centers +6 500 Servers +4 x 100G inter-DCs links Power 8+ Mega Watt (+/- 7000 homes for 1 year) Data Centers World Wide Criteo Global Footprint 2017-Q4
  37. 37. 4 760 1019 1366 1831 2454 2080 1320 301 2017 2018 2019 2020 2021 0 1000 2000 3000 4000 5000 6000 7000 8000 TY5 #SRV to Add #SRV Site Capacity #SRV Total - Capacity Planning: - Organic Growth - New Projects - Corporate Strategy - Resilience - Hadoop
  38. 38. 5 Q4 2017Q3 2017 RFP Process RFP Launch February 7th Vendor Award June 9th RFP Answers March 1st Contract review Project Launch BUILD Phase Data Center Infrastructure Ready Data Center Commissioning Hardware Procurement Cabling & Setup IP Transit & Leased Line Q&A February March April Shortlist Selection April 3rd May First Billing Offers Review Vendors Negotiations PO release & Contract signed August September Site visit Q2 2017 Hardware Setup+ SRE validation July Q1 2017 October November 4 months 6 months June December ETA for PO approval Timeline
  39. 39. 6 Who? R&D Infrastructure Procurement Team Legal Counsel Qapla Team CTO CFO MBG
  40. 40. 7 RFP: Documentation Package Master document with background and planning Appendix 1 : Technical Requirement and Answering Grid Mandatory Requirements All questions will be shared on this file Administrative documentation
  41. 41. 8 RFP: Administrative documentation
  42. 42. 9 RFP: Planning and Master Document
  43. 43. 10 RFP: Technical Requirement and Answering Grid
  44. 44. 11 RFP: Pricing Grid
  45. 45. 12
  46. 46. 13 RFP: Results: Proposal Summary Vendor #7 is not Shortlisted Vendor #3 is Shortlisted
  47. 47. 14 Data Centers Visits
  48. 48. 15 Data Centers Visits
  49. 49. 16
  50. 50. 17 Y a plus qu’à …
  51. 51. 18 Questions?
  52. 52. 1
  53. 53. ACDC - AutomatiC DataCenter Felix Cantournet & Xavier Krantz 2017-11-07
  54. 54. Agenda 1. Leboncoin 2. Historique 3. Remise en question 4. ACDC 5. Next 6. Rex 3
  55. 55. Leboncoin Quelques chiffres 4
  56. 56. 5
  57. 57. 6
  58. 58. 7
  59. 59. 1.2 - Stack Technique 2 Datacenters 600 serveurs physiques (plus de 1000 avec les virtuels) 12 Gbits/s de débit sortant 6 To de BDD 8 300M d’images 15k req/s sur leboncoin.fr
  60. 60. 1.2 - Stack Technique 2 Datacenters 600 serveurs physiques (plus de 1000 avec les virtuels) 12 Gbits/s de débit sortant 6 To de BDD 9 300M d’images 15k req/s sur leboncoin.fr
  61. 61. Historique & Évolutions 10
  62. 62. 2.1 - Situation initiale 11
  63. 63. 2.1 - Situation initiale ● 1 - Operator ○ find a free IP (Welcome ping !) ● 3 - Foreman ○ Go in Foreman and select a node ○ Get the @MAC ○ Create the node + put in build mode 12 ● 4 - Puppet ○ Reserve @Mac / DNS name in DHCP ○ Commit + push ○ Run the agent on every DHCP nodes ● 2 - Puppet ○ Reserve IP / DNS name in DNS ○ Commit + push ○ Run the agent on every DHCP nodes
  64. 64. 2.1 - Situation initiale ● 5 - Foreman ○ Reboot the node via BMC plugin ● 7 - Operator ○ Follows with Java console 13 ● 6 - Node installs ○ Boot on network (PXE) ○ DHCP redirects to TFTP ○ TFTP serves the custom PXE config ○ Pressed is rendered by Foreman
  65. 65. 2.1 - Situation initiale ● 5 - Foreman ○ Reboot the node via BMC plugin ● 7 - Operator ○ Follows with Java console 14 ● 6 - Node installs ○ Boot on network (PXE) ○ DHCP redirects to TFTP ○ TFTP serves the custom PXE config ○ Pressed is rendered by Foreman 6 manual steps Errors prone Human conflicts Time consuming
  66. 66. 15
  67. 67. 2.2 - Problématique ● Simplifier le provisioning bare metal ○ Provisioning / installation non-supervisée ○ 1 manual step 16
  68. 68. 2.3 - Essai 1 - Foreman + SmartProxies Constat: Sous utilisation de Foreman. Solutions: Smart proxy pour automatiser : - IPAM + DHCP - DNS 17
  69. 69. ● Foreman Smart-proxy ○ Not supported 2.3 - Essai 1 - Foreman + SmartProxies ● We ○ 1 big zone file ● Foreman Smart-proxy ○ Dynamic updates = nsupdate ○ Binary journal file + serial conflicts 18 ● We ○ Do nics bonding ○ Need to register n@Macs <> 1 IP Pain points: DNS Pain points: DHCP
  70. 70. 2.3 - Essai 1 - Foreman + SmartProxies ● We ○ Do not master Ruby ○ Are not “a Tech company” ○ Are not that big ● Foreman & Smart-proxy ○ Very complex code base ○ Very complex UI ○ Generic and have a lots (too many) of features 19 Pain points: Foreman
  71. 71. Remise en cause 20
  72. 72. 3.1 - Interface avec prestataire Celeris : Prestataire interventions en DC ● Spreadsheet ● DCIM : Netbox ○ Open source ○ Digital Ocean ○ python + postgresql Intégration avec Foreman ? 21
  73. 73. 3.2 - Overlap de solutions IPAM DCIMCMDB ??? 22
  74. 74. Problématique 2 ● Automatiser la gestion du cycle de vie des machines physiques ○ Discovery/intake ○ Provisioning / installation non-supervisée ○ Maintenance, decommission 23
  75. 75. Collins ● Project open source https://github.com/tumblr/collins ● Machine à état imposée ● Système de hook / callback arbitraire sur les transitions d’état ● Metadata key / value arbitraires associées à chaque assets ● Web UI + API http + firehose 24
  76. 76. Collins: Tooling 25 API Clients ● Go-collins ● pycollins ● Ruby libs ○ collins-auth ○ collins-client ○ collins-notify ○ collins-state ○ ... CLI ● collins-shell
  77. 77. Collins: Web UI 26
  78. 78. Collins: Web UI 27
  79. 79. Collins: Cycle de vie 28 Workflows spécifiés : - Intake - Comissionnement - Maintenance - Décomissionnement
  80. 80. Collins: Callback registry 29
  81. 81. ACDC 30
  82. 82. 4.1 - Overview 31
  83. 83. 4.2 - Lorie 32
  84. 84. 4.2 - Lorie 33
  85. 85. 4.3 - IPXE Router 34
  86. 86. 4.4 - Collins callbacks 35 ● nowProvisioned ○ on = "asset_update" ○ When ■ previous.state = "isProvisioning" ■ && current.state = "isProvisioned" ● provisionEvent ○ on = "asset_update" ○ When ■ current.state = "isNew" ● unallocated ○ on = "asset_update" ○ When ■ current.state = "isUnallocated"
  87. 87. 4.5 - Provisioning 36
  88. 88. 4.6 - Tooling 37 $ collins-shell INFO - ENV Variable COLLINS_CONFIG=/home/xkrantz/Sources/github.schibsted.io/leboncoin/acdc/conf/collins.yaml Tasks: collins-shell asset <command> # Asset related commands collins-shell asset_type <command> # Asset Type related commands collins-shell console # drop into the interactive collins shell collins-shell help [TASK] # Describe available tasks or one specific task collins-shell ip_address <command> # IP address related commands collins-shell ipmi <command> # IPMI related commands collins-shell latest # check if there is a newer version of collins-shell collins-shell log MESSAGE # log a message on an asset collins-shell logs TAG # fetch logs for an asset specified by its tag. Use "all" for a... collins-shell power ACTION --reason=REASON --tag=TAG # perform power action (off, on, rebootSoft, rebootHard, etc) o... collins-shell power_status # check power status on an asset collins-shell provision <command> # Provisioning related commands collins-shell search_logs QUERY # search for asset logs collins-shell state <command> # State management related commands - use with care collins-shell tag <command> # Tag related commands collins-shell version # current version of collins-shell
  89. 89. Next 38
  90. 90. 5 - Next ACDC v2 Rework ● Discovery ● OS bootstrapping Add ● Disk management ● Firmware updates ● Any maintenance tasks 39
  91. 91. 5 - Next ACDC v2 Rework ● Discovery ● OS bootstrapping Add ● Disk management ● Firmware updates ● Any maintenance tasks Discovery ● Currently: ○ Genesis (Tumblr) ○ Ruby DSL (Chef like) ● Next: ○ CoreOS in Memory + Ansible 40
  92. 92. 5 - Next ACDC v2 Rework ● Discovery ● OS bootstrapping Add ● Disk management ● Firmware updates ● Any maintenance tasks OS Bootstrapping ● Currently: ○ Pressed / Kickstart ○ Shell scripts ● Next: ○ CoreOS in Memory + Ansible 41
  93. 93. 5.1 - Ansible jobs runner 42
  94. 94. 5.1 - Ansible jobs runner 43
  95. 95. 5.2 - Visualization & federation 44
  96. 96. 5.3 - Integration 45
  97. 97. 5.3 - Integration 46
  98. 98. SPECS REX 47
  99. 99. 20% projects are not enough REX 48
  100. 100. Services & ownership transition (for Ops) REX 49

×