One of advantages about cloud computing is potentially huge-scale resources for your task. And it's especially beneficial to data driven process with heavy computing. In this talk, the idea of job script to orchestrate the execution of workflows across multiple computing nodes is introduced. An implementation based on AWS SWF (Simple Workflow) is described with examples of processing for music streaming and video streaming in KKBOX.
@PyCon APAC 2015
Boost Fertility New Invention Ups Success Rates.pdf
Orchestrating the execution of workflows for media streaming service and even more
1. Orchestrating the execution of
workflows for media streaming
service and even more
Shuen-Huei (Drake) Guan
sr. principal engineer, KKBOX
vice chairperson, PyCon APAC 2015
2. Who am I?
• administrator, Ptt BBS
• technical director / R&D manager, Digimax
• team player, KKBOX
• contributor, PyCon Taiwan
3. Rather a story than tech sharing.
No any KKBOX trade secrets get revealed.
4. There're just some slides talking
about Python.
And, it's not about music streaming.
10. If we can make music
streaming work, how
about video streaming?
— KKBOX CxO
11. Let's work on a video-on-demand service
• Adaptive streaming.
• DRM protection.
• Video processing on cloud.
12. We thought video streaming is
similar to music streaming,
but we were wrong.
13. Issue 1. Workflow
multiple distinct interconnected steps that need to be executed in
a particular order in a distributed environment...
— someone
flickr:siddhu2020
flickr:siddhu2020
http://bit.ly/1FAukT2
15. def run(source, secret_key, cipher):
# verify if the source is ok.
if not verify(source): return False
# convert audio with different bitrates
_ = [convert(source, i) for i in range(4)]
# update id3 tag for all converted audios
_ = update_id3_tag(_)
# encrypt all audios
_ = encrypt(_, secret_key, cipher)
# deploy to backend DB
deploy(_)
return True
19. Sample client code to submit a workflow1
$workflow = new Gearman_Workflow('KKBOX_Convert_Audio'
'source' => $source,
'args' => $args);
$workflow->attachCallback(function () {
});
$client->run($workflow);
1
warning, it's PHP.
20. Sample worker (server) code to do things1
class KKBOX_Convert_Audio extends Gearman_Worker {
public function run($arg) {
// check the source
if (!verify()) return;
// convert audio with different bitrates
for ($i=0; $i<4; $i++) {
convert($i);
}
// update id3 tag for all audios
update_id3_tag();
// encrypt audios
encrypt();
// sequentially deploy to backend DB
for ($i=0; $i<4; $i++) {
deploy($i);
}
}
1
warning, it's PHP.
22. Sample worker (server) code to do things1
class KKBOX_Encode_Video extends Gearman_Worker {
public function run($arg) {
transcode();
encrypt();
}
}
class KKBOX_Convert_Video extends Gearman_Worker {
public function run($arg) {
if (!verify()) return;
// create asynchronous sub-workflows
$result = create_sub_workflow(KKBOX_Encode_Video);
// wait for all sub-workflows to finish
joint($result);
create_sub_workflow(KKBOX_Package_DASH, $result->encrypted);
create_sub_workflow(KKBOX_Package_HLS, $result->plain);
joint();
deploy();
}
1
warning, it's PHP.
23. The real gearman worker code is way more
complicated w/o elegance we like to have
24. Issue 3. Workflows would evolve...
• Let's save file size and IO.
• Let's make it faster.
• Let's add some more profiles.
• Let's fix some encoding.
25.
26. Everything fails all the time.
— Werner Vogels, CTO of Amazon
flickr:Bill Abbott
flickr:Bill Abbott
http://bit.ly/1GnrSGr
28. Factors we like to pay much attention in
• Encoding workflow
• Tasks distributing across machines on cloud.
• Server maintenance.
29. We hope ...
1. no need to maintain this system;
2. easier to distribute workflow/tasks, even to local machine;
3. with high-level workflow.As long as you can draw your
processes on a paper, you can map it to a workflow!
30. What Google suggests us...
• Apache Kafka, Mesos, ...
• Gearman (sorry, but we've tried.)
• Luigi by Spotify
• Celery
• Potentially all message brokers with some additional work.
32. class HelloWorker(swf.ActivityWorker):
domain = DOMAIN
version = VERSION
task_list = TASKLIST
def run(self):
activity_task = self.poll()
if 'activityId' in activity_task:
print 'Hello, World!'
self.complete()
return True
33. class HelloDecider(swf.Decider):
domain = DOMAIN
task_list = TASKLIST
version = VERSION
def run(self):
history = self.poll()
if 'events' in history:
# Find workflow events not related to decision scheduling.
workflow_events = [e for e in history['events']
if not e['eventType'].startswith('Decision')]
last_event = workflow_events[-1]
decisions = swf.Layer1Decisions()
if last_event['eventType'] == 'WorkflowExecutionStarted':
decisions.schedule_activity_task(...)
elif last_event['eventType'] == 'ActivityTaskCompleted':
decisions.complete_workflow_execution()
self.complete(decisions=decisions)
return True
34. SWF
• Decider defines the workflow.
• We still need to write workflow logic in decider.
• Workers do the action.
• Everytime, we changed workflow or action, we need to re-
deploy deciders and workers.
41. Make it pythonic if that makes developers
happier
source = 's3://bucket/source.mp4'
with Job():
with Task('Source Inspection'):
Cmd('emilia verify -i %s' % source)
with Task('Transcode', parallel=True):
for i in range(4):
with Task():
Cmd('ffmpeg -i %s ... -o /tmp/a_%d.mp4' % (source, i))
for i in range(9):
with Task():
Cmd('ffmpeg -i %s ... -o /tmp/v_%d.mp4' % (source, i))
with Task('Adaptive'):
with Task('DASH'):
pass
with Task('HLS'):
pass
with Task('MSS'):
pass
42. Status
• 1,500,000-minute videos got encoded.
• 3,000 videos per day (max).
• 800 workers on 100 c3.8xlarge instances (max).
• spent lots of $.
• everyone is really happy for that performance.
43. Technical status
• Fault tolerance by retry. [decider]
• Workflow/task has priorities. [SWF]
• try..except..finally mechanism.
[-whendone, -whenerror, -precmds, -postcmds, ...]