FireWorks overview

Anubhav Jain
FireWorks workflow software:
An introduction
LLNL meeting | November 2016
Energy & Environmental Technologies
Berkeley Lab
1Slides available at www.slideshare.net/anubhavster

¡ Built w/Python+MongoDB. Open-source, pip-installable:
§ http://pythonhosted.org/FireWorks/
§ Very easy to install, most people can run first tutorial within 30 minutes of
starting
¡ At least 100 million CPU-hours used; everyday production use by
3 large DOE projects (Materials Project, JCESR, JCAP) as well as
many materials science research groups
¡ Also used for graphics processing, machine learning, multiscale
modeling, and document processing (but not by us)
¡ #1 Google hit for “Python workflow software”
§ still behind Pegasus, Kepler, Taverna, Trident,
for “scientific workflow software”
2

¡ Partly, we had trouble learning and using other people’s
workflow software
§ Today, I think the situation is much better
§ For example, Pegasus in 2011 gave no instructions to a
general user on how to install/use/deploy it apart from a
super-complicated user manual
§ Today, Pegasus takes more care to show you how to use it on
their web page
§ Other tools like Swift (Argonne) are also providing tutorials
¡ Partly, the other workflow software wasn’t what we were
looking for
§ Other software emphasized completing a fixed workload
quickly rather than fluidly adding, subtracting, reprioritizing,
searching, etc. workflows over long time periods
4

http://www3.canisius.edu/~grandem/animalshabitats/animals.jpg
5

¡ Millions of small jobs, each at least a minute long
¡ Small amount of inter-job parallelism (“bundling”) (e.g. <1000
jobs); any amount of intra-job parallelism
¡ Failures are common; need persistent status
§ like UPS packages, database is a necessity
¡ Very dynamic workflows
§ i.e. workflows that can modify themselves intelligently and act like
researchers that submit extra calculations as needed
¡ Collisions/duplicate detection
§ people submitting the same workflow, or perhaps have some steps in
common
¡ Runs on a laptop or a supercomputer
¡ Not “extreme” or record-breaking applications
¡ Can install/learn/use it by yourself without help/support, and
by a normal scientist rather than a “workflow expert”.
¡ Python-centric
6

¡ Features
¡ Potential issues
¡ Conclusion
¡ Appendix slides
§ Implementation
§ Getting started
§ Advanced usage
7

LAUNCHPAD
FW 1
FW 2
FW 3 FW 4
ROCKET LAUNCHER /
QUEUE LAUNCHER
Directory 1 Directory 2
8

?
You can scale without human effort
Easily customize what gets run where
9

¡ PBS
¡ SGE
¡ SLURM
¡ IBM LoadLeveler
¡ NEWT (a REST-based API at NERSC)
¡ Cobalt (Argonne LCF, initial runs of ~2
million CPU-hours successful)
10

what machine
what time
what directory
what was the output
when was it queued
when did it start running
when was it completed
LAUNCH
¡ both job details (scripts+parameters) and
launch details are automatically stored
13

¡ Soft failures, hard failures, human errors
§ “lpad rerun –s FIZZLED”
§ “lpad detect_unreserved –rerun” OR
§ “lpad detect_lostruns –rerun” OR
14

Xiaohui can be replaced by
digital Xiaohui,
programmed into FireWorks
15

Generate relaxation
VASP input files from
initial structure
Run VASP calculation
with Custodian
Insert results into
database
Set up AIMD simulation
using final relaxed
structure
Generate AIMD VASP
input files from relaxed
structure
Run VASP calculation with
Custodian with Walltime
Handler
Insert AIMD
simulation results
into database
Convergence
reached?
No
Done
Transfer AIMD calculation
output to specified final
location
Yes
Each box represents a FireTask, and
each series of boxes with the same
color represents a single Firework.
Green: Initial structure relaxation run
Blue: AIMD simulation
Red: Insert AIMD run into db.
Generate AIMD VASP
input files from relaxed
structure
Run VASP calculation with
Custodian with Walltime
Handler
Insert AIMD
simulation results
into database
Convergence
reached?
No
Done
Transfer AIMD calculation
output to specified final
location
Yes
Dynamically add multiple
parallel AIMD Fireworks.
E.g., different incar configs,
temperatures, etc.
Dynamically add
continuation AIMD
Firework that starts
from previous run.
Dynamically add
continuation AIMD
Firework that starts
from previous run.
17

¡ Submitting
millions of jobs
§ Easy to lose track
of what was done
before
¡ Multiple users
submitting jobs
¡ Sub-workflow
duplication
A A
Duplicate Job
detection
(if two workflows contain an
identical step,
ensure that the step is only
run once and relevant
information is still passed)
18

¡ Within workflow, or between workflows
¡ Completely flexible and can be modified
whenever you want
19

Now seems like a
good time to bring
up the last few lines
of the OUTCAR of all
failed jobs...
20

¡ Keep queue full with jobs
¡ Pack jobs automatically (to a point)
21

22
¡ Keep queue full with jobs
¡ Pack jobs automatically (to a point)

¡ Lots of care put into
documentation and
tutorials
§ Many strangers and
outsiders have
independently used it w/o
support from us
¡ Built in tasks
§ run BASH/Python scripts
§ file transfer (incl. remote)
§ write/copy/delete files
23

¡ No direct funding for FWS – certainly not a multimillion dollar project
¡ Mitigating longevity concerns:
§ FWS is open-source so the existing code will always be there
§ FWS never required explicit funding for development / enhancment
§ FWS has a distributed user and developer community, shielding it from a single point of
failure
§ Several multimillion dollar DOE projects and many research groups including my own
depend critically on FireWorks. Funding for basic improvements/bugfixes is certainly
going to be there if really needed.
¡ Mitigating support concerns:
§ No funding does mean limited support for external users
§ Support mechanisms favor solving problems broadly (e.g., better code, better
documentation) versus working one-on-one with potential users to solve their problems
and develop single-serving “workarounds”
§ BUT there is a free support list, and if you look, you will see that even specific individual
concerns are handled quickly and efficiently:
▪ https://groups.google.com/forum/#!forum/fireworkflows
§ In fact, I have yet to see proof of better user support from well-funded projects:
▪ Compare against: http://mailman.isi.edu/pipermail/pegasus-users/
▪ Compare against: https://lists.apache.org/list.html?users@taverna.apache.org
▪ Compare against: http://swift-lang.org/support/index.php (no results in any search?)
24

¡ Features
¡ Potential issues
¡ Conclusion
¡ Appendix slides
§ Implementation
§ Getting started
§ Advanced usage
25

26
LAUNCHPAD
(MongoDB)
FIREWORKER
(computing resource)
LAUNCHPAD
(MongoDB)
FIREWORKER
LAUNCHPAD
(MongoDB)
FIREWORKER
LaunchPad and FireWorker within
the same network firewall
à Works great
LaunchPad and FireWorker
separated by firewall, BUT login
node of FireWorker is open to
MongoDB connection
à Works great if you have a MOM
node type structure
à Otherwise “offline” mode is a non-
ideal but viable option
LaunchPad and FireWorker
separated by firewall, no
communication allowed
à Doesn’t work!

2
4
6
0 250 500 750 1000
# Jobs
Jobs/second
command
mlaunch
rlaunch
1 workflow 5 workflows
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.6
1client8clients
200
400
600
800
1000
200
400
600
800
1000
Number of tasks
Secondspertask
Workflow pattern
pairwise
parallel
reduce
sequence
¡ Tests indicate the FireWorks can handle a throughput of
about 6-7 jobs finishing per second
¡ Overhead is 0.1-1 sec per task
¡ Recently changes might enhance speed, but not tested
27

¡ Computing center issues
§ Almost all computing centers limit the number
of “mpirun”-style commands that can be
executed within a single job
§ Typically, this sets a limit to the degree of job
packing that can be achieved
§ Currently, no good solution; may need to work
on “hacking” the MPI communicator. e.g.,
“wraprun” is one effort at Oak Ridge.
28

¡ Features
¡ Potential issues
¡ Conclusion
¡ Appendix slides
§ Implementation
§ Getting started
§ Advanced usage
29

¡ If you are curious, just try spending 1 hour with
FireWorks
§ http://pythonhosted.org/FireWorks
§ If you’re not intrigued after an hour, try something else
¡ If you need help, contact the support list:
§ https://groups.google.com/forum/#!forum/fireworkflows
¡ If you want to read up on FireWorks, there is a paper
– but this is no substitute for trying it
§ “FireWorks: a dynamic workflow system designed for high-
throughput applications”. Concurr. Comput. Pract. Exp. 22,
5037–5059 (2015).
§ Please cite this if you use FireWorks
30

¡ Features
¡ Potential issues
¡ Conclusion
¡ Appendix slides
§ Implementation
§ Getting started
§ Advanced usage
31

FW 1 Spec
FireTask 1
FireTask 2
FW 2 Spec
FireTask 1
FW 3 Spec
FireTask 1
FireTask 2
FireTask 3
FWAction
32

from fireworks import Firework, Workflow, LaunchPad, ScriptTask
from fireworks.core.rocket_launcher import rapidfire
# set up the LaunchPad and reset it (first time only)
launchpad = LaunchPad()
launchpad.reset('', require_password=False)
# define the individual FireWorks and Workflow
fw1 = Firework(ScriptTask.from_str('echo "To be, or not to be,"'))
fw2 = Firework(ScriptTask.from_str('echo "that is the question:"'))
wf = Workflow([fw1, fw2], {fw1:fw2}) # set of FWs and dependencies
# store workflow in LaunchPad
launchpad.add_wf(wf)
# pull all jobs and run them locally
rapidfire(launchpad)
33

fws:
- fw_id: 1
spec:
_tasks:
- _fw_name: ScriptTask:
script: echo 'To be, or not to be,’
- fw_id: 2
spec:
_tasks:
- _fw_name: ScriptTask
script: echo 'that is the question:’
links:
1:
- 2
metadata: {}
(this is YAML, a bit prettier for humans
but less pretty for computers)
The same JSON document will
produce the same result on
any computer (with the same
Python functions).
34

fws:
- fw_id: 1
spec:
_tasks:
- _fw_name: ScriptTask:
script: echo 'To be, or not to be,’
- fw_id: 2
spec:
_tasks:
- _fw_name: ScriptTask
script: echo 'that is the question:’
links:
1:
- 2
metadata: {}
Just some of your search
options:
• simple matches
• match in array
• greater than/less than
• regular expressions
• match subdocument
• Javascript function
• MapReduce…
All for free, and all on the native workflow format!
(this is YAML, a bit prettier for humans
but less pretty for computers)
35

¡ Theme: Worker machine pulls a job & runs it
¡ Variation 1:
§ different workers can be configured to pull different
types of jobs via config + MongoDB
¡ Variation 2:
§ worker machines sort the jobs by a priority key and
pull matching jobs the highest priority
37

Queue launcher
(running on login node or crontab)
thruput job
thruput job
thruput job
thruput job
thruput job
Job wakes up
when PBS runs it
Grabs the latest
job description
from an external
DB
Runs the job based
on DB description
38

¡ Multiple processes pull and run jobs simultaneously
§ It is all the same thing, just sliced* different ways!
Query&Job&*>&&&job&A!!*>&update&DB&
Query&Job&*>&&&job&B!!*>&update&DB&&
Query&Job&*>&&&job&X&&*>&Update&DB&
mpirun&*>&Node&1%
mpirun&*>&Node&2%
mpirun&*>&Node&n%
1!large!job!
Independent&Processes&
mol&a%
mol&b%
mol&x%
*get it? wink wink
39

because jobs
are JSON, they
are completely
serializable!
40

¡ Features
¡ Potential issues
¡ Conclusion
¡ Appendix slides
§ Implementation
§ Getting started
§ Advanced usage
41

input_array: [1, 2, 3]
1. Sum input array
2. Write to file
3. Pass result to next job
1. Sum input array
2. Write to file
input_data: [6, 15]
1. Sum input data
2. Write to file
-------------------------------------
1. Copy result to home dir
6 15

class MyAdditionTask(FireTaskBase):
_fw_name = "My Addition Task"
def run_task(self, fw_spec):
input_array = fw_spec['input_array']
m_sum = sum(input_array)
print("The sum of {} is: {}".format(input_array, m_sum))
with open('my_sum.txt', 'a') as f:
f.writelines(str(m_sum)+'n')
# store the sum; push the sum to the input array of the next
sum
return FWAction(stored_data={'sum': m_sum},
mod_spec=[{'_push': {'input_array': m_sum}}])
See also: http://pythonhosted.org/FireWorks/guide_to_writing_firetasks.html
1. Sum input array
2. Write to file

1.  Sum input array
2.  Write to file
3.  Pass result to next job
1.  Sum input array
2.  Write to file
input_data: [6, 15]
1.  Sum input data
2.  Write to file
-------------------------------------
1.  Copy result to home dir
6 15!
# set up the LaunchPad and reset it
launchpad = LaunchPad()
launchpad.reset('', require_password=False)
# create Workflow consisting of a AdditionTask FWs + file transfer
fw1 = Firework(MyAdditionTask(), {"input_array": [1,2,3]}, name="pt 1A")
fw2 = Firework(MyAdditionTask(), {"input_array": [4,5,6]}, name="pt 1B")
fw3 = Firework([MyAdditionTask(), FileTransferTask({"mode": "cp", "files": ["my_sum.txt"],
"dest": "~"})], name="pt 2")
wf = Workflow([fw1, fw2, fw3], {fw1: fw3, fw2: fw3}, name="MAVRL test")
launchpad.add_wf(wf)
# launch the entire Workflow locally
rapidfire(launchpad, FWorker())

¡ lpad get_wflows -d more
¡ lpad get_fws -i 3 -d all
¡ lpad webgui
¡ Also rerun features
See all reporting at official docs:
http://pythonhosted.org/FireWorks

¡ There are a ton in the documentation and tutorials,
just try them!
¡ I want an example of running VASP!
§ https://github.com/materialsvirtuallab/fireworks-vasp
§ https://gist.github.com/computron/
▪ look for “fireworks-vasp_demo.py”
§ Note: demo is only a single VASP run
§ multiple VASP runs require passing directory names
between jobs
▪ currently you must do this manually
▪ in future, perhaps build into FireWorks

¡ If you can copy commands from a web page
and type them into a Terminal, you possess the
skills needed to complete the FireWorks tutorials
§ BUT: for long-term use, highly suggested you learn
some Python
¡ Go to:
§ or Google “FireWorks workflow software”
¡ NERSC-specific instructions & notes:
§ https://pythonhosted.org/FireWorks/installation_note
s.html
47

¡ Features
¡ Potential issues
¡ Conclusion
¡ Appendix slides
§ Implementation
§ Getting started
§ Advanced usage
48

¡ Say you have a FWS database with many different
job types, and want to run different jobs types on
different machines
¡ You have three options:
1. Set the “_fworker” variable in the FW itself. Only the
FWorker(s) with the matching name will run the job.
2. Set the “_category” variable in the FW itself. Only the
FWorker(s) with the matching categories will run the job.
3. Set the “query” parameter in the FWorker. You can set
any Mongo query on the FW to decide what jobs this
FWorker will run. e.g., jobs with certain parameter
ranges.
49

¡ Both Trackers and BackgroundTasks will run a process in
the background of your main FW.
¡ A Tracker is a quick way to monitor the first or last few
lines of a file (e.g., output file) during job execution. It is
also easy to set up, just set the “_tracker” variable in the
FW spec with the details of what files you want to
monitor.
§ This allows you to track output files of all your jobs using the
database.
§ For example, one command will let you view the output files of
all failed jobs – all without logging into any machines!
¡ A BackgroundTask will run any FireTask in a separate
Process from the main task. There are built-in parameters
to help.
50

¡ Sometimes, the specific Python code that you
need to execute (FireTask) depends on what
machine you are running on
¡ A solution to this is FW_env
¡ Each Worker configuration can set its own “env”
variable, which is accessible by the FireWork
when running within the “_fw_env” key
¡ The same job will see different values of
“_fw_env” depending on where it’s running, and
use this to execute the workflow
51

¡ Normally, a workflow stops proceeding when a
FireWork fails, or “fizzles”.
§ at this point, a user might change some backend code and
rerun the failed job
¡ Sometimes, you want a child FW to run even if one
or more parents have “fizzled”.
§ For example, the child FW might inspect the parent,
determine a cause of failure, and initiate a “recovery
workflow”
¡ To enable a child to run, set the
“_allow_fizzled_parents” key in the spec to True
§ FWS also create a “_fizzled_parents” key in that FW
spec that becomes available when the parents fail, and
contains details about the parent FW
52

¡ You might want some statistics on FWS jobs:
§ daily, weekly, monthly reports over certain periods for
how many Workflows/FireWorks/etc. completed
§ identify days when there were many job failures, perhaps
associated with a computing center outage
§ grouping FIZZLED jobs by a key in the spec, e.g. to get
stats on what job types failed most often
¡ All this is possible with the reporting package, type
“lpad report –h” for more information
¡ You can also introspect to find common factors in job
failures, type “lpad introspect –h” for more
information
53

FireWorks overview

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a FireWorks overview

Similar a FireWorks overview (20)

Más de Anubhav Jain

Más de Anubhav Jain (20)

Último

Último (20)

FireWorks overview