SlideShare una empresa de Scribd logo
1 de 36
SoS
Script of Scripts
Bo Peng, PhD
Department of Bioinformatics and Computational Biology
The University of Texas MD Anderson Cancer Center
Polyglot Notebook and Workflow System for both Interactive
Multi-language Data Analysis and Batch Data Processing
SoS
A quick survey
Introduction
• Have you used more than one Jupyter kernels?
• Have you used more than one Jupyter kernels for a single project?
• Have you used Jupyter to analyze large data?
• Have you used any workflow system for your work?
SoS
Who we are and what we do
Introduction
SoS
Our computational environment
Introduction
SoS
Write and manage scripts written in different
languages for different environments
Understand and reproduce others’
(and sometimes my own) projects
workflow
Manage data and workflows on different
environments for batch data processing
SoS
The promises of Jupyter ecosystem
Introduction
• Supports virtually all scripting
languages
• Unified notebook format and interface
• Flexible client/server architecture
• JupyterHub for enterprise
• JupyterLab was around the corner
(now ready for users)
• Binder for reproducible data analysis
SoS
What was missing for our work?
Introduction
More IDE features for
interactive data analysis
Multi-language support Integrated workflow system for
batch data processing
snakemake
SoS
SoS Polyglot Notebook
Introduction
Notebook
Server
Kernel
Notebook
Server
Kernel
Notebook
Server
Kernel
SoS
SoS Polyglot Notebook
Introduction
Kernel
Notebook
Server
Kernel
Kernel
Kernel
SoS Introduction
SoS Workflow System
Kernel
Notebook
Server
Kernel
Kernel
Kernel
Workflow
System
SoS Polyglot Notebook
+ =Polyglot
Notebook
Working
Environment
Workflow
System
SoS
A super kernel to all jupyter kernels
Polyglot Notebook
Kernel
Subkernel
• Starts and shuts down subkernels
• Receives input from frontend,
(optionally) processes it, sends it to
subkernels
• Receives output from subkernels,
(optionally) processes it, sends to
frontend
%expand %capture
SoS
Prepare input and capture output of subkernels
Polyglot Notebook
SoS
Data Exchange (magics %get, %put, and %with)
Polyglot Notebook
SoS
How data exchange works
Polyglot Notebook
arr: [1, 2, 3]
df: data.frame(…)
Kernel
Kernel
arr <- c(1, 2, 3)
df = feather.read_dataframe(tmpfile)
write_feather(df, tmpfile)
%put arr --to R
arr: c(1, 2, 3)
%put df
df: pandas.DataFrame(…)
SoS
Kernel
Kernel
• Create independent variables in another kernel
• Direct data exchange between subkernels, or by
way of SoS
• Create variables of similar types
• One to many (e.g. 1, c(1,2) in R)
• Many to one (e.g. Char and str in Julia)
• Intended to support a majority of datatypes, but
with no guarantee of lossless data exchange
• Supports kernels for 11 languages now
Data exchange between SoS and supported subkernels
Polyglot Notebook
Kernel
a=1
b=c(1,2)
a=1
b=[1,2]
c='x'
d='Hello'
c='x'
d="Hello"
SoS
Line-by-line execution in side panel (Ctrl-Shift-Enter)
Polyglot Notebook
Command notebook:run-in-console is available in JupyterLab to execute code in a console panel, a default shortcut is not yet assigned.
SoS
Preview of expressions and files
Polyglot Notebook
JupyterLab PR #4879 for displaying transient information from kernels is pending.
SoS
%revisions, %sessioninfo, and %sossave
Polyglot Notebook
%sossave is equivalent to sos convert from command line. Multiple templates are available.
SoS Workflow System
+ =Polyglot
Notebook
Working
Environment
Workflow
System
SoS
Overview of SoS Workflow Syntax
Workflow System
Script format of function calls
• Indentation is recommended but not required
• Alternative sigil is allowed (e.g. expand='${ }')
Function format
Script format
3.6+
Step header and statements
• Headers define “steps” of workflows
• input, output, and depends specify input, output and
dependent targets of the step
• task defines the rest of the step as external tasks
SoS
From subkernels to SoS kernel
Workflow System
Subkernels
(possibly incomplete scripts)
Kernel
(complete scripts)
SoS
Embedded workflows in notebook
Workflow System
Kernel
(shared kernel namespace)
Workflow
(independent workflow namespace)
SoS
Parameters and runtime signatures
Workflow System
SoS
Process-oriented vs outcome-oriented workflows
Workflow System
• Numerically numbered steps of a “process”
• Execute sequentially (logically)
• Steps can provides targets for others
• Workflow constructed to generate specified targets
(option –t)
SoS
Concurrent execution and external tasks
Workflow System
SoS
hosts.yml
SoS task model
Workflow System
input: “c:Projectf1.fastq”
output: “c:Projectf1.bam”
sh: expand=True
some_command_to_process {_input}
77e3c2ef7079a236.task
input: “/home/bpeng/Project/f1.fastq”
output: “/home/bpeng/Project/f1.bam”
sh: expand=True
some_command_to_process {_input}
77e3c2ef7079a236.task
c:Projectf1.fastq
/Project/f1.fastq
#PBS –N 77e3c2ef7079a236
#PBS –l nodes=1:ppn=1:mem=10G
#PBS –l walltime=24:00:00
cd /home/bpeng1/Project
sos execute 77e3c2ef7079a236
77e3c2ef7079a236.sh /Project/f1.bam
c:Projectf1.bam
SoS
Execute scripts in docker containers
Workflow System
SoS
DAG and workflow reports
Workflow System
SoS Summary
+ =Polyglot
Notebook
Working
Environment
Workflow
System
SoS
Our previous computational environment
Summary
SoS
Our new computational environment
Summary
SoS
SoS notebooks for reproducible data analysis
Summary
+ =
• Multi-language data analysis
with data exchange
• Side panel and magics for
interactive data analysis
Polyglot
Notebook
• Powerful Python-based multi-
style workflow system
• Remote execution of external
tasks
Workflow
System
• Environment for both
interactive data analysis and
batch data analysis
• Reproducible notebooks
Working
Environment
SoS
SoS Status
Summary
https://vatlab.github.io/SoS https://github.com/vatlab
https://vatlab.github.io/blogbpeng@mdanderson.org ScriptOfScripts
Browser:
Languages:
OS: Jupyter:
Container:Task queue:
License:
sos 0.16.9
sos-notebook 0.16.10
jupyterlab-sos 0.2.4
SoS
Acknowledgements
Summary
• Gao Wang (U Chicago)
• Jun Ma
• Man Chong Leong
• Chris Wakefield
• James Melott
• Yulun Chiu
• Di Du
• Dr. John Weinstein
• Dr. Christopher Amos (BCM)
• Dr. Paul Scheet
• Dr. Suzanne Leal (BCM)
• Grant R01HG008972
• Grant 1R01HG005859 (Dr. Paul Scheet)
• CPRIT RP130397
• Gordon and Berry Moore Foundation (#4559)
• The Michael and Susan Dell Foundation
• The Chapman Foundation
SoS Summary
https://vatlab.github.io/sos/live

Más contenido relacionado

La actualidad más candente

Course 102: Lecture 5: File Handling Internals
Course 102: Lecture 5: File Handling Internals Course 102: Lecture 5: File Handling Internals
Course 102: Lecture 5: File Handling Internals Ahmed El-Arabawy
 
Course 102: Lecture 12: Basic Text Handling
Course 102: Lecture 12: Basic Text Handling Course 102: Lecture 12: Basic Text Handling
Course 102: Lecture 12: Basic Text Handling Ahmed El-Arabawy
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
A brief history of system calls
A brief history of system callsA brief history of system calls
A brief history of system callsSysdig
 
Course 102: Lecture 24: Archiving and Compression of Files
Course 102: Lecture 24: Archiving and Compression of Files Course 102: Lecture 24: Archiving and Compression of Files
Course 102: Lecture 24: Archiving and Compression of Files Ahmed El-Arabawy
 
Introduction to-linux
Introduction to-linuxIntroduction to-linux
Introduction to-linuxkishore1986
 
Linux Interview Questions Quiz
Linux Interview Questions QuizLinux Interview Questions Quiz
Linux Interview Questions QuizUtkarsh Sengar
 
Ganesh naik linux_kernel_internals
Ganesh naik linux_kernel_internalsGanesh naik linux_kernel_internals
Ganesh naik linux_kernel_internalsGanesh Naik
 
Unix commands in etl testing
Unix commands in etl testingUnix commands in etl testing
Unix commands in etl testingGaruda Trainings
 
Course 102: Lecture 10: Learning About the Shell
Course 102: Lecture 10: Learning About the Shell Course 102: Lecture 10: Learning About the Shell
Course 102: Lecture 10: Learning About the Shell Ahmed El-Arabawy
 
Shell Scripting in Linux
Shell Scripting in LinuxShell Scripting in Linux
Shell Scripting in LinuxAnu Chaudhry
 

La actualidad más candente (19)

Intro to Python programming and iPython
Intro to Python programming and iPython Intro to Python programming and iPython
Intro to Python programming and iPython
 
Course 102: Lecture 5: File Handling Internals
Course 102: Lecture 5: File Handling Internals Course 102: Lecture 5: File Handling Internals
Course 102: Lecture 5: File Handling Internals
 
Course 102: Lecture 12: Basic Text Handling
Course 102: Lecture 12: Basic Text Handling Course 102: Lecture 12: Basic Text Handling
Course 102: Lecture 12: Basic Text Handling
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Linux Shell Basics
Linux Shell BasicsLinux Shell Basics
Linux Shell Basics
 
A brief history of system calls
A brief history of system callsA brief history of system calls
A brief history of system calls
 
Course 102: Lecture 24: Archiving and Compression of Files
Course 102: Lecture 24: Archiving and Compression of Files Course 102: Lecture 24: Archiving and Compression of Files
Course 102: Lecture 24: Archiving and Compression of Files
 
UNIX Basics and Cluster Computing
UNIX Basics and Cluster ComputingUNIX Basics and Cluster Computing
UNIX Basics and Cluster Computing
 
Introduction to-linux
Introduction to-linuxIntroduction to-linux
Introduction to-linux
 
Linux Interview Questions Quiz
Linux Interview Questions QuizLinux Interview Questions Quiz
Linux Interview Questions Quiz
 
Ganesh naik linux_kernel_internals
Ganesh naik linux_kernel_internalsGanesh naik linux_kernel_internals
Ganesh naik linux_kernel_internals
 
Unix - Filters/Editors
Unix - Filters/EditorsUnix - Filters/Editors
Unix - Filters/Editors
 
50 Most Frequently Used UNIX Linux Commands -hmftj
50 Most Frequently Used UNIX  Linux Commands -hmftj50 Most Frequently Used UNIX  Linux Commands -hmftj
50 Most Frequently Used UNIX Linux Commands -hmftj
 
Linux Fundamentals
Linux FundamentalsLinux Fundamentals
Linux Fundamentals
 
Unix commands in etl testing
Unix commands in etl testingUnix commands in etl testing
Unix commands in etl testing
 
Course 102: Lecture 10: Learning About the Shell
Course 102: Lecture 10: Learning About the Shell Course 102: Lecture 10: Learning About the Shell
Course 102: Lecture 10: Learning About the Shell
 
50 most frequently used unix
50 most frequently used unix50 most frequently used unix
50 most frequently used unix
 
Shell Scripting in Linux
Shell Scripting in LinuxShell Scripting in Linux
Shell Scripting in Linux
 
Curious Case of SQLi
Curious Case of SQLiCurious Case of SQLi
Curious Case of SQLi
 

Similar a Script of Scripts Polyglot Notebook and Workflow System

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Purdue CS354 Operating Systems 2008
Purdue CS354 Operating Systems 2008Purdue CS354 Operating Systems 2008
Purdue CS354 Operating Systems 2008guestd9065
 
What Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaWhat Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaBrendan Gregg
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsMicrosoft Tech Community
 
Rust All Hands Winter 2011
Rust All Hands Winter 2011Rust All Hands Winter 2011
Rust All Hands Winter 2011Patrick Walton
 
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Nicolas Morales
 
MozillaPH Rust Hack & Learn Session 1
MozillaPH Rust Hack & Learn Session 1MozillaPH Rust Hack & Learn Session 1
MozillaPH Rust Hack & Learn Session 1Robert 'Bob' Reyes
 
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)Michael Rys
 
NASM Introduction.pptx
NASM Introduction.pptxNASM Introduction.pptx
NASM Introduction.pptxAnshKarwa
 
Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)Itzik Kotler
 
Tips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyTips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyOlivier Bourgeois
 
Expressing and sharing workflows
Expressing and sharing workflowsExpressing and sharing workflows
Expressing and sharing workflowsDaniel S. Katz
 
PASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep DivePASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep DiveTravis Wright
 
Introduction-to-Linux.pptx
Introduction-to-Linux.pptxIntroduction-to-Linux.pptx
Introduction-to-Linux.pptxDavidMaina47
 

Similar a Script of Scripts Polyglot Notebook and Workflow System (20)

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Purdue CS354 Operating Systems 2008
Purdue CS354 Operating Systems 2008Purdue CS354 Operating Systems 2008
Purdue CS354 Operating Systems 2008
 
What Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaWhat Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versa
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
 
Rust All Hands Winter 2011
Rust All Hands Winter 2011Rust All Hands Winter 2011
Rust All Hands Winter 2011
 
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014
 
MozillaPH Rust Hack & Learn Session 1
MozillaPH Rust Hack & Learn Session 1MozillaPH Rust Hack & Learn Session 1
MozillaPH Rust Hack & Learn Session 1
 
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
 
Intro reverse engineering
Intro reverse engineeringIntro reverse engineering
Intro reverse engineering
 
NASM Introduction.pptx
NASM Introduction.pptxNASM Introduction.pptx
NASM Introduction.pptx
 
Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)
 
Tips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyTips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development Efficiency
 
Programming languages
Programming languagesProgramming languages
Programming languages
 
Implement Runtime Environments for HSA using LLVM
Implement Runtime Environments for HSA using LLVMImplement Runtime Environments for HSA using LLVM
Implement Runtime Environments for HSA using LLVM
 
Basics of C
Basics of CBasics of C
Basics of C
 
Ansible - A 'crowd' introduction
Ansible - A 'crowd' introductionAnsible - A 'crowd' introduction
Ansible - A 'crowd' introduction
 
Expressing and sharing workflows
Expressing and sharing workflowsExpressing and sharing workflows
Expressing and sharing workflows
 
PASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep DivePASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep Dive
 
Introduction-to-Linux.pptx
Introduction-to-Linux.pptxIntroduction-to-Linux.pptx
Introduction-to-Linux.pptx
 
Introduction-to-Linux.pptx
Introduction-to-Linux.pptxIntroduction-to-Linux.pptx
Introduction-to-Linux.pptx
 

Último

Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyRafigAliyev2
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxDilipVasan
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理pyhepag
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Jon Hansen
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Calllward7
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group MeetingAlison Pitt
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理cyebo
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfEmmanuel Dauda
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理pyhepag
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonPayment Village
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfscitechtalktv
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理pyhepag
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理pyhepag
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfMichaelSenkow
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdfvyankatesh1
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...ssuserf63bd7
 

Último (20)

Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 

Script of Scripts Polyglot Notebook and Workflow System

  • 1. SoS Script of Scripts Bo Peng, PhD Department of Bioinformatics and Computational Biology The University of Texas MD Anderson Cancer Center Polyglot Notebook and Workflow System for both Interactive Multi-language Data Analysis and Batch Data Processing
  • 2. SoS A quick survey Introduction • Have you used more than one Jupyter kernels? • Have you used more than one Jupyter kernels for a single project? • Have you used Jupyter to analyze large data? • Have you used any workflow system for your work?
  • 3. SoS Who we are and what we do Introduction
  • 5. SoS Write and manage scripts written in different languages for different environments Understand and reproduce others’ (and sometimes my own) projects workflow Manage data and workflows on different environments for batch data processing
  • 6. SoS The promises of Jupyter ecosystem Introduction • Supports virtually all scripting languages • Unified notebook format and interface • Flexible client/server architecture • JupyterHub for enterprise • JupyterLab was around the corner (now ready for users) • Binder for reproducible data analysis
  • 7. SoS What was missing for our work? Introduction More IDE features for interactive data analysis Multi-language support Integrated workflow system for batch data processing snakemake
  • 10. SoS Introduction SoS Workflow System Kernel Notebook Server Kernel Kernel Kernel Workflow System
  • 11. SoS Polyglot Notebook + =Polyglot Notebook Working Environment Workflow System
  • 12. SoS A super kernel to all jupyter kernels Polyglot Notebook Kernel Subkernel • Starts and shuts down subkernels • Receives input from frontend, (optionally) processes it, sends it to subkernels • Receives output from subkernels, (optionally) processes it, sends to frontend %expand %capture
  • 13. SoS Prepare input and capture output of subkernels Polyglot Notebook
  • 14. SoS Data Exchange (magics %get, %put, and %with) Polyglot Notebook
  • 15. SoS How data exchange works Polyglot Notebook arr: [1, 2, 3] df: data.frame(…) Kernel Kernel arr <- c(1, 2, 3) df = feather.read_dataframe(tmpfile) write_feather(df, tmpfile) %put arr --to R arr: c(1, 2, 3) %put df df: pandas.DataFrame(…)
  • 16. SoS Kernel Kernel • Create independent variables in another kernel • Direct data exchange between subkernels, or by way of SoS • Create variables of similar types • One to many (e.g. 1, c(1,2) in R) • Many to one (e.g. Char and str in Julia) • Intended to support a majority of datatypes, but with no guarantee of lossless data exchange • Supports kernels for 11 languages now Data exchange between SoS and supported subkernels Polyglot Notebook Kernel a=1 b=c(1,2) a=1 b=[1,2] c='x' d='Hello' c='x' d="Hello"
  • 17. SoS Line-by-line execution in side panel (Ctrl-Shift-Enter) Polyglot Notebook Command notebook:run-in-console is available in JupyterLab to execute code in a console panel, a default shortcut is not yet assigned.
  • 18. SoS Preview of expressions and files Polyglot Notebook JupyterLab PR #4879 for displaying transient information from kernels is pending.
  • 19. SoS %revisions, %sessioninfo, and %sossave Polyglot Notebook %sossave is equivalent to sos convert from command line. Multiple templates are available.
  • 20. SoS Workflow System + =Polyglot Notebook Working Environment Workflow System
  • 21. SoS Overview of SoS Workflow Syntax Workflow System Script format of function calls • Indentation is recommended but not required • Alternative sigil is allowed (e.g. expand='${ }') Function format Script format 3.6+ Step header and statements • Headers define “steps” of workflows • input, output, and depends specify input, output and dependent targets of the step • task defines the rest of the step as external tasks
  • 22. SoS From subkernels to SoS kernel Workflow System Subkernels (possibly incomplete scripts) Kernel (complete scripts)
  • 23. SoS Embedded workflows in notebook Workflow System Kernel (shared kernel namespace) Workflow (independent workflow namespace)
  • 24. SoS Parameters and runtime signatures Workflow System
  • 25. SoS Process-oriented vs outcome-oriented workflows Workflow System • Numerically numbered steps of a “process” • Execute sequentially (logically) • Steps can provides targets for others • Workflow constructed to generate specified targets (option –t)
  • 26. SoS Concurrent execution and external tasks Workflow System
  • 27. SoS hosts.yml SoS task model Workflow System input: “c:Projectf1.fastq” output: “c:Projectf1.bam” sh: expand=True some_command_to_process {_input} 77e3c2ef7079a236.task input: “/home/bpeng/Project/f1.fastq” output: “/home/bpeng/Project/f1.bam” sh: expand=True some_command_to_process {_input} 77e3c2ef7079a236.task c:Projectf1.fastq /Project/f1.fastq #PBS –N 77e3c2ef7079a236 #PBS –l nodes=1:ppn=1:mem=10G #PBS –l walltime=24:00:00 cd /home/bpeng1/Project sos execute 77e3c2ef7079a236 77e3c2ef7079a236.sh /Project/f1.bam c:Projectf1.bam
  • 28. SoS Execute scripts in docker containers Workflow System
  • 29. SoS DAG and workflow reports Workflow System
  • 31. SoS Our previous computational environment Summary
  • 32. SoS Our new computational environment Summary
  • 33. SoS SoS notebooks for reproducible data analysis Summary + = • Multi-language data analysis with data exchange • Side panel and magics for interactive data analysis Polyglot Notebook • Powerful Python-based multi- style workflow system • Remote execution of external tasks Workflow System • Environment for both interactive data analysis and batch data analysis • Reproducible notebooks Working Environment
  • 34. SoS SoS Status Summary https://vatlab.github.io/SoS https://github.com/vatlab https://vatlab.github.io/blogbpeng@mdanderson.org ScriptOfScripts Browser: Languages: OS: Jupyter: Container:Task queue: License: sos 0.16.9 sos-notebook 0.16.10 jupyterlab-sos 0.2.4
  • 35. SoS Acknowledgements Summary • Gao Wang (U Chicago) • Jun Ma • Man Chong Leong • Chris Wakefield • James Melott • Yulun Chiu • Di Du • Dr. John Weinstein • Dr. Christopher Amos (BCM) • Dr. Paul Scheet • Dr. Suzanne Leal (BCM) • Grant R01HG008972 • Grant 1R01HG005859 (Dr. Paul Scheet) • CPRIT RP130397 • Gordon and Berry Moore Foundation (#4559) • The Michael and Susan Dell Foundation • The Chapman Foundation

Notas del editor

  1. My answers to all these questions are yes.
  2. We are MD Anderson Cancer One of the largest and best cancer hospital in the world One of the largest bioinformatics department in the nation We have 15 faculty have who made major contribution to many of the national and international projects such as TCGA and ICGC. We have a large statistical analysts team with 20 PhDs (or double MS) who worked on almost 400 projects for more than 100 Principal Investigators at MD Anderson. Basically, we deal with a lot of data.
  3. Data usually come from our labs Bioinformatics need to use all different tools in many languages
  4. JupyterCon so I will save the time
  5. Compared to R Studio Line-by-line execution in console window Variable inspector Preview of variables, figures etc Jupyter supports only one kernel in a notebook Multiple notebooks BeakerX does not support MATLAB and SAS Needs workflow system for batch data processing Usain Bolt competing with Michael Phelps for swimming Different environments counter productive
  6. Start at 8
  7. Three ways but all based on the first magic
  8. Start at 18
  9. Explain what this workflow does
  10. Start at 36
  11. SoS has really changed the way we work, and it should work wonder for you! Please test and let us know what you think.