6. High Through Put Data
❖ Big Data!
❖ file size is small but there are many files!
❖ file size is large but there are just few files!
❖ Data size of bioinformatics!
❖ 1,000,000,000 records for a subject (person) is normal
8. Data Size of Sequencing After 5 Years
https://www.nanoporetech.com
70,000 New Born Baby X 500 GB = 35 TB
30,000 patients X 10,000 cells X 500 GB = 1.5 X 1011 GB = 150 EB
from Dr. Yu-Tai Wang
1. count by current NGS data!
2. not include civil medical institutes
9. Computing Power is Required
❖ HPC!
❖ Infiniband cluster!
❖ Amazon EC2 cluster!
❖ Hadoop cluster!
❖ Many cores of CPU!
❖ Large Memory!
❖ High IO efficiency
http://arstechnica.com/business/2012/05/amazons-hpc-cloud-supercomputing-for-the-99/
14. Data Scientist Concerns
❖ Data quality!
❖ Factors of filter!
❖ Statistics!
❖ Visualization!
❖ Interpretation
15. Programmer also Concerns
❖ High through put data (Big Data) handling!
❖ Data format / File format!
❖ Data parsing!
❖ Statistic tools!
❖ Visualization!
❖ Profit / Markets
69. Ensembl Virtual Machine
❖ Powered by VeeWee, Vagrant and Chef!
❖ Automatic build versioned Ensembl system (perl)!
❖ Include database, queuing services and analysis tools!
❖ Multi sites, multi species in one virtual machine!
❖ Help to build local & custom system
from Tse-Ching Ho
70. Ensembl Virtual Machine
Use existed
vagrant box
Prepare SOP for
Chef recipes
Provision VM
with Chef recipes
Write Chef recipes
Export VM
by Virtualbox
Setup Vagrantfile
Create Vagrant box
by Veewee
Write definition of
Vagrant box by Veewee
Ensembl VM
Automation
from Tse-Ching Ho
72. DR. RAW
❖ Derived from DRAW and SneakPeek!
❖ Composed of C/C++, bash, perl, java, ruby!
❖ Have both DNA and RNA re-sequence analysis!
❖ Enhanced quality control for DNA and RNA!
❖ Distributed computing pipeline!
❖ Support PBS, LSF, SGE platforms (queuing system)
from Hannah Lin
73. DR. RAW
Analysis
Tools
Analysis
Pipeline
Quality
Control
Resource
Manager
System
DNA QC
Forward : Reverse
RNA QC!
Forward : Reverse
BWA-0.7.7!
Samtools-0.1.19!
GATK-3.1
GSNAP-13-10-25!
Cufflink-13-11!
FusionGene …
DNA Sequencing data
RNA Sequencing data
SGE (Sun Grid Engine)
PBS (Portable Batch System)!
LSF (Load Sharing Facility)
Green: new components!
Red: updated components from Hannah Lin
75. Neo4j - JRuby Data Parser
❖ Graph database for data integration of discrete clinical
research documents!
❖ Origin data are excel/csv files collected in different
time, by different people!
❖ Neo4j is good for cleanup such massive data set!
❖ Cooperation between biologist and programmer
from Wei-Ming Wu, Chia-Hsuan Lee
76. Neo4j - JRuby Data Parser
from Wei-Ming Wu, Chia-Hsuan Lee
77. Neo4j - JRuby Data Parser
from Wei-Ming Wu, Chia-Hsuan Lee
Collision Rate of Input Data: 1.3 %
78. API Server for Third Party Firm
❖ API server based on Rails, run by JRuby!
❖ ActiveRecord models for Oracle database!
❖ activerecord-oracle_enhanced-adapter gem!
❖ Import excel files to third party GUI client !
❖ Third party server send XML request to API server
from Wei-Ming Wu, Sean Wang
79. API Server for Third Party Firm
TCHC server
API server
(rails, jruby)
CSIS
(java, oracle)
Send data by XML
Write into database
Read data by client program
Upload data
Parse request
Third Party
Our Servers
Windows GUI
from Wei-Ming Wu, Sean Wang
80. Daily Checking Rule
❖ Based on Rails, run by JRuby!
❖ ActiveRecord models for Oracle database!
❖ activerecord-oracle_enhanced-adapter gem!
❖ User can define rules for checking data, usually values
in filled forms!
❖ Run checking rules daily, not before filling forms
from Wei-Ming Wu, Sean Wang
85. Patient Randomization
❖ Based on Rails, run by JRuby!
❖ ActiveRecord models for Oracle database!
❖ activerecord-oracle_enhanced-adapter gem!
❖ Assign patients into different groups by randomization
method!
❖ Cooperation between statistician and programmer
from Wei-Ming Wu, Sean Wang
89. Database Statistics Dashboard
❖ Based on Rails, run by JRuby!
❖ ActiveRecord models for Oracle database!
❖ activerecord-oracle_enhanced-adapter gem!
❖ google_visualr gem for visualization!
❖ Count number of projects, forms, fields, records and
patients
from Wei-Ming Wu, Winnie Lui
101. Topics to take in action
❖ data generation and data management!
❖ data analysis and software!
❖ data processing and storage!
❖ application of bioinformatics in pharma research and
development
http://www.giichinese.com.tw/report/bc268909-
bioinformatics-technologies-global-markets.html
102. Health Care in Cloud
❖ Health promotion cloud!
❖ Vaccination cloud!
❖ Exercise cloud!
❖ Workplace wellness!
❖ Physical checkup cloud!
❖ Welfare cloud
from Dr. Chi-Hung Lin