SlideShare una empresa de Scribd logo
1 de 63
Descargar para leer sin conexión
Enabling	
  Discoveries	
  at	
  High	
  Throughput	
  	
  
           Small	
  molecule	
  and	
  RNAi	
  HTS	
  at	
  the	
  NCTT	
  

                            Rajarshi	
  Guha	
  
          NIH	
  Center	
  for	
  Transla6on	
  Therapeu6cs	
  

                           May	
  3,	
  2011	
  
Outline	
  

•  Informa6cs	
  for	
  small	
  molecule	
  &	
  RNAi	
  screening	
  
•  HCA	
  &	
  automated	
  decision	
  making	
  
    –  Pre7y	
  pictures	
  can	
  lead	
  to	
  more	
  efficient	
  screens	
  
•  Large	
  scale	
  cheminforma6cs	
  	
  	
  
    –  We	
  can	
  do	
  it,	
  but	
  do	
  we	
  need	
  to?	
  
NIH Chemical Genomics Center
•    Founded	
  2004	
  as	
  part	
  of	
  NIH	
  Roadmap	
  Molecular	
  Libraries	
  Ini6a6ve	
  
      –  NCGC	
  staffed	
  with	
  90+	
  scien6sts	
  –	
  biologists,	
  chemists,	
  informa6cians,	
  engineers	
  
      –  Post-­‐doc	
  program	
  
•    Mission	
  
      –  MLPCN	
  (screening	
  &	
  chemical	
  synthesis;	
  compound	
  repository;	
  PubChem	
  database;	
  
         funding	
  for	
  assay,	
  library	
  and	
  technology	
  development	
  )	
  
      –  Develop	
  new	
  chemical	
  probes	
  for	
  basic	
  research	
  and	
  leads	
  for	
  therapeu6c	
  development,	
  
         par6cularly	
  for	
  rare/neglected	
  diseases	
  
      –  New	
  paradigms	
  &	
  applica6ons	
  of	
  HTS	
  for	
  chemical	
  biology	
  /	
  chemical	
  genomics	
  
•    All	
  NCGC	
  projects	
  are	
  collabora6ons	
  with	
  a	
  target	
  or	
  disease	
  expert;	
  	
  currently	
  >200	
  
     collabora6ons	
  with	
  inves6gators	
  worldwide	
  	
  
Project Diversity
                             Project	
  Diversity	
  
(A) Disease areas   (B) Target types




                           (C) Detection methods
Assay	
  formats	
  &	
  detec?on	
  methods	
  in	
  HTS	
  
     Assay formats
                                         •    cellular signal transduction           •    luminescence	
  
•     ligand	
  binding	
                      –  reporter gene                            –    chemiluminescence	
  
       –  compe66on	
  binding	
  	
                                                       –    bioluminescence	
  
                                               –  second messenger
•     enzyma6c	
  ac6vity	
              •    phenotypic                                   –    BRET	
  
       –  biochemical	
                                                                    –    ALPHA	
  
       –  cellular	
                           –  protein redistribution
                                                                                     •    fluorescence	
  
•     ion	
  or	
  ligand	
  transport	
       –  cell viability                           –    FI	
  	
  
       –  Ion-­‐sensi6ve	
  dyes	
             –  etc.
       –  membrane	
  poten6al	
  dyes	
                 Detection modes                   – 
                                                                                           – 
                                                                                                FRET	
  	
  
                                                                                                TRF	
  
•     protein-­‐protein	
  interac6ons	
  	
                        •    absorbance        –    TR-­‐FRET	
  
       –  biochemical	
                                                                    –    FP	
  	
  
       –  cellular	
  
                                                                    •    radioactivity     –    FCS	
  
                                                                          –  SPA           –    FLT	
  
Detector	
  Systems:	
  “Reading	
  the	
  assay”	
  

•  ViewLux	
  
      –  Mul6modal	
  CCD-­‐based	
  imager	
  
            •  Abs.,	
  Luminescence,	
  Fluorescence	
  
•  Envision	
  
      –  PMT-­‐based	
  reader           	
  	
  
            •  ALPHA	
  

•  Acumen	
  Explorer	
  
      –  Laser	
  Scanning	
  Imager	
  
            •  “sta6c”	
  cell	
  cytometry	
  



•  Hamamatsu	
  FDS	
  7000	
  Series	
  	
  
      –  rapid	
  kine6cs	
  

•  INCell1000	
  
      –  Subcellular	
  imaging	
  
qHTS:	
  High	
  Throughput	
  Dose	
  Response	
  
        Assay concentration ranges over 4 logs                       Informatics pipeline. Automated curve fitting

A	
     (high:~ 100 μM)
        1536-well plates, inter-plate dilution series
                                                                     and classification. 300K samples



                                                             C	
  
        Assay volumes 2 – 5 μL




B	
       Automated concentration-response data collection
          ~1 CRC/sec
Informa?cs	
  Ac?vi?es	
  
•  High	
  throughput	
  curve	
  fieng	
  
•  Data	
  integra6on,	
  automated	
  cherry	
  picking	
  
•  SAR	
  algorithms	
  
   –  QSAR	
  modeling	
  
   –  Fragment	
  based	
  analysis	
  
   –  Ac6vity	
  cliffs	
  
•  Tools	
  –	
  standardizer,	
  tautomers,	
  fragment	
  acDvity	
  
   browser,	
  kinome	
  browser	
  and	
  more	
  
•  RNAi	
  hit	
  selec6on,	
  OTE	
  analysis	
  
•  High	
  content	
  analysis	
  
Kinome	
  Navigator	
  
  •  Browse	
  kinase	
  
     panel	
  data	
  
  •  Currently	
  focused	
  
     on	
  the	
  Abbot	
  
     dataset	
  
   •  View	
  	
  
      •  Fragments	
  
      •  Target	
  pairs	
  
      •  Kinome	
  overlay	
  

               hip://tripod.nih.gov	
  
Fragment	
  Browser	
  




•  View	
  ac6vi6es	
  on	
  a	
  fragment	
  wise	
  basis	
  
•  Compare	
  ac6vity	
  distribu6ons	
  by	
  fragment	
  
•  Currently	
  based	
  around	
  ChEMBL	
  assays	
  but	
  users	
  
   can	
  browse	
  their	
  own	
  compounds	
  &	
  ac6vi6es	
  
                                                       hip://tripod.nih.gov	
  
Structure	
  Ac?vity	
  Landscapes	
  



            •  Rugged	
  gorges	
  or	
  rolling	
  hills?	
  
                        –  Small	
  structural	
  changes	
  associated	
  with	
  large	
  
                           ac6vity	
  changes	
  represent	
  steep	
  slopes	
  in	
  the	
  
                           landscape	
  
                        –  But	
  tradi6onally,	
  QSAR	
  assumes	
  gentle	
  slopes	
  	
  
                        –  We	
  can	
  characterize	
  the	
  landscape	
  using	
  SALI	
  

Maggiora,	
  G.M.,	
  J.	
  Chem.	
  Inf.	
  Model.,	
  2006,	
  46,	
  1535–1535	
  
What	
  Can	
  We	
  Do	
  With	
  SALI’s?	
  

       •  SALI	
  characterizes	
  cliffs	
  &	
  non-­‐cliffs	
  
       •  For	
  a	
  	
  given	
  molecular	
  representa6on,	
  SALI’s	
  
          gives	
  us	
  an	
  idea	
  of	
  	
  the	
  
          smoothness	
  of	
  the	
  	
  
          SAR	
  landscape	
  
       •  Models	
  try	
  and	
  encode	
  
          this	
  landscape	
  
       •  Use	
  the	
  landscape	
  to	
  guide	
  
          descriptor	
  or	
  model	
  	
  
          selec6on	
  
Guha,	
  R.;	
  Van	
  Drie,	
  J.H.,	
  J.	
  Chem.	
  Inf.	
  Model.,	
  2008,	
  48,	
  646–658	
  
Predic?ng	
  the	
  Landscape	
  

       •  Rather	
  than	
  predic6ng	
  ac6vity	
  directly,	
  we	
  can	
  
          try	
  to	
  predict	
  the	
  SAR	
  landscape	
  
       •  Implies	
  that	
  we	
  aiempt	
  to	
  directly	
  predict	
  cliffs	
  
                    –  Observa6ons	
  are	
  now	
  pairs	
  of	
  molecules	
  

                               Original	
  pIC50	
                                                         SALI,	
  AbsDiff	
       SALI,	
  GeoMean	
  
                                RMSE	
  =	
  0.97	
                                                        RMSE	
  =	
  1.10	
      RMSE	
  =	
  1.04	
  




Scheiber	
  et	
  al,	
  StaDsDcal	
  Analysis	
  and	
  Data	
  Mining,	
  2009,	
  2,	
  115-­‐122	
  
Data	
  Integra?on	
  

•  It’s	
  nice	
  to	
  simplify	
  data,	
  but	
  we	
  can	
  s6ll	
  be	
  faced	
  
   with	
  a	
  mul6tude	
  of	
  data	
  types	
  
•  We	
  want	
  to	
  explore	
  these	
  data	
  in	
  a	
  linked	
  fashion	
  
•  How	
  we	
  explore	
  and	
  what	
  we	
  explore	
  is	
  generally	
  
   influenced	
  by	
  the	
  task	
  at	
  hand	
  
•  At	
  one	
  point,	
  make	
  inferences	
  over	
  all	
  the	
  data	
  
Data	
  Integra?on	
  
User’s	
  Network	
  
                                        Content:	
  
                                            -­‐ Drugs	
  
                                            -­‐ Compounds	
  
                                            -­‐ Scaffolds	
  
                                            -­‐ Assays	
  
                                            -­‐ Genes	
  
                                            -­‐ Targets	
  
                                            -­‐ Pathways	
  
                                            -­‐ Diseases	
  
                                            -­‐ Clinical	
  Trials	
  
                                            -­‐ Documents	
  


                                        Links:	
  
Network	
  of	
  Public	
  Data	
            -­‐Manually	
  curated	
  
                                             -­‐Derived	
  from	
  algorithms	
  
Record	
  View	
  of	
  an	
  Assay	
  
Access	
  Disease	
  Hierarchy	
  &	
  Network	
  
Ar?cles,	
  Patents,	
  Drug	
  Labels,	
  …	
  
NPC	
  Browser	
  




hip://tripod.nih.gov/npc/	
  
Going	
  Beyond	
  Explora?on?	
  

    •  Simply	
  being	
  able	
  to	
  explore	
  data	
  in	
  an	
  integrated	
  
       manner	
  is	
  useful	
  as	
  an	
  idea	
  generator	
  
    •  Can	
  we	
  integrate	
  heterogenous	
  data	
  types	
  &	
  
       sources	
  to	
  get	
  a	
  systems	
  level	
  view?	
  
                –  Current	
  research	
  problem	
  in	
  genomics	
  and	
  systems	
  
                   biology	
  
                –  Some	
  aiempts	
  have	
  been	
  made	
  to	
  merge	
  chemical	
  
                   data	
  with	
  other	
  data	
  types	
  



Young,	
  D.W.	
  et	
  al,	
  Nat.	
  Chem.	
  Biol.,	
  2008,	
  4,	
  59-­‐68	
  
RNAi	
  Facility	
  Mission	
  

•  Perform	
  collabora6ve	
  genome-­‐wide	
  RNAi	
  screening-­‐
   based	
  projects	
  with	
  intramural	
  inves6gators	
  
•  Advance	
  the	
  science	
  of	
  RNAi	
  and	
  miRNA	
  screening	
  
   and	
  informa6cs	
  via	
  technology	
  development	
  to	
  
   improve	
  efficiency,	
  reliability,	
  and	
  costs.	
  




   Simple Phenotypes             Pathway (Reporter          Complex Phenotypes
   (Viability, cytotoxicity,   assays, e.g. luciferase,   (High-content imaging, cell
   oxidative stress, etc)!          β-lactamase)!          cycle, translocation, etc)!


                                 Range of Assays!
RNAi	
  Effectors	
  




RNAi effectors provide an excellent way to conduct gene-specific loss of
function studies."
Issues	
  Using	
  RNAi	
  Effectors	
  
•  RNAi effectors give a knockdown not a knockout (70% - 80% is considered
   good). Therefore, they may not silence enough to give a phenotype even if the
   target is involved in what you are assaying for."
•  RNAi effectors induce off-target effects!!!!! "
Examples of of	
  Current	
  Projects	
  
                       Examples	
   Current Projects

• 	
  Protein	
  Quality	
  Control	
                  • 	
  Poxvirus	
  
• 	
  DNA	
  Re-­‐replica6on	
                         • 	
  Respiratory	
  Viruses	
  
• 	
  Base	
  Excision	
  Repair	
                     • 	
  Lysosomal	
  Storage	
  Disorders	
  
• 	
  DNA	
  Damage	
  –	
  ELG1	
  stabiliza6on	
     • 	
  Parkinsons	
  –	
  Mitochondrial	
  Quality	
  
                                                       	
  Control	
  
• 	
  An6oxidant	
  Response	
  
                                                       • 	
  Ewings	
  Sarcoma	
  
• 	
  Hypoxia	
  
                                                       • 	
  Drug	
  Modifiers,	
  Pancrea6c	
  Cancer	
  
• 	
  TNFa	
  Response	
  
                                                       • 	
  Drug	
  Modifiers,	
  TOP1	
  Clinical	
  
• 	
  Interferon	
  Response	
  
                                                       	
  Agents	
  
• 	
  iPS	
  to	
  RPE	
  
                                                       • 	
  Immunotoxin-­‐Mediated	
  Cell	
  Death	
  
User	
  Accessible	
  Tools	
  
RNAi	
  Libraries	
  
        Ambion Human Genome-                 Ambion Mouse Genome-Wide
       Wide Library, 21,585 genes, 3           Library, 17,582 genes, 3
         unique siRNAs per gene. "             unique siRNAs per gene."



         Dharmacon Human Duet                  Human and Mouse miRNA
          Genome-Wide siRNA                       Mimic Libraries &
         Libraries, 18,236 genes,               Human miRNA Inhibitor
               siRNA pools."                           Library"



         Qiagen Human Druggable                    Kinome Libraries"
         Genome Library, > 7,000
                                              Purchased from a number of
        genes, 4 unique siRNAs per
                                                      vendors."
                   gene."


• Smaller libraries (e.g. kinome and miRNA mimics) will enable high-impact screens
  in systems less amenable to high throughput applications."
• Considerations are being made for additional species and shRNA resources."
Druggable	
  Genome	
  Screening	
  Campaign	
  
                                                                          Pseudo-colored Blue/Green Ratio
                                                                            (Normalized to plate Median)


•  Over 7,000 genes, 4
   unique siRNAs per gene
   (≈36,000 wells).

•  85 genes were selected                                          Significant enrichment for core
   for follow-up through a                                               NF-kB components
   variety of threshold-based                                         Percent Reduction in NF-kB Signal
                                                         100
   selection schemes.                                                                     Qiagen siRNAs
                                                                                          Ambion siRNAs
                                Average Inhibition (%)


                                                         80
•  27 genes were validated
   as confident hits using                               60

   siRNAs from multiple                                  40
   vendors.
                                                         20

                                                          0
                                                               TNFα Receptor    IKKα	

      RELA           NEMO
Druggable	
  Genome	
  Screening	
  Campaign	
  
                                            Significant enrichment for proteins that form the 28S
                                                                proteasome


                                             Percent Reduction in NF-kB Signal                     Qiagen
                                                                                                   Ambion               RPN
                            100                                                                                                 19S
                                                                                                                                Regulator
                                                                                                                                particle
   Average Inhibition (%)




                            80
                                                                                                                        RPT
                            60                                                                                          α1-7 20S
                                                                                                                        ß1-7 Proteasome
                            40                                                                                          α1-7

                            20                                                                                          RPT
                                                                                                                                19S
                                                                                                                                Regulator
                             0                                                                                                  particle
                                                                                                                        RPN




                                                                                                        D14
                                                                                    C4

                                                                                         C5

                                                                                              D2

                                                                                                   D7
                                                                     B2

                                                                          B3

                                                                               B4
                                                 A4

                                                      A5

                                                           A6

                                                                A7
                                  A1

                                       A2

                                            A3




PSM Gene

                                                                                                              Murata et al
PSM Protein                                  α core 20S              β core 20S     RPT 19S    RPN 19S        Nature Reviews
                                                                                                              Mol. Cell Biol.


 An additional 34 genes remain inconclusive, but noteworthy hits that require further study.
 Some of these tie into the core NF-kB pathway
Seed	
  Sequence	
  Analysis	
  




Other instances of the seeds incorporated within siRNAs targeting PSMA3 do not
exhibit significant activity, adding to the likelihood of this being an on-target effect."
Seed	
  Sequence	
  Analysis	
  




Other instances of the seeds within the active siRNAs targeting SLC24A1 tend to
downregulate NF-kB reporter, adding to the likelihood of this being an off-target effect."
RNAi	
  &	
  Small	
  Molecule	
  Screens	
  


                                                                                                 What	
  targets	
  mediate	
  ac6vity	
  of	
  
                                                                                                 siRNA	
  	
  and	
  compound	
  


                                                                                                 Pathway	
  elucida6on,	
  iden6fica6on	
  
• 	
  Reuse	
  pre-­‐exis6ng	
  MLI	
  data	
                                                    of	
  interac6ons	
  
• 	
  Develop	
  new	
  annotated	
  libraries	
  
             CAGCATGAGTACTACAGGCCA	
  
             TACGGGAACTACCATAATTTA	
  
                                                                                                 Target	
  ID	
  and	
  valida6on	
  


                                                                                                 Link	
  RNAi	
  generated	
  pathway	
  
                                                                                                 peturba6ons	
  to	
  small	
  molecule	
  
                                                                                                 ac6vi6es.	
  Could	
  provide	
  insight	
  into	
  
                                                                                                 polypharmacology	
  



• 	
  Run	
  parallel	
  RNAi	
  screen	
  




                            Goal:	
  Develop	
  systems	
  level	
  view	
  of	
  small	
  molecule	
  acUvity	
  
Matching	
  Phenotypes	
  
RNAI	
  




  Small	
  Molecule	
  
Merging	
  Screening	
  Technologies	
  

•  Lead	
  iden6fica6on	
  
 High	
  throughput	
  screening	
                High	
  content	
  screening	
  
•  Single	
  (few)	
  read	
  outs	
              •    Phenotypic	
  profiling	
  
•  High-­‐throughput	
                            •    Mul6ple	
  parameters	
  
•  Moderate	
  data	
  volumes	
                  •    Moderate	
  throughput	
  
                                                  •    Very	
  large	
  data	
  
                                                       volumes	
  

  •  We’d	
  like	
  to	
  combine	
  the	
  technologies,	
  to	
  obtain	
  rich	
  
     high-­‐resolu6on	
  data	
  at	
  high	
  speed	
  
  •  Is	
  this	
  feasible?	
  What	
  are	
  the	
  trade-­‐offs?	
  
Merging	
  Screening	
  Technologies	
  

•  A	
  simple	
  solu6on	
  is	
  to	
  run	
  a	
  HTS	
  &	
  HCS	
  as	
  
   separate,	
  primary	
  &	
  secondary	
  screens	
  
•  Alterna6vely	
  –	
  Wells	
  to	
  Cells	
  
    –  Integrate	
  HTS	
  &	
  HCS	
  in	
  a	
  single	
  screen	
  using	
  a	
  
       combined	
  plavorm	
  for	
  robo6cs	
  &	
  real	
  6me	
  
       automated	
  HTS	
  analy6cs	
  
    –  Selec6ve	
  imaging	
  of	
  interes6ng	
  wells	
  
Wells	
  to	
  Cells	
  Workflow	
  
  •  Sequen6al	
  qHTS	
  using	
  laser	
  
     scanning	
  cytometry	
  followed	
  
     by	
  high-­‐res	
  microscopy	
  
  •  Unit	
  of	
  work	
  is	
  a	
  plate	
  series	
  	
  
  •  The	
  same	
  aliquot	
  is	
  analyzed	
  
     by	
  both	
  techniques	
  
  •  A	
  message	
  based	
  system	
  
  •  The	
  key	
  is	
  deciding	
  which	
  
     wells	
  go	
  through	
  the	
  
     workflow	
  
Well	
  to	
  Cells	
  Assays	
  	
  

•  Cell	
  cycle,	
  cell	
  transloca6on,	
  DNA	
  repreplica6on	
  
•  All	
  assays	
  run	
  against	
  LOPAC1280	
  	
  
•  Consistency	
  between	
  cytometry	
  &	
  microscopy	
  is	
  
   measured	
  by	
  the	
  R2	
  between	
  log	
  AC50’s	
  
   –  Cell	
  cycle,	
  0.94	
  –	
  0.96	
  
   –  Cell	
  transloca6on,	
  0.66	
  –	
  0.94	
  
   –  DNA	
  rereplica6on,	
  s6ll	
  in	
  progress	
  	
  
Cell	
  Transloca?on	
  Example	
  Hits	
  
Informa?cs	
  Pla[orm	
  




                                                         InCell	
  Layout	
  	
  
                                                             File	
  


•  Advanced	
  correc6on	
  and	
  
   normaliza6on	
  methods	
  
•  Sophis6cated	
  curve	
  fieng	
  
   algorithm	
  
•  Good	
  performance,	
  allows	
  
   paralleliza6on	
  of	
  the	
  en6re	
  
   workflow	
  
Why	
  Messaging?	
  

•  A	
  messaging	
  architecture	
  allows	
  for	
  significant	
  
   flexibility	
  
   –  Persistent,	
  can	
  be	
  kept	
  for	
  process	
  tracking,	
  
      repor6ng	
  
   –  Asynchronous,	
  allows	
  individual	
  components	
  of	
  
      the	
  workflow	
  to	
  proceed	
  at	
  their	
  own	
  pace	
  
   –  Modular,	
  new	
  components	
  can	
  be	
  introduced	
  at	
  
      any	
  6me	
  without	
  redesigning	
  the	
  whole	
  workflow	
  
•  We	
  employ	
  Oracle	
  AQ,	
  but	
  any	
  message	
  
   queue	
  can	
  be	
  employed	
  
Handling	
  Mul?ple	
  Pla[orms	
  

•  Current	
  examples	
  employ	
  InCell	
  hardware	
  
•  We	
  also	
  use	
  Molecular	
  Devices	
  hardware	
  
•  As	
  a	
  result	
  we	
  have	
  two	
  orthogonal	
  image	
  stores	
  /	
  
   databases	
  
•  Need	
  to	
  integrate	
  them	
  
    –  Support	
  seamless	
  data	
  browsing	
  	
  across	
  mul6ple	
  
       screens	
  irrespec6ve	
  of	
  imaging	
  plavorm	
  used	
  
    –  Support	
  analy6cs	
  external	
  to	
  vendor	
  code	
  
A	
  Unified	
  Interface	
  

•  A	
  client	
  sees	
  a	
  single,	
  simple	
  interface	
  to	
  
   screening	
  image	
  data	
  
           hXp://host/rest/protocol/plate/well/image	
  

•  Transparently	
  extract	
  	
  
   image	
  data	
  via	
  the	
  	
  
   MetaXpress	
  database	
  	
  
   or	
  via	
  custom	
  code	
  
•  Currently	
  the	
  interface	
  address	
  image	
  serving	
  
•  Unified	
  metadata	
  interface	
  in	
  the	
  works	
  
Trade-­‐offs	
  &	
  Opportuni?es	
  

•  Automa6on	
  reduces	
  the	
  ability	
  to	
  handle	
  
   unforeseen	
  errors	
  
    –  Dispense	
  errors	
  and	
  other	
  plate	
  problems	
  
    –  Well	
  selec6on	
  based	
  on	
  curve	
  classes	
  may	
  need	
  to	
  
       be	
  modified	
  on	
  the	
  fly	
  
•  Well	
  selec6on	
  does	
  not	
  consider	
  SAR	
  
    –  Wells	
  are	
  selected	
  independently	
  of	
  each	
  other	
  
    –  If	
  we	
  could	
  model	
  SAR	
  on	
  the	
  fly	
  (or	
  from	
  
       valida6on	
  screens),	
  we’d	
  select	
  mul6ple	
  wells,	
  to	
  
       obtain	
  posi6ve	
  and	
  nega?ve	
  results	
  
Cloud	
  Compu?ng	
  &	
  Cheminforma?cs	
  

•  Cloud	
  compu6ng	
  is	
  a	
  hot	
  topic	
  
•  A	
  number	
  of	
  examples	
  of	
  computa6onal	
  
   chemistry	
  /	
  cheminforma6cs	
  on	
  the	
  cloud	
  
    –  MolPlex,	
  hBar,	
  Numerate,	
  Wingu,	
  Sciligence,	
  Pfizer	
  
•  Many	
  examples	
  use	
  the	
  cloud	
  for	
  remote	
  storage	
  
   remote	
  (hosted)	
  computa6ons	
  
•  But	
  providers	
  such	
  as	
  Amazon	
  allow	
  us	
  to	
  run	
  
   distributed	
  compuDng	
  applica6ons	
  on	
  the	
  cloud	
  
Map/Reduce	
  
•  Map/Reduce	
  is	
  a	
  programming	
  model	
  for	
  
   efficient	
  distributed	
  compu6ng	
  
•  M/R	
  made	
  “famous”	
  by	
  Google,	
  but	
  the	
  idea	
  
   has	
  been	
  around	
  for	
  a	
  long	
  6me	
  
•  It	
  works	
  like	
  a	
  Unix	
  pipeline:	
  
    –  cat input | grep |                                                       sort                   | uniq -c | cat > output
    –  	
  	
  	
  Input	
  	
  	
  	
  	
  	
  	
  |	
  Map	
  	
  	
  |	
  Shuffle	
  &	
  Sort	
  	
  |	
  	
  	
  Reduce	
  	
  	
  	
  	
  	
  |	
  Output	
  

•  Efficiency	
  from	
  	
  
    –  Streaming	
  through	
  data,	
  reducing	
  seeks	
  
    –  Pipelining	
  
                                                                                                                                              Owen	
  O’Malley,	
  hip://bit.ly/ecHPvB	
  
Map/Reduce	
  




    Owen	
  O’Malley,	
  hip://bit.ly/ecHPvB	
  
Hadoop	
  &	
  Cheminforma?cs	
  

•  Hadoop	
  is	
  an	
  Open	
  Source	
  implementa6on	
  
   of	
  the	
  map/reduce	
  paradigm	
  
•  Hadoop	
  is	
  a	
  framework	
  for	
  scalable,	
  	
  
   distributed	
  compu6ng	
  
    –  Hadoop,	
  HDFS,	
  Hive,	
  PIG	
  
•  Importantly,	
  you	
  can	
  play	
  with	
  all	
  this	
  on	
  your	
  
   laptop	
  and	
  just	
  copy	
  files	
  to	
  the	
  big	
  cluster	
  when	
  
   you’re	
  ready	
  for	
  produc6on	
  
Why	
  Hadoop?	
  

•  Simple	
  way	
  to	
  make	
  use	
  of	
  large	
  clusters	
  
   without	
  MPI	
  etc	
  
•  AWS	
  supports	
  Hadoop,	
  so	
  easy	
  to	
  scale	
  
   up	
  to	
  100’s	
  or	
  1000’s	
  of	
  cores	
  
•  Great	
  for	
  Java	
  code,	
  but	
  non-­‐Java	
  code	
  can	
  also	
  
   make	
  use	
  of	
  Hadoop	
  
•  M/R	
  can	
  be	
  applied	
  to	
  a	
  lot	
  of	
  problems,	
  but	
  one	
  
   of	
  the	
  simplest	
  is	
  to	
  use	
  it	
  as	
  a	
  “chunker”	
  
Cheminforma?cs	
  in	
  Parallel	
  

•  Many	
  cheminforma6cs	
  problems	
  are	
  data	
  parallel	
  
    –  Chunk	
  the	
  data	
  and	
  apply	
  the	
  same	
  technique	
  over	
  
       each	
  chunk	
  
•  This	
  makes	
  many	
  problems	
  amenable	
  for	
  M/R	
  
    –  Substructure	
  /	
  pharmacophore	
  search	
  
    –  Descriptor	
  calcula6ons,	
  virtual	
  screening	
  
    –  Model	
  development	
  (?)	
  
•  In	
  general,	
  each	
  chunk	
  is	
  processed	
  on	
  a	
  dis6nct	
  
   node	
  –	
  so	
  code	
  itself	
  can	
  be	
  non-­‐parallel	
  
Cheminforma?cs	
  in	
  Parallel	
  




See	
  h_p://blog.rguha.net/?tag=hadoop	
  for	
  examples	
  &	
  code	
  
Substructure	
  Searching	
  
                                         public class SubSearch {!



•  Substructure	
  
                                         …!
                                              public static class MoleculeMapper extends !
                                                      Mapper<Object, Text, Text, IntWritable> {!


   searching	
  is	
  a	
  trivial	
               private Text matches = new Text();!
                                                   private String pattern;!



   extension	
  of	
  atom	
                     public void setup(Context context) {!
                                                     pattern = context.getConfiguration().get
                                         ("net.rguha.dc.data.pattern");!


   coun6ng	
  
                                                 }!

                                                   public void map(Object key, Text value, Context context) throws!
                                                      IOException, InterruptedException {!


•  If	
  a	
  structure	
  
                                                       try {!
                                                           IAtomContainer molecule = sp.parseSmiles(value.toString());   !




   matches,	
  emit	
  
                                                            sqt.setSmarts(pattern);!
                                                            boolean matched = sqt.matches(molecule);!
                                                            matches.set((String) molecule.getProperty(CDKConstants.TITLE));!
                                                            if (matched) context.write(matches, one);!


   (name,1)!
                                                            else context.write(matches, zero);!
                                                        } catch (CDKException e) {!
                                                            e.printStackTrace();!
                                                        }!


•  Otherwise	
  	
  
                                                   }!
                                              }!

                                              public static class SMARTSMatchReducer extends !

   (name,0)	
                                           Reducer<Text, IntWritable, Text, IntWritable> {!
                                                  private IntWritable result = new IntWritable();!




•  Reducer	
  simply	
  
                                                 public void reduce(Text key, Iterable<IntWritable> values,!
                                                                    Context context) throws IOException,
                                         InterruptedException {!
                                                     for (IntWritable val : values) {!


   outputs	
  tuples	
  of	
  the	
  
                                                         if (val.compareTo(one) == 0) {!
                                                             result.set(1);!
                                                             context.write(key, result);!


   form	
  (name,1)	
  
                                                         }!
                                                     }!
                                                 }!
Running	
  on	
  AWS	
  
•  All	
  the	
  code	
  was	
  debugged	
  on	
  my	
  laptop	
  with	
  
   rela6vely	
  small	
  files	
  
•  To	
  test	
  the	
  scalability,	
  I	
  shi{ed	
  everything	
  to	
  AWS	
  
    –  Pharmacophore	
  search	
  
    –  136K	
  structures,	
  single	
  	
  
       conformer,	
  560MB	
  
    –  Created	
  a	
  single	
  JAR	
  file	
  with	
  
       CDK	
  &	
  applica6on	
  code	
  
    –  Uploaded	
  data	
  files	
  to	
  S3	
  
•  Total	
  cost	
  of	
  experiments	
  
   was	
  ~	
  $10	
  
But	
  I	
  Don’t	
  Want	
  to	
  Write	
  Programs	
  

•  All	
  these	
  examples	
  require	
  us	
  to	
  write	
  full	
  fledged	
  
   Java	
  classes	
  
•  An	
  easier	
  way	
  to	
  use	
  Pig	
  &	
  Pig	
  La6n	
  –	
  a	
  plavorm	
  
   and	
  query	
  language	
  built	
  on	
  top	
  of	
  Hadoop	
  
•  Lets	
  us	
  write	
  SQL-­‐like	
  queries	
  that	
  make	
  use	
  of	
  
   Hadoop	
  underneath	
  
•  Flexible	
  due	
  to	
  user	
  defined	
  func6ons	
  (UDF’s)	
  
    –  UDF’s	
  encapsulate	
  the	
  cheminforma6cs	
  
Cheminforma?cs	
  &	
  Pig	
  
   A = load 'medium.smi' as (smiles:chararray);!
   B = filter A by net.rguha.dc.pig.SMATCH(smiles, 'NC(=O)C(=O)N');!
   store B into 'output.txt';!




•  Iden6fy	
  molecules	
  in	
  medium.smi	
  that	
  match	
  the	
  
   SMARTS	
  paiern	
  and	
  dump	
  to	
  output.txt	
  
•  The	
  complexity	
  is	
  now	
  hidden	
  in	
  the	
  UDF	
  
•  Many	
  toolkit	
  func6ons	
  could	
  be	
  wrapped	
  as	
  
   UDF’s,	
  allowing	
  flexible	
  queries	
  with	
  much	
  
   simpler	
  code	
  
•  See	
  hip://blog.rguha.net/?p=748	
  for	
  the	
  code	
  
Latency	
  

•  Hadoop	
  is	
  suited	
  for	
  batch	
  processing	
  
•  Significant	
  network	
  I/O	
  involved	
  in	
  distribu6ng	
  
   data	
  to	
  compute	
  nodes	
  
•  Not	
  good	
  for	
  	
  
    –  Random	
  ad	
  hoc	
  processing	
  of	
  small	
  subsets	
  
    –  Small	
  volume	
  data	
  
    –  Real	
  6me	
  (low	
  latency)	
  work	
  
•  But	
  latency	
  issues	
  can	
  be	
  addressed	
  somewhat	
  	
  
   by	
  Hbase,	
  Hive	
  and	
  other	
  technologies	
  
More	
  than	
  Chunking?	
  

•  But	
  all	
  the	
  examples	
  so	
  far	
  could	
  have	
  been	
  done	
  
   via	
  PBS/Condor	
  or	
  any	
  other	
  job	
  scheduler	
  
    –  (With	
  Hadoop	
  we	
  don’t	
  have	
  to	
  worry	
  about	
  explicit	
  
       chunking	
  of	
  the	
  input	
  data)	
  
•  But	
  are	
  there	
  cheminforma6cs	
  algorithms	
  that	
  
   can	
  be	
  reworked	
  in	
  to	
  the	
  M/R	
  paradigm?	
  
    –  Predic6ve	
  modeling?	
  
    –  Graph	
  algorithms?	
  
More	
  than	
  Chunking?	
  

•  Both	
  predic6ve	
  &	
  graph	
  algorithms	
  are	
  
   increasingly	
  supported	
  in	
  Hadoop	
  
   –  Mahout	
  for	
  M/L	
  algorithms	
  on	
  massive	
  datasets	
  
   –  Cloud9	
  for	
  graph	
  algorithms	
  
•  A	
  number	
  of	
  bioinforma6cs	
  applica6ons	
  make	
  
   use	
  of	
  M/R	
  at	
  the	
  algorithmic	
  level	
  
•  They	
  are	
  all	
  big	
  applica6ons	
  
   –  Crossbow	
  aligns	
  3	
  billion	
  paired/unpaired	
  reads	
  
•  Cheminforma?cs	
  datasets	
  are	
  not	
  very	
  big	
  
Summary	
  

•  HTS	
  data	
  is	
  an	
  ample	
  playground	
  for	
  interes6ng	
  
   analy6cs,	
  mul6ple	
  data	
  types	
  makes	
  it	
  more	
  fun	
  
•  A	
  major	
  challenge	
  in	
  our	
  informa6cs	
  
   infrastructure	
  is	
  dealing	
  with	
  proprietary	
  vendor	
  
   interfaces	
  
•  Hadoop	
  and	
  M/R	
  provide	
  great	
  opportuni6es	
  for	
  
   handling	
  large	
  data	
  in	
  a	
  flexible	
  manner	
  
•  But	
  can	
  cheminforma6cs	
  really	
  make	
  use	
  of	
  it?	
  
Acknowledgments

InformaUcs	
                 RNAi	
  &	
  Small	
  Molecule	
  
•    Ajit	
  Jadhav	
        •    Scoi	
  Mar6n	
  
•    Trung	
  Nguyen	
       •    Pinar	
  Tuzmen	
  
•    Noel	
  Southall	
      •    Yu-­‐Chi	
  Chen	
  
•    Ruili	
  Huang	
        •    Carleen	
  Klump	
  
•    Min	
  Shen	
           •    Craig	
  Thomas	
  
•    Hongmao	
  Sun	
        •    Jim	
  Inglese	
  
•    Xin	
  Hu	
             •    Ron	
  Johnson	
  
•    Tongan	
  Zhao	
        •    Sam	
  Michael	
  
                             •    Jennifer	
  Wichterman	
  
Coun?ng	
  Atoms	
  

•  The	
  canonical	
  Hadoop	
  program	
  is	
  to	
  count	
  the	
  
   frequency	
  of	
  words	
  in	
  a	
  text	
  file	
  
    –  Mapper	
  reads	
  a	
  line,	
  outputs	
  a	
  tuple	
  –	
  (word,	
  1)	
  
    –  Reducer	
  will	
  receive	
  tuples,	
  keyed	
  on	
  word!
         •  Summing	
  up	
  the	
  1’s	
  gives	
  us	
  the	
  frequency	
  of	
  word	
  	
  
•  By	
  default,	
  Hadoop	
  works	
  on	
  a	
  line-­‐by-­‐line	
  basis	
  
•  For	
  cheminforma6cs	
  problems,	
  SMILES	
  files	
  
   sa6sfy	
  this	
  requirement	
  –	
  one	
  line,	
  one	
  molecule	
  
Coun?ng	
  Atoms	
  
                                   public class HeavyAtomCount {!


•  Uses	
  the	
  CDK	
  to	
          static SmilesParser sp = new SmilesParser(DefaultChemObjectBuilder.getInstance());!

                                         public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>
                                           {!
                                                                                                                                !



   parse	
  SMILES	
                          private final static IntWritable one = new IntWritable(1);!
                                              private Text word = new Text();!




•  For	
  each	
  
                                              public void map(Object key, Text value, Context context) throws !
                                                     IOException, InterruptedException {!
                                                  try {!
                                                       IAtomContainer molecule = sp.parseSmiles(value.toString());!



   molecule	
  loop	
  
                                                       for (IAtom atom : molecule.atoms()) {!
                                                           word.set(atom.getSymbol());!
                                                           context.write(word, one);!
                                                       }!



   over	
  atoms	
  
                                                  } catch (InvalidSmilesException e) {!
                                                       // do nothing for now!
                                                  }!
                                              }!
                                         }!


    –  Emit	
  	
                        public static class IntSumReducer extends Reducer<Text, IntWritable, !
                                                   Text, IntWritable> {!
                                             private IntWritable result = new IntWritable();!

       (symbol,1)!                            public void reduce(Text key, Iterable<IntWritable> values,!
                                                                 Context context) throws IOException, InterruptedException {!
                                                  int sum = 0;!


•  Reducer	
  simply	
  
                                                  for (IntWritable val : values) {!
                                                      sum += val.get();!
                                                  }!
                                                  result.set(sum);!


   sums	
  the	
  1’s	
  for	
  
                                                  context.write(key, result);!
                                              }!
                                         }!
                                   ….!


   each	
  symbol	
  
                                   }!
Mul?line	
  Records	
  

•  Lots	
  of	
  cheminforma6cs	
  applica6ons	
  require	
  3D	
  –	
  
   SMILES	
  won’t	
  do.	
  Need	
  to	
  support	
  SDF	
  
•  We	
  implement	
  a	
  custom	
  RecordReader to	
  
   process	
  SD	
  files!
•  We’re	
  now	
  ready	
  to	
  	
  
   tackle	
  preiy	
  much	
  	
  
   most	
  	
  cheminforma6cs	
  
   tasks	
  
Why	
  Hadoop?	
  
•  Java	
  and	
  C++	
  APIs	
  
    –  In	
  Java	
  use	
  Objects,	
  while	
  in	
  C++	
  bytes	
  
•  Each	
  task	
  can	
  process	
  data	
  sets	
  larger	
  	
  
   than	
  RAM	
  
•  Automa6c	
  re-­‐execu6on	
  on	
  failure	
  
    –  In	
  a	
  large	
  cluster,	
  some	
  nodes	
  are	
  always	
  slow	
  or	
  flaky	
  
    –  Framework	
  re-­‐executes	
  failed	
  tasks	
  	
  
•  Locality	
  op6miza6ons	
  
    –  M/R	
  queries	
  HDFS	
  for	
  loca6ons	
  of	
  input	
  data	
  
    –  Map	
  tasks	
  are	
  scheduled	
  close	
  to	
  the	
  inputs	
  when	
  
       possible	
  
                                                                            Owen	
  O’Malley,	
  hip://bit.ly/ecHPvB	
  

Más contenido relacionado

Similar a Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

CRISPR presentation extended Mouse Modeling
CRISPR presentation extended Mouse ModelingCRISPR presentation extended Mouse Modeling
CRISPR presentation extended Mouse Modeling
Tristan Kempston
 
Next-generation genomics: an integrative approach
Next-generation genomics: an integrative approachNext-generation genomics: an integrative approach
Next-generation genomics: an integrative approach
Hong ChangBum
 
Nanopore for dna sequencing by shreya
Nanopore for dna sequencing by shreyaNanopore for dna sequencing by shreya
Nanopore for dna sequencing by shreya
Shreya Modi
 

Similar a Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT (20)

Sequencing @ BitLab
Sequencing @ BitLabSequencing @ BitLab
Sequencing @ BitLab
 
Specificity Assessment At Santaris Pharma
Specificity Assessment At Santaris PharmaSpecificity Assessment At Santaris Pharma
Specificity Assessment At Santaris Pharma
 
Advanced diagnostic techniques
Advanced diagnostic techniquesAdvanced diagnostic techniques
Advanced diagnostic techniques
 
CRISPR presentation extended Mouse Modeling
CRISPR presentation extended Mouse ModelingCRISPR presentation extended Mouse Modeling
CRISPR presentation extended Mouse Modeling
 
Next-generation genomics: an integrative approach
Next-generation genomics: an integrative approachNext-generation genomics: an integrative approach
Next-generation genomics: an integrative approach
 
Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014
 
29 advion is now quintiles rodney keller - advion-quintiles
29 advion is now quintiles rodney keller - advion-quintiles29 advion is now quintiles rodney keller - advion-quintiles
29 advion is now quintiles rodney keller - advion-quintiles
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics Workshop
 
HVP Critical Assessment of Genome Interpretation
HVP Critical Assessment of Genome InterpretationHVP Critical Assessment of Genome Interpretation
HVP Critical Assessment of Genome Interpretation
 
OMICS (Ivo gut)
OMICS (Ivo gut)OMICS (Ivo gut)
OMICS (Ivo gut)
 
New Molecular Approaches to Identify 21st Century Microbes - Dr Melissa Mille...
New Molecular Approaches to Identify 21st Century Microbes - Dr Melissa Mille...New Molecular Approaches to Identify 21st Century Microbes - Dr Melissa Mille...
New Molecular Approaches to Identify 21st Century Microbes - Dr Melissa Mille...
 
SLAS Screen Design and Assay Technology SIG: SLAS2013 Presentation
SLAS Screen Design and Assay Technology SIG: SLAS2013 PresentationSLAS Screen Design and Assay Technology SIG: SLAS2013 Presentation
SLAS Screen Design and Assay Technology SIG: SLAS2013 Presentation
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
 
Bioinformatica t7-protein structure
Bioinformatica t7-protein structureBioinformatica t7-protein structure
Bioinformatica t7-protein structure
 
Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018
 
Lightning Deloitte Patheon QB3 Talk 2012
Lightning Deloitte Patheon QB3 Talk 2012Lightning Deloitte Patheon QB3 Talk 2012
Lightning Deloitte Patheon QB3 Talk 2012
 
CADD Lecture
CADD LectureCADD Lecture
CADD Lecture
 
Nanopore for dna sequencing by shreya
Nanopore for dna sequencing by shreyaNanopore for dna sequencing by shreya
Nanopore for dna sequencing by shreya
 
Synthetic Biology and Data-Driven Synthetic Biology for Personalized Medicine...
Synthetic Biology and Data-Driven Synthetic Biology for Personalized Medicine...Synthetic Biology and Data-Driven Synthetic Biology for Personalized Medicine...
Synthetic Biology and Data-Driven Synthetic Biology for Personalized Medicine...
 
Introduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqIntroduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seq
 

Más de Rajarshi Guha

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark Genome
Rajarshi Guha
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in context
Rajarshi Guha
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark Genome
Rajarshi Guha
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMC
Rajarshi Guha
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Rajarshi Guha
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?
Rajarshi Guha
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?
Rajarshi Guha
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network Models
Rajarshi Guha
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
Rajarshi Guha
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & R
Rajarshi Guha
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
Rajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Rajarshi Guha
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the parts
Rajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Rajarshi Guha
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the Pipes
Rajarshi Guha
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...
Rajarshi Guha
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research Database
Rajarshi Guha
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
Rajarshi Guha
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & Reproducible
Rajarshi Guha
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?
Rajarshi Guha
 

Más de Rajarshi Guha (20)

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark Genome
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in context
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark Genome
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMC
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network Models
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & R
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the parts
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the Pipes
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research Database
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & Reproducible
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

  • 1. Enabling  Discoveries  at  High  Throughput     Small  molecule  and  RNAi  HTS  at  the  NCTT   Rajarshi  Guha   NIH  Center  for  Transla6on  Therapeu6cs   May  3,  2011  
  • 2. Outline   •  Informa6cs  for  small  molecule  &  RNAi  screening   •  HCA  &  automated  decision  making   –  Pre7y  pictures  can  lead  to  more  efficient  screens   •  Large  scale  cheminforma6cs       –  We  can  do  it,  but  do  we  need  to?  
  • 3. NIH Chemical Genomics Center •  Founded  2004  as  part  of  NIH  Roadmap  Molecular  Libraries  Ini6a6ve   –  NCGC  staffed  with  90+  scien6sts  –  biologists,  chemists,  informa6cians,  engineers   –  Post-­‐doc  program   •  Mission   –  MLPCN  (screening  &  chemical  synthesis;  compound  repository;  PubChem  database;   funding  for  assay,  library  and  technology  development  )   –  Develop  new  chemical  probes  for  basic  research  and  leads  for  therapeu6c  development,   par6cularly  for  rare/neglected  diseases   –  New  paradigms  &  applica6ons  of  HTS  for  chemical  biology  /  chemical  genomics   •  All  NCGC  projects  are  collabora6ons  with  a  target  or  disease  expert;    currently  >200   collabora6ons  with  inves6gators  worldwide    
  • 4. Project Diversity Project  Diversity   (A) Disease areas (B) Target types (C) Detection methods
  • 5. Assay  formats  &  detec?on  methods  in  HTS   Assay formats •  cellular signal transduction •  luminescence   •  ligand  binding   –  reporter gene –  chemiluminescence   –  compe66on  binding     –  bioluminescence   –  second messenger •  enzyma6c  ac6vity   •  phenotypic –  BRET   –  biochemical   –  ALPHA   –  cellular   –  protein redistribution •  fluorescence   •  ion  or  ligand  transport   –  cell viability –  FI     –  Ion-­‐sensi6ve  dyes   –  etc. –  membrane  poten6al  dyes   Detection modes –  –  FRET     TRF   •  protein-­‐protein  interac6ons     •  absorbance –  TR-­‐FRET   –  biochemical   –  FP     –  cellular   •  radioactivity –  FCS   –  SPA –  FLT  
  • 6. Detector  Systems:  “Reading  the  assay”   •  ViewLux   –  Mul6modal  CCD-­‐based  imager   •  Abs.,  Luminescence,  Fluorescence   •  Envision   –  PMT-­‐based  reader     •  ALPHA   •  Acumen  Explorer   –  Laser  Scanning  Imager   •  “sta6c”  cell  cytometry   •  Hamamatsu  FDS  7000  Series     –  rapid  kine6cs   •  INCell1000   –  Subcellular  imaging  
  • 7. qHTS:  High  Throughput  Dose  Response   Assay concentration ranges over 4 logs Informatics pipeline. Automated curve fitting A   (high:~ 100 μM) 1536-well plates, inter-plate dilution series and classification. 300K samples C   Assay volumes 2 – 5 μL B   Automated concentration-response data collection ~1 CRC/sec
  • 8. Informa?cs  Ac?vi?es   •  High  throughput  curve  fieng   •  Data  integra6on,  automated  cherry  picking   •  SAR  algorithms   –  QSAR  modeling   –  Fragment  based  analysis   –  Ac6vity  cliffs   •  Tools  –  standardizer,  tautomers,  fragment  acDvity   browser,  kinome  browser  and  more   •  RNAi  hit  selec6on,  OTE  analysis   •  High  content  analysis  
  • 9. Kinome  Navigator   •  Browse  kinase   panel  data   •  Currently  focused   on  the  Abbot   dataset   •  View     •  Fragments   •  Target  pairs   •  Kinome  overlay   hip://tripod.nih.gov  
  • 10. Fragment  Browser   •  View  ac6vi6es  on  a  fragment  wise  basis   •  Compare  ac6vity  distribu6ons  by  fragment   •  Currently  based  around  ChEMBL  assays  but  users   can  browse  their  own  compounds  &  ac6vi6es   hip://tripod.nih.gov  
  • 11. Structure  Ac?vity  Landscapes   •  Rugged  gorges  or  rolling  hills?   –  Small  structural  changes  associated  with  large   ac6vity  changes  represent  steep  slopes  in  the   landscape   –  But  tradi6onally,  QSAR  assumes  gentle  slopes     –  We  can  characterize  the  landscape  using  SALI   Maggiora,  G.M.,  J.  Chem.  Inf.  Model.,  2006,  46,  1535–1535  
  • 12. What  Can  We  Do  With  SALI’s?   •  SALI  characterizes  cliffs  &  non-­‐cliffs   •  For  a    given  molecular  representa6on,  SALI’s   gives  us  an  idea  of    the   smoothness  of  the     SAR  landscape   •  Models  try  and  encode   this  landscape   •  Use  the  landscape  to  guide   descriptor  or  model     selec6on   Guha,  R.;  Van  Drie,  J.H.,  J.  Chem.  Inf.  Model.,  2008,  48,  646–658  
  • 13. Predic?ng  the  Landscape   •  Rather  than  predic6ng  ac6vity  directly,  we  can   try  to  predict  the  SAR  landscape   •  Implies  that  we  aiempt  to  directly  predict  cliffs   –  Observa6ons  are  now  pairs  of  molecules   Original  pIC50   SALI,  AbsDiff   SALI,  GeoMean   RMSE  =  0.97   RMSE  =  1.10   RMSE  =  1.04   Scheiber  et  al,  StaDsDcal  Analysis  and  Data  Mining,  2009,  2,  115-­‐122  
  • 14. Data  Integra?on   •  It’s  nice  to  simplify  data,  but  we  can  s6ll  be  faced   with  a  mul6tude  of  data  types   •  We  want  to  explore  these  data  in  a  linked  fashion   •  How  we  explore  and  what  we  explore  is  generally   influenced  by  the  task  at  hand   •  At  one  point,  make  inferences  over  all  the  data  
  • 15. Data  Integra?on   User’s  Network   Content:   -­‐ Drugs   -­‐ Compounds   -­‐ Scaffolds   -­‐ Assays   -­‐ Genes   -­‐ Targets   -­‐ Pathways   -­‐ Diseases   -­‐ Clinical  Trials   -­‐ Documents   Links:   Network  of  Public  Data   -­‐Manually  curated   -­‐Derived  from  algorithms  
  • 16. Record  View  of  an  Assay  
  • 17. Access  Disease  Hierarchy  &  Network  
  • 18. Ar?cles,  Patents,  Drug  Labels,  …  
  • 20. Going  Beyond  Explora?on?   •  Simply  being  able  to  explore  data  in  an  integrated   manner  is  useful  as  an  idea  generator   •  Can  we  integrate  heterogenous  data  types  &   sources  to  get  a  systems  level  view?   –  Current  research  problem  in  genomics  and  systems   biology   –  Some  aiempts  have  been  made  to  merge  chemical   data  with  other  data  types   Young,  D.W.  et  al,  Nat.  Chem.  Biol.,  2008,  4,  59-­‐68  
  • 21. RNAi  Facility  Mission   •  Perform  collabora6ve  genome-­‐wide  RNAi  screening-­‐ based  projects  with  intramural  inves6gators   •  Advance  the  science  of  RNAi  and  miRNA  screening   and  informa6cs  via  technology  development  to   improve  efficiency,  reliability,  and  costs.   Simple Phenotypes Pathway (Reporter Complex Phenotypes (Viability, cytotoxicity, assays, e.g. luciferase, (High-content imaging, cell oxidative stress, etc)! β-lactamase)! cycle, translocation, etc)! Range of Assays!
  • 22. RNAi  Effectors   RNAi effectors provide an excellent way to conduct gene-specific loss of function studies."
  • 23. Issues  Using  RNAi  Effectors   •  RNAi effectors give a knockdown not a knockout (70% - 80% is considered good). Therefore, they may not silence enough to give a phenotype even if the target is involved in what you are assaying for." •  RNAi effectors induce off-target effects!!!!! "
  • 24. Examples of of  Current  Projects   Examples   Current Projects •   Protein  Quality  Control   •   Poxvirus   •   DNA  Re-­‐replica6on   •   Respiratory  Viruses   •   Base  Excision  Repair   •   Lysosomal  Storage  Disorders   •   DNA  Damage  –  ELG1  stabiliza6on   •   Parkinsons  –  Mitochondrial  Quality    Control   •   An6oxidant  Response   •   Ewings  Sarcoma   •   Hypoxia   •   Drug  Modifiers,  Pancrea6c  Cancer   •   TNFa  Response   •   Drug  Modifiers,  TOP1  Clinical   •   Interferon  Response    Agents   •   iPS  to  RPE   •   Immunotoxin-­‐Mediated  Cell  Death  
  • 26. RNAi  Libraries   Ambion Human Genome- Ambion Mouse Genome-Wide Wide Library, 21,585 genes, 3 Library, 17,582 genes, 3 unique siRNAs per gene. " unique siRNAs per gene." Dharmacon Human Duet Human and Mouse miRNA Genome-Wide siRNA Mimic Libraries & Libraries, 18,236 genes, Human miRNA Inhibitor siRNA pools." Library" Qiagen Human Druggable Kinome Libraries" Genome Library, > 7,000 Purchased from a number of genes, 4 unique siRNAs per vendors." gene." • Smaller libraries (e.g. kinome and miRNA mimics) will enable high-impact screens in systems less amenable to high throughput applications." • Considerations are being made for additional species and shRNA resources."
  • 27. Druggable  Genome  Screening  Campaign   Pseudo-colored Blue/Green Ratio (Normalized to plate Median) •  Over 7,000 genes, 4 unique siRNAs per gene (≈36,000 wells). •  85 genes were selected Significant enrichment for core for follow-up through a NF-kB components variety of threshold-based Percent Reduction in NF-kB Signal 100 selection schemes. Qiagen siRNAs Ambion siRNAs Average Inhibition (%) 80 •  27 genes were validated as confident hits using 60 siRNAs from multiple 40 vendors. 20 0 TNFα Receptor IKKα RELA NEMO
  • 28. Druggable  Genome  Screening  Campaign   Significant enrichment for proteins that form the 28S proteasome Percent Reduction in NF-kB Signal Qiagen Ambion RPN 100 19S Regulator particle Average Inhibition (%) 80 RPT 60 α1-7 20S ß1-7 Proteasome 40 α1-7 20 RPT 19S Regulator 0 particle RPN D14 C4 C5 D2 D7 B2 B3 B4 A4 A5 A6 A7 A1 A2 A3 PSM Gene Murata et al PSM Protein α core 20S β core 20S RPT 19S RPN 19S Nature Reviews Mol. Cell Biol. An additional 34 genes remain inconclusive, but noteworthy hits that require further study. Some of these tie into the core NF-kB pathway
  • 29. Seed  Sequence  Analysis   Other instances of the seeds incorporated within siRNAs targeting PSMA3 do not exhibit significant activity, adding to the likelihood of this being an on-target effect."
  • 30. Seed  Sequence  Analysis   Other instances of the seeds within the active siRNAs targeting SLC24A1 tend to downregulate NF-kB reporter, adding to the likelihood of this being an off-target effect."
  • 31. RNAi  &  Small  Molecule  Screens   What  targets  mediate  ac6vity  of   siRNA    and  compound   Pathway  elucida6on,  iden6fica6on   •   Reuse  pre-­‐exis6ng  MLI  data   of  interac6ons   •   Develop  new  annotated  libraries   CAGCATGAGTACTACAGGCCA   TACGGGAACTACCATAATTTA   Target  ID  and  valida6on   Link  RNAi  generated  pathway   peturba6ons  to  small  molecule   ac6vi6es.  Could  provide  insight  into   polypharmacology   •   Run  parallel  RNAi  screen   Goal:  Develop  systems  level  view  of  small  molecule  acUvity  
  • 32. Matching  Phenotypes   RNAI   Small  Molecule  
  • 33. Merging  Screening  Technologies   •  Lead  iden6fica6on   High  throughput  screening   High  content  screening   •  Single  (few)  read  outs   •  Phenotypic  profiling   •  High-­‐throughput   •  Mul6ple  parameters   •  Moderate  data  volumes   •  Moderate  throughput   •  Very  large  data   volumes   •  We’d  like  to  combine  the  technologies,  to  obtain  rich   high-­‐resolu6on  data  at  high  speed   •  Is  this  feasible?  What  are  the  trade-­‐offs?  
  • 34. Merging  Screening  Technologies   •  A  simple  solu6on  is  to  run  a  HTS  &  HCS  as   separate,  primary  &  secondary  screens   •  Alterna6vely  –  Wells  to  Cells   –  Integrate  HTS  &  HCS  in  a  single  screen  using  a   combined  plavorm  for  robo6cs  &  real  6me   automated  HTS  analy6cs   –  Selec6ve  imaging  of  interes6ng  wells  
  • 35. Wells  to  Cells  Workflow   •  Sequen6al  qHTS  using  laser   scanning  cytometry  followed   by  high-­‐res  microscopy   •  Unit  of  work  is  a  plate  series     •  The  same  aliquot  is  analyzed   by  both  techniques   •  A  message  based  system   •  The  key  is  deciding  which   wells  go  through  the   workflow  
  • 36. Well  to  Cells  Assays     •  Cell  cycle,  cell  transloca6on,  DNA  repreplica6on   •  All  assays  run  against  LOPAC1280     •  Consistency  between  cytometry  &  microscopy  is   measured  by  the  R2  between  log  AC50’s   –  Cell  cycle,  0.94  –  0.96   –  Cell  transloca6on,  0.66  –  0.94   –  DNA  rereplica6on,  s6ll  in  progress    
  • 38. Informa?cs  Pla[orm   InCell  Layout     File   •  Advanced  correc6on  and   normaliza6on  methods   •  Sophis6cated  curve  fieng   algorithm   •  Good  performance,  allows   paralleliza6on  of  the  en6re   workflow  
  • 39. Why  Messaging?   •  A  messaging  architecture  allows  for  significant   flexibility   –  Persistent,  can  be  kept  for  process  tracking,   repor6ng   –  Asynchronous,  allows  individual  components  of   the  workflow  to  proceed  at  their  own  pace   –  Modular,  new  components  can  be  introduced  at   any  6me  without  redesigning  the  whole  workflow   •  We  employ  Oracle  AQ,  but  any  message   queue  can  be  employed  
  • 40. Handling  Mul?ple  Pla[orms   •  Current  examples  employ  InCell  hardware   •  We  also  use  Molecular  Devices  hardware   •  As  a  result  we  have  two  orthogonal  image  stores  /   databases   •  Need  to  integrate  them   –  Support  seamless  data  browsing    across  mul6ple   screens  irrespec6ve  of  imaging  plavorm  used   –  Support  analy6cs  external  to  vendor  code  
  • 41. A  Unified  Interface   •  A  client  sees  a  single,  simple  interface  to   screening  image  data   hXp://host/rest/protocol/plate/well/image   •  Transparently  extract     image  data  via  the     MetaXpress  database     or  via  custom  code   •  Currently  the  interface  address  image  serving   •  Unified  metadata  interface  in  the  works  
  • 42. Trade-­‐offs  &  Opportuni?es   •  Automa6on  reduces  the  ability  to  handle   unforeseen  errors   –  Dispense  errors  and  other  plate  problems   –  Well  selec6on  based  on  curve  classes  may  need  to   be  modified  on  the  fly   •  Well  selec6on  does  not  consider  SAR   –  Wells  are  selected  independently  of  each  other   –  If  we  could  model  SAR  on  the  fly  (or  from   valida6on  screens),  we’d  select  mul6ple  wells,  to   obtain  posi6ve  and  nega?ve  results  
  • 43. Cloud  Compu?ng  &  Cheminforma?cs   •  Cloud  compu6ng  is  a  hot  topic   •  A  number  of  examples  of  computa6onal   chemistry  /  cheminforma6cs  on  the  cloud   –  MolPlex,  hBar,  Numerate,  Wingu,  Sciligence,  Pfizer   •  Many  examples  use  the  cloud  for  remote  storage   remote  (hosted)  computa6ons   •  But  providers  such  as  Amazon  allow  us  to  run   distributed  compuDng  applica6ons  on  the  cloud  
  • 44. Map/Reduce   •  Map/Reduce  is  a  programming  model  for   efficient  distributed  compu6ng   •  M/R  made  “famous”  by  Google,  but  the  idea   has  been  around  for  a  long  6me   •  It  works  like  a  Unix  pipeline:   –  cat input | grep | sort | uniq -c | cat > output –       Input              |  Map      |  Shuffle  &  Sort    |      Reduce            |  Output   •  Efficiency  from     –  Streaming  through  data,  reducing  seeks   –  Pipelining   Owen  O’Malley,  hip://bit.ly/ecHPvB  
  • 45. Map/Reduce   Owen  O’Malley,  hip://bit.ly/ecHPvB  
  • 46. Hadoop  &  Cheminforma?cs   •  Hadoop  is  an  Open  Source  implementa6on   of  the  map/reduce  paradigm   •  Hadoop  is  a  framework  for  scalable,     distributed  compu6ng   –  Hadoop,  HDFS,  Hive,  PIG   •  Importantly,  you  can  play  with  all  this  on  your   laptop  and  just  copy  files  to  the  big  cluster  when   you’re  ready  for  produc6on  
  • 47. Why  Hadoop?   •  Simple  way  to  make  use  of  large  clusters   without  MPI  etc   •  AWS  supports  Hadoop,  so  easy  to  scale   up  to  100’s  or  1000’s  of  cores   •  Great  for  Java  code,  but  non-­‐Java  code  can  also   make  use  of  Hadoop   •  M/R  can  be  applied  to  a  lot  of  problems,  but  one   of  the  simplest  is  to  use  it  as  a  “chunker”  
  • 48. Cheminforma?cs  in  Parallel   •  Many  cheminforma6cs  problems  are  data  parallel   –  Chunk  the  data  and  apply  the  same  technique  over   each  chunk   •  This  makes  many  problems  amenable  for  M/R   –  Substructure  /  pharmacophore  search   –  Descriptor  calcula6ons,  virtual  screening   –  Model  development  (?)   •  In  general,  each  chunk  is  processed  on  a  dis6nct   node  –  so  code  itself  can  be  non-­‐parallel  
  • 49. Cheminforma?cs  in  Parallel   See  h_p://blog.rguha.net/?tag=hadoop  for  examples  &  code  
  • 50. Substructure  Searching   public class SubSearch {! •  Substructure   …! public static class MoleculeMapper extends ! Mapper<Object, Text, Text, IntWritable> {! searching  is  a  trivial   private Text matches = new Text();! private String pattern;! extension  of  atom   public void setup(Context context) {! pattern = context.getConfiguration().get ("net.rguha.dc.data.pattern");! coun6ng   }! public void map(Object key, Text value, Context context) throws! IOException, InterruptedException {! •  If  a  structure   try {! IAtomContainer molecule = sp.parseSmiles(value.toString()); ! matches,  emit   sqt.setSmarts(pattern);! boolean matched = sqt.matches(molecule);! matches.set((String) molecule.getProperty(CDKConstants.TITLE));! if (matched) context.write(matches, one);! (name,1)! else context.write(matches, zero);! } catch (CDKException e) {! e.printStackTrace();! }! •  Otherwise     }! }! public static class SMARTSMatchReducer extends ! (name,0)   Reducer<Text, IntWritable, Text, IntWritable> {! private IntWritable result = new IntWritable();! •  Reducer  simply   public void reduce(Text key, Iterable<IntWritable> values,! Context context) throws IOException, InterruptedException {! for (IntWritable val : values) {! outputs  tuples  of  the   if (val.compareTo(one) == 0) {! result.set(1);! context.write(key, result);! form  (name,1)   }! }! }!
  • 51. Running  on  AWS   •  All  the  code  was  debugged  on  my  laptop  with   rela6vely  small  files   •  To  test  the  scalability,  I  shi{ed  everything  to  AWS   –  Pharmacophore  search   –  136K  structures,  single     conformer,  560MB   –  Created  a  single  JAR  file  with   CDK  &  applica6on  code   –  Uploaded  data  files  to  S3   •  Total  cost  of  experiments   was  ~  $10  
  • 52. But  I  Don’t  Want  to  Write  Programs   •  All  these  examples  require  us  to  write  full  fledged   Java  classes   •  An  easier  way  to  use  Pig  &  Pig  La6n  –  a  plavorm   and  query  language  built  on  top  of  Hadoop   •  Lets  us  write  SQL-­‐like  queries  that  make  use  of   Hadoop  underneath   •  Flexible  due  to  user  defined  func6ons  (UDF’s)   –  UDF’s  encapsulate  the  cheminforma6cs  
  • 53. Cheminforma?cs  &  Pig   A = load 'medium.smi' as (smiles:chararray);! B = filter A by net.rguha.dc.pig.SMATCH(smiles, 'NC(=O)C(=O)N');! store B into 'output.txt';! •  Iden6fy  molecules  in  medium.smi  that  match  the   SMARTS  paiern  and  dump  to  output.txt   •  The  complexity  is  now  hidden  in  the  UDF   •  Many  toolkit  func6ons  could  be  wrapped  as   UDF’s,  allowing  flexible  queries  with  much   simpler  code   •  See  hip://blog.rguha.net/?p=748  for  the  code  
  • 54. Latency   •  Hadoop  is  suited  for  batch  processing   •  Significant  network  I/O  involved  in  distribu6ng   data  to  compute  nodes   •  Not  good  for     –  Random  ad  hoc  processing  of  small  subsets   –  Small  volume  data   –  Real  6me  (low  latency)  work   •  But  latency  issues  can  be  addressed  somewhat     by  Hbase,  Hive  and  other  technologies  
  • 55. More  than  Chunking?   •  But  all  the  examples  so  far  could  have  been  done   via  PBS/Condor  or  any  other  job  scheduler   –  (With  Hadoop  we  don’t  have  to  worry  about  explicit   chunking  of  the  input  data)   •  But  are  there  cheminforma6cs  algorithms  that   can  be  reworked  in  to  the  M/R  paradigm?   –  Predic6ve  modeling?   –  Graph  algorithms?  
  • 56. More  than  Chunking?   •  Both  predic6ve  &  graph  algorithms  are   increasingly  supported  in  Hadoop   –  Mahout  for  M/L  algorithms  on  massive  datasets   –  Cloud9  for  graph  algorithms   •  A  number  of  bioinforma6cs  applica6ons  make   use  of  M/R  at  the  algorithmic  level   •  They  are  all  big  applica6ons   –  Crossbow  aligns  3  billion  paired/unpaired  reads   •  Cheminforma?cs  datasets  are  not  very  big  
  • 57. Summary   •  HTS  data  is  an  ample  playground  for  interes6ng   analy6cs,  mul6ple  data  types  makes  it  more  fun   •  A  major  challenge  in  our  informa6cs   infrastructure  is  dealing  with  proprietary  vendor   interfaces   •  Hadoop  and  M/R  provide  great  opportuni6es  for   handling  large  data  in  a  flexible  manner   •  But  can  cheminforma6cs  really  make  use  of  it?  
  • 58. Acknowledgments InformaUcs   RNAi  &  Small  Molecule   •  Ajit  Jadhav   •  Scoi  Mar6n   •  Trung  Nguyen   •  Pinar  Tuzmen   •  Noel  Southall   •  Yu-­‐Chi  Chen   •  Ruili  Huang   •  Carleen  Klump   •  Min  Shen   •  Craig  Thomas   •  Hongmao  Sun   •  Jim  Inglese   •  Xin  Hu   •  Ron  Johnson   •  Tongan  Zhao   •  Sam  Michael   •  Jennifer  Wichterman  
  • 59.
  • 60. Coun?ng  Atoms   •  The  canonical  Hadoop  program  is  to  count  the   frequency  of  words  in  a  text  file   –  Mapper  reads  a  line,  outputs  a  tuple  –  (word,  1)   –  Reducer  will  receive  tuples,  keyed  on  word! •  Summing  up  the  1’s  gives  us  the  frequency  of  word     •  By  default,  Hadoop  works  on  a  line-­‐by-­‐line  basis   •  For  cheminforma6cs  problems,  SMILES  files   sa6sfy  this  requirement  –  one  line,  one  molecule  
  • 61. Coun?ng  Atoms   public class HeavyAtomCount {! •  Uses  the  CDK  to   static SmilesParser sp = new SmilesParser(DefaultChemObjectBuilder.getInstance());! public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {! ! parse  SMILES   private final static IntWritable one = new IntWritable(1);! private Text word = new Text();! •  For  each   public void map(Object key, Text value, Context context) throws ! IOException, InterruptedException {! try {! IAtomContainer molecule = sp.parseSmiles(value.toString());! molecule  loop   for (IAtom atom : molecule.atoms()) {! word.set(atom.getSymbol());! context.write(word, one);! }! over  atoms   } catch (InvalidSmilesException e) {! // do nothing for now! }! }! }! –  Emit     public static class IntSumReducer extends Reducer<Text, IntWritable, ! Text, IntWritable> {! private IntWritable result = new IntWritable();! (symbol,1)! public void reduce(Text key, Iterable<IntWritable> values,! Context context) throws IOException, InterruptedException {! int sum = 0;! •  Reducer  simply   for (IntWritable val : values) {! sum += val.get();! }! result.set(sum);! sums  the  1’s  for   context.write(key, result);! }! }! ….! each  symbol   }!
  • 62. Mul?line  Records   •  Lots  of  cheminforma6cs  applica6ons  require  3D  –   SMILES  won’t  do.  Need  to  support  SDF   •  We  implement  a  custom  RecordReader to   process  SD  files! •  We’re  now  ready  to     tackle  preiy  much     most    cheminforma6cs   tasks  
  • 63. Why  Hadoop?   •  Java  and  C++  APIs   –  In  Java  use  Objects,  while  in  C++  bytes   •  Each  task  can  process  data  sets  larger     than  RAM   •  Automa6c  re-­‐execu6on  on  failure   –  In  a  large  cluster,  some  nodes  are  always  slow  or  flaky   –  Framework  re-­‐executes  failed  tasks     •  Locality  op6miza6ons   –  M/R  queries  HDFS  for  loca6ons  of  input  data   –  Map  tasks  are  scheduled  close  to  the  inputs  when   possible   Owen  O’Malley,  hip://bit.ly/ecHPvB