SlideShare una empresa de Scribd logo
1 de 27
Descargar para leer sin conexión
Data	
  Driven	
  Tes,ng	
  for	
  	
  
              Distributed	
  Systems	
  	
  
          Case	
  study	
  with	
  Apache	
  Helix	
  


Kishore	
  Gopalakrishna,	
  @kishoreg1980	
  
hBp://www.linkedin.com/in/kgopalak	
  
	
  
Outline	
  
•    Intro	
  to	
  Helix	
  
•    Use	
  case:	
  Distributed	
  data	
  store	
  
•    Tradi,onal	
  approach	
  
•    Data	
  driven	
  tes,ng	
  
•    Q	
  &	
  A	
  
What	
  is	
  Helix	
  
•  Generic	
  cluster	
  management	
  framework	
  
   –  Par,,on	
  management	
  
   –  Failure	
  detec,on	
  and	
  handling	
  
   –  Elas,city	
  
Terminologies 	
  	
  
Node	
          A	
  single	
  machine	
  

Cluster	
       Set	
  of	
  Nodes	
  

Resource	
      A	
  logical	
  en/ty	
  e.g.	
  database,	
  index,	
  task	
  

Par,,on	
       Subset	
  of	
  the	
  resource.	
  

Replica	
       Copy	
  of	
  a	
  par,,on	
  

State	
         Status	
  of	
  a	
  par,,on	
  replica,	
  e.g	
  Master,	
  Slave	
  

Transi,on	
     Ac,on	
  that	
  lets	
  replicas	
  change	
  status	
  e.g	
  Slave	
  -­‐>	
  Master	
  




                                                                                                              4	
  
Core	
  concept:	
  Augmented	
  finite	
  state	
  machine	
  

          State	
  Machine	
                        Constraints	
                           Objec,ves	
  

• States	
                              • States	
                                • Par,,on	
  Placement	
  
  • S1,S2,S3	
                            • S1à	
  max=1,	
  S2=min=2	
          • Failure	
  seman,cs	
  
• Transi,on	
                           • Transi,ons	
  
  • S1àS2,	
  S2àS1,	
  S2àS3,	
       • Concurrent(S1-­‐>S2)	
  
    S3àS1	
  	
                            across	
  cluster	
  <	
  5	
  	
  




                                                                                                               5	
  
Helix	
  usage	
  at	
  LinkedIn	
  
	
  




       	
     Espresso	
  




                                                           6	
  
Use	
  case:	
  Distributed	
  data	
  store	
  
•    Timeline	
  consistent	
  par,,oned	
  data	
  store	
  
•    One	
  master	
  replica	
  per	
  par,,on	
  
•    Even	
  distribu,on	
  of	
  master/slave	
  
•    On	
  failure:	
  promote	
  slave	
  to	
  master	
  
      P.1	
        P.2	
        P.3	
     P.5	
          P.6	
       P.7	
     P.9	
         P.10	
       P.11	
  


      P.4	
        P.5	
        P.6	
     P.8	
          P.1	
       P.2	
     P.12	
        P.3	
        P.4	
  
                                                         P.1	
  

      P.9	
        P.10	
                 P.11	
         P.12	
                P.7	
         P.8	
  



                Node	
  1	
                          Node	
  2	
                          Node	
  3	
  
Helix	
  based	
  solu,on	
  
            State	
  Machine	
                                        Constraints	
                                      Objec,ves	
  

• States	
                                               • States	
                                            • Par,,on	
  Placement	
  
  • Offline,	
  Slave,	
  Master	
                           • M=1,	
  S=2	
                                     • Failure	
  seman,cs	
  
• Transi,on	
                                            • Transi,ons	
  
  • O-­‐>S,	
  S-­‐>M,S-­‐>M,	
  M-­‐>S	
                  • concurrent(0-­‐>S)	
  <	
  5	
  	
  




                                                                   COUNT=2             minimize(maxnj∈N	
  S(nj)	
  )
                                                 t1≤5
                                                                        S	
  
                                                        t1                                  t2

                                                              t3                  t4
                                              O	
                                                   M	
     COUNT=1     minimize(maxnj∈N	
  M(nj)	
  )




                                                                                                                                                         8	
  
Tes,ng	
  
•  Happy	
  path	
  func,onality	
  
    –  Meet	
  SLA	
  
        •  	
   99th	
  percen,le	
  latency	
  etc	
  
    –  Writes	
  to	
  master	
  
•  Non	
  happy	
  path	
  
    –  System	
  failures	
  	
  
    –  Applica,on	
  failures	
  
    –  How	
  does	
  system	
  behave	
  in	
  such	
  scenarios	
  
        	
  
Non	
  happy	
  path	
  -­‐	
  Tradi,onal	
  approach	
  
•  Iden,fy	
  scenarios	
  of	
  interest	
  
    –  Node	
  failure	
  
    –  System	
  upgrade	
  
•  Tested	
  each	
  scenario	
  in	
  isola,on	
  via	
  test	
  case	
  
    –  All	
  test	
  passed	
  J	
  
•  Deployed	
  in	
  alpha	
  
    –  First	
  soiware	
  upgrade	
  failed	
  …	
  but	
  we	
  tested	
  it	
  
What	
  was	
  missing	
  
•  Failures	
  don’t	
  happen	
  in	
  isola,on	
  
•  Induc,on	
  principle	
  does	
  not	
  work	
  
    –  If	
  something	
  works	
  once	
  does	
  not	
  mean	
  it	
  will	
  
       always	
  work	
  
•  Lack	
  of	
  tools	
  to	
  debug	
  issues	
  
    –  Could	
  not	
  iden,fy	
  the	
  cause	
  from	
  one	
  log	
  file	
  
•  Poor	
  coverage	
  
    –  Impossible	
  to	
  think	
  of	
  all	
  possible	
  test	
  cases	
  
What	
  we	
  learnt	
  
•  Test	
  with	
  all	
  components	
  integrated	
  
•  Simulate	
  real	
  produc,on	
  environment	
  
   –  Generate	
  load	
  
   –  Random	
  failures	
  of	
  mul,ple	
  components	
  
•  BeBer	
  debugging	
  tools	
  
   –  Need	
  to	
  co-­‐relate	
  messages	
  from	
  mul,ple	
  logs	
  
   –  Failure	
  is	
  a	
  symptom,	
  actual	
  reason	
  in	
  past	
  logs	
  of	
  
      different	
  machine.	
  
Data	
  driven	
  tes,ng	
  
•  Instrument	
  –	
  
       •  	
  Zookeeper,	
  controller,	
  par,cipant	
  logs	
  
•  Simulate	
  –	
  Chaos	
  monkey	
  
•  Analyze	
  –	
  Invariants	
  are	
  
       •  Respect	
  state	
  transi,on	
  constraints	
  
       •  Respect	
  state	
  count	
  constraints	
  
       •  And	
  so	
  on	
  
•  Debugging	
  made	
  easy	
  
       •  Reproduce	
  exact	
  sequence	
  of	
  events	
  	
  
	
  
                                                                    13	
  
Chaos	
  monkey	
  
•  Select	
  a	
  random	
  component(s)	
  to	
  fail	
  
•  How	
  should	
  it	
  fail	
  
    –  Hard/soi	
  failure	
  
    –  Network	
  Par,,on	
  
    –  Garbage	
  collec,on	
  
    –  Process	
  freeze	
  


    	
  
Automa,on	
  of	
  chaos	
  monkey	
  
•  Helix	
  agent	
  on	
  each	
  node	
  
                                                     STATE	
  MACHINE	
  
•  Modify	
  the	
  behavior	
  of	
  
   each	
  service	
  using	
  Helix	
                        STOPPED	
  

    –  Component	
  1	
  
        •  Node1:	
  RUNNING	
       STOP	
                    FREEZED	
               START	
  
        •  Node2:	
  STOPPED	
                  PAUSE	
                      UNPAUSE	
  
        •  Node3:	
  KILLED	
  
                                                              RUNNING	
  
    –  Component	
  2	
  
                                                            KILL	
  
        •  Node1:	
  STOPPED	
  
                                                                KILLED	
  
Pseudo	
  test	
  case	
  
setup	
  cluster	
  	
  
generate	
  load	
  
do	
  
   	
  (c,t)	
  =	
  components	
  to	
  fail	
  and	
  type	
  of	
  failure	
  
   	
  simulate	
  failure	
  
   	
  verify	
  system_is_stable	
  
   	
  restart	
  failed	
  components	
  
while(verify	
  system_is_stable)	
  
Test	
  case	
  failed	
  &	
  here	
  is	
  the	
  sequence	
  of	
  events	
  	
  
   	
  	
  
Cluster	
  verifica,on	
  
•  Verify	
  all	
  constraints	
  are	
  sa,sfied	
  
    –  Is	
  there	
  a	
  master	
  for	
  all	
  par,,on	
  
    –  Is	
  slave	
  replica,ng	
  	
  
    –  Node/component	
  down	
  should	
  not	
  maBer	
  
    –  Validate	
  every	
  ac,on	
  not	
  just	
  end	
  result	
  
        •  Having	
  master	
  is	
  not	
  good	
  enough,	
  if	
  two	
  nodes	
  
           became	
  master	
  and	
  later	
  one	
  of	
  them	
  died.	
  
Log	
  analysis	
  
•  Log	
  important	
  events	
  
    –  Becoming	
  master	
  from	
  slave	
  for	
  this	
  par,,on	
  at	
  
       this	
  ,me	
  
•  Tools	
  to	
  collect,	
  merge	
  &	
  analyze	
  logs	
  
    –  Parsed	
  zookeeper	
  transac,on	
  logs	
  
    –  Gathered	
  helix	
  controller,	
  par,cipant	
  logs	
  
    –  Sorted	
  on	
  ,me.	
  
•  Helix	
  provides	
  these	
  tools	
  out	
  of	
  the	
  box	
  
Structured	
  Log	
  File	
  –	
  sample	
  
 timestamp      partition     instanceName                   sessionId                  state

1323312236368   TestDB_123   express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   OFFLINE

1323312236426   TestDB_123   express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   OFFLINE

1323312236530   TestDB_123   express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   OFFLINE

1323312236530   TestDB_91    express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   OFFLINE

1323312236561   TestDB_123   express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   SLAVE

1323312236561   TestDB_91    express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   OFFLINE

1323312236685   TestDB_123   express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   SLAVE

1323312236685   TestDB_91    express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   OFFLINE

1323312236685   TestDB_60    express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   OFFLINE

1323312236719   TestDB_123   express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   SLAVE

1323312236719   TestDB_91    express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   SLAVE

1323312236719   TestDB_60    express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   OFFLINE

1323312236814   TestDB_123   express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   SLAVE
Benefits	
  
•  Test	
  case	
  stops	
  as	
  soon	
  as	
  system	
  is	
  unstable	
  
    –  The	
  cluster	
  is	
  available	
  for	
  debugging	
  	
  


•  Provides	
  exact	
  sequence	
  of	
  events	
  
    –  Makes	
  it	
  easy	
  to	
  debug	
  and	
  reproduce	
  
    –  Best	
  part:	
  We	
  auto	
  generated	
  test	
  case.	
  	
  
    	
  	
  
Reproduce	
  the	
  issue	
  	
  
Start	
  state	
                                Orchestrate	
  the	
  sequence	
  
•  Helix	
  brings	
  the	
  system	
  to	
     •  Use	
  Helix	
  messaging	
  api	
  to	
  
   start	
  state.	
                               replay	
  the	
  events	
  
 {
     	
  
   "id" : "MyDataStore",
   "simpleFields" : {                              1.  Node1:MyDataStore_0: Master-Slave
     "IDEAL_STATE_MODE" : "CUSTOM",
     "NUM_PARTITIONS" : ”2",
     "REPLICAS" : "3",                             2. Node1:HARD KILL
     "STATE_MODEL_DEF_REF" : "MasterSlave",
   }
  "mapFields" : {                                  3. Node2:START
     "MyDataStore_0" : {
       "node1" : "MASTER",
       "node2" : "OFFLINE",
       "node3" : "SLAVE",
     },
     "MyDataStore_0" : {
       "node1" : "SLAVE",
       "node2" : "OFFLINE",
       "node3" : "MASTER",
     },
   }
 }
Constraint	
  viola,on	
  
No	
  more	
  than	
  R=2	
  slaves	
  

 Time             State              Number Slaves         Instance
42632          OFFLINE                    0          10.117.58.247_12918
42796            SLAVE                    1          10.117.58.247_12918
43124          OFFLINE                    1          10.202.187.155_12918
43131          OFFLINE                    1          10.220.225.153_12918
43275            SLAVE                    2          10.220.225.153_12918
43323            SLAVE                    3          10.202.187.155_12918
85795          MASTER                     2          10.220.225.153_12918
How	
  long	
  was	
  it	
  out	
  of	
  whack?	
  
Number	
  of	
  Slaves	
            Time	
  	
                          Percentage	
  
0	
                                 1082319	
                           0.5	
  
1	
                                 35578388	
                          16.46	
  
2	
                                 179417802	
                         82.99	
  
3	
                                 118863	
                            0.05	
  


              83%	
  of	
  the	
  ,me,	
  there	
  were	
  2	
  slaves	
  to	
  a	
  par,,on	
  
              93%	
  of	
  the	
  ,me,	
  there	
  was	
  1	
  master	
  to	
  a	
  par,,on	
  

Number	
  of	
  Masters	
           Time	
                              Percentage	
  
                  0                                15490456                        7.164960359
                  1                                200706916                       92.83503964
Invariant	
  2:	
  State	
  Transi,ons	
  
 FROM	
            TO	
            COUNT	
  

MASTER           SLAVE               55
OFFLINE        DROPPED                0
OFFLINE          SLAVE              298
SLAVE           MASTER              155
SLAVE           OFFLINE               0
Fun	
  facts	
  
•  For	
  almost	
  a	
  month	
  the	
  test	
  failed	
  to	
  run	
  
   successfully	
  for	
  one	
  night	
  
•  Most	
  issues	
  were	
  found	
  using	
  one	
  test	
  case	
  
•  Reproduced	
  almost	
  all	
  failures	
  
Conclusion	
  
•  Tradi,onal	
  approach	
  is	
  not	
  good	
  enough	
  
•  Data	
  driven	
  tes,ng	
  is	
  way	
  to	
  go	
  
   –  Focus	
  on	
  workload	
  and	
  analysis	
  
   –  Produc,on	
  system	
  always	
  in	
  test	
  mode	
  
   –  Leverage	
  tools	
  built	
  for	
  tes,ng	
  to	
  debug	
  
      produc,on	
  issues	
  
website	
   helix.incubator.apache.org	
  

users	
      user@helix.incubator.apache.org	
  

dev	
        dev@helix.incubator.apache.org	
  
twiBer	
     @apachehelix	
  




                                                   27	
  

Más contenido relacionado

La actualidad más candente

PuppetConf 2016: High Availability for Puppet – Russ Mull & Zack Smith, Puppet
PuppetConf 2016: High Availability for Puppet – Russ Mull & Zack Smith, PuppetPuppetConf 2016: High Availability for Puppet – Russ Mull & Zack Smith, Puppet
PuppetConf 2016: High Availability for Puppet – Russ Mull & Zack Smith, PuppetPuppet
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudthelabdude
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...gethue
 
Linux Cluster next generation
Linux Cluster next generationLinux Cluster next generation
Linux Cluster next generationsamsolutionsby
 
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanScala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanJimin Hsieh
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with PacemakerKris Buytaert
 
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...Amit Gupta
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance SmackdownDataWorks Summit
 
Using OVSDB and OpenFlow southbound plugins
Using OVSDB and OpenFlow southbound pluginsUsing OVSDB and OpenFlow southbound plugins
Using OVSDB and OpenFlow southbound pluginsOpenDaylight
 
AKUDA Labs: Pulsar
AKUDA Labs: PulsarAKUDA Labs: Pulsar
AKUDA Labs: PulsarAKUDA Labs
 
Inspecting a multi everything linux system (plmce2k14)
Inspecting a multi everything linux system (plmce2k14)Inspecting a multi everything linux system (plmce2k14)
Inspecting a multi everything linux system (plmce2k14)Frederic Descamps
 
Benchmark: Bananas vs Spark Streaming
Benchmark: Bananas vs Spark StreamingBenchmark: Bananas vs Spark Streaming
Benchmark: Bananas vs Spark StreamingAKUDA Labs
 
Storing Cassandra Metrics
Storing Cassandra MetricsStoring Cassandra Metrics
Storing Cassandra MetricsChris Lohfink
 
Solrcloud Leader Election
Solrcloud Leader ElectionSolrcloud Leader Election
Solrcloud Leader Electionravikgiitk
 
Robust ha solutions with proxysql
Robust ha solutions with proxysqlRobust ha solutions with proxysql
Robust ha solutions with proxysqlMarco Tusa
 
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformAkka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformLegacy Typesafe (now Lightbend)
 

La actualidad más candente (17)

PuppetConf 2016: High Availability for Puppet – Russ Mull & Zack Smith, Puppet
PuppetConf 2016: High Availability for Puppet – Russ Mull & Zack Smith, PuppetPuppetConf 2016: High Availability for Puppet – Russ Mull & Zack Smith, Puppet
PuppetConf 2016: High Availability for Puppet – Russ Mull & Zack Smith, Puppet
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
Apache Flink Hands On
Apache Flink Hands OnApache Flink Hands On
Apache Flink Hands On
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
 
Linux Cluster next generation
Linux Cluster next generationLinux Cluster next generation
Linux Cluster next generation
 
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanScala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with Pacemaker
 
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
 
Using OVSDB and OpenFlow southbound plugins
Using OVSDB and OpenFlow southbound pluginsUsing OVSDB and OpenFlow southbound plugins
Using OVSDB and OpenFlow southbound plugins
 
AKUDA Labs: Pulsar
AKUDA Labs: PulsarAKUDA Labs: Pulsar
AKUDA Labs: Pulsar
 
Inspecting a multi everything linux system (plmce2k14)
Inspecting a multi everything linux system (plmce2k14)Inspecting a multi everything linux system (plmce2k14)
Inspecting a multi everything linux system (plmce2k14)
 
Benchmark: Bananas vs Spark Streaming
Benchmark: Bananas vs Spark StreamingBenchmark: Bananas vs Spark Streaming
Benchmark: Bananas vs Spark Streaming
 
Storing Cassandra Metrics
Storing Cassandra MetricsStoring Cassandra Metrics
Storing Cassandra Metrics
 
Solrcloud Leader Election
Solrcloud Leader ElectionSolrcloud Leader Election
Solrcloud Leader Election
 
Robust ha solutions with proxysql
Robust ha solutions with proxysqlRobust ha solutions with proxysql
Robust ha solutions with proxysql
 
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformAkka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
 

Similar a Data driven testing: Case study with Apache Helix

Galera explained 3
Galera explained 3Galera explained 3
Galera explained 3Marco Tusa
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with HelixAmy W. Tang
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using HelixAmy W. Tang
 
Concurrency in Distributed Systems : Leslie Lamport papers
Concurrency in Distributed Systems : Leslie Lamport papersConcurrency in Distributed Systems : Leslie Lamport papers
Concurrency in Distributed Systems : Leslie Lamport papersSubhajit Sahu
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccsrisatish ambati
 
Verification with LoLA: 4 Using LoLA
Verification with LoLA: 4 Using LoLAVerification with LoLA: 4 Using LoLA
Verification with LoLA: 4 Using LoLAUniversität Rostock
 
Advanced pipelining
Advanced pipeliningAdvanced pipelining
Advanced pipelininga_spac
 
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...NoSQLmatters
 
Storm presentation
Storm presentationStorm presentation
Storm presentationShyam Raj
 
London devops logging
London devops loggingLondon devops logging
London devops loggingTomas Doran
 
The Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmThe Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmESCOM
 
Storm 2012-03-29
Storm 2012-03-29Storm 2012-03-29
Storm 2012-03-29Ted Dunning
 
Playing Go with Clojure
Playing Go with ClojurePlaying Go with Clojure
Playing Go with Clojureztellman
 

Similar a Data driven testing: Case study with Apache Helix (20)

Work items
Work itemsWork items
Work items
 
Work items
Work itemsWork items
Work items
 
Galera explained 3
Galera explained 3Galera explained 3
Galera explained 3
 
Apache Helix presentation at Vmware
Apache Helix presentation at VmwareApache Helix presentation at Vmware
Apache Helix presentation at Vmware
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with Helix
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using Helix
 
Concurrency in Distributed Systems : Leslie Lamport papers
Concurrency in Distributed Systems : Leslie Lamport papersConcurrency in Distributed Systems : Leslie Lamport papers
Concurrency in Distributed Systems : Leslie Lamport papers
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svcc
 
Sd4 tignes 2013-2
Sd4 tignes 2013-2Sd4 tignes 2013-2
Sd4 tignes 2013-2
 
Verification with LoLA: 4 Using LoLA
Verification with LoLA: 4 Using LoLAVerification with LoLA: 4 Using LoLA
Verification with LoLA: 4 Using LoLA
 
Advanced pipelining
Advanced pipeliningAdvanced pipelining
Advanced pipelining
 
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
 
CAP theorem by Ali Ghodsi
CAP theorem by Ali GhodsiCAP theorem by Ali Ghodsi
CAP theorem by Ali Ghodsi
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
London devops logging
London devops loggingLondon devops logging
London devops logging
 
The Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmThe Back Propagation Learning Algorithm
The Back Propagation Learning Algorithm
 
Storm 2012-03-29
Storm 2012-03-29Storm 2012-03-29
Storm 2012-03-29
 
Playing Go with Clojure
Playing Go with ClojurePlaying Go with Clojure
Playing Go with Clojure
 

Más de Kishore Gopalakrishna

Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyKishore Gopalakrishna
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastoreKishore Gopalakrishna
 
Multi-Tenant Data Cloud with YARN & Helix
Multi-Tenant Data Cloud with YARN & HelixMulti-Tenant Data Cloud with YARN & Helix
Multi-Tenant Data Cloud with YARN & HelixKishore Gopalakrishna
 
Untangling cluster management with Helix
Untangling cluster management with HelixUntangling cluster management with Helix
Untangling cluster management with HelixKishore Gopalakrishna
 

Más de Kishore Gopalakrishna (6)

History of Apache Pinot
History of Apache Pinot History of Apache Pinot
History of Apache Pinot
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case study
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastore
 
Multi-Tenant Data Cloud with YARN & Helix
Multi-Tenant Data Cloud with YARN & HelixMulti-Tenant Data Cloud with YARN & Helix
Multi-Tenant Data Cloud with YARN & Helix
 
Helix talk at RelateIQ
Helix talk at RelateIQHelix talk at RelateIQ
Helix talk at RelateIQ
 
Untangling cluster management with Helix
Untangling cluster management with HelixUntangling cluster management with Helix
Untangling cluster management with Helix
 

Último

UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncObject Automation
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdfJamie (Taka) Wang
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 

Último (20)

UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation Inc
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 

Data driven testing: Case study with Apache Helix

  • 1. Data  Driven  Tes,ng  for     Distributed  Systems     Case  study  with  Apache  Helix   Kishore  Gopalakrishna,  @kishoreg1980   hBp://www.linkedin.com/in/kgopalak    
  • 2. Outline   •  Intro  to  Helix   •  Use  case:  Distributed  data  store   •  Tradi,onal  approach   •  Data  driven  tes,ng   •  Q  &  A  
  • 3. What  is  Helix   •  Generic  cluster  management  framework   –  Par,,on  management   –  Failure  detec,on  and  handling   –  Elas,city  
  • 4. Terminologies     Node   A  single  machine   Cluster   Set  of  Nodes   Resource   A  logical  en/ty  e.g.  database,  index,  task   Par,,on   Subset  of  the  resource.   Replica   Copy  of  a  par,,on   State   Status  of  a  par,,on  replica,  e.g  Master,  Slave   Transi,on   Ac,on  that  lets  replicas  change  status  e.g  Slave  -­‐>  Master   4  
  • 5. Core  concept:  Augmented  finite  state  machine   State  Machine   Constraints   Objec,ves   • States   • States   • Par,,on  Placement   • S1,S2,S3   • S1à  max=1,  S2=min=2   • Failure  seman,cs   • Transi,on   • Transi,ons   • S1àS2,  S2àS1,  S2àS3,   • Concurrent(S1-­‐>S2)   S3àS1     across  cluster  <  5     5  
  • 6. Helix  usage  at  LinkedIn       Espresso   6  
  • 7. Use  case:  Distributed  data  store   •  Timeline  consistent  par,,oned  data  store   •  One  master  replica  per  par,,on   •  Even  distribu,on  of  master/slave   •  On  failure:  promote  slave  to  master   P.1   P.2   P.3   P.5   P.6   P.7   P.9   P.10   P.11   P.4   P.5   P.6   P.8   P.1   P.2   P.12   P.3   P.4   P.1   P.9   P.10   P.11   P.12   P.7   P.8   Node  1   Node  2   Node  3  
  • 8. Helix  based  solu,on   State  Machine   Constraints   Objec,ves   • States   • States   • Par,,on  Placement   • Offline,  Slave,  Master   • M=1,  S=2   • Failure  seman,cs   • Transi,on   • Transi,ons   • O-­‐>S,  S-­‐>M,S-­‐>M,  M-­‐>S   • concurrent(0-­‐>S)  <  5     COUNT=2 minimize(maxnj∈N  S(nj)  ) t1≤5 S   t1 t2 t3 t4 O   M   COUNT=1 minimize(maxnj∈N  M(nj)  ) 8  
  • 9. Tes,ng   •  Happy  path  func,onality   –  Meet  SLA   •    99th  percen,le  latency  etc   –  Writes  to  master   •  Non  happy  path   –  System  failures     –  Applica,on  failures   –  How  does  system  behave  in  such  scenarios    
  • 10. Non  happy  path  -­‐  Tradi,onal  approach   •  Iden,fy  scenarios  of  interest   –  Node  failure   –  System  upgrade   •  Tested  each  scenario  in  isola,on  via  test  case   –  All  test  passed  J   •  Deployed  in  alpha   –  First  soiware  upgrade  failed  …  but  we  tested  it  
  • 11. What  was  missing   •  Failures  don’t  happen  in  isola,on   •  Induc,on  principle  does  not  work   –  If  something  works  once  does  not  mean  it  will   always  work   •  Lack  of  tools  to  debug  issues   –  Could  not  iden,fy  the  cause  from  one  log  file   •  Poor  coverage   –  Impossible  to  think  of  all  possible  test  cases  
  • 12. What  we  learnt   •  Test  with  all  components  integrated   •  Simulate  real  produc,on  environment   –  Generate  load   –  Random  failures  of  mul,ple  components   •  BeBer  debugging  tools   –  Need  to  co-­‐relate  messages  from  mul,ple  logs   –  Failure  is  a  symptom,  actual  reason  in  past  logs  of   different  machine.  
  • 13. Data  driven  tes,ng   •  Instrument  –   •   Zookeeper,  controller,  par,cipant  logs   •  Simulate  –  Chaos  monkey   •  Analyze  –  Invariants  are   •  Respect  state  transi,on  constraints   •  Respect  state  count  constraints   •  And  so  on   •  Debugging  made  easy   •  Reproduce  exact  sequence  of  events       13  
  • 14. Chaos  monkey   •  Select  a  random  component(s)  to  fail   •  How  should  it  fail   –  Hard/soi  failure   –  Network  Par,,on   –  Garbage  collec,on   –  Process  freeze    
  • 15. Automa,on  of  chaos  monkey   •  Helix  agent  on  each  node   STATE  MACHINE   •  Modify  the  behavior  of   each  service  using  Helix   STOPPED   –  Component  1   •  Node1:  RUNNING   STOP   FREEZED   START   •  Node2:  STOPPED   PAUSE   UNPAUSE   •  Node3:  KILLED   RUNNING   –  Component  2   KILL   •  Node1:  STOPPED   KILLED  
  • 16. Pseudo  test  case   setup  cluster     generate  load   do    (c,t)  =  components  to  fail  and  type  of  failure    simulate  failure    verify  system_is_stable    restart  failed  components   while(verify  system_is_stable)   Test  case  failed  &  here  is  the  sequence  of  events        
  • 17. Cluster  verifica,on   •  Verify  all  constraints  are  sa,sfied   –  Is  there  a  master  for  all  par,,on   –  Is  slave  replica,ng     –  Node/component  down  should  not  maBer   –  Validate  every  ac,on  not  just  end  result   •  Having  master  is  not  good  enough,  if  two  nodes   became  master  and  later  one  of  them  died.  
  • 18. Log  analysis   •  Log  important  events   –  Becoming  master  from  slave  for  this  par,,on  at   this  ,me   •  Tools  to  collect,  merge  &  analyze  logs   –  Parsed  zookeeper  transac,on  logs   –  Gathered  helix  controller,  par,cipant  logs   –  Sorted  on  ,me.   •  Helix  provides  these  tools  out  of  the  box  
  • 19. Structured  Log  File  –  sample   timestamp partition instanceName sessionId state 1323312236368 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1323312236426 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1323312236530 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1323312236530 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1323312236561 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE 1323312236561 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1323312236685 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE 1323312236685 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1323312236685 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1323312236719 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE 1323312236719 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE 1323312236719 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1323312236814 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
  • 20. Benefits   •  Test  case  stops  as  soon  as  system  is  unstable   –  The  cluster  is  available  for  debugging     •  Provides  exact  sequence  of  events   –  Makes  it  easy  to  debug  and  reproduce   –  Best  part:  We  auto  generated  test  case.        
  • 21. Reproduce  the  issue     Start  state   Orchestrate  the  sequence   •  Helix  brings  the  system  to   •  Use  Helix  messaging  api  to   start  state.   replay  the  events   {   "id" : "MyDataStore", "simpleFields" : { 1.  Node1:MyDataStore_0: Master-Slave "IDEAL_STATE_MODE" : "CUSTOM", "NUM_PARTITIONS" : ”2", "REPLICAS" : "3", 2. Node1:HARD KILL "STATE_MODEL_DEF_REF" : "MasterSlave", } "mapFields" : { 3. Node2:START "MyDataStore_0" : { "node1" : "MASTER", "node2" : "OFFLINE", "node3" : "SLAVE", }, "MyDataStore_0" : { "node1" : "SLAVE", "node2" : "OFFLINE", "node3" : "MASTER", }, } }
  • 22. Constraint  viola,on   No  more  than  R=2  slaves   Time State Number Slaves Instance 42632 OFFLINE 0 10.117.58.247_12918 42796 SLAVE 1 10.117.58.247_12918 43124 OFFLINE 1 10.202.187.155_12918 43131 OFFLINE 1 10.220.225.153_12918 43275 SLAVE 2 10.220.225.153_12918 43323 SLAVE 3 10.202.187.155_12918 85795 MASTER 2 10.220.225.153_12918
  • 23. How  long  was  it  out  of  whack?   Number  of  Slaves   Time     Percentage   0   1082319   0.5   1   35578388   16.46   2   179417802   82.99   3   118863   0.05   83%  of  the  ,me,  there  were  2  slaves  to  a  par,,on   93%  of  the  ,me,  there  was  1  master  to  a  par,,on   Number  of  Masters   Time   Percentage   0 15490456 7.164960359 1 200706916 92.83503964
  • 24. Invariant  2:  State  Transi,ons   FROM   TO   COUNT   MASTER SLAVE 55 OFFLINE DROPPED 0 OFFLINE SLAVE 298 SLAVE MASTER 155 SLAVE OFFLINE 0
  • 25. Fun  facts   •  For  almost  a  month  the  test  failed  to  run   successfully  for  one  night   •  Most  issues  were  found  using  one  test  case   •  Reproduced  almost  all  failures  
  • 26. Conclusion   •  Tradi,onal  approach  is  not  good  enough   •  Data  driven  tes,ng  is  way  to  go   –  Focus  on  workload  and  analysis   –  Produc,on  system  always  in  test  mode   –  Leverage  tools  built  for  tes,ng  to  debug   produc,on  issues  
  • 27. website   helix.incubator.apache.org   users   user@helix.incubator.apache.org   dev   dev@helix.incubator.apache.org   twiBer   @apachehelix   27  

Notas del editor

  1. In this slide, we will look at the problem from a different perspective and possibly re-define the cluster management problem.So re-cap to solve dds we need to define number of partitions and replicas, and for each replicas we need to different roles like master/slave etcOne of the well proven way to express such behavior is use a state machine
  2. Used in production and manage the core infrastructure components in the companyOperation is easy and easy for dev ops to operate multiple systems
  3. Coverage
  4. Mention coverage