SlideShare una empresa de Scribd logo
1 de 28
Bug bites Elephant?
Test-driven Quality Assurance
in Big Data Application Development
Dr. Dominik Benz, inovex GmbH
2013/06/03, Berlin Buzzwords
? TDD!
?
?
?
?
?
?
?
Write/execute tests, specify
acceptance criteria, …
2
Who speaks… … the Elephant language?
Class A
extends
Mapper…
ROI, $$, …
apt-get
install…
3
The road… … to Big Data QA
our Big Data
QA problem
the FitNesse
approach
test data
definition /
selection
job & workflow
control
result
inspection
4
QA problem Web Intelligence @ 1&1
DWH
Hadoop Cluster
~ 1 billion log
events / day,
~ 1 TB (thrift)
logfiles
chains of MR jobs,
running on
20 nodes / 8 cores /
96 GB RAM (CDH)
BI reporting, web
analytics, …
5
QA problem An exemplary workflow
Log Files
(thrift)
Log Files
(thrift)
Log Files
(thrift)
Inter-
mediate
result
(avro)
MR
job 1
… DWH
(RDBMS)
MR
job 2
create
(sample)
input data
create
(sample)
input data
? inspect
(binary)
formats
inspect
(binary)
formats
? control
workflows
control
workflows
?
method tests what? issues for our usecase
JUnit isolated functions no integration, Java syntax
MRUnit 1 mapper + 1 reducer „little“ integration, Java
syntax
iTest hadoop jobs/workflows Java / Groovy syntax
Scripts/CLI (manual) scripting/inspect. „script chaos“, syntax
6
QA problem Existing Approaches
FitNesse as suitable addition / solution!
7
The road… … to Big Data QA
Big Data QA
is different!
the FitNesse
approach
test data
definition /
selection
job & workflow
control
result
inspection
8
FitNesse In a nutshell
„fully integrated
standalone wiki and
acceptance testing
framework”
• „executable“ Wiki-
Pages (returning test
results)
• (almost) natural
language test
specification
• connection to SUT
via (Java-)“Fixtures“
9
FitNesse Architecture Overview
script |
check |
num results |
3 |
Browser
FitNesse
Server
public int
numResults
{ ... }
System under Test
Fixtures
„calling java methods from
wiki“, compare return values
Integrates with REST, Jenkins…
10
FitNesse An Exemplary Test
11
FitNesse Exemplary Test Source
!path /home/inovex/lib/*.jar
| script | Hadoop |
| upload | viewLog.csv | to hdfs | /testdata/ |
| hadoop job from jar | viewLog.jar | [...] |
| show | job output |
| check | number of output files | 3 |
12
FitNesse Hadoop Fixture Java Code
public class Hadoop {
public boolean uploadToHdfs(String localFile,
String remoteFile) {...}
public boolean hadoopJobFromJar(String jar,
String input, String output) {...}
public String jobOutput() {...}
public String numberOfOutputFiles() {...}
}
13
The road… … to Big Data QA
Big Data QA
is different!
Fitnesse Wiki
test execution!
test data
definition /
selection
job & workflow
control
result
inspection
14
Test Data CSV
‣ Big Data: Efficient data transfer among heterogeneous sources
‣ Define Interface via IDL, Compiler for many languages
15
Test Data Thrift
‣ Dev/Test Hadoop Cluster: Identical Hardware like Prod, but
fewer nodes
‣ (random/biased) sampling e.g. on daily basis
‣ Feedback loop:
‣ identify „special cases“ from real data
‣ include them in (manual) data definition
‣ Gradually increase test coverage / artefact quality
16
Test Data Real World Data
17
The road… … to Big Data QA
Big Data QA
is different!
FitNesse Wiki
test execution!
Define CSV /
thrift / real-
world test data!
job & workflow
control
result
inspection
‣ Execute arbitrary (shell) commands
‣ Mainly a wrapper around apache.commons.exec.CommandLine
18
Job Control Swiss Army Knife: Shell
‣ Hide complexity from test authors
‣ „define“ appropriate test language via (Java) method names
‣ re-use other fixtures (Shell, …) internally
19
Job Control Hadoop Fixture
‣ FitNesse allows to group tests into suites
‣ Can be used to simulate MR processing chains
‣ SetupSuite / TearDownSuite for creating /
destroying test conditions
‣ Tests can still be executed individually
20
Job Control Workflows & Suites
MR job 1
MR job 2
21
The road… … to Big Data QA
Big Data QA
is different!
FitNesse Wiki
test execution!
Define CSV /
thrift / real-
world data!
Use suites & fixtures
for jobs/workflows!
result
inspection
‣ Validate RDBMS contents (via JDBC)
‣ E.g. for checking the final result
‣ Or use Hive + Hive-Server to query raw data
22
Results Data Warehouse / Hive
‣ Execute arbitrary pig commands from Wiki page
‣ Inspect e.g. binary intermediate results (avro, …)
23
Results Pig
public class PigConsole extends PigServer {
public void loadAvroFileUsingAlias(String
filename, String alias) {
this.registerQuery(
alias + "= LOAD" + filename + "USING" +
AVRO_STORAGE_LOADER + ";");
}
}
24
Results Pig Fixture extends PigServer
25
Results Server Infrastructure
Fitnesse Master
TestEnvironments
ProjA ProjB
TestConfigurations
ProjA ProjB
dev qs live dev qs live
Import / edit
tests remotely
QS
ProjA Slave
Dev
ProjA Slave
Live
ProjA SlaveProjA
QS
ProjA Slave
Dev
ProjA Slave
Live
ProjA Slave
Import / edit
config remotely
dev qs live
26
Thank you! dominik.benz@inovex.de
Big Data QA
is different!
FitNesse Wiki
test execution!
Define CSV /
thrift / real-
world data!
Inspect results
via Pig/Hive
Use suites & fixtures
for jobs/workflows!
27
Want more? Inovex trains you!
Android Developer Training (3 days, Karlsruhe/München)
Hadoop Developer Training (3 days, Karlsruhe/Köln)
Certified Scrum Developer Training (5 days, Köln)
Pentaho Data Integration Training (4 days, München/Köln)
Liferay Portal-Admin Training (3 days, Karlsruhe)
Liferay Portal-Developer Training (4 days, Karlsruhe)
information and registration at
www.inovex.de/offene-trainings
28
Inovex @bbuzz
Stefan Kathrin Bernhard
Jörg
Andrew
Christian Christian

Más contenido relacionado

La actualidad más candente

Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
MongoDB
 

La actualidad más candente (10)

Ajax Lecture Notes
Ajax Lecture NotesAjax Lecture Notes
Ajax Lecture Notes
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical Depictions
 
Meetup spark structured streaming
Meetup spark structured streamingMeetup spark structured streaming
Meetup spark structured streaming
 
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure way
 
Deep Dumpster Diving
Deep Dumpster DivingDeep Dumpster Diving
Deep Dumpster Diving
 
Presentation of OrientDB v2.2 - Webinar
Presentation of OrientDB v2.2 - WebinarPresentation of OrientDB v2.2 - Webinar
Presentation of OrientDB v2.2 - Webinar
 
Interactive Session on Sparkling Water
Interactive Session on Sparkling WaterInteractive Session on Sparkling Water
Interactive Session on Sparkling Water
 

Destacado

Big data and the data quality imperative
Big data and the data quality imperativeBig data and the data quality imperative
Big data and the data quality imperative
Trillium Software
 
Data Quality Challenges to Big Data_Practical Insights_KPMG Presentation 20.4...
Data Quality Challenges to Big Data_Practical Insights_KPMG Presentation 20.4...Data Quality Challenges to Big Data_Practical Insights_KPMG Presentation 20.4...
Data Quality Challenges to Big Data_Practical Insights_KPMG Presentation 20.4...
Hugo van Hoogstraten
 

Destacado (20)

Dudley dolan q-validus_big data_workshop_dcu_event_aqua_smart_march16_ final
Dudley dolan q-validus_big data_workshop_dcu_event_aqua_smart_march16_ finalDudley dolan q-validus_big data_workshop_dcu_event_aqua_smart_march16_ final
Dudley dolan q-validus_big data_workshop_dcu_event_aqua_smart_march16_ final
 
Metadata quality in digital repositories
Metadata quality in digital repositoriesMetadata quality in digital repositories
Metadata quality in digital repositories
 
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...
 
Big data governance as a corporate governance imperative
Big data governance as a corporate governance imperativeBig data governance as a corporate governance imperative
Big data governance as a corporate governance imperative
 
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
Big Data Expo 2015 - Trillium software Big Data and the Data QualityBig Data Expo 2015 - Trillium software Big Data and the Data Quality
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
 
Data donderdag data quality sas
Data donderdag data quality sasData donderdag data quality sas
Data donderdag data quality sas
 
Quality Approaches to Big Data
Quality Approaches to Big DataQuality Approaches to Big Data
Quality Approaches to Big Data
 
Big Data Quality Panel : Diachron Workshop @EDBT
Big Data Quality Panel: Diachron Workshop @EDBTBig Data Quality Panel: Diachron Workshop @EDBT
Big Data Quality Panel : Diachron Workshop @EDBT
 
Data quality - The True Big Data Challenge
Data quality - The True Big Data ChallengeData quality - The True Big Data Challenge
Data quality - The True Big Data Challenge
 
Big data and the data quality imperative
Big data and the data quality imperativeBig data and the data quality imperative
Big data and the data quality imperative
 
Strengthening the Quality of Big Data Implementations
Strengthening the Quality of Big Data ImplementationsStrengthening the Quality of Big Data Implementations
Strengthening the Quality of Big Data Implementations
 
Acquire Grow & Retain customers - The business imperative for Big Data
Acquire Grow & Retain customers - The business imperative for Big DataAcquire Grow & Retain customers - The business imperative for Big Data
Acquire Grow & Retain customers - The business imperative for Big Data
 
Towards Quality-Aware Development of Big Data Applications with DICE
Towards Quality-Aware Development of Big Data Applications with DICETowards Quality-Aware Development of Big Data Applications with DICE
Towards Quality-Aware Development of Big Data Applications with DICE
 
Big data for quality education
Big data for quality educationBig data for quality education
Big data for quality education
 
Data profiling-best-practices
Data profiling-best-practicesData profiling-best-practices
Data profiling-best-practices
 
Metadata Workshop
Metadata WorkshopMetadata Workshop
Metadata Workshop
 
HUGE List of IEP Goals
HUGE List of IEP Goals HUGE List of IEP Goals
HUGE List of IEP Goals
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
DICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made EasyDICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made Easy
 
Data Quality Challenges to Big Data_Practical Insights_KPMG Presentation 20.4...
Data Quality Challenges to Big Data_Practical Insights_KPMG Presentation 20.4...Data Quality Challenges to Big Data_Practical Insights_KPMG Presentation 20.4...
Data Quality Challenges to Big Data_Practical Insights_KPMG Presentation 20.4...
 

Similar a Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Development

QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiouslyQA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
QAFest
 

Similar a Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Development (20)

QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiouslyQA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
 
Testing Big Data solutions fast and furiously
Testing Big Data solutions fast and furiouslyTesting Big Data solutions fast and furiously
Testing Big Data solutions fast and furiously
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
Hadoop: Code Injection, Distributed Fault Injection
Hadoop: Code Injection, Distributed Fault InjectionHadoop: Code Injection, Distributed Fault Injection
Hadoop: Code Injection, Distributed Fault Injection
 
Java one 2010
Java one 2010Java one 2010
Java one 2010
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Inside Azure Diagnostics (DevLink 2014)
Inside Azure Diagnostics (DevLink 2014)Inside Azure Diagnostics (DevLink 2014)
Inside Azure Diagnostics (DevLink 2014)
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
What to expect from Java 9
What to expect from Java 9What to expect from Java 9
What to expect from Java 9
 
Inside Azure Diagnostics
Inside Azure DiagnosticsInside Azure Diagnostics
Inside Azure Diagnostics
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 
Deep Dive Modern Apps Lifecycle with Visual Studio 2012: How to create cross ...
Deep Dive Modern Apps Lifecycle with Visual Studio 2012: How to create cross ...Deep Dive Modern Apps Lifecycle with Visual Studio 2012: How to create cross ...
Deep Dive Modern Apps Lifecycle with Visual Studio 2012: How to create cross ...
 
Start with version control and experiments management in machine learning
Start with version control and experiments management in machine learningStart with version control and experiments management in machine learning
Start with version control and experiments management in machine learning
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
OWASP ZAP Workshop for QA Testers
OWASP ZAP Workshop for QA TestersOWASP ZAP Workshop for QA Testers
OWASP ZAP Workshop for QA Testers
 

Más de inovex GmbH

Interpretable Machine Learning
Interpretable Machine LearningInterpretable Machine Learning
Interpretable Machine Learning
inovex GmbH
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
inovex GmbH
 
Representation Learning von Zeitreihen
Representation Learning von ZeitreihenRepresentation Learning von Zeitreihen
Representation Learning von Zeitreihen
inovex GmbH
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
inovex GmbH
 

Más de inovex GmbH (20)

lldb – Debugger auf Abwegen
lldb – Debugger auf Abwegenlldb – Debugger auf Abwegen
lldb – Debugger auf Abwegen
 
Are you sure about that?! Uncertainty Quantification in AI
Are you sure about that?! Uncertainty Quantification in AIAre you sure about that?! Uncertainty Quantification in AI
Are you sure about that?! Uncertainty Quantification in AI
 
Why natural language is next step in the AI evolution
Why natural language is next step in the AI evolutionWhy natural language is next step in the AI evolution
Why natural language is next step in the AI evolution
 
WWDC 2019 Recap
WWDC 2019 RecapWWDC 2019 Recap
WWDC 2019 Recap
 
Network Policies
Network PoliciesNetwork Policies
Network Policies
 
Interpretable Machine Learning
Interpretable Machine LearningInterpretable Machine Learning
Interpretable Machine Learning
 
Jenkins X – CI/CD in wolkigen Umgebungen
Jenkins X – CI/CD in wolkigen UmgebungenJenkins X – CI/CD in wolkigen Umgebungen
Jenkins X – CI/CD in wolkigen Umgebungen
 
AI auf Edge-Geraeten
AI auf Edge-GeraetenAI auf Edge-Geraeten
AI auf Edge-Geraeten
 
Prometheus on Kubernetes
Prometheus on KubernetesPrometheus on Kubernetes
Prometheus on Kubernetes
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Azure IoT Edge
Azure IoT EdgeAzure IoT Edge
Azure IoT Edge
 
Representation Learning von Zeitreihen
Representation Learning von ZeitreihenRepresentation Learning von Zeitreihen
Representation Learning von Zeitreihen
 
Talk to me – Chatbots und digitale Assistenten
Talk to me – Chatbots und digitale AssistentenTalk to me – Chatbots und digitale Assistenten
Talk to me – Chatbots und digitale Assistenten
 
Künstlich intelligent?
Künstlich intelligent?Künstlich intelligent?
Künstlich intelligent?
 
Dev + Ops = Go
Dev + Ops = GoDev + Ops = Go
Dev + Ops = Go
 
Das Android Open Source Project
Das Android Open Source ProjectDas Android Open Source Project
Das Android Open Source Project
 
Machine Learning Interpretability
Machine Learning InterpretabilityMachine Learning Interpretability
Machine Learning Interpretability
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
 
People & Products – Lessons learned from the daily IT madness
People & Products – Lessons learned from the daily IT madnessPeople & Products – Lessons learned from the daily IT madness
People & Products – Lessons learned from the daily IT madness
 
Infrastructure as (real) Code – Manage your K8s resources with Pulumi
Infrastructure as (real) Code – Manage your K8s resources with PulumiInfrastructure as (real) Code – Manage your K8s resources with Pulumi
Infrastructure as (real) Code – Manage your K8s resources with Pulumi
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Development

  • 1. Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Development Dr. Dominik Benz, inovex GmbH 2013/06/03, Berlin Buzzwords
  • 2. ? TDD! ? ? ? ? ? ? ? Write/execute tests, specify acceptance criteria, … 2 Who speaks… … the Elephant language? Class A extends Mapper… ROI, $$, … apt-get install…
  • 3. 3 The road… … to Big Data QA our Big Data QA problem the FitNesse approach test data definition / selection job & workflow control result inspection
  • 4. 4 QA problem Web Intelligence @ 1&1 DWH Hadoop Cluster ~ 1 billion log events / day, ~ 1 TB (thrift) logfiles chains of MR jobs, running on 20 nodes / 8 cores / 96 GB RAM (CDH) BI reporting, web analytics, …
  • 5. 5 QA problem An exemplary workflow Log Files (thrift) Log Files (thrift) Log Files (thrift) Inter- mediate result (avro) MR job 1 … DWH (RDBMS) MR job 2 create (sample) input data create (sample) input data ? inspect (binary) formats inspect (binary) formats ? control workflows control workflows ?
  • 6. method tests what? issues for our usecase JUnit isolated functions no integration, Java syntax MRUnit 1 mapper + 1 reducer „little“ integration, Java syntax iTest hadoop jobs/workflows Java / Groovy syntax Scripts/CLI (manual) scripting/inspect. „script chaos“, syntax 6 QA problem Existing Approaches FitNesse as suitable addition / solution!
  • 7. 7 The road… … to Big Data QA Big Data QA is different! the FitNesse approach test data definition / selection job & workflow control result inspection
  • 8. 8 FitNesse In a nutshell „fully integrated standalone wiki and acceptance testing framework” • „executable“ Wiki- Pages (returning test results) • (almost) natural language test specification • connection to SUT via (Java-)“Fixtures“
  • 9. 9 FitNesse Architecture Overview script | check | num results | 3 | Browser FitNesse Server public int numResults { ... } System under Test Fixtures „calling java methods from wiki“, compare return values Integrates with REST, Jenkins…
  • 11. 11 FitNesse Exemplary Test Source !path /home/inovex/lib/*.jar | script | Hadoop | | upload | viewLog.csv | to hdfs | /testdata/ | | hadoop job from jar | viewLog.jar | [...] | | show | job output | | check | number of output files | 3 |
  • 12. 12 FitNesse Hadoop Fixture Java Code public class Hadoop { public boolean uploadToHdfs(String localFile, String remoteFile) {...} public boolean hadoopJobFromJar(String jar, String input, String output) {...} public String jobOutput() {...} public String numberOfOutputFiles() {...} }
  • 13. 13 The road… … to Big Data QA Big Data QA is different! Fitnesse Wiki test execution! test data definition / selection job & workflow control result inspection
  • 15. ‣ Big Data: Efficient data transfer among heterogeneous sources ‣ Define Interface via IDL, Compiler for many languages 15 Test Data Thrift
  • 16. ‣ Dev/Test Hadoop Cluster: Identical Hardware like Prod, but fewer nodes ‣ (random/biased) sampling e.g. on daily basis ‣ Feedback loop: ‣ identify „special cases“ from real data ‣ include them in (manual) data definition ‣ Gradually increase test coverage / artefact quality 16 Test Data Real World Data
  • 17. 17 The road… … to Big Data QA Big Data QA is different! FitNesse Wiki test execution! Define CSV / thrift / real- world test data! job & workflow control result inspection
  • 18. ‣ Execute arbitrary (shell) commands ‣ Mainly a wrapper around apache.commons.exec.CommandLine 18 Job Control Swiss Army Knife: Shell
  • 19. ‣ Hide complexity from test authors ‣ „define“ appropriate test language via (Java) method names ‣ re-use other fixtures (Shell, …) internally 19 Job Control Hadoop Fixture
  • 20. ‣ FitNesse allows to group tests into suites ‣ Can be used to simulate MR processing chains ‣ SetupSuite / TearDownSuite for creating / destroying test conditions ‣ Tests can still be executed individually 20 Job Control Workflows & Suites MR job 1 MR job 2
  • 21. 21 The road… … to Big Data QA Big Data QA is different! FitNesse Wiki test execution! Define CSV / thrift / real- world data! Use suites & fixtures for jobs/workflows! result inspection
  • 22. ‣ Validate RDBMS contents (via JDBC) ‣ E.g. for checking the final result ‣ Or use Hive + Hive-Server to query raw data 22 Results Data Warehouse / Hive
  • 23. ‣ Execute arbitrary pig commands from Wiki page ‣ Inspect e.g. binary intermediate results (avro, …) 23 Results Pig
  • 24. public class PigConsole extends PigServer { public void loadAvroFileUsingAlias(String filename, String alias) { this.registerQuery( alias + "= LOAD" + filename + "USING" + AVRO_STORAGE_LOADER + ";"); } } 24 Results Pig Fixture extends PigServer
  • 25. 25 Results Server Infrastructure Fitnesse Master TestEnvironments ProjA ProjB TestConfigurations ProjA ProjB dev qs live dev qs live Import / edit tests remotely QS ProjA Slave Dev ProjA Slave Live ProjA SlaveProjA QS ProjA Slave Dev ProjA Slave Live ProjA Slave Import / edit config remotely dev qs live
  • 26. 26 Thank you! dominik.benz@inovex.de Big Data QA is different! FitNesse Wiki test execution! Define CSV / thrift / real- world data! Inspect results via Pig/Hive Use suites & fixtures for jobs/workflows!
  • 27. 27 Want more? Inovex trains you! Android Developer Training (3 days, Karlsruhe/München) Hadoop Developer Training (3 days, Karlsruhe/Köln) Certified Scrum Developer Training (5 days, Köln) Pentaho Data Integration Training (4 days, München/Köln) Liferay Portal-Admin Training (3 days, Karlsruhe) Liferay Portal-Developer Training (4 days, Karlsruhe) information and registration at www.inovex.de/offene-trainings
  • 28. 28 Inovex @bbuzz Stefan Kathrin Bernhard Jörg Andrew Christian Christian