As a Hadoop developer, do you want to quickly develop your Hadoop workflows? Do you want to test your workflows in a sandboxed environment similar to production? Do you want to write unit tests for your workflows and add assertions on top of it?
In just a few years, the number of users writing Hadoop/Spark jobs at LinkedIn have grown from tens to hundreds and the number of jobs running every day has grown from hundreds to thousands. With the ever increasing number of users and jobs, it becomes crucial to reduce the development time for these jobs. It is also important to test these jobs thoroughly before they go to production.
We’ve tried to address these issues by creating a testing framework for Hadoop/Spark jobs. The testing framework enables the users to run their jobs in an environment similar to the production environment and on the data which is sampled from the original data. The testing framework consists of a test deployment system, a data generation pipeline to generate the sampled data, a data management system to help users manage and search the sampled data and an assertion engine to validate the test output.
In this talk, we will discuss the motivation behind the testing framework before deep diving into its design. We will further discuss how the testing framework is helping the Hadoop users at LinkedIn to be more productive.
HTML Injection Attacks: Impact and Mitigation Strategies
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
1.
2. Beyond unit tests:
Testing for Spark/Hadoop workflows
Anant Nag
Senior Software Engineer
LinkedIn
Shankar Manian
Staff Software Engineer
LinkedIn
3. A day in the life of data engineer
● Produce D.E.D Report by 5 AM.
● At 1 AM, an alert goes off saying the pipeline has
failed
● Dev wakes up, curses his bad luck and starts
backfill job at high priority
● Finds cluster busy, starts killing jobs to make way
for D.E.D job
● Debugs failures, finds today is daylight savings day
and data partitions has gone haywire.
● Most days it works on retry
● Some days, we are not so lucky
NO D.E.D => We are D.E.A.D
4. Scale @ LinkedIn
10s of clusters
1000s of machines
1000s of users
100s of 1000s of Azkaban workflows running per month
Powers key business impacting features
People you may know
Who viewed my profile
5. Nightmares at Data street
● Cluster gets upgraded
● Data partition changes
● Code needs to be rewritten in a
new technology
● Different version of a dependent
jar is available
● ...
6. Do you know your dependencies?
● Direct dependencies
● Indirect dependencies
● Hidden dependencies
● Semantic dependencies
“Hey, I am changing column X in data P to format B. Do you foresee any issues?”
13. ● 10s of clusters @LINKEDIN
● Multiple versions of Pig - 0.11, 0.15
● Some clusters update now, some later
● Code should run on all the versions
Configuration override
14. Configuration override
● Write multiple tests
● One test for each version of pig
● Override pig version in the tests
workflowTestSuite("testWithPig11") {
addWorkflow('countByCountryFlow') {
lookup('countByCountry') {
set properties: [
‘pig.home’: ‘/path/to/pig/11’
]
}
}
}
workflowTestSuite("testWithPig12") {
addWorkflow('countByCountryFlow') {
lookup('countByCountry') {
set properties: [
‘pig.home’: ‘/path/to/pig/12’
]
}
}
}
workflowTestSuite("testWithPig15") {
addWorkflow('countByCountryFlow') {
lookup('countByCountry') {
set properties: [
‘pig.home’: ‘/path/to/pig/15’
]
}
}
}
17. ● DataFrames API
● Rewrite all spark jobs to use DataFrames
● Is my new code ready for production??
● Write tests
○ Assertions on output
● All tests succeed after changes (^_^)
18. Data Validation and Assertion
● Types of checks
○ Record level checks
○ Aggregate level
■ Data transformation
■ Data aggregation
■ Data distribution
● Assert against Expectation
20. Aggregated validation
● Binary spam classifier C1
● C1 classifies 30% of test input as spam
● Wrote a new classifier C2
● C2 can deviate at most 10%
● Write aggregated validations
22. Test execution
$> gradle azkabanTest -Ptestname=test1
● Test results on the terminal
● Reports for the passed and failed tests
Tests [1/1]:=> Flow TEST-countByCountry completed with status SUCCEEDED
Summary of the completed tests
Job Statistics for individual test
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Flow Name | Latest Exec ID | Status | Running | Succeeded | Failed | Ready | Cancelled | Disabled | Total
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
TEST-countByCountry | 2997645 | SUCCEEDED | 0 | 2 | 0 | 0 | 0 | 0 | 2
23. Test automation
● Auto deployment process for Azkaban artifacts
● Tests run as part of the deployment
● Tests fail => Artifact deployment fails
● No un-tested code can go to production
25. Test data
● Real data
○ Very large data
○ Tests run very slow
● Randomly generated data
○ Not equivalent to real data
○ Real issues can never be catched
● Manually created data
○ Covers all the cases
○ Too much effort
● Anything else????
26. Requirements of test data
● Representative of real data
● Smaller in size
● Automated generation
● Sharable
● Discoverable
27. Data sampling
● Sampling definition:
○ Sampling logic + parameters
○ Sampling logic example:
SELECT * FROM ${table} TABLESAMPLE(${sample_percent} PERCENT) s;
○ Parameters: table, sample_percent
● Joinable samples
○ Join tables first and then sample
○ Hash bucket on the column to join and pick same buckets
● Metadata
○ Expiry date
○ Refresh rate
○ Permissions
28. Sampling pipeline
● Separate repository for sampling definition + metadata
● Sampling processor generates Azkaban workflows
● Each definition corresponds to single workflow
● Auto deployment to Azkaban
● Workflows scheduled based on metadata
○ Refresh rate
○ Expiry date
● Samples produced stored in HDFS
● Permissions set using metadata
29. Sampling discovery
● Publish metadata for Sampled data to data discovery service
● Linkedin wherehows
● Publish metadata
○ Original data for a sample
○ All samples of a data
○ Default sample of a data
○ Lifecycle details
33. Data validation framework
● Validation logic in schema
● Automated validation on read and write
● Serves as contract between producers and consumers
34. Sampling pipeline
● Hash Bucketing on data ingestion
● Automated discovery of samples through DALI API
● Sampling for model training and testing
● Obfuscation of sensitive data
Hello Everyone, Thank you for coming. I am Shankar and this is Anant and we are going to talk about testing big data workflows.
It has been challenging to do that. As you might have experienced, there are external factors that are hard to control influence the outcome. Quite often, we have to upgrade our clusters to the latest version of hadoop. As stable as hadoop has been, upgrades almost always cause some issue or other to some of the existing pipelines. Sometimes, its the way the data is partitioned or ingested that causes our joins to fail. The technology space in big data being as vibrant as it is, we often find new and better ways of doing things. May be we want to rewrite the job in Spark for better performance. Or Tez. Or something else that might come up in the future. Or its sometimes as simple as a newer version of a jar your code depends on. These are just some of the examples. The list is endless
To be able to just be aware of all the dependencies that can break your pipeline is super hard. Too soon your dependencies are so complicated and interwoven no one knows why a pipeline is failing. Direct dependencies are the easiest one. These are the data or the flows that produce the data that your flow is dependent on. But then those flows in turn could be dependent on some other flow and so on forming a large list of indirect dependencies. Sometimes the dependencies are hidden. One of the flows could be dependent on an intermediate data that some other flow is producing and did not clean up and no one knew about the dependency. One fine day the other flow owner decided to clean up and voila your flow is failing. Dependencies on datasets and their schemas are much easier compared to tracking semantic dependencies. A time field could be changing from PST to UTC. Does it affect you ? Will your flow run with that change ? it's hard to tell.
There is an easy solution to all of this. "Don't change anything". That might sound paranoid but in light of all of this, it might be justified. There is one serious issue with that solution. If you are not feeling confident to constantly make changes, you will be making less and less of them and losing out the innovative edge your company has in the marketplace. Clearly, no company would want that. Is there a better alternative ? Is there a way to be confident whatever changes, our flow will continue to work. Is there a way to find out a potential failure before it happens in production?
These are the questions that motivated us to work on a workflow testing framework we are calling Marvin. To talk more about Marvin and how it helps address these scenarios, let me call Anant.
I’m going to talk about the End to end testing framework
I’ll talk about the different components involved, how those components are implemented and how they interact with each other
What are the requirements for a workflow testing framework, what do we need?
We definitely need a way to programatically write our Azkaban workflows
We can do that using the Azkaban DSL, I’ll talk more about how Azkaban DSL looks like in the future slides
Once we have written our workflows, We should be able to write tests for it
We can again use the Azkaban DSL to write tests.
Once we have written the tests, where do we run them???
Can we run them locally?
Well we can but it’s very difficult to replicate the production environment on the local machine
Can we run them on the production environment, the tests might be slow, there can be latency issues but our tests will be more robust and our code will be immune to environment or configuration changes.
What is next?? We have the test definitions, we know where to run them and how to run them but we don’t have the test data
Again we can have it locally but since the tests are being run on Azkaban, we should have them in HDFS
This is how the complete architecture looks like, We have the test definitions, test execution environment and the test data
Now we can go see how a workflow definition looks like
The workflow definition is written in a language called the Hadoop DSL. Using this language, one can define multiple workflows,
Add jobs to the workflow, add properties to the jobs and many more.
In this example, we have declared a workflow called workflow1, This workflow contains a two jobs,
Job1 which is a spark job
Job2 which is a pig job
There is a linear dependency between these jobs and job2 depends on Job1. Each job also has some properties which are required for the job to run. For e.g. to run a spark job you need to have the execution jar, and the class which should be run or you might want to set the eecutorMemory or the number of executors
When this DSL is compiled, it produces Azkaban workflows. Visually, the DAG will look something like this
Now we know how to define a workflow programatically, we are ready to write tests for our workflows.
What are the requirements for a good testing framework, what should we be able to do???
We should be able to create a named test
Can do that using the workflowTestSuite construct
We should be able to Add the workflow that we want to test.
Can be done using the addWorkflow construct.
Provide the name of the workflow that you want to add as parameter of addWorkflow construct
We should also be able to add multiple tests
So we can multiple workflowTestSuite blocks with different names
Adding a workflow to test is not enough, we may want to override certain parameters of the workflow
Workflows themselves doesn’t have any parameter, who has parameters? Jobs!!
We should be able to find a job and change parameters
Parameters such as input_data
Point it to test data
Point the output to test output
Don’t change anything else
Apart from the parameters, there are other things that we can change, we can change environment variables, we can change configurations
Let me give you a real world example of why configuration override or environment override might be necessary
Linkedin has 10s of clusters. Some clusters are used as ETL clusters, development clusters, production clusters. Multiple clusters might not have the same set of softwares or the same environment. Since there are multiple versions of pig floating around, some clusters will have pig 11, some will have pig 15 and so on.
We update our clusters with latest versions but some might be updated now, some later so there is always a mismatch of installed software installed.
I cannot always go to all the clusters and run my code in each of the clusters
So what can I do there to make sure that my code runs on all the version pig and on all the clusters.
I can write multiple tests for my code
Each of my test will test my code against a particular version of pig
If all of my tests succeed, then I know that my pig code works for all the pig versions and it will run on other clusters as well
Any test is incomplete without assertions
Assertions let you add a validation on the output of your function
They help you to compare the output of your function to a expected value and see if the output meets expectation
They also help you to categorize your failures, if a particular assertion fails, you know where to look.
So why do we need assertions for our workflow tests? To answer this question, let me tell you a true story….
I had a complex pipeline with 10s of jobs in the workflow. IT used to collect metrics from hadoop and ingest them to HDFS from where the data was consumed by other pipelines.
Most of my jobs were spark jobs and I wrote them using spark 1.4
So Spark came up with a new DataFrames API
I was very excited about the new API and wanted to rewrite all of my jobs to use dataframes.
Before I start to rewrite my jobs, How can I be sure that the new code is ready for production??
What did i do
I wrote tests and added assertions on the output of the test
Now even if I change my code and rewrite it, the output data should be the same and all my assertions should pass
And when I ran my code, all of my tests succeeded and I could move my code to production.
The moral of the story is that we need Data validation and assertions, specially for workflows
Now what kind checks or validations can we have on our output data
We can have record level checks. We can have some validation rules on each record and validate those rules record by record
Or we can have aggregated level checks. Which means that we’ll have validation rules on a number of records at once. We might not be able to write straight forward aggregated rules so we may have to transform the data, aggregate the data or distribute the data before performing any checks
We can also have a direct comparison against the expected output. For e.g. you can just compare the schema of the output to the expected schema. Or you can have a hardcoded expected data which should be equal to the output data.
An example of record level checks could be that you have a table with two columsn
The first column indicates the number id of the country. The second column is the number of member from the country. In the data we can see the countries us, india, china and their corresponding number of memebers
Now one kind of check that we can perform here is that we can check if the number of membes of each country is not negative. To do this kind of assertion in our test definition, we can add an assertion workflow which is just another workflow but it will be run after the workflow that we are testing. In this assertionWorkflow, we can have a job which validates our output and checks if each count of the country is not negative.
This can be easily written using spark where we use the require method to find out all the records with negative count values and assertthat the number of such records is zero
We can also have aggregated level validation.
To give you an example,
Consider your team have developed a binary spam classifier called C1
You expect it to perform well and it classifier about 30% of your test input as spam
Now your team has started workfing on a new classifier called C2 which is very similar to classifier C1
You don’t expect the classifier C2 to deviate a lot from C1
It can deviate at most 10% from C1
What can I do to make sure that classifier C2 works as expected???
Well, I can again write tests and hae aggregated validations on the test output
Even with the new classifier, my validations should passs.
Now we know how to define the workflows, how to define the tests and add assertions on top of the tests. We also have the test execution environment in place.
But how do we run our tests, how do we deploy and execute our tests
Azkaban DSL comes as part of a Gradle plugin, which has certain tasks
Tasks to deploy an artifact to Azkaban, run the executions on Azkaban and fetch the result
We have written a new task to the gradle plugin which can deploy our tests separately, run them and get the result of the tests on the terminal
It also creates a nice report for the passed and the failed tests along with information about specific jobs
On your screen, you can see such an example. We created a test called Test-countByCountry and ran it in Azkaban,
The test suceeded and it had two succeeded jobs
So the azkabanTest task gives us the ability to run ad-hoc tests. Now ad-hoc tests are fine when you are developing.
But testing is incomplete unless you have an automated way of running these tests.
You want to run your tests whenever there is a new change in the code. And these tests should be run before your code is even deployed to production.
So for this we have a test automation system in process.
LInkedin has an auto deployment system for deploying artifacts to Azkaban
This sytem just takes the azkaban artifacts from artifactory and uploads them to Azkaban.
Now the tests can run as part of the auto deployment process. Our deployment service, can run all the tests before deploying the artifacts and
If any test fails, the deployment to the production can also fail.
This way we make sure that any buggy code or non tested code doesn’t go to the production.
Till now we have talked about writing the tests, writing the assertions, running the tests on Azkaban, but we are still missing a very important component of our testing framework.
Test data, we still don’t know where to find our test data, how to generate the test data, what are the requirements of test data.
Different ways in which we can get the test data
We can use the real data but the real data is very large, our tests will run very slow
We can use randomly generated data but it’s not equivalent to the real data. We can run our tests very fast but the real issues can never be catched
Or we can look at the original data and manually create our test data. Now manually created data is good, it will cover all the cases but it’s too much of an efforct to create the test data. You’ll have to look at the original data, observe it and then manually create the data.
Is there anything else??
Before trying to find an alternative, let us take the learnings from this slide and come up with requirements of ideal test data
So we want our test data to be -
Representative of the real data, to catch most of the real world issues
It shoudl be smaller in size so that we can run all our tests relatively fast
We should not put in a lot of efforct to generate the test data therefore we should have the ability to generate the test daTa automatically.
The test data should be sharable. It means that once someone has created a test data for his tests, the test data should be available for other teams to use as well.
It should be discoverable, it means that other teams should be able to discover your data before trying to generate their own data
Test framework allows you to override inputs/outputs to any data of your choice. People can create their own samples and use it.
Sampling pipeline’s goal is to enable discoverablility and sharing of existing samples and standardization and debuggability of sampling definition. In addition, it can improve performance by materializing at ingestion stage.
So at LinkedIn, we realized that sampled data can be a good candidate for the test data.
So we came up with concept of Sampling definition. Now a sampling definition is nothing but just the sampling logic and some parameters. A sampling logic could be as simple as “shown here”. The sample_percent is provided as a parameter. Now we need this kind of sampling definition because we want multiple developers to share their logic for sampling and then others can just plugin in their parameters and generate samples of their own.
We now know the requirements for the test data and how sample data meets the requirements. But, there are still some parts missing
We still need to automate the generation, we still need to make it discoverable and sharable
We’ve created a sampling pipeline which takes care of all these things