Boost PC performance: How more available memory can improve productivity
Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.
1. Egor Pakhomov
Data Architect, Anchorfree
egor@anchrofree.com
Data infrastructure architecture for a medium
size organization:
tips for collecting, storing and analysis.
2. Medium organization
(<500 people)
Big organization
( >500 people)
DATA CUSTOMERS >10 >100
DATA VOLUME “Big data” “Big data”
DATA TEAM PEOPLE
RESOURCES
Enough to integrate and support
some open source stack
Enough to write our own data tools
FINANCIAL RESOURCES Enough to buy hardware
for Hadoop cluster
Enough to buy some cloud solution
(Databricks cloud, Google
BigQuery...)
Examples:Examples:
Data infrastructure architecture
4. Data architect in AnchorFree
About me
Spark contributor since 0.9
Integrated spark in Yandex
Islands. Worked in Yandex
Data Factory
Participated in “Alpine Data”
development - Spark based data
platform
5. Agenda
Data
aggregation
Why SQL is important
and how to use
it in Hadoop?
• SQL vs R/Python
• Impala vs Spark
• Zeppelin vs SQL
desktop client
How to store data
to query it fast
and change easily?
• JSON vs Parquet
• Schema vs schema-
less
How to aggregate your
data to work better
with BI tools?
• Aggregate your data!
• SQL code is code!
1
Data
Querying
2
Data Storage
3
Data
Aggregation
6. 1
Data
Querying
Why SQL is important and
how to use
it in Hadoop?
1. SQL vs R/Python
2. Impala vs Spark
3. Zeppelin vs SQL desktop client
10. Which one would you choose? Both!
SparkSQL Impala
SUPPORT HIVE METASTORE + +
FAST - +
RELIABLE
(WORKS NOT ONLY IN RAM)
+ -
JSON SUPPORT + -
HIVE COMPATIBLE SYNTAX + -
OUT OF THE BOX YARN SUPPORT + -
MORE THAN JUST A SQL
FRAMEWORK
+ -
12. Give SQL to users
Hadoop
ODBC/JDB
C server
Step 2
13. 1. Manage desktop application on N laptops
2. One spark context per many users
3. Lack of visualizing
4. No decent resource scheduling
Would not work...
17. 1. Web-based
2. Notebook-based
3. Great visualisation
4. Works with both Impala and Spark
5. Has cloud solution with support - Zeppelin Hub from
NFLabs
It’s great!
19. 2
Data Storage
How to store data
to query it fast
and change easily?
1. JSON vs Parquet
2. Schema vs schema-less
20. What would you need from data storing?
Flexible
format
Fast querying Access
to “raw” data
Have schema
21. Can we choose just one data format? We need both!
Json Parquet
FLEXIBLE +
ACCESS TO “RAW” DATA +
FAST QUERYING +
HAVE SCHEMA +
IMPALA SUPPORT +
22. FORMAT QUERY TIME
Parquet SELECT Sum(some_field) FROM logs.parquet_datasource 136 sec
JSON SELECT Sum(Get_json_object(line, ‘$.some_field’))
FROM logs.json_datasource
764 sec
Parquet is 5 times faster!
But! when you need raw data, 5 times slower is not that bad
Let’s compare elegance and speed:
23. {
“First name”: “Mike”,
“Last name”: “Smith”,
“Gender”: “Male”,
“Country”: “US”
}
{
“First name”: “Anna”,
“Last name”: “Smith”,
“Age”: “45”,
“Country”: “Canada”,
Comments: ”Some additional info”
}
...
FIRST
NAME
LAST
NAME
GENDER AGE
Mike Smith Male NULL
Anna Smith NULL 45
... ... ... ...
JSON Parquet
How data in these formats compare
25. ● “Big data” does not mean you need to query all data Daily
● BI tools should not do big queries
Aggregate your data!
26. BI tool
select * from ...
How aggregation works?
Git with queries
Query executor
Aggregated table
27. Report development process
1
2
4
Creating aggregated table in Zeppelin
Creating BI report based on this table
Adding queries to git to run daily
Publishing report
3
Data for report changing process:
Change query in git1
29. 1. Need to apply our patches to source code
2. Move to new versions before any release
3. Move to new version on part of infrastructure - rest remain
on old one
We do not use Spark, which comes with Hadoop installation
Hi, my name is Egor, I’m data Architect at AnchorFree and I’d like to tell you about lessons we’ve learned during building our data infrastructure.
Working with big data in small and medium size organizations is different from big organizations. You still have big amount of data to process and you still have a lot of people, who work with data. But you do not have special team to build your own data tools and you do not have financial resources to go and shop for an existing solution in a cloud. But such constraints will force you to use every bit of technology that exist out there in open source hadoop stack.
So how to manage Big data when you are not so big?
About me: I’ve participated in data infrastructure development in Anchorfree and in Yandex data factory before that. I’m spark contributor since 0.9.
Today I would tell about our approach for solving 3 big problems in big data area: how to query data, how to store it and how to efficiently aggregate it.
Let’s start with querying.
For us core of data processing is SQL. Today many companies invest in R or Python support, but it make sense for companies, where SQL is already adopted. If you want to hire data-analyst, it’s 10 times easier to find a person good in SQL, rather than a person who is good in Python or R. Business people used to BI tools like Tableau, which require SQL access to data. QA engineers in our company currently use data to verify, that functionality works correctly on real users and SQL is easy tool to teach them, if they do not know it already. Even for person, who is very advanced in any scripting language, it is much easier to get some answer with SQL query comparing to writing and debugging a script. If I have to choose the most important technical goal for our data infrastructure, I would choose providing a fast, reliable SQL based interface to data.
First we need to select an SQL engine. Big data SQL engine should be fast, reliable, able to process terabytes of data, support modern SQL statements and have a good tools integration . Other important feature, which you should pay attention to, is a support for Hive metastore.
Hive metastore is a storage for meta information about tables. This information includes schema of the table and location of the files with data in the table. When SQL engine tries to work with some table, first it goes to Hive metastore and retrieve information about the schema and file location. You should be able in no time switch between SQL engines without need to migrate meta information database and Hive metastore is most common solution.
For us there were always 2 major players in this field - Impala and Spark SQL. Other interesting solutions are Apache Drill and Presto. other tools have their own advantages, but, when your resources are limited, it’s hard to support more than 2-3 SQL engines and, what is more important, it’s difficult to give every data user knowledge on when they should pick each particular tool.
Impala is fast and does not rely on YARN for resource scheduling. That is both good and bad at the same time - queries start faster, because they do not need to allocate a container, but without containers it is hard to manage resource quotas.
SparkSQL significantly slower, but stable, better at resource allocation and it provides not only SQL - it is ML framework, Data Streaming framework and framework for much more. Biggest mistake we made during building our infrastructure - we tried to choose between SparkSQL and Impala. There is no way to make such choice - you need to use both of them.
Impala could be winning only by speed, but speed is the very important. 10 seconds spent to execute query and 2 minutes spent on the query is a difference between being able to do interactive data analysis or not. When you join multiple tables with 100 terabytes of data - Spark will be your obvious choice. When you need query snapshot of some MySQL database in Hadoop - volume of data is not so big, so you can use Impala. We use Impala when we looking for insights in the data, having not very heavy queries and need speed for fast cycles and we switch to Spark for stable execution.
Next important choice we made about SQL in our company is a tool to run the SQL queries. Impala or Spark are engines, not an UI, which human can use directly.
We have Tableau. Tableau, like many BI tools, needs some JDBC/ODBC server. Impala has this functionality out of the box, but for Spark you will need to set up Thrift ODBC server.
After that step we have easy solution for users to run SQL. They will connect to Impala or Spark server with a desktop SQL client like Squirrel. But this solution would not work for some reasons.
First reason: you need to set up SQL client on every laptop of every user. In our organisation with 80 people it would be at least 30 laptops of users, who use our data infrastructure. It’s hard to maintain software on so many individual desktops.
Second reason - it is easy to mess up the Spark context. You could run a query for a big table with multiple files and partitions and Spark Server will hang OOM-like trying to process the query. Spark has bugs, so context can get broken without even doing some heavy things. And when user will break the Spark context, the server would stop working for everyone.
Another thing in the desktop SQL clients - lack of any visualization tools, which will be important, when you work with the data.
Last reason is much deeper and it is resource allocation. There is no decent functionality in the resource allocation for Impala or in a single Spark server. It is a easier for Impala since it is faster and people will spend less time waiting
It’s worse for the Spark even with turned on Fair Scheduler. Spark and Impala are both bad in the resource allocation, but guess who good at it - Hadoop.
For every user you could define a queue on a cluster and have a single spark application working in the queue. You can set up a quota on this queue and have a hierarchical quota definition based on an user, department and organizations. So you need somehow to have one spark context for single user.
Perfect solution for all problems I’ve described - Apache Zeppelin. Its web-based, notebook-based interactive data analytics tool. If you familiar with IPython or Jupyter - it’s very similar.
It’s web based, so you do not have to install anything to user’s machines. It’s have nice visualisation. It works with both Impala and Spark. And you can have a zeppelin per user or per department. Every separate Zeppelin has it’s own Spark application and separate queue in hadoop. When user write heavy query and brake spark application it does not bother anyone.
Latest version of zeppelin, which became available around this August has authentication and new Livy interpreter, which allows you separate zeppelin and spark context it’s working with. But it’s all rather new and we haven’t had chance to use it.
If you are not ready to try Livy interpreter, I can give you a tip about old way. Do not put multiple Zeppelin instances on a single machine - every driver on every Zeppelin would require RAM for caching table meta information and RAM for working with query result. So separate them into separate virtual machines. In our case 16 gb RAM per machine is enough.
Let’s talk about storing the data.
Different requirements for format in which we should store data are very contradictive. We want data format to be fast like parquet, since we would query it a lot. We need schema, so people, would be able to understand nature of datasource without any additional help. We need flexibility to change data often - every release data producers start to report some new fields. And of course we need to have access to raw data in case something very rare was reported and such rare events does not got into schema.
Compromise for all these requirements would be simultaneous use of JSON and Parquet. Data producer generates simple flat JSON and sends it to hadoop. We store it as it is and have Hive table , which wraps this datasource. Nightly we transform all JSON data to parquet format for previous day, leaving JSON data unchanged. Of course this parquet data is part of some Hive table as well.
If I need to query some important long living fields, I query Parquet data like “select some_field from logs.parquet_datasource” and it takes x seconds. If I do same query against JSON datasource, it looks like “select properties[‘some_field’] from logs.json_datasource” and take x seconds. It’s both slower and require lookup schema somewhere. Most of the time people query narrow set of fields from schema, but when they query something rare, 5 times slower is not so bad. Another thing, which you should keep in mind is that Impala does not work with JSON.
During set up of this infrastructure we made an interesting mistake. We’ve thought, that all fields in JSON are probably important and started to amend schema of parquet table, when we see some new field. It was a mistake, because data producers tend to have bugs and put thousands of irrelevant fields into JSON, which resulted in our schema was too big. After that we decided to put next protocol in place:
Nightly spark job, which transforms json to parquet, take schema from hive table for this parquet datasource and use this schema while extracting data from JSON. It makes created parquet files and hive schema consistent. When data producers start to report some new field it does not affect schema and size of parquet datasource until I manually amend schema of parquet table. I makes our infrastructure defended against trash fields.
Another thing, which we tried to avoid for a long time - creating tables with aggregated data.
“Stack for big data” gives you opportunity to query terabytes of data in reasonable time - but it doesn’t mean, that you can query it daily for building reports.
Imagine - you have some old fashion BI tool , which was designed to access data sources with sub second query latency. And then you try to use it against hadoop stack, where querying data for month can take minutes to execute. First naive approach - try to insert queries in Tableau anyway. It wouldn’t work properly – BI visualisation systems are about visualisation, not about job scheduling, and when you make it execute 30 minutes extracts - you make it do job scheduling. It’s bad, because job scheduling tool should be good at keeping quotas, priorities, retries and logging problems with jobs. And BI tools bad at it. Another problem - all queries go to single Spark server, which is additional single point of failure. Another problem - we already have thousands lines of code for queries, which get data for our reports. It’s SQL code, but it’s code anyway. As any code, it requires refactoring, extracting common pieces into separate abstraction and using control version system for properly working with it. And of course it all impossible to do, when your code in BI tool. Even slight change of the way we work with data would require to download report, change code inside it and upload it back. It’s impossible to manage this way tens of reports.
So we’ve put all ours queries for data preparation for reports into git. We’ve wrote simple spark job, which takes SQL queries and executes it. And we scheduled this job to run daily. It significantly improved report development process. Analyst first experiment with queries in his notebook and then put these queries to git to run daily. No additional support required from engineering team, since analyst can commit to git himself. Then he just creates a report based on aggregated table in Tableau. And the query, which he uses in Tableau is just “Select * from aggregates.some_table”.
When he need to change the way data is processed for this report - he need only to change query in git.
Such changes significantly speed up report development cycle. But at first step we’ve daily executed queries, which processed whole history and it was an issue - currently to create all aggregated tables it takes 17 hours for cluster. That’s why at some point we started to have 2 queries for every aggregated table - one processing all history and other processing only previous day and inserting data into aggregated table. And time of daily query running dropped to 3hours.
I’ve talked about major decisions to make and I want to share some smaller tips.
We never use spark, which comes with hadoop installation. We always build our own Spark from sources and use it for all processes. We have several reasons for that: We applied some patches for better work with JSON data, so we need to build spark from source to include these patches. Sometimes we move to newer version of Spark because of some essential functionality or bug fixes and we can not wait for official release. When we move to newer version of spark, different components move with different speed. Zeppelin might be already on 2.0, when data transformation should be at 1.6 until we verify, that new version have no bugs for our type of data and our type of queries.