Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

•Descargar como PPTX, PDF•

2 recomendaciones•1,709 vistas

DataWorks Summit/Hadoop Summit

Tecnología

Egor Pakhomov
Data Architect, Anchorfree
egor@anchrofree.com
Data infrastructure architecture for a medium
size organization:
tips for collecting, storing and analysis.

Medium organization
(<500 people)
Big organization
( >500 people)
DATA CUSTOMERS >10 >100
DATA VOLUME “Big data” “Big data”
DATA TEAM PEOPLE
RESOURCES
Enough to integrate and support
some open source stack
Enough to write our own data tools
FINANCIAL RESOURCES Enough to buy hardware
for Hadoop cluster
Enough to buy some cloud solution
(Databricks cloud, Google
BigQuery...)
Examples:Examples:
Data infrastructure architecture

HOW TO MANAGE BIG DATA
WHEN YOU ARE NOT THAT
BIG?

Data architect in AnchorFree
About me
Spark contributor since 0.9
Integrated spark in Yandex
Islands. Worked in Yandex
Data Factory
Participated in “Alpine Data”
development - Spark based data
platform

Agenda
Data
aggregation
Why SQL is important
and how to use
it in Hadoop?
• SQL vs R/Python
• Impala vs Spark
• Zeppelin vs SQL
desktop client
How to store data
to query it fast
and change easily?
• JSON vs Parquet
• Schema vs schema-
less
How to aggregate your
data to work better
with BI tools?
• Aggregate your data!
• SQL code is code!
1
Data
Querying
2
Data Storage
3
Data
Aggregation

1
Data
Querying
Why SQL is important and
how to use
it in Hadoop?
1. SQL vs R/Python
2. Impala vs Spark
3. Zeppelin vs SQL desktop client

BI
Analysts
Regular data
transformations
SQL
QA

What do you need from SQL engine?
Fast Reliable Able to process
terabytes of data
Support
Hive metastore
Support modern
SQL statements

Hive metastore role
HDFS
Hive Metastore
table_1 -> file341, file542, file453
table_2 -> file457, file458, file459
table_3 -> file37, file568, file359
table_4 -> file3457, file568, file349
…..
Driver of SQL engine
1
Driver of SQL engine
1
Executor Executor Executor
Executor

Which one would you choose? Both!
SparkSQL Impala
SUPPORT HIVE METASTORE + +
FAST - +
RELIABLE
(WORKS NOT ONLY IN RAM)
+ -
JSON SUPPORT + -
HIVE COMPATIBLE SYNTAX + -
OUT OF THE BOX YARN SUPPORT + -
MORE THAN JUST A SQL
FRAMEWORK
+ -

Connect Tableau to HadoopStep 1
Hadoop
ODBC/JDB
C server

Give SQL to users
Hadoop
ODBC/JDB
C server
Step 2

1. Manage desktop application on N laptops
2. One spark context per many users
3. Lack of visualizing
4. No decent resource scheduling
Would not work...

No decent resource scheduling:
One user blocks everyone

No decent resource scheduling:
Hadoop good in resource scheduling!

1. Web-based
2. Notebook-based
3. Great visualisation
4. Works with both Impala and Spark
5. Has cloud solution with support - Zeppelin Hub from
NFLabs
It’s great!

2
Data Storage
How to store data
to query it fast
and change easily?
1. JSON vs Parquet
2. Schema vs schema-less

What would you need from data storing?
Flexible
format
Fast querying Access
to “raw” data
Have schema

Can we choose just one data format? We need both!
Json Parquet
FLEXIBLE +
ACCESS TO “RAW” DATA +
FAST QUERYING +
HAVE SCHEMA +
IMPALA SUPPORT +

FORMAT QUERY TIME
Parquet SELECT Sum(some_field) FROM logs.parquet_datasource 136 sec
JSON SELECT Sum(Get_json_object(line, ‘$.some_field’))
FROM logs.json_datasource
764 sec
Parquet is 5 times faster!
But! when you need raw data, 5 times slower is not that bad
Let’s compare elegance and speed:

{
“First name”: “Mike”,
“Last name”: “Smith”,
“Gender”: “Male”,
“Country”: “US”
}
{
“First name”: “Anna”,
“Last name”: “Smith”,
“Age”: “45”,
“Country”: “Canada”,
Comments: ”Some additional info”
}
...
FIRST
NAME
LAST
NAME
GENDER AGE
Mike Smith Male NULL
Anna Smith NULL 45
... ... ... ...
JSON Parquet
How data in these formats compare

3
Data
Aggregation
How to aggregate
your data to work
better with BI tools?
1. Aggregate your data!
2. SQL code is code!

● “Big data” does not mean you need to query all data Daily
● BI tools should not do big queries
Aggregate your data!

BI tool
select * from ...
How aggregation works?
Git with queries
Query executor
Aggregated table

Report development process
1
2
4
Creating aggregated table in Zeppelin
Creating BI report based on this table
Adding queries to git to run daily
Publishing report
3
Data for report changing process:
Change query in git1

1. Need to apply our patches to source code
2. Move to new versions before any release
3. Move to new version on part of infrastructure - rest remain
on old one
We do not use Spark, which comes with Hadoop installation

Questions?
Contact:
Egor Pakhomov
egor@anchorfree.com
pahomov.egor@gmail.com
https://www.linkedin.com/in/egor-pakhomov-35179a3a

Más contenido relacionado

La actualidad más candente

Pandas UDF: Scalable Analysis with Python and PySparkLi Jin

Realizing the Promise of Portable Data Processing with Apache BeamDataWorks Summit

Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Holden Ackerman

Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowPyData

Building an ETL pipeline for Elasticsearch using SparkItai Yaffe

Hadoop Journey at WalgreensDataWorks Summit

Introduction to DremioDremio Corporation

Sparkler Presentation for Spark Summit East 2017Karanjeet Singh

Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit

Addressing Enterprise Customer Pain Points with a Data Driven ArchitectureDataWorks Summit

What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016StampedeCon

High-Scale Entity Resolution in HadoopDataWorks Summit/Hadoop Summit

Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation

Impala use case @ ZooskCloudera, Inc.

Data Science at Scale by Sarah GuidoSpark Summit

Redshift IntroductionDataKitchen

The Fundamentals Guide to HDP and HDInsightGert Drapers

Apache Spark in Scientific ApplciationsDr. Mirko Kämpf

Hadoop vs. RDBMS for Advanced Analyticsjoshwills

La actualidad más candente (19)

Pandas UDF: Scalable Analysis with Python and PySpark

Realizing the Promise of Portable Data Processing with Apache Beam

Presto & differences between popular SQL engines (Spark, Redshift, and Hive)

Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow

Building an ETL pipeline for Elasticsearch using Spark

Hadoop Journey at Walgreens

Introduction to Dremio

Sparkler Presentation for Spark Summit East 2017

Realtime Analytical Query Processing and Predictive Model Building on High Di...

Addressing Enterprise Customer Pain Points with a Data Driven Architecture

What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016

High-Scale Entity Resolution in Hadoop

Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

Impala use case @ Zoosk

Data Science at Scale by Sarah Guido

Redshift Introduction

The Fundamentals Guide to HDP and HDInsight

Apache Spark in Scientific Applciations

Hadoop vs. RDBMS for Advanced Analytics

Destacado

Apache Hadoop 3.0 What's new in YARN and MapReduceDataWorks Summit/Hadoop Summit

How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit

The truth about SQL and Data Warehousing on HadoopDataWorks Summit/Hadoop Summit

To The Cloud and Back: A Look At Hybrid AnalyticsDataWorks Summit/Hadoop Summit

Rebuilding Web Tracking Infrastructure for ScaleDataWorks Summit/Hadoop Summit

Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...DataWorks Summit/Hadoop Summit

The real world use of Big Data to change businessDataWorks Summit/Hadoop Summit

Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...DataWorks Summit/Hadoop Summit

Apache NiFi 1.0 in NutshellDataWorks Summit/Hadoop Summit

From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...DataWorks Summit/Hadoop Summit

Data science lifecycle with Apache ZeppelinDataWorks Summit/Hadoop Summit

Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseDataWorks Summit/Hadoop Summit

Moving towards enterprise ready Hadoop clusters on the cloudDataWorks Summit/Hadoop Summit

What's new in Hadoop Common and HDFS DataWorks Summit/Hadoop Summit

A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit

Comparison of Transactional Libraries for HBaseDataWorks Summit/Hadoop Summit

Streamline Hadoop DevOps with Apache AmbariDataWorks Summit/Hadoop Summit

Network for the Large-scale Hadoop cluster at Yahoo! JAPANDataWorks Summit/Hadoop Summit

Case study of DevOps for Hadoop in Recruit.DataWorks Summit/Hadoop Summit

SEGA : Growth hacking by Spark ML for Mobile gamesDataWorks Summit/Hadoop Summit

Destacado (20)

Apache Hadoop 3.0 What's new in YARN and MapReduce

How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...

The truth about SQL and Data Warehousing on Hadoop

To The Cloud and Back: A Look At Hybrid Analytics

Rebuilding Web Tracking Infrastructure for Scale

Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...

The real world use of Big Data to change business

Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...

Apache NiFi 1.0 in Nutshell

From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...

Data science lifecycle with Apache Zeppelin

Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

Moving towards enterprise ready Hadoop clusters on the cloud

What's new in Hadoop Common and HDFS

A Container-based Sizing Framework for Apache Hadoop/Spark Clusters

Comparison of Transactional Libraries for HBase

Streamline Hadoop DevOps with Apache Ambari

Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Case study of DevOps for Hadoop in Recruit.

SEGA : Growth hacking by Spark ML for Mobile games

Similar a Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

DoneDeal - AWS Data Analytics Platformmartinbpeters

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.

Ncku csie talk about SparkGiivee The

Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Jim Czuprynski

Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon

Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo

Lighthouse - an open-source library to build data lakes - Kris PeetersData Science Leuven

Spark SQL - 10 Things You Need to KnowKristian Alexander

2015 Data Science Summit @ dato ReviewHang Li

Essential Data Engineering for Data Scientist SoftServe

Using PySpark to Process Boat Loads of DataRobert Dempsey

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

Jumpstart on Apache Spark 2.2 on DatabricksDatabricks

Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...Data Con LA

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

The Great Lakes: How to Approach a Big Data ImplementationInside Analysis

Similar a Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis. (20)

Jump Start with Apache Spark 2.0 on Databricks

DoneDeal - AWS Data Analytics Platform

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...

Apache Spark for Everyone - Women Who Code Workshop

Cloudera Breakfast Series, Analytics Part 1: Use All Your Data

Ncku csie talk about Spark

Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...

Stephen Dillon - Fast Data Presentation Sept 02

Scaling up with Cisco Big Data: Data + Science = Data Science

Lighthouse - an open-source library to build data lakes - Kris Peeters

Spark SQL - 10 Things You Need to Know

2015 Data Science Summit @ dato Review

Essential Data Engineering for Data Scientist

Using PySpark to Process Boat Loads of Data

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy

Jump Start on Apache® Spark™ 2.x with Databricks

Jumpstart on Apache Spark 2.2 on Databricks

Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

The Great Lakes: How to Approach a Big Data Implementation

Más de DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in ProductionDataWorks Summit/Hadoop Summit

State of Security: Apache Spark & Apache ZeppelinDataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit

Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit

Revolutionize Text Mining with Spark and ZeppelinDataWorks Summit/Hadoop Summit

Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit

Hadoop Crash CourseDataWorks Summit/Hadoop Summit

Data Science Crash CourseDataWorks Summit/Hadoop Summit

Apache Spark Crash CourseDataWorks Summit/Hadoop Summit

Dataflow with Apache NiFiDataWorks Summit/Hadoop Summit

Schema Registry - Set you Data FreeDataWorks Summit/Hadoop Summit

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit

Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient DataWorks Summit/Hadoop Summit

HBase in Practice DataWorks Summit/Hadoop Summit

The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit

Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopDataWorks Summit/Hadoop Summit

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit

Backup and Disaster Recovery in Hadoop DataWorks Summit/Hadoop Summit

Más de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production

State of Security: Apache Spark & Apache Zeppelin

Unleashing the Power of Apache Atlas with Apache Ranger

Enabling Digital Diagnostics with a Data Science Platform

Revolutionize Text Mining with Spark and Zeppelin

Double Your Hadoop Performance with Hortonworks SmartSense

Hadoop Crash Course

Data Science Crash Course

Apache Spark Crash Course

Dataflow with Apache NiFi

Schema Registry - Set you Data Free

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...

Mool - Automated Log Analysis using Data Science and ML

How Hadoop Makes the Natixis Pack More Efficient

HBase in Practice

The Challenge of Driving Business Value from the Analytics of Things (AOT)

Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...

Backup and Disaster Recovery in Hadoop

Último

GenCyber Cyber Security Day PresentationMichael W. Hawkins

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Histor y of HAM Radio presentation slidevu2urc

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

How to convert PDF to text with Nanonetsnaman860154

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

🐬 The future of MySQL is Postgres 🐘RTylerCroy

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Google AI Hackathon: LLM based Evaluator for RAGSujit Pal

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

1. Egor Pakhomov Data Architect, Anchorfree egor@anchrofree.com Data infrastructure architecture for a medium size organization: tips for collecting, storing and analysis.

2. Medium organization (<500 people) Big organization ( >500 people) DATA CUSTOMERS >10 >100 DATA VOLUME “Big data” “Big data” DATA TEAM PEOPLE RESOURCES Enough to integrate and support some open source stack Enough to write our own data tools FINANCIAL RESOURCES Enough to buy hardware for Hadoop cluster Enough to buy some cloud solution (Databricks cloud, Google BigQuery...) Examples:Examples: Data infrastructure architecture

3. HOW TO MANAGE BIG DATA WHEN YOU ARE NOT THAT BIG?

4. Data architect in AnchorFree About me Spark contributor since 0.9 Integrated spark in Yandex Islands. Worked in Yandex Data Factory Participated in “Alpine Data” development - Spark based data platform

5. Agenda Data aggregation Why SQL is important and how to use it in Hadoop? • SQL vs R/Python • Impala vs Spark • Zeppelin vs SQL desktop client How to store data to query it fast and change easily? • JSON vs Parquet • Schema vs schema- less How to aggregate your data to work better with BI tools? • Aggregate your data! • SQL code is code! 1 Data Querying 2 Data Storage 3 Data Aggregation

6. 1 Data Querying Why SQL is important and how to use it in Hadoop? 1. SQL vs R/Python 2. Impala vs Spark 3. Zeppelin vs SQL desktop client

7. BI Analysts Regular data transformations SQL QA

8. What do you need from SQL engine? Fast Reliable Able to process terabytes of data Support Hive metastore Support modern SQL statements

9. Hive metastore role HDFS Hive Metastore table_1 -> file341, file542, file453 table_2 -> file457, file458, file459 table_3 -> file37, file568, file359 table_4 -> file3457, file568, file349 ….. Driver of SQL engine 1 Driver of SQL engine 1 Executor Executor Executor Executor

10. Which one would you choose? Both! SparkSQL Impala SUPPORT HIVE METASTORE + + FAST - + RELIABLE (WORKS NOT ONLY IN RAM) + - JSON SUPPORT + - HIVE COMPATIBLE SYNTAX + - OUT OF THE BOX YARN SUPPORT + - MORE THAN JUST A SQL FRAMEWORK + -

11. Connect Tableau to HadoopStep 1 Hadoop ODBC/JDB C server

12. Give SQL to users Hadoop ODBC/JDB C server Step 2

13. 1. Manage desktop application on N laptops 2. One spark context per many users 3. Lack of visualizing 4. No decent resource scheduling Would not work...

14. No decent resource scheduling: One user blocks everyone

15. No decent resource scheduling: Hadoop good in resource scheduling!

16. Apache Zeppelin is our solution

17. 1. Web-based 2. Notebook-based 3. Great visualisation 4. Works with both Impala and Spark 5. Has cloud solution with support - Zeppelin Hub from NFLabs It’s great!

18. Apache Zeppelin integration Hadoop

19. 2 Data Storage How to store data to query it fast and change easily? 1. JSON vs Parquet 2. Schema vs schema-less

20. What would you need from data storing? Flexible format Fast querying Access to “raw” data Have schema

21. Can we choose just one data format? We need both! Json Parquet FLEXIBLE + ACCESS TO “RAW” DATA + FAST QUERYING + HAVE SCHEMA + IMPALA SUPPORT +

22. FORMAT QUERY TIME Parquet SELECT Sum(some_field) FROM logs.parquet_datasource 136 sec JSON SELECT Sum(Get_json_object(line, ‘$.some_field’)) FROM logs.json_datasource 764 sec Parquet is 5 times faster! But! when you need raw data, 5 times slower is not that bad Let’s compare elegance and speed:

23. { “First name”: “Mike”, “Last name”: “Smith”, “Gender”: “Male”, “Country”: “US” } { “First name”: “Anna”, “Last name”: “Smith”, “Age”: “45”, “Country”: “Canada”, Comments: ”Some additional info” } ... FIRST NAME LAST NAME GENDER AGE Mike Smith Male NULL Anna Smith NULL 45 ... ... ... ... JSON Parquet How data in these formats compare

24. 3 Data Aggregation How to aggregate your data to work better with BI tools? 1. Aggregate your data! 2. SQL code is code!

25. ● “Big data” does not mean you need to query all data Daily ● BI tools should not do big queries Aggregate your data!

26. BI tool select * from ... How aggregation works? Git with queries Query executor Aggregated table

27. Report development process 1 2 4 Creating aggregated table in Zeppelin Creating BI report based on this table Adding queries to git to run daily Publishing report 3 Data for report changing process: Change query in git1

28. One more tip)

29. 1. Need to apply our patches to source code 2. Move to new versions before any release 3. Move to new version on part of infrastructure - rest remain on old one We do not use Spark, which comes with Hadoop installation

30. Questions? Contact: Egor Pakhomov egor@anchorfree.com pahomov.egor@gmail.com https://www.linkedin.com/in/egor-pakhomov-35179a3a

Notas del editor

Hi, my name is Egor, I’m data Architect at AnchorFree and I’d like to tell you about lessons we’ve learned during building our data infrastructure.
Working with big data in small and medium size organizations is different from big organizations. You still have big amount of data to process and you still have a lot of people, who work with data. But you do not have special team to build your own data tools and you do not have financial resources to go and shop for an existing solution in a cloud. But such constraints will force you to use every bit of technology that exist out there in open source hadoop stack.
So how to manage Big data when you are not so big?
About me: I’ve participated in data infrastructure development in Anchorfree and in Yandex data factory before that. I’m spark contributor since 0.9.
Today I would tell about our approach for solving 3 big problems in big data area: how to query data, how to store it and how to efficiently aggregate it.
Let’s start with querying.
For us core of data processing is SQL. Today many companies invest in R or Python support, but it make sense for companies, where SQL is already adopted. If you want to hire data-analyst, it’s 10 times easier to find a person good in SQL, rather than a person who is good in Python or R. Business people used to BI tools like Tableau, which require SQL access to data. QA engineers in our company currently use data to verify, that functionality works correctly on real users and SQL is easy tool to teach them, if they do not know it already. Even for person, who is very advanced in any scripting language, it is much easier to get some answer with SQL query comparing to writing and debugging a script. If I have to choose the most important technical goal for our data infrastructure, I would choose providing a fast, reliable SQL based interface to data.
First we need to select an SQL engine. Big data SQL engine should be fast, reliable, able to process terabytes of data, support modern SQL statements and have a good tools integration . Other important feature, which you should pay attention to, is a support for Hive metastore.
Hive metastore is a storage for meta information about tables. This information includes schema of the table and location of the files with data in the table. When SQL engine tries to work with some table, first it goes to Hive metastore and retrieve information about the schema and file location. You should be able in no time switch between SQL engines without need to migrate meta information database and Hive metastore is most common solution.
For us there were always 2 major players in this field - Impala and Spark SQL. Other interesting solutions are Apache Drill and Presto. other tools have their own advantages, but, when your resources are limited, it’s hard to support more than 2-3 SQL engines and, what is more important, it’s difficult to give every data user knowledge on when they should pick each particular tool. Impala is fast and does not rely on YARN for resource scheduling. That is both good and bad at the same time - queries start faster, because they do not need to allocate a container, but without containers it is hard to manage resource quotas. SparkSQL significantly slower, but stable, better at resource allocation and it provides not only SQL - it is ML framework, Data Streaming framework and framework for much more. Biggest mistake we made during building our infrastructure - we tried to choose between SparkSQL and Impala. There is no way to make such choice - you need to use both of them. Impala could be winning only by speed, but speed is the very important. 10 seconds spent to execute query and 2 minutes spent on the query is a difference between being able to do interactive data analysis or not. When you join multiple tables with 100 terabytes of data - Spark will be your obvious choice. When you need query snapshot of some MySQL database in Hadoop - volume of data is not so big, so you can use Impala. We use Impala when we looking for insights in the data, having not very heavy queries and need speed for fast cycles and we switch to Spark for stable execution.
Next important choice we made about SQL in our company is a tool to run the SQL queries. Impala or Spark are engines, not an UI, which human can use directly. We have Tableau. Tableau, like many BI tools, needs some JDBC/ODBC server. Impala has this functionality out of the box, but for Spark you will need to set up Thrift ODBC server.
After that step we have easy solution for users to run SQL. They will connect to Impala or Spark server with a desktop SQL client like Squirrel. But this solution would not work for some reasons.
First reason: you need to set up SQL client on every laptop of every user. In our organisation with 80 people it would be at least 30 laptops of users, who use our data infrastructure. It’s hard to maintain software on so many individual desktops. Second reason - it is easy to mess up the Spark context. You could run a query for a big table with multiple files and partitions and Spark Server will hang OOM-like trying to process the query. Spark has bugs, so context can get broken without even doing some heavy things. And when user will break the Spark context, the server would stop working for everyone. Another thing in the desktop SQL clients - lack of any visualization tools, which will be important, when you work with the data. Last reason is much deeper and it is resource allocation. There is no decent functionality in the resource allocation for Impala or in a single Spark server. It is a easier for Impala since it is faster and people will spend less time waiting
It’s worse for the Spark even with turned on Fair Scheduler. Spark and Impala are both bad in the resource allocation, but guess who good at it - Hadoop.
For every user you could define a queue on a cluster and have a single spark application working in the queue. You can set up a quota on this queue and have a hierarchical quota definition based on an user, department and organizations. So you need somehow to have one spark context for single user.
Perfect solution for all problems I’ve described - Apache Zeppelin. Its web-based, notebook-based interactive data analytics tool. If you familiar with IPython or Jupyter - it’s very similar.
It’s web based, so you do not have to install anything to user’s machines. It’s have nice visualisation. It works with both Impala and Spark. And you can have a zeppelin per user or per department. Every separate Zeppelin has it’s own Spark application and separate queue in hadoop. When user write heavy query and brake spark application it does not bother anyone.
Latest version of zeppelin, which became available around this August has authentication and new Livy interpreter, which allows you separate zeppelin and spark context it’s working with. But it’s all rather new and we haven’t had chance to use it. If you are not ready to try Livy interpreter, I can give you a tip about old way. Do not put multiple Zeppelin instances on a single machine - every driver on every Zeppelin would require RAM for caching table meta information and RAM for working with query result. So separate them into separate virtual machines. In our case 16 gb RAM per machine is enough.
Let’s talk about storing the data.
Different requirements for format in which we should store data are very contradictive. We want data format to be fast like parquet, since we would query it a lot. We need schema, so people, would be able to understand nature of datasource without any additional help. We need flexibility to change data often - every release data producers start to report some new fields. And of course we need to have access to raw data in case something very rare was reported and such rare events does not got into schema.
Compromise for all these requirements would be simultaneous use of JSON and Parquet. Data producer generates simple flat JSON and sends it to hadoop. We store it as it is and have Hive table , which wraps this datasource. Nightly we transform all JSON data to parquet format for previous day, leaving JSON data unchanged. Of course this parquet data is part of some Hive table as well.
If I need to query some important long living fields, I query Parquet data like “select some_field from logs.parquet_datasource” and it takes x seconds. If I do same query against JSON datasource, it looks like “select properties[‘some_field’] from logs.json_datasource” and take x seconds. It’s both slower and require lookup schema somewhere. Most of the time people query narrow set of fields from schema, but when they query something rare, 5 times slower is not so bad. Another thing, which you should keep in mind is that Impala does not work with JSON.
During set up of this infrastructure we made an interesting mistake. We’ve thought, that all fields in JSON are probably important and started to amend schema of parquet table, when we see some new field. It was a mistake, because data producers tend to have bugs and put thousands of irrelevant fields into JSON, which resulted in our schema was too big. After that we decided to put next protocol in place: Nightly spark job, which transforms json to parquet, take schema from hive table for this parquet datasource and use this schema while extracting data from JSON. It makes created parquet files and hive schema consistent. When data producers start to report some new field it does not affect schema and size of parquet datasource until I manually amend schema of parquet table. I makes our infrastructure defended against trash fields.
Another thing, which we tried to avoid for a long time - creating tables with aggregated data.
“Stack for big data” gives you opportunity to query terabytes of data in reasonable time - but it doesn’t mean, that you can query it daily for building reports. Imagine - you have some old fashion BI tool , which was designed to access data sources with sub second query latency. And then you try to use it against hadoop stack, where querying data for month can take minutes to execute. First naive approach - try to insert queries in Tableau anyway. It wouldn’t work properly – BI visualisation systems are about visualisation, not about job scheduling, and when you make it execute 30 minutes extracts - you make it do job scheduling. It’s bad, because job scheduling tool should be good at keeping quotas, priorities, retries and logging problems with jobs. And BI tools bad at it. Another problem - all queries go to single Spark server, which is additional single point of failure. Another problem - we already have thousands lines of code for queries, which get data for our reports. It’s SQL code, but it’s code anyway. As any code, it requires refactoring, extracting common pieces into separate abstraction and using control version system for properly working with it. And of course it all impossible to do, when your code in BI tool. Even slight change of the way we work with data would require to download report, change code inside it and upload it back. It’s impossible to manage this way tens of reports.
So we’ve put all ours queries for data preparation for reports into git. We’ve wrote simple spark job, which takes SQL queries and executes it. And we scheduled this job to run daily. It significantly improved report development process. Analyst first experiment with queries in his notebook and then put these queries to git to run daily. No additional support required from engineering team, since analyst can commit to git himself. Then he just creates a report based on aggregated table in Tableau. And the query, which he uses in Tableau is just “Select * from aggregates.some_table”.
When he need to change the way data is processed for this report - he need only to change query in git. Such changes significantly speed up report development cycle. But at first step we’ve daily executed queries, which processed whole history and it was an issue - currently to create all aggregated tables it takes 17 hours for cluster. That’s why at some point we started to have 2 queries for every aggregated table - one processing all history and other processing only previous day and inserting data into aggregated table. And time of daily query running dropped to 3hours.
I’ve talked about major decisions to make and I want to share some smaller tips. We never use spark, which comes with hadoop installation. We always build our own Spark from sources and use it for all processes. We have several reasons for that: We applied some patches for better work with JSON data, so we need to build spark from source to include these patches. Sometimes we move to newer version of Spark because of some essential functionality or bug fixes and we can not wait for official release. When we move to newer version of spark, different components move with different speed. Zeppelin might be already on 2.0, when data transformation should be at 1.6 until we verify, that new version have no bugs for our type of data and our type of queries.

Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (20)

Similar a Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

Similar a Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis. (20)

Más de DataWorks Summit/Hadoop Summit

Más de DataWorks Summit/Hadoop Summit (20)

Último

Último (20)

Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

Notas del editor