SlideShare una empresa de Scribd logo
1 de 30
Egor Pakhomov
Data Architect, Anchorfree
egor@anchrofree.com
Data infrastructure architecture for a medium
size organization:
tips for collecting, storing and analysis.
Medium organization
(<500 people)
Big organization
( >500 people)
DATA CUSTOMERS >10 >100
DATA VOLUME “Big data” “Big data”
DATA TEAM PEOPLE
RESOURCES
Enough to integrate and support
some open source stack
Enough to write our own data tools
FINANCIAL RESOURCES Enough to buy hardware
for Hadoop cluster
Enough to buy some cloud solution
(Databricks cloud, Google
BigQuery...)
Examples:Examples:
Data infrastructure architecture
HOW TO MANAGE BIG DATA
WHEN YOU ARE NOT THAT
BIG?
Data architect in AnchorFree
About me
Spark contributor since 0.9
Integrated spark in Yandex
Islands. Worked in Yandex
Data Factory
Participated in “Alpine Data”
development - Spark based data
platform
Agenda
Data
aggregation
Why SQL is important
and how to use
it in Hadoop?
• SQL vs R/Python
• Impala vs Spark
• Zeppelin vs SQL
desktop client
How to store data
to query it fast
and change easily?
• JSON vs Parquet
• Schema vs schema-
less
How to aggregate your
data to work better
with BI tools?
• Aggregate your data!
• SQL code is code!
1
Data
Querying
2
Data Storage
3
Data
Aggregation
1
Data
Querying
Why SQL is important and
how to use
it in Hadoop?
1. SQL vs R/Python
2. Impala vs Spark
3. Zeppelin vs SQL desktop client
BI
Analysts
Regular data
transformations
SQL
QA
What do you need from SQL engine?
Fast Reliable Able to process
terabytes of data
Support
Hive metastore
Support modern
SQL statements
Hive metastore role
HDFS
Hive Metastore
table_1 -> file341, file542, file453
table_2 -> file457, file458, file459
table_3 -> file37, file568, file359
table_4 -> file3457, file568, file349
…..
Driver of SQL engine
1
Driver of SQL engine
1
Executor Executor Executor
Executor
Which one would you choose? Both!
SparkSQL Impala
SUPPORT HIVE METASTORE + +
FAST - +
RELIABLE
(WORKS NOT ONLY IN RAM)
+ -
JSON SUPPORT + -
HIVE COMPATIBLE SYNTAX + -
OUT OF THE BOX YARN SUPPORT + -
MORE THAN JUST A SQL
FRAMEWORK
+ -
Connect Tableau to HadoopStep 1
Hadoop
ODBC/JDB
C server
Give SQL to users
Hadoop
ODBC/JDB
C server
Step 2
1. Manage desktop application on N laptops
2. One spark context per many users
3. Lack of visualizing
4. No decent resource scheduling
Would not work...
No decent resource scheduling:
One user blocks everyone
No decent resource scheduling:
Hadoop good in resource scheduling!
Apache Zeppelin is our solution
1. Web-based
2. Notebook-based
3. Great visualisation
4. Works with both Impala and Spark
5. Has cloud solution with support - Zeppelin Hub from
NFLabs
It’s great!
Apache Zeppelin integration
Hadoop
2
Data Storage
How to store data
to query it fast
and change easily?
1. JSON vs Parquet
2. Schema vs schema-less
What would you need from data storing?
Flexible
format
Fast querying Access
to “raw” data
Have schema
Can we choose just one data format? We need both!
Json Parquet
FLEXIBLE +
ACCESS TO “RAW” DATA +
FAST QUERYING +
HAVE SCHEMA +
IMPALA SUPPORT +
FORMAT QUERY TIME
Parquet SELECT Sum(some_field) FROM logs.parquet_datasource 136 sec
JSON SELECT Sum(Get_json_object(line, ‘$.some_field’))
FROM logs.json_datasource
764 sec
Parquet is 5 times faster!
But! when you need raw data, 5 times slower is not that bad
Let’s compare elegance and speed:
{
“First name”: “Mike”,
“Last name”: “Smith”,
“Gender”: “Male”,
“Country”: “US”
}
{
“First name”: “Anna”,
“Last name”: “Smith”,
“Age”: “45”,
“Country”: “Canada”,
Comments: ”Some additional info”
}
...
FIRST
NAME
LAST
NAME
GENDER AGE
Mike Smith Male NULL
Anna Smith NULL 45
... ... ... ...
JSON Parquet
How data in these formats compare
3
Data
Aggregation
How to aggregate
your data to work
better with BI tools?
1. Aggregate your data!
2. SQL code is code!
● “Big data” does not mean you need to query all data Daily
● BI tools should not do big queries
Aggregate your data!
BI tool
select * from ...
How aggregation works?
Git with queries
Query executor
Aggregated table
Report development process
1
2
4
Creating aggregated table in Zeppelin
Creating BI report based on this table
Adding queries to git to run daily
Publishing report
3
Data for report changing process:
Change query in git1
One more tip)
1. Need to apply our patches to source code
2. Move to new versions before any release
3. Move to new version on part of infrastructure - rest remain
on old one
We do not use Spark, which comes with Hadoop installation
Questions?
Contact:
Egor Pakhomov
egor@anchorfree.com
pahomov.egor@gmail.com
https://www.linkedin.com/in/egor-pakhomov-35179a3a

Más contenido relacionado

La actualidad más candente

Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamDataWorks Summit
 
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Holden Ackerman
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowSimplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowPyData
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at WalgreensDataWorks Summit
 
Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Karanjeet Singh
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit
 
Addressing Enterprise Customer Pain Points with a Data Driven Architecture
Addressing Enterprise Customer Pain Points with a Data Driven ArchitectureAddressing Enterprise Customer Pain Points with a Data Driven Architecture
Addressing Enterprise Customer Pain Points with a Data Driven ArchitectureDataWorks Summit
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016StampedeCon
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ ZooskCloudera, Inc.
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
Redshift Introduction
Redshift IntroductionRedshift Introduction
Redshift IntroductionDataKitchen
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analyticsjoshwills
 

La actualidad más candente (19)

Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowSimplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
 
Introduction to Dremio
Introduction to DremioIntroduction to Dremio
Introduction to Dremio
 
Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
Addressing Enterprise Customer Pain Points with a Data Driven Architecture
Addressing Enterprise Customer Pain Points with a Data Driven ArchitectureAddressing Enterprise Customer Pain Points with a Data Driven Architecture
Addressing Enterprise Customer Pain Points with a Data Driven Architecture
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
 
High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Redshift Introduction
Redshift IntroductionRedshift Introduction
Redshift Introduction
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 

Destacado

How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...DataWorks Summit/Hadoop Summit
 
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...DataWorks Summit/Hadoop Summit
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...DataWorks Summit/Hadoop Summit
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseDataWorks Summit/Hadoop Summit
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudDataWorks Summit/Hadoop Summit
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANDataWorks Summit/Hadoop Summit
 

Destacado (20)

Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
The truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on HadoopThe truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on Hadoop
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
 
Rebuilding Web Tracking Infrastructure for Scale
Rebuilding Web Tracking Infrastructure for ScaleRebuilding Web Tracking Infrastructure for Scale
Rebuilding Web Tracking Infrastructure for Scale
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
The real world use of Big Data to change business
The real world use of Big Data to change businessThe real world use of Big Data to change business
The real world use of Big Data to change business
 
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
 
Apache NiFi 1.0 in Nutshell
Apache NiFi 1.0 in NutshellApache NiFi 1.0 in Nutshell
Apache NiFi 1.0 in Nutshell
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
 
Data science lifecycle with Apache Zeppelin
Data science lifecycle with Apache ZeppelinData science lifecycle with Apache Zeppelin
Data science lifecycle with Apache Zeppelin
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Comparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBaseComparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBase
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
 
Case study of DevOps for Hadoop in Recruit.
Case study of DevOps for Hadoop in Recruit.Case study of DevOps for Hadoop in Recruit.
Case study of DevOps for Hadoop in Recruit.
 
SEGA : Growth hacking by Spark ML for Mobile games
SEGA : Growth hacking by Spark ML for Mobile gamesSEGA : Growth hacking by Spark ML for Mobile games
SEGA : Growth hacking by Spark ML for Mobile games
 

Similar a Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platformmartinbpeters
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Jim Czuprynski
 
Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris PeetersLighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris PeetersData Science Leuven
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato ReviewHang Li
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataRobert Dempsey
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...Data Con LA
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationInside Analysis
 

Similar a Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis. (20)

Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platform
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
 
Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris PeetersLighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris Peeters
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of Data
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data Implementation
 

Más de DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

Más de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Último

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Último (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

  • 1. Egor Pakhomov Data Architect, Anchorfree egor@anchrofree.com Data infrastructure architecture for a medium size organization: tips for collecting, storing and analysis.
  • 2. Medium organization (<500 people) Big organization ( >500 people) DATA CUSTOMERS >10 >100 DATA VOLUME “Big data” “Big data” DATA TEAM PEOPLE RESOURCES Enough to integrate and support some open source stack Enough to write our own data tools FINANCIAL RESOURCES Enough to buy hardware for Hadoop cluster Enough to buy some cloud solution (Databricks cloud, Google BigQuery...) Examples:Examples: Data infrastructure architecture
  • 3. HOW TO MANAGE BIG DATA WHEN YOU ARE NOT THAT BIG?
  • 4. Data architect in AnchorFree About me Spark contributor since 0.9 Integrated spark in Yandex Islands. Worked in Yandex Data Factory Participated in “Alpine Data” development - Spark based data platform
  • 5. Agenda Data aggregation Why SQL is important and how to use it in Hadoop? • SQL vs R/Python • Impala vs Spark • Zeppelin vs SQL desktop client How to store data to query it fast and change easily? • JSON vs Parquet • Schema vs schema- less How to aggregate your data to work better with BI tools? • Aggregate your data! • SQL code is code! 1 Data Querying 2 Data Storage 3 Data Aggregation
  • 6. 1 Data Querying Why SQL is important and how to use it in Hadoop? 1. SQL vs R/Python 2. Impala vs Spark 3. Zeppelin vs SQL desktop client
  • 8. What do you need from SQL engine? Fast Reliable Able to process terabytes of data Support Hive metastore Support modern SQL statements
  • 9. Hive metastore role HDFS Hive Metastore table_1 -> file341, file542, file453 table_2 -> file457, file458, file459 table_3 -> file37, file568, file359 table_4 -> file3457, file568, file349 ….. Driver of SQL engine 1 Driver of SQL engine 1 Executor Executor Executor Executor
  • 10. Which one would you choose? Both! SparkSQL Impala SUPPORT HIVE METASTORE + + FAST - + RELIABLE (WORKS NOT ONLY IN RAM) + - JSON SUPPORT + - HIVE COMPATIBLE SYNTAX + - OUT OF THE BOX YARN SUPPORT + - MORE THAN JUST A SQL FRAMEWORK + -
  • 11. Connect Tableau to HadoopStep 1 Hadoop ODBC/JDB C server
  • 12. Give SQL to users Hadoop ODBC/JDB C server Step 2
  • 13. 1. Manage desktop application on N laptops 2. One spark context per many users 3. Lack of visualizing 4. No decent resource scheduling Would not work...
  • 14. No decent resource scheduling: One user blocks everyone
  • 15. No decent resource scheduling: Hadoop good in resource scheduling!
  • 16. Apache Zeppelin is our solution
  • 17. 1. Web-based 2. Notebook-based 3. Great visualisation 4. Works with both Impala and Spark 5. Has cloud solution with support - Zeppelin Hub from NFLabs It’s great!
  • 19. 2 Data Storage How to store data to query it fast and change easily? 1. JSON vs Parquet 2. Schema vs schema-less
  • 20. What would you need from data storing? Flexible format Fast querying Access to “raw” data Have schema
  • 21. Can we choose just one data format? We need both! Json Parquet FLEXIBLE + ACCESS TO “RAW” DATA + FAST QUERYING + HAVE SCHEMA + IMPALA SUPPORT +
  • 22. FORMAT QUERY TIME Parquet SELECT Sum(some_field) FROM logs.parquet_datasource 136 sec JSON SELECT Sum(Get_json_object(line, ‘$.some_field’)) FROM logs.json_datasource 764 sec Parquet is 5 times faster! But! when you need raw data, 5 times slower is not that bad Let’s compare elegance and speed:
  • 23. { “First name”: “Mike”, “Last name”: “Smith”, “Gender”: “Male”, “Country”: “US” } { “First name”: “Anna”, “Last name”: “Smith”, “Age”: “45”, “Country”: “Canada”, Comments: ”Some additional info” } ... FIRST NAME LAST NAME GENDER AGE Mike Smith Male NULL Anna Smith NULL 45 ... ... ... ... JSON Parquet How data in these formats compare
  • 24. 3 Data Aggregation How to aggregate your data to work better with BI tools? 1. Aggregate your data! 2. SQL code is code!
  • 25. ● “Big data” does not mean you need to query all data Daily ● BI tools should not do big queries Aggregate your data!
  • 26. BI tool select * from ... How aggregation works? Git with queries Query executor Aggregated table
  • 27. Report development process 1 2 4 Creating aggregated table in Zeppelin Creating BI report based on this table Adding queries to git to run daily Publishing report 3 Data for report changing process: Change query in git1
  • 29. 1. Need to apply our patches to source code 2. Move to new versions before any release 3. Move to new version on part of infrastructure - rest remain on old one We do not use Spark, which comes with Hadoop installation

Notas del editor

  1. Hi, my name is Egor, I’m data Architect at AnchorFree and I’d like to tell you about lessons we’ve learned during building our data infrastructure.
  2. Working with big data in small and medium size organizations is different from big organizations. You still have big amount of data to process and you still have a lot of people, who work with data. But you do not have special team to build your own data tools and you do not have financial resources to go and shop for an existing solution in a cloud. But such constraints will force you to use every bit of technology that exist out there in open source hadoop stack.
  3. So how to manage Big data when you are not so big?
  4. About me: I’ve participated in data infrastructure development in Anchorfree and in Yandex data factory before that. I’m spark contributor since 0.9.
  5. Today I would tell about our approach for solving 3 big problems in big data area: how to query data, how to store it and how to efficiently aggregate it.
  6. Let’s start with querying.
  7. For us core of data processing is SQL. Today many companies invest in R or Python support, but it make sense for companies, where SQL is already adopted. If you want to hire data-analyst, it’s 10 times easier to find a person good in SQL, rather than a person who is good in Python or R. Business people used to BI tools like Tableau, which require SQL access to data. QA engineers in our company currently use data to verify, that functionality works correctly on real users and SQL is easy tool to teach them, if they do not know it already. Even for person, who is very advanced in any scripting language, it is much easier to get some answer with SQL query comparing to writing and debugging a script. If I have to choose the most important technical goal for our data infrastructure, I would choose providing a fast, reliable SQL based interface to data.
  8. First we need to select an SQL engine. Big data SQL engine should be fast, reliable, able to process terabytes of data, support modern SQL statements and have a good tools integration . Other important feature, which you should pay attention to, is a support for Hive metastore.
  9. Hive metastore is a storage for meta information about tables. This information includes schema of the table and location of the files with data in the table. When SQL engine tries to work with some table, first it goes to Hive metastore and retrieve information about the schema and file location. You should be able in no time switch between SQL engines without need to migrate meta information database and Hive metastore is most common solution.
  10. For us there were always 2 major players in this field - Impala and Spark SQL. Other interesting solutions are Apache Drill and Presto. other tools have their own advantages, but, when your resources are limited, it’s hard to support more than 2-3 SQL engines and, what is more important, it’s difficult to give every data user knowledge on when they should pick each particular tool. Impala is fast and does not rely on YARN for resource scheduling. That is both good and bad at the same time - queries start faster, because they do not need to allocate a container, but without containers it is hard to manage resource quotas. SparkSQL significantly slower, but stable, better at resource allocation and it provides not only SQL - it is ML framework, Data Streaming framework and framework for much more. Biggest mistake we made during building our infrastructure - we tried to choose between SparkSQL and Impala. There is no way to make such choice - you need to use both of them. Impala could be winning only by speed, but speed is the very important. 10 seconds spent to execute query and 2 minutes spent on the query is a difference between being able to do interactive data analysis or not. When you join multiple tables with 100 terabytes of data - Spark will be your obvious choice. When you need query snapshot of some MySQL database in Hadoop - volume of data is not so big, so you can use Impala. We use Impala when we looking for insights in the data, having not very heavy queries and need speed for fast cycles and we switch to Spark for stable execution.
  11. Next important choice we made about SQL in our company is a tool to run the SQL queries. Impala or Spark are engines, not an UI, which human can use directly. We have Tableau. Tableau, like many BI tools, needs some JDBC/ODBC server. Impala has this functionality out of the box, but for Spark you will need to set up Thrift ODBC server.
  12. After that step we have easy solution for users to run SQL. They will connect to Impala or Spark server with a desktop SQL client like Squirrel. But this solution would not work for some reasons.
  13. First reason: you need to set up SQL client on every laptop of every user. In our organisation with 80 people it would be at least 30 laptops of users, who use our data infrastructure. It’s hard to maintain software on so many individual desktops. Second reason - it is easy to mess up the Spark context. You could run a query for a big table with multiple files and partitions and Spark Server will hang OOM-like trying to process the query. Spark has bugs, so context can get broken without even doing some heavy things. And when user will break the Spark context, the server would stop working for everyone. Another thing in the desktop SQL clients - lack of any visualization tools, which will be important, when you work with the data. Last reason is much deeper and it is resource allocation. There is no decent functionality in the resource allocation for Impala or in a single Spark server. It is a easier for Impala since it is faster and people will spend less time waiting
  14. It’s worse for the Spark even with turned on Fair Scheduler. Spark and Impala are both bad in the resource allocation, but guess who good at it - Hadoop.
  15. For every user you could define a queue on a cluster and have a single spark application working in the queue. You can set up a quota on this queue and have a hierarchical quota definition based on an user, department and organizations. So you need somehow to have one spark context for single user.
  16. Perfect solution for all problems I’ve described - Apache Zeppelin. Its web-based, notebook-based interactive data analytics tool. If you familiar with IPython or Jupyter - it’s very similar.
  17. It’s web based, so you do not have to install anything to user’s machines. It’s have nice visualisation. It works with both Impala and Spark. And you can have a zeppelin per user or per department. Every separate Zeppelin has it’s own Spark application and separate queue in hadoop. When user write heavy query and brake spark application it does not bother anyone.
  18. Latest version of zeppelin, which became available around this August has authentication and new Livy interpreter, which allows you separate zeppelin and spark context it’s working with. But it’s all rather new and we haven’t had chance to use it. If you are not ready to try Livy interpreter, I can give you a tip about old way. Do not put multiple Zeppelin instances on a single machine - every driver on every Zeppelin would require RAM for caching table meta information and RAM for working with query result. So separate them into separate virtual machines. In our case 16 gb RAM per machine is enough.
  19. Let’s talk about storing the data.
  20. Different requirements for format in which we should store data are very contradictive. We want data format to be fast like parquet, since we would query it a lot. We need schema, so people, would be able to understand nature of datasource without any additional help. We need flexibility to change data often - every release data producers start to report some new fields. And of course we need to have access to raw data in case something very rare was reported and such rare events does not got into schema.
  21. Compromise for all these requirements would be simultaneous use of JSON and Parquet. Data producer generates simple flat JSON and sends it to hadoop. We store it as it is and have Hive table , which wraps this datasource. Nightly we transform all JSON data to parquet format for previous day, leaving JSON data unchanged. Of course this parquet data is part of some Hive table as well.
  22. If I need to query some important long living fields, I query Parquet data like “select some_field from logs.parquet_datasource” and it takes x seconds. If I do same query against JSON datasource, it looks like “select properties[‘some_field’] from logs.json_datasource” and take x seconds. It’s both slower and require lookup schema somewhere. Most of the time people query narrow set of fields from schema, but when they query something rare, 5 times slower is not so bad. Another thing, which you should keep in mind is that Impala does not work with JSON.
  23. During set up of this infrastructure we made an interesting mistake. We’ve thought, that all fields in JSON are probably important and started to amend schema of parquet table, when we see some new field. It was a mistake, because data producers tend to have bugs and put thousands of irrelevant fields into JSON, which resulted in our schema was too big. After that we decided to put next protocol in place: Nightly spark job, which transforms json to parquet, take schema from hive table for this parquet datasource and use this schema while extracting data from JSON. It makes created parquet files and hive schema consistent. When data producers start to report some new field it does not affect schema and size of parquet datasource until I manually amend schema of parquet table. I makes our infrastructure defended against trash fields.
  24. Another thing, which we tried to avoid for a long time - creating tables with aggregated data.
  25. “Stack for big data” gives you opportunity to query terabytes of data in reasonable time - but it doesn’t mean, that you can query it daily for building reports. Imagine - you have some old fashion BI tool , which was designed to access data sources with sub second query latency. And then you try to use it against hadoop stack, where querying data for month can take minutes to execute. First naive approach - try to insert queries in Tableau anyway. It wouldn’t work properly – BI visualisation systems are about visualisation, not about job scheduling, and when you make it execute 30 minutes extracts - you make it do job scheduling. It’s bad, because job scheduling tool should be good at keeping quotas, priorities, retries and logging problems with jobs. And BI tools bad at it. Another problem - all queries go to single Spark server, which is additional single point of failure. Another problem - we already have thousands lines of code for queries, which get data for our reports. It’s SQL code, but it’s code anyway. As any code, it requires refactoring, extracting common pieces into separate abstraction and using control version system for properly working with it. And of course it all impossible to do, when your code in BI tool. Even slight change of the way we work with data would require to download report, change code inside it and upload it back. It’s impossible to manage this way tens of reports.
  26. So we’ve put all ours queries for data preparation for reports into git. We’ve wrote simple spark job, which takes SQL queries and executes it. And we scheduled this job to run daily. It significantly improved report development process. Analyst first experiment with queries in his notebook and then put these queries to git to run daily. No additional support required from engineering team, since analyst can commit to git himself. Then he just creates a report based on aggregated table in Tableau. And the query, which he uses in Tableau is just “Select * from aggregates.some_table”.
  27. When he need to change the way data is processed for this report - he need only to change query in git. Such changes significantly speed up report development cycle. But at first step we’ve daily executed queries, which processed whole history and it was an issue - currently to create all aggregated tables it takes 17 hours for cluster. That’s why at some point we started to have 2 queries for every aggregated table - one processing all history and other processing only previous day and inserting data into aggregated table. And time of daily query running dropped to 3hours.
  28. I’ve talked about major decisions to make and I want to share some smaller tips. We never use spark, which comes with hadoop installation. We always build our own Spark from sources and use it for all processes. We have several reasons for that: We applied some patches for better work with JSON data, so we need to build spark from source to include these patches. Sometimes we move to newer version of Spark because of some essential functionality or bug fixes and we can not wait for official release. When we move to newer version of spark, different components move with different speed. Zeppelin might be already on 2.0, when data transformation should be at 1.6 until we verify, that new version have no bugs for our type of data and our type of queries.