SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
How The Weather Company® Uses Spark
to Serve Weather Data Faster and
Cheaper
Erik Goepfert and Paula Ta-Shma
IBM TWC and IBM Research
By Leveraging IBM Cloud® SQL Query and Cloud Object Storage
The Weather Company started with a
simple mission to
decisions
solutions Map the
atmosphere
every
15 minutes
Process over
400 terabytes
of data daily
Deliver more than 50 billion requests for
weather information every day and produce
25 billion forecasts daily
Source: Qliksense internal report, April 2017; According to internal forecasting system + # of locations in the
world by Lat Lon locations (2 decimal places); 400 terabytes according to internal SUN platform numbers
And has evolved into
Source: ForecastWatch, Three Region Accuracy Overview, 2010-2017, https://www.forecastwatch.com/static/Three_Region_Accuracy_Overview_2010-2017.pdf
History on Demand
Conditions (HoD)
Provides access to a
worldwide, hourly, high-
resolution, gridded
dataset of past weather
conditions via a web API
Global 4 km grid
0.044-degree resolution
34 potential weather properties
34 million records added every hour
Geospatial and temporal search
Point, bounding box, and polygon search over a time range
Usage
Averages 600,000 requests per day
Used by clients primarily for machine learning and data analytics
Supports research in domains such as climate science, energy &
utilities, agriculture, transportation, insurance, and retail
Problems with our previous solution
▪ Expensive
▪ Our synchronous data access solution is expensive
▪ Limited storage capacity
▪ Hard storage limits per cluster with our previous cloud provider and storage solution
▪ We have reduced the data we provide
▪ To lower cost and stay below the storage limit, we've reduced our data to land only, and 20 of the available 34 weather properties
▪ Clients are limited to small requests
▪ To allow for a synchronous interaction, clients are required to limit the scope of their requests to 2,400 records
▪ Slow at retrieving large amounts of data
▪ Because of the small query sizes, it is time consuming to retrieve large amounts of data
Our new asynchronous solution
▪ More cost-effective
▪ Our use of IBM Cloud SQL Query and Cloud Object Storage has resulted in an order of magnitude reduction in cost
▪ Unlimited storage
▪ With Cloud Object Storage we effectively have an unlimited storage capacity
▪ Global weather data coverage with all 34 weather properties
▪ With the reduced cost and unlimited storage we no longer have to limit the data we provide
▪ Support for large requests
▪ With an asynchronous interaction, clients can now submit a single request for everything they're interested in
▪ Large amounts of data retrieved quickly with a single query
▪ Because we can rely on IBM Cloud SQL Query using Spark behind the scenes, large queries complete relatively quickly
Solution Overview
Serverless approach
▪ Pay per use -> Low cost
IBM Cloud SQL Query
▪ Serverless SQL powered by Spark
IBM Cloud Object Storage (COS)
▪ S3 Compatible API
Apply Best Practices
▪ Parquet
▪ Geospatial Data Layout
IBM Cloud SQL Query
▪ Serverless SQL service built on Apache Spark
▪ Supports all Spark native data formats e.g. Parquet, ORC, CSV, Avro, JSON
▪ Intuitive UI, no set up/installation required
▪ Integrated with Watson Studio notebooks, Cloud Functions and supports REST APIs
▪ Query and ETL data on COS directly
▪ Also supports Db2
▪ High Performance
▪ Built-in Catalog – multi-tenant Hive Metastore
▪ Data Skipping indexes
▪ Low Cost
▪ $5/TB scanned
▪ Skip over irrelevant data using Catalog and Data Skipping
▪ Zero standing cost
SQL Query Catalog
▪ Multi-tenant Hive Metastore (HMS)
▪ Critical to achieve high performance for Big Data
1. Spark SQL leverages HMS for partition pruning – avoid reading irrelevant partitions
▪ More flexible than the Hive Style Partitioning naming convention
2. Significantly reduces time spent on object listing
▪ HMS enables listing partitions in parallel – sequential listing can be very slow
3. Stores schema and statistics for Cost Based Optimization
▪ Useful to maintain consistency
▪ Can consistently replace an entire partition
▪ Needed for changing the data layout
▪ Future: use Delta/Iceberg/Hudi format for this
Weather/dt=2020-08-17/part-00085.parquet
Weather/dt=2020-08-17/part-00086.parquet
Weather/dt=2020-08-17/part-00087.parquet
Weather/dt=2020-08-17/part-00088.parquet
Weather/dt=2020-08-18/part-00001.parquet
Weather/dt=2020-08-18/part-00002.parquet
Partition MD
Partition MD
HMS
Geospatial Analytics in SQL Query
▪ Supports geospatial operations and data types
- According to the SQL/MM standard
- Aggregation, computation and join via native SQL syntax
▪ Geodetic Full Earth support
- Increased developer productivity
▪ Avoid piece-wise planar projections
- High precision calculations anywhere on earth
- Very large polygons e.g. countries, polar caps etc.
▪ Native geohash support
- Fine granularity
- Fast spatial aggregation
▪ Geospatial Data Skipping
Data Skipping in SQL Query
▪ Avoid reading irrelevant objects
using indexes
▪ Complements partition pruning -> object level pruning
▪ Stores aggregate metadata per object to enable skipping
decisions
▪ Indexes are stored in COS
▪ Supports multiple index types
▪ Currently MinMax, ValueList, BloomFilter, Geospatial
▪ Underlying data skipping library is extensible
▪ New index types can easily be supported
▪ Enables data skipping for Queries
with UDFs
▪ e.g. ST_Contains, ST_Distance etc.
▪ UDFs are mapped to indexes
How Data Skipping Works
Query
Prune
partitions
Read data
Query
Prune
partitions
Optional file
filter
Read data
Metadata
Filter
Spark SQL Query Execution Flow
Uses Catalyst optimizer and
session extensions API
Data Skipping Example
Weather/dt=2020-08-17/part-00085.parquet
Weather/dt=2020-08-17/part-00086.parquet
Weather/dt=2020-08-17/part-00087.parquet
Weather/dt=2020-08-17/part-00088.parquet
Weather/dt=2020-08-18/part-00001.parquet
Weather/dt=2020-08-18/part-00002.parquet
Data
Object Listing
Example Query
SELECT *
FROM cos://us-geo/twc/Weather STORED AS parquet
WHERE temp > 40
Object Name Temp
Min
Temp
Max
...
dt=2020-08-17/part-00085 7.97 26.77
dt=2020-08-17/part-00086 2.45 23.71
dt=2020-08-17/part-00087 6.46 18.62
dt=2020-08-17/part-00088 23.67 41.02
...
Metadata
Red objects are not relevant to this query
Data Skipping Example
Weather/dt=2020-08-17/part-00085.parquet
Weather/dt=2020-08-17/part-00086.parquet
Weather/dt=2020-08-17/part-00087.parquet
Weather/dt=2020-08-17/part-00088.parquet
Weather/dt=2020-08-18/part-00001.parquet
Weather/dt=2020-08-18/part-00002.parquet
Data
Object Listing
Example Query
SELECT *
FROM cos://us-geo/twc/Weather STORED AS parquet
WHERE temp > 40
Object Name Temp
Min
Temp
Max
...
dt=2020-08-17/part-00085 7.97 26.77
dt=2020-08-17/part-00086 2.45 23.71
dt=2020-08-17/part-00087 6.46 18.62
dt=2020-08-17/part-00088 23.67 41.02
...
Metadata
Red objects are not relevant to this query
Data layout is important
to get good skipping
HoD Data Layout in Production
gcod/v1/
hourly/year=2019/month=2/
20190201T002000Z-part-00000.parquet
20190201T002000Z-part-00001.parquet
…
20190218T232000Z-part-00007.parquet
20190218T232000Z-part-00008.parquet
monthly/year=2019/month=1/
part-00000.parquet
part-00001.parquet
…
part-08191.parquet
HoD Data Layout in Production
MonthlyHourly
* boundaries here are an approximation, not based on actual data
Geospatial Data Skipping Example
Example Query
SELECT *
FROM Weather STORED AS parquet
WHERE
ST_Contains(ST_WKTToSQL('POLYGON((-78.93
36.00, -78.67 35.78, -79.04 35.90, -78.93 36.00))'),
ST_Point(long, lat))
INTO cos://us-south/results STORED AS parquet
Object Name lat
Min
lat
Max
...
dt=2020-08-17/part-00085 35.02 36.17
dt=2020-08-17/part-00086 43.59 44.95
dt=2020-08-17/part-00087 34.86 40.62
dt=2020-08-17/part-00088 23.67 25.92
...
Metadata
Red objects are not relevant to this query
Raleigh Research
Triangle (US)
Map ST Contains UDF
to necessary conditions
on lat, long
Query Rewrite Approach
Example Query
SELECT *
FROM Weather STORED AS parquet
WHERE
ST_Contains(ST_WKTToSQL('POLYGON((-78.93
36.00, -78.67 35.78, -79.04 35.90, -78.93 36.00))'),
ST_Point(long, lat))
INTO cos://us-south/results STORED AS parquet
Raleigh Research
Triangle (US)
Rewritten Query
SELECT *
FROM Weather STORED AS parquet
WHERE
ST_Contains(ST_WKTToSQL('POLYGON((-78.93
36.00, -78.67 35.78, -79.04 35.90, -78.93 36.00))'),
ST_Point(long, lat))
AND long BETWEEN -79.04 AND -78.67
AND lat BETWEEN 35.78 AND 36.00
INTO cos://us-south/results STORED AS parquet
Benefits of Consolidated Metadata
Query rewrite approach
can leverage MinMax
metadata in
Parquet/ORC formats
Consolidated metadata
approach performs
better
▪ Avoids reading footers
▪ Better resource allocation
X3.6 faster
X10 Acceleration with Data Skipping and Catalog
Assumes query
rewrite approach
(yellow) is the
baseline
• Requires Parquet/ORC
For other formats the
acceleration is much
larger
• e.g. CSV/JSON/Avro
Experiment uses Raleigh Research
Triangle query
X10 speedup
on average
Demo
Demo Stats
▪ 6.404 TB in Parquet format
▪ 172,004 objects
▪ 36 MB per object (on average)
▪ 21 months of weather data
▪ 21 partitions
▪ Create table: 3.8s
▪ Recover partitions: 21.9s
▪ Create indexes: 12 min 17.0s
▪ Data scanned: 5.72 MB
▪ Geospatial query: 1 min 14.0s
▪ Data scanned: 20.4 MB
▪ Catalog: skips 20 of 21 partitions
▪ Data skipped: 8186 of 8190 objects
Performance StatsTWC Demo Dataset Properties
Example Query used by HoD in Production
SELECT *
FROM hod_gcod
WHERE (
year = 2016 AND 10 <= month
OR year BETWEEN 2017 AND 2019
OR year = 2020 AND month <= 3
)
AND date_time BETWEEN timestamp("2016-10-15 00:00:00Z") AND timestamp("2020-03-11 00:00:00Z")
AND ST_Contains(
ST_Boundingbox(-111.711, 41.081, -109.953, 42.840),
ST_Point(longitude, latitude)
)
INTO cos://us-east/my-results-bucket STORED AS CSV
Query Runtime for HoD in Production
Querying a 40x40 gridpoint area (25,000 km2) over time
HoD Sync vs Async
Querying a 40x40 gridpoint bbox (25,000 km2) to retrieve 1 year of data
Synchronous
(previous solution)
Asynchronous
(new solution)
Query count 8,000 1
Total query time 2h 15m 3m 20s
HoD Sync vs Async
▪ Limited storage
▪ Land only
▪ 20 weather properties
▪ Query result size limit of 2,400 records
▪ Unlimited storage
▪ Global coverage
▪ All 34 weather properties
▪ Unlimited query result size
▪ An order of magnitude reduction in cost
Asynchronous
(new solution)
Synchronous
(previous solution)
Conclusions
▪ Order of magnitude cost reduction
▪ Order of magnitude performance improvements
▪ Enhanced functionality
▪ Key factors:
▪ Serverless approach with IBM Cloud SQL Query + COS
1. Seamless integration with powerful geospatial library
2. Fully integrated Catalog
3. Geospatial data skipping
▪ Our data skipping work is extensible
Thanks!
Contact Info:
Erik Goepfert erik.goepfert@ibm.com
Paula Ta-Shma paula@il.ibm.com
Thanks to the team :
Ofer Biran, Dat Bui, Linsong Chu, Patrick Dantressangle, Pranita Dewan,
Michael Factor, Oshrit Feder, Raghu Ganti, Michael Haide, Holly Hassenzahl,
Pete Ihlenfeldt, Guy Khazma, Simon Laws, Gal Lushi, Yosef Moatti, Jeremy
Nachman, Daniel Pittner, Mudhakar Srinivasta,Torsten Steinbach
The research leading to these results has received funding from the European Community’s
Horizon 2020 research and innovation program under grant agreement n° 779747.
Feedback
Don’t forget to rate
and review this session
Take a look
at our latest blog

Más contenido relacionado

La actualidad más candente

Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleDatabricks
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Databricks
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSDatabricks
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Databricks
 
Acid ORC, Iceberg and Delta Lake
Acid ORC, Iceberg and Delta LakeAcid ORC, Iceberg and Delta Lake
Acid ORC, Iceberg and Delta LakeMichal Gancarski
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
Managing ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkManaging ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkDatabricks
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark Summit
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkDatabricks
 

La actualidad más candente (20)

Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
 
Acid ORC, Iceberg and Delta Lake
Acid ORC, Iceberg and Delta LakeAcid ORC, Iceberg and Delta Lake
Acid ORC, Iceberg and Delta Lake
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Managing ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkManaging ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache Spark
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 

Similar a How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost

Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeTorsten Steinbach
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeTorsten Steinbach
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheAlluxio, Inc.
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeongYousun Jeong
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data Ceph Community
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveTorsten Steinbach
 
Modernizing upstream workflows with aws storage - john mallory
Modernizing upstream workflows with aws storage -  john malloryModernizing upstream workflows with aws storage -  john mallory
Modernizing upstream workflows with aws storage - john malloryAmazon Web Services
 
Druid at naver.com - part 1
Druid at naver.com - part 1Druid at naver.com - part 1
Druid at naver.com - part 1Jungsu Heo
 
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...Spark Summit
 
(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon RedshiftAmazon Web Services
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Databricks
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
 
Geospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataGeospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataAlexMiowski
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongCeph Community
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsVoltDB
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageKai Sasaki
 

Similar a How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost (20)

Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
 
Modernizing upstream workflows with aws storage - john mallory
Modernizing upstream workflows with aws storage -  john malloryModernizing upstream workflows with aws storage -  john mallory
Modernizing upstream workflows with aws storage - john mallory
 
Druid at naver.com - part 1
Druid at naver.com - part 1Druid at naver.com - part 1
Druid at naver.com - part 1
 
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
 
(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
 
Geospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataGeospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning Data
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming Aggregations
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud Storage
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 

Último (20)

Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 

How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost

  • 1. How The Weather Company® Uses Spark to Serve Weather Data Faster and Cheaper Erik Goepfert and Paula Ta-Shma IBM TWC and IBM Research By Leveraging IBM Cloud® SQL Query and Cloud Object Storage
  • 2. The Weather Company started with a simple mission to
  • 3. decisions solutions Map the atmosphere every 15 minutes Process over 400 terabytes of data daily Deliver more than 50 billion requests for weather information every day and produce 25 billion forecasts daily Source: Qliksense internal report, April 2017; According to internal forecasting system + # of locations in the world by Lat Lon locations (2 decimal places); 400 terabytes according to internal SUN platform numbers And has evolved into
  • 4. Source: ForecastWatch, Three Region Accuracy Overview, 2010-2017, https://www.forecastwatch.com/static/Three_Region_Accuracy_Overview_2010-2017.pdf History on Demand Conditions (HoD) Provides access to a worldwide, hourly, high- resolution, gridded dataset of past weather conditions via a web API Global 4 km grid 0.044-degree resolution 34 potential weather properties 34 million records added every hour Geospatial and temporal search Point, bounding box, and polygon search over a time range Usage Averages 600,000 requests per day Used by clients primarily for machine learning and data analytics Supports research in domains such as climate science, energy & utilities, agriculture, transportation, insurance, and retail
  • 5. Problems with our previous solution ▪ Expensive ▪ Our synchronous data access solution is expensive ▪ Limited storage capacity ▪ Hard storage limits per cluster with our previous cloud provider and storage solution ▪ We have reduced the data we provide ▪ To lower cost and stay below the storage limit, we've reduced our data to land only, and 20 of the available 34 weather properties ▪ Clients are limited to small requests ▪ To allow for a synchronous interaction, clients are required to limit the scope of their requests to 2,400 records ▪ Slow at retrieving large amounts of data ▪ Because of the small query sizes, it is time consuming to retrieve large amounts of data
  • 6. Our new asynchronous solution ▪ More cost-effective ▪ Our use of IBM Cloud SQL Query and Cloud Object Storage has resulted in an order of magnitude reduction in cost ▪ Unlimited storage ▪ With Cloud Object Storage we effectively have an unlimited storage capacity ▪ Global weather data coverage with all 34 weather properties ▪ With the reduced cost and unlimited storage we no longer have to limit the data we provide ▪ Support for large requests ▪ With an asynchronous interaction, clients can now submit a single request for everything they're interested in ▪ Large amounts of data retrieved quickly with a single query ▪ Because we can rely on IBM Cloud SQL Query using Spark behind the scenes, large queries complete relatively quickly
  • 7. Solution Overview Serverless approach ▪ Pay per use -> Low cost IBM Cloud SQL Query ▪ Serverless SQL powered by Spark IBM Cloud Object Storage (COS) ▪ S3 Compatible API Apply Best Practices ▪ Parquet ▪ Geospatial Data Layout
  • 8. IBM Cloud SQL Query ▪ Serverless SQL service built on Apache Spark ▪ Supports all Spark native data formats e.g. Parquet, ORC, CSV, Avro, JSON ▪ Intuitive UI, no set up/installation required ▪ Integrated with Watson Studio notebooks, Cloud Functions and supports REST APIs ▪ Query and ETL data on COS directly ▪ Also supports Db2 ▪ High Performance ▪ Built-in Catalog – multi-tenant Hive Metastore ▪ Data Skipping indexes ▪ Low Cost ▪ $5/TB scanned ▪ Skip over irrelevant data using Catalog and Data Skipping ▪ Zero standing cost
  • 9. SQL Query Catalog ▪ Multi-tenant Hive Metastore (HMS) ▪ Critical to achieve high performance for Big Data 1. Spark SQL leverages HMS for partition pruning – avoid reading irrelevant partitions ▪ More flexible than the Hive Style Partitioning naming convention 2. Significantly reduces time spent on object listing ▪ HMS enables listing partitions in parallel – sequential listing can be very slow 3. Stores schema and statistics for Cost Based Optimization ▪ Useful to maintain consistency ▪ Can consistently replace an entire partition ▪ Needed for changing the data layout ▪ Future: use Delta/Iceberg/Hudi format for this Weather/dt=2020-08-17/part-00085.parquet Weather/dt=2020-08-17/part-00086.parquet Weather/dt=2020-08-17/part-00087.parquet Weather/dt=2020-08-17/part-00088.parquet Weather/dt=2020-08-18/part-00001.parquet Weather/dt=2020-08-18/part-00002.parquet Partition MD Partition MD HMS
  • 10. Geospatial Analytics in SQL Query ▪ Supports geospatial operations and data types - According to the SQL/MM standard - Aggregation, computation and join via native SQL syntax ▪ Geodetic Full Earth support - Increased developer productivity ▪ Avoid piece-wise planar projections - High precision calculations anywhere on earth - Very large polygons e.g. countries, polar caps etc. ▪ Native geohash support - Fine granularity - Fast spatial aggregation ▪ Geospatial Data Skipping
  • 11. Data Skipping in SQL Query ▪ Avoid reading irrelevant objects using indexes ▪ Complements partition pruning -> object level pruning ▪ Stores aggregate metadata per object to enable skipping decisions ▪ Indexes are stored in COS ▪ Supports multiple index types ▪ Currently MinMax, ValueList, BloomFilter, Geospatial ▪ Underlying data skipping library is extensible ▪ New index types can easily be supported ▪ Enables data skipping for Queries with UDFs ▪ e.g. ST_Contains, ST_Distance etc. ▪ UDFs are mapped to indexes
  • 12. How Data Skipping Works Query Prune partitions Read data Query Prune partitions Optional file filter Read data Metadata Filter Spark SQL Query Execution Flow Uses Catalyst optimizer and session extensions API
  • 13. Data Skipping Example Weather/dt=2020-08-17/part-00085.parquet Weather/dt=2020-08-17/part-00086.parquet Weather/dt=2020-08-17/part-00087.parquet Weather/dt=2020-08-17/part-00088.parquet Weather/dt=2020-08-18/part-00001.parquet Weather/dt=2020-08-18/part-00002.parquet Data Object Listing Example Query SELECT * FROM cos://us-geo/twc/Weather STORED AS parquet WHERE temp > 40 Object Name Temp Min Temp Max ... dt=2020-08-17/part-00085 7.97 26.77 dt=2020-08-17/part-00086 2.45 23.71 dt=2020-08-17/part-00087 6.46 18.62 dt=2020-08-17/part-00088 23.67 41.02 ... Metadata Red objects are not relevant to this query
  • 14. Data Skipping Example Weather/dt=2020-08-17/part-00085.parquet Weather/dt=2020-08-17/part-00086.parquet Weather/dt=2020-08-17/part-00087.parquet Weather/dt=2020-08-17/part-00088.parquet Weather/dt=2020-08-18/part-00001.parquet Weather/dt=2020-08-18/part-00002.parquet Data Object Listing Example Query SELECT * FROM cos://us-geo/twc/Weather STORED AS parquet WHERE temp > 40 Object Name Temp Min Temp Max ... dt=2020-08-17/part-00085 7.97 26.77 dt=2020-08-17/part-00086 2.45 23.71 dt=2020-08-17/part-00087 6.46 18.62 dt=2020-08-17/part-00088 23.67 41.02 ... Metadata Red objects are not relevant to this query Data layout is important to get good skipping
  • 15. HoD Data Layout in Production gcod/v1/ hourly/year=2019/month=2/ 20190201T002000Z-part-00000.parquet 20190201T002000Z-part-00001.parquet … 20190218T232000Z-part-00007.parquet 20190218T232000Z-part-00008.parquet monthly/year=2019/month=1/ part-00000.parquet part-00001.parquet … part-08191.parquet
  • 16. HoD Data Layout in Production MonthlyHourly * boundaries here are an approximation, not based on actual data
  • 17. Geospatial Data Skipping Example Example Query SELECT * FROM Weather STORED AS parquet WHERE ST_Contains(ST_WKTToSQL('POLYGON((-78.93 36.00, -78.67 35.78, -79.04 35.90, -78.93 36.00))'), ST_Point(long, lat)) INTO cos://us-south/results STORED AS parquet Object Name lat Min lat Max ... dt=2020-08-17/part-00085 35.02 36.17 dt=2020-08-17/part-00086 43.59 44.95 dt=2020-08-17/part-00087 34.86 40.62 dt=2020-08-17/part-00088 23.67 25.92 ... Metadata Red objects are not relevant to this query Raleigh Research Triangle (US) Map ST Contains UDF to necessary conditions on lat, long
  • 18. Query Rewrite Approach Example Query SELECT * FROM Weather STORED AS parquet WHERE ST_Contains(ST_WKTToSQL('POLYGON((-78.93 36.00, -78.67 35.78, -79.04 35.90, -78.93 36.00))'), ST_Point(long, lat)) INTO cos://us-south/results STORED AS parquet Raleigh Research Triangle (US) Rewritten Query SELECT * FROM Weather STORED AS parquet WHERE ST_Contains(ST_WKTToSQL('POLYGON((-78.93 36.00, -78.67 35.78, -79.04 35.90, -78.93 36.00))'), ST_Point(long, lat)) AND long BETWEEN -79.04 AND -78.67 AND lat BETWEEN 35.78 AND 36.00 INTO cos://us-south/results STORED AS parquet
  • 19. Benefits of Consolidated Metadata Query rewrite approach can leverage MinMax metadata in Parquet/ORC formats Consolidated metadata approach performs better ▪ Avoids reading footers ▪ Better resource allocation X3.6 faster
  • 20. X10 Acceleration with Data Skipping and Catalog Assumes query rewrite approach (yellow) is the baseline • Requires Parquet/ORC For other formats the acceleration is much larger • e.g. CSV/JSON/Avro Experiment uses Raleigh Research Triangle query X10 speedup on average
  • 21. Demo
  • 22. Demo Stats ▪ 6.404 TB in Parquet format ▪ 172,004 objects ▪ 36 MB per object (on average) ▪ 21 months of weather data ▪ 21 partitions ▪ Create table: 3.8s ▪ Recover partitions: 21.9s ▪ Create indexes: 12 min 17.0s ▪ Data scanned: 5.72 MB ▪ Geospatial query: 1 min 14.0s ▪ Data scanned: 20.4 MB ▪ Catalog: skips 20 of 21 partitions ▪ Data skipped: 8186 of 8190 objects Performance StatsTWC Demo Dataset Properties
  • 23. Example Query used by HoD in Production SELECT * FROM hod_gcod WHERE ( year = 2016 AND 10 <= month OR year BETWEEN 2017 AND 2019 OR year = 2020 AND month <= 3 ) AND date_time BETWEEN timestamp("2016-10-15 00:00:00Z") AND timestamp("2020-03-11 00:00:00Z") AND ST_Contains( ST_Boundingbox(-111.711, 41.081, -109.953, 42.840), ST_Point(longitude, latitude) ) INTO cos://us-east/my-results-bucket STORED AS CSV
  • 24. Query Runtime for HoD in Production Querying a 40x40 gridpoint area (25,000 km2) over time
  • 25. HoD Sync vs Async Querying a 40x40 gridpoint bbox (25,000 km2) to retrieve 1 year of data Synchronous (previous solution) Asynchronous (new solution) Query count 8,000 1 Total query time 2h 15m 3m 20s
  • 26. HoD Sync vs Async ▪ Limited storage ▪ Land only ▪ 20 weather properties ▪ Query result size limit of 2,400 records ▪ Unlimited storage ▪ Global coverage ▪ All 34 weather properties ▪ Unlimited query result size ▪ An order of magnitude reduction in cost Asynchronous (new solution) Synchronous (previous solution)
  • 27. Conclusions ▪ Order of magnitude cost reduction ▪ Order of magnitude performance improvements ▪ Enhanced functionality ▪ Key factors: ▪ Serverless approach with IBM Cloud SQL Query + COS 1. Seamless integration with powerful geospatial library 2. Fully integrated Catalog 3. Geospatial data skipping ▪ Our data skipping work is extensible
  • 28. Thanks! Contact Info: Erik Goepfert erik.goepfert@ibm.com Paula Ta-Shma paula@il.ibm.com Thanks to the team : Ofer Biran, Dat Bui, Linsong Chu, Patrick Dantressangle, Pranita Dewan, Michael Factor, Oshrit Feder, Raghu Ganti, Michael Haide, Holly Hassenzahl, Pete Ihlenfeldt, Guy Khazma, Simon Laws, Gal Lushi, Yosef Moatti, Jeremy Nachman, Daniel Pittner, Mudhakar Srinivasta,Torsten Steinbach The research leading to these results has received funding from the European Community’s Horizon 2020 research and innovation program under grant agreement n° 779747.
  • 29. Feedback Don’t forget to rate and review this session Take a look at our latest blog