SlideShare una empresa de Scribd logo
1 de 44
Descargar para leer sin conexión
Building a Dataset Search
Engine with Spark and
Elasticsearch
Oscar Castañeda-Villagrán
Xoom a PayPal service
About
• Data Scientist at Xoom a PayPal service.
• Interests:
• Data Management,
• Dataset Search,
• Online Learning to Rank.
Spark cluster …
http://bit.ly/2em6RUK
Spark cluster with …
http://bit.ly/2ebM9HO
Spark cluster with Elasticsearch …
Spark cluster with Elasticsearch Inside
And …
Spark cluster with Elasticsearch Inside
And …
And Indexed …
Spark cluster with Elasticsearch Inside
And Indexed RDatasets
Spark cluster with Elasticsearch Inside
Agenda
• Problem Statement, Overview and Motivation.
• Elasticsearch Server inside Spark Cluster.
• Metadata extraction. (Write to ES + Snapshot index.)
• Demo: RDataset Search Engine with Spark and Elasticsearch.
• Q&A
Problem Statement
• Despite Datasets being a key corporate
asset they are generally not given the
importance they deserve in terms of
making them easy to find.
Questions
• How are Datasets produced?
• Can indexing be performed when
Dataset are produced?
Overview
Rdatasets
Take ES
snapshot
Restore ES snapshot
http://bit.ly/2e5H1jL
Overview: Metadata Extraction
Rdatasets
Data Pipelines:
•Extract: Filename, URLs, Column headers.
•Extract: Type information.
•Index Metadata.
Extract Filename
Extract Type information.
Index metadata.
Extract Column headers
Extract URLs
Overview
Data Lake
Overview
Data Lake
Motivation
• Organizing and indexing Datasets and Data Lake(s).
• Spark with Elasticsearch Inside can be used to index datasets
that are produced as part of Spark data pipelines.
• “Replaying Datasets” in Data Lake(s) can be leveraged to create
a dataset index (a posteriori vs. post hoc (Halevy et al., 2016)).
• ES snapshot is a useful means to deploy Dataset search index in
production (out of Spark with Elasticsearch Inside).
Organizing and Indexing Datasets
Index
dataset
features
Input data
Data Pipeline
Create dataset
profile pages.
Using Tableau Javascript APIES Cluster
Fetch dataset from Datalake and run Data Pipeline on demand.
Search
Fetch dataset
from Datalake.
Dataset profile Team Dashboard
Organizing and Indexing Datasets
Input data
Data Pipeline
Organizing and Indexing Datasets
Input data
ES Cluster
Search
Data Pipeline
Index
dataset
features
Organizing and Indexing Datasets
Input data
ES Cluster
Search
Create dataset
profile pages.
Dataset profile
Data Pipeline
Index
dataset
features
Organizing and Indexing Datasets
Input data
ES Cluster
Search
Create dataset
profile pages.
Using Tableau Javascript API
Dataset profile
Data Pipeline
Index
dataset
features
Organizing and Indexing Datasets
Input data
ES Cluster
Search
Create dataset
profile pages.
Using Tableau Javascript API
Dataset profile Team Dashboard
Data Pipeline
Index
dataset
features
Organizing and Indexing Datasets
Input data
ES Cluster
Search
Create dataset
profile pages.
Using Tableau Javascript API
Dataset profile Team Dashboard
Fetch dataset
from Datalake.
Data Pipeline
Index
dataset
features
Organizing and Indexing Datasets
Index
dataset
features
Input data
Data Pipeline
Create dataset
profile pages.
Using Tableau Javascript APIES Cluster
Fetch dataset from Datalake and run Data Pipeline on demand.
Dataset profile Team Dashboard
Search
Fetch dataset
from Datalake.
How do you run Elasticsearch
inside Spark Cluster?
Preliminary Notes
http://bit.ly/2ebM9HO
• Embedded Elasticsearch 5 not supported.
More information:
• https://www.elastic.co/blog/elasticsearch-the-server
• https://github.com/elastic/elasticsearch/issues/19903
Imports
http://bit.ly/2efaib4
http://bit.ly/2di0cFq
http://bit.ly/2ebM9HO
Setup Local ES
server.start()
Metadata Extraction
E.g. do not interpret first line as header
• Only first row
• Split by comma
• Remove additional characters
• Extract column-header names.
Result = Comma separated list of column-header names.
E.g. do interpret first line as header
• run ‘dtypes’ on Data Frame
• Split by comma
• Remove additional characters
• Extract column-header types.
Result = Comma separated list of column-header types.
Writing to local ES
saveToEs("data/set")
Check results on local ES
GET
getUrlAsString(“http://<YOUR-IP-ADDRESS>:9200/_cat/indicies?v”)
Snapshot to S3
Demo!
A posteriori vs. Post-hoc
• Halevy et al (2016) advocate finding data in a
post-hoc manner by collecting and aggregating
metadata after datasets are created or updated.
• I propose an a posteriori approach where
metadata is generated and indexed using Spark
as part of running pipelines.
Pros: A posteriori (1)
• Gathering metadata and recovering dataset
semantics does not require separate processing.
• Dataset normalization to a single “dataset” concept
becomes feasible.
• More granular metrics available to evaluate
metadata regeneration.
Pros: A posteriori (2)
• Granular (per-pipeline) access to recompute dataset
importance based on feedback from user interaction.
• E.g. Datasets with increased user interaction call
for more in-depth metadata extraction.
• Granular (per-pipeline) access to status metadata.
Cons: A posteriori (2)
• Looking at trees instead of the forest.
• Cluster-type (or higher order) analyses limited in-
pipeline (tunnel vision).
• Need to replay indexing pipeline when things
change (per data pipeline).
What have we seen?
• How to create ES Server inside Spark Cluster.
• How to write to Elasticsearch index + snapshot to S3.
• How to extract metadata inside data pipelines.
• Demo: RDataset Search Engine with Spark and
Elasticsearch.
Next Steps (1)
• Replicate on embedded Solr using spark-solr plugin to
read/write from/to Spark RDD’s instead of Elasticsearch.
• Naïvely implement Online Learning to Rank (OLTR)
on Dataset index using Solr Online Learning to Rank Plugin
(Hofmann, 2013) [1].
• Describe Datasets in a structured schema.org way
using Data Catalog Vocabulary [2].
[1] https://bitbucket.org/ilps/solr-online-learning-to-rank-plug-in
[2] http://www.w3.org/TR/vocab-dcat/
Next Steps (2)
• Develop methods to extract relations among Datasets and rank
those using OLTR (interleave (Hofmann, 2013) or multileave (Schuth,
2016)) in place of human relevance judgements (as with featureRank in
(Balakrishnan, et al. 2015)).
• Build a knowledge graph and use GraphX to extract insights. (Useful
e.g. for column concept determination (Deng et al. 2013)).
• Build topic models based on structured Datasets using Glint to
perform scalable topic model extraction in Spark (Jagerman and Eickhoff,
2016) [1].
[1] https://spark-summit.org/eu-2016/events/glint-an-asynchronous-parameter-server-for-spark/
References
• Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang.
Goods: Organizing google’s datasets. In Fatmañzcan, Georgia Koutrika, and Sam Madden, editors, Proceedings of the 2016
International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01,
2016, pages 795–806. ACM, 2016. ISBN 978-1-4503-3531-7. doi: http://doi.acm.org/10.1145/2882903.2903730.
• Katja Hofmann. Fast and Reliable Online Learning to Rank for Information Retrieval. PhD thesis, Informatics Institute,
University of Amsterdam, May 2013.
• Rolf Jagerman and Carsten Eickhoff. Web-scale topic models in spark: An asynchronous parameter server. CoRR, abs/
1605.07422, 2016. URL http://arxiv.org/abs/1605. 07422.
• Dong Deng, Yu Jiang, Guoliang Li, Jian Li, and Cong Yu. Scalable column concept de- termination for web tables using large
knowledge bases. PVLDB, 6(13):1606–1617, 2013. doi: http://www.vldb.org/pvldb/vol6/p1606-li.pdf.
• Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. Multileave gradient descent for fast online learning to
rank. In WSDM 2016: The 9th International Conference on Web Search and Data Mining, pages 457-466. ACM, February 2016.
• Sreeram Balakrishnan, Alon Y. Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin Rostamizadeh, Warren Shen,
Kenneth Wilder, Fei Wu 0003, and Cong Yu. Ap- plying webtables in practice. In CIDR 2015, Seventh Biennial Conference on
Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org, 2015.
Q&A
Thank You.
Email: ocastaneda@paypal.com
Twitter: @oscar_castaneda

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
PLPgSqL- Datatypes, Language structure.pptx
PLPgSqL- Datatypes, Language structure.pptxPLPgSqL- Datatypes, Language structure.pptx
PLPgSqL- Datatypes, Language structure.pptx
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
 
Oracle Database in-Memory Overivew
Oracle Database in-Memory OverivewOracle Database in-Memory Overivew
Oracle Database in-Memory Overivew
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
What to Expect From Oracle database 19c
What to Expect From Oracle database 19cWhat to Expect From Oracle database 19c
What to Expect From Oracle database 19c
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 

Destacado

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Spark Summit
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Spark Summit
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 

Destacado (10)

What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 

Similar a Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit East talk by Oscar Castaneda Villagran

Learning to Rank Datasets for Search with Oscar Castaneda
Learning to Rank Datasets for Search with Oscar CastanedaLearning to Rank Datasets for Search with Oscar Castaneda
Learning to Rank Datasets for Search with Oscar Castaneda
Databricks
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
ALTER WAY
 

Similar a Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit East talk by Oscar Castaneda Villagran (20)

Learning to Rank Datasets for Search with Oscar Castaneda
Learning to Rank Datasets for Search with Oscar CastanedaLearning to Rank Datasets for Search with Oscar Castaneda
Learning to Rank Datasets for Search with Oscar Castaneda
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014
 
Meetup070416 Presentations
Meetup070416 PresentationsMeetup070416 Presentations
Meetup070416 Presentations
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
Getting Started with Elasticsearch
Getting Started with ElasticsearchGetting Started with Elasticsearch
Getting Started with Elasticsearch
 
Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas Workshop
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Semantic framework for web scraping.
Semantic framework for web scraping.Semantic framework for web scraping.
Semantic framework for web scraping.
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scale
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 

Más de Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Más de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Último

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Último (20)

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit East talk by Oscar Castaneda Villagran

  • 1. Building a Dataset Search Engine with Spark and Elasticsearch Oscar Castañeda-Villagrán Xoom a PayPal service
  • 2. About • Data Scientist at Xoom a PayPal service. • Interests: • Data Management, • Dataset Search, • Online Learning to Rank.
  • 4. Spark cluster with … http://bit.ly/2ebM9HO
  • 5. Spark cluster with Elasticsearch …
  • 6. Spark cluster with Elasticsearch Inside And …
  • 7. Spark cluster with Elasticsearch Inside And …
  • 8. And Indexed … Spark cluster with Elasticsearch Inside
  • 9. And Indexed RDatasets Spark cluster with Elasticsearch Inside
  • 10. Agenda • Problem Statement, Overview and Motivation. • Elasticsearch Server inside Spark Cluster. • Metadata extraction. (Write to ES + Snapshot index.) • Demo: RDataset Search Engine with Spark and Elasticsearch. • Q&A
  • 11. Problem Statement • Despite Datasets being a key corporate asset they are generally not given the importance they deserve in terms of making them easy to find.
  • 12. Questions • How are Datasets produced? • Can indexing be performed when Dataset are produced?
  • 13. Overview Rdatasets Take ES snapshot Restore ES snapshot http://bit.ly/2e5H1jL
  • 14. Overview: Metadata Extraction Rdatasets Data Pipelines: •Extract: Filename, URLs, Column headers. •Extract: Type information. •Index Metadata. Extract Filename Extract Type information. Index metadata. Extract Column headers Extract URLs
  • 17. Motivation • Organizing and indexing Datasets and Data Lake(s). • Spark with Elasticsearch Inside can be used to index datasets that are produced as part of Spark data pipelines. • “Replaying Datasets” in Data Lake(s) can be leveraged to create a dataset index (a posteriori vs. post hoc (Halevy et al., 2016)). • ES snapshot is a useful means to deploy Dataset search index in production (out of Spark with Elasticsearch Inside).
  • 18. Organizing and Indexing Datasets Index dataset features Input data Data Pipeline Create dataset profile pages. Using Tableau Javascript APIES Cluster Fetch dataset from Datalake and run Data Pipeline on demand. Search Fetch dataset from Datalake. Dataset profile Team Dashboard
  • 19. Organizing and Indexing Datasets Input data Data Pipeline
  • 20. Organizing and Indexing Datasets Input data ES Cluster Search Data Pipeline Index dataset features
  • 21. Organizing and Indexing Datasets Input data ES Cluster Search Create dataset profile pages. Dataset profile Data Pipeline Index dataset features
  • 22. Organizing and Indexing Datasets Input data ES Cluster Search Create dataset profile pages. Using Tableau Javascript API Dataset profile Data Pipeline Index dataset features
  • 23. Organizing and Indexing Datasets Input data ES Cluster Search Create dataset profile pages. Using Tableau Javascript API Dataset profile Team Dashboard Data Pipeline Index dataset features
  • 24. Organizing and Indexing Datasets Input data ES Cluster Search Create dataset profile pages. Using Tableau Javascript API Dataset profile Team Dashboard Fetch dataset from Datalake. Data Pipeline Index dataset features
  • 25. Organizing and Indexing Datasets Index dataset features Input data Data Pipeline Create dataset profile pages. Using Tableau Javascript APIES Cluster Fetch dataset from Datalake and run Data Pipeline on demand. Dataset profile Team Dashboard Search Fetch dataset from Datalake.
  • 26. How do you run Elasticsearch inside Spark Cluster?
  • 27. Preliminary Notes http://bit.ly/2ebM9HO • Embedded Elasticsearch 5 not supported. More information: • https://www.elastic.co/blog/elasticsearch-the-server • https://github.com/elastic/elasticsearch/issues/19903
  • 30. Metadata Extraction E.g. do not interpret first line as header • Only first row • Split by comma • Remove additional characters • Extract column-header names. Result = Comma separated list of column-header names. E.g. do interpret first line as header • run ‘dtypes’ on Data Frame • Split by comma • Remove additional characters • Extract column-header types. Result = Comma separated list of column-header types.
  • 31. Writing to local ES saveToEs("data/set")
  • 32. Check results on local ES GET getUrlAsString(“http://<YOUR-IP-ADDRESS>:9200/_cat/indicies?v”)
  • 34. Demo!
  • 35. A posteriori vs. Post-hoc • Halevy et al (2016) advocate finding data in a post-hoc manner by collecting and aggregating metadata after datasets are created or updated. • I propose an a posteriori approach where metadata is generated and indexed using Spark as part of running pipelines.
  • 36. Pros: A posteriori (1) • Gathering metadata and recovering dataset semantics does not require separate processing. • Dataset normalization to a single “dataset” concept becomes feasible. • More granular metrics available to evaluate metadata regeneration.
  • 37. Pros: A posteriori (2) • Granular (per-pipeline) access to recompute dataset importance based on feedback from user interaction. • E.g. Datasets with increased user interaction call for more in-depth metadata extraction. • Granular (per-pipeline) access to status metadata.
  • 38. Cons: A posteriori (2) • Looking at trees instead of the forest. • Cluster-type (or higher order) analyses limited in- pipeline (tunnel vision). • Need to replay indexing pipeline when things change (per data pipeline).
  • 39. What have we seen? • How to create ES Server inside Spark Cluster. • How to write to Elasticsearch index + snapshot to S3. • How to extract metadata inside data pipelines. • Demo: RDataset Search Engine with Spark and Elasticsearch.
  • 40. Next Steps (1) • Replicate on embedded Solr using spark-solr plugin to read/write from/to Spark RDD’s instead of Elasticsearch. • Naïvely implement Online Learning to Rank (OLTR) on Dataset index using Solr Online Learning to Rank Plugin (Hofmann, 2013) [1]. • Describe Datasets in a structured schema.org way using Data Catalog Vocabulary [2]. [1] https://bitbucket.org/ilps/solr-online-learning-to-rank-plug-in [2] http://www.w3.org/TR/vocab-dcat/
  • 41. Next Steps (2) • Develop methods to extract relations among Datasets and rank those using OLTR (interleave (Hofmann, 2013) or multileave (Schuth, 2016)) in place of human relevance judgements (as with featureRank in (Balakrishnan, et al. 2015)). • Build a knowledge graph and use GraphX to extract insights. (Useful e.g. for column concept determination (Deng et al. 2013)). • Build topic models based on structured Datasets using Glint to perform scalable topic model extraction in Spark (Jagerman and Eickhoff, 2016) [1]. [1] https://spark-summit.org/eu-2016/events/glint-an-asynchronous-parameter-server-for-spark/
  • 42. References • Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. Goods: Organizing google’s datasets. In Fatmañzcan, Georgia Koutrika, and Sam Madden, editors, Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 795–806. ACM, 2016. ISBN 978-1-4503-3531-7. doi: http://doi.acm.org/10.1145/2882903.2903730. • Katja Hofmann. Fast and Reliable Online Learning to Rank for Information Retrieval. PhD thesis, Informatics Institute, University of Amsterdam, May 2013. • Rolf Jagerman and Carsten Eickhoff. Web-scale topic models in spark: An asynchronous parameter server. CoRR, abs/ 1605.07422, 2016. URL http://arxiv.org/abs/1605. 07422. • Dong Deng, Yu Jiang, Guoliang Li, Jian Li, and Cong Yu. Scalable column concept de- termination for web tables using large knowledge bases. PVLDB, 6(13):1606–1617, 2013. doi: http://www.vldb.org/pvldb/vol6/p1606-li.pdf. • Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. Multileave gradient descent for fast online learning to rank. In WSDM 2016: The 9th International Conference on Web Search and Data Mining, pages 457-466. ACM, February 2016. • Sreeram Balakrishnan, Alon Y. Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin Rostamizadeh, Warren Shen, Kenneth Wilder, Fei Wu 0003, and Cong Yu. Ap- plying webtables in practice. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org, 2015.
  • 43. Q&A