SlideShare una empresa de Scribd logo
1 de 39
Multi Source Data Analysis
Using Apache Spark and Tellius
https://github.com/phatak-dev/spark2.0-examples
● Madhukara Phatak
● Director of
Engineering,Tellius
● Work on Hadoop, Spark , ML
and Scala
● www.madhukaraphatak.com
Agenda
● Multi Source Data
● Challenges with Multi Source
● Traditional and Data Lake Approach
● Spark Approach
● Data Source and Data Frame API
● Tellius Platform
● Multi Source analysis in Tellius
Multi Source Data
Multi Source Data
● In the era of cloud computing and big data, data for
analysis can come from various sources
● In every organization, it has become very common to
have multiple different sources to store wide variety of
storage system
● The nature of the data will vary from source to source
● Data can be structured, semi structured or fully
unstructured also.
Multi Source Example in Ecommerce
● Relational databases are used to hold product details
and customer transactions
● Big data warehousing tools like Hadoop/Hive/Impala are
used to store historical transactions and ratings for
analytics
● Google analytics to store the website analytics data
● Log Data in S3/ Azure Blog
● Every storage system is optimized to store specific type
of data
Multi Source Data Analysis
Need of Multi Source Analysis
● If the analysis of the data is restricted to only one
source, then we may lose sight of interesting patterns in
our business
● Complete view / 360 degree view of the business in not
possible unless we consider all the data which is
available to us
● Advance analytics like ML or AI is more useful when
there is more variety in the data
Traditional Approach
● In traditional way of doing multi source analysis, needed
all data to be moved to a single data source
● This approach made sense when number of sources
were few and data was well structured
● With increasing number of sources, the time to ETL
becomes bigger
● Normalizing the data for same schema becomes
challenging for semi-structured sources
● Traditional databases cannot hold the data in volume
also
Data Lake Approach
● Move the data to big data enabled repository from
different sources
● It solves the problem of volume, but there are still
challenges with it
● All the rich schema information in the source may not
translate well to the data lake repository
● ETL time will be still significant
● Will be not able to use underneath source processing
capabilities
● Not good for exploratory analysis
Apache Spark Approach
Requirements
● Ability to load the data uniformly from different source
irrespective their type
● Ability to represent the data in a single format
irrespective of their sources
● Ability to combine the data from the source naturally
● Ability to query the data across the sources naturally
● Ability to use the underneath source processing
whenever possible
Apache Spark Approach
● Data Source API of Spark SQL allows user to load the
uniformly from wide variety of sources
● DataFrame/ Dataset API of Spark allows user to
represent all the data source data uniformly
● Spark SQL has ability to join the data from different
sources
● Spark SQL pushes filters and prune columns if the
underneath source supports it
Customer 360 Use Case
Customer 360
● Four different datasets from two different sources
● We will be using flat file and Mysql data sources
● Transactions - Primarily focuses on Customer information like
Age, Gender, location etc. ( Mysql)
● Demographics - Cost of product, purchase date, store id, store
type, brands, Retail Department, Retail cost(Mysql)
● Credit Information – Reward Member, Redemption Method
● Marketing Information - Ad source, Promotional code
Loading Data
● We are going to use csv and jdbc connector for spark to
load the data
● Due to auto inference of the schema, we will get all the
needed schema in data frame
● After that we are going to preview the data, using show
method
● Ex : MultiSourceLoad
Multi Source Data Model
● We can define a data model using the join of the spark
● Here we will be joining the 4 datasets on customerid as
common
● After join using inner join, we get a data model which
has all the sources combine
● Ex : MultiSourceDataModel
Multi Source Analysis
● Show us the sales by different sources
● Average Cost and Sum Revenue by City and
Department
● Revenue by Campaign
● Ex : MultiSourceDataAnalysis
Introduction to Tellius
About Tellius
Search and AI-powered analytics platform,
enabling anyone to get answers from their business data
using an intuitive search-driven interface and automatically
uncover hidden insights with machine learning
SMART INTUITIVE PERSONALIZED
Customers expect ON-DEMAND , Personalized experience
We live in the era of intelligent consumer apps
Takes days/weeks to get
answers to ad-hoc questions
Time consuming manual process of
analyzing millions of combinations
and charts
No easy way for business users and
analysts to understand, trust and
leverage ML/AI techniques
Low Analytics adoption Analysis process not scalable Trust with AI for business outcomes
So much business data, but very few insights
Tellius is disrupting data analytics with AI
Combining modern search driven user experience with
AI-driven automation to find hidden answers
Tellius Modern Analytics experience
Get Instant answers
Start exploring
Reduce your analysis time from
Hours to Mins
Explainable AI for business
analysts
Time consuming,
Canned reports and dashboards
On-Demand,
Personalized experienceSelf-service data prep
Scalable In-Memory Data Platform
Search-driven
Conversational Analytics
Automated discovery
Of insights
Automated Machine
Learning
Only AI Platform that enables collaboration between roles
DATA MANAGEMENT
Visual Data prep with
SQL/ Python support
VISUAL ANALYSIS
Voice Enabled Search Driven
Interface for Asking Questions
Business User
Data Science
Practitioner
Data Analyst
Data Engineer
DISCOVERY OF INSIGHTS
Augmented discovery of insights
With natural language narrative
MACHINE LEARNING
AutoML and deployment of
ML models with Explainable AI
Google-like Search
driven Conversational
interface
Reveals hidden
relevant insights
saving 1000’s of hours
Eliminating friction
between self service
data prep to
ad-hoc analysis
and explainable
ML models
In-memory
architecture capable
of handling
billions of records
Intuitive UX AI-Driven Automation
Unified Analytics
Experience
Scalable Architecture
Why Tellius?
Only company providing instant Natural language Search experience, surfacing
AI-driven relevant insights across billions of records across data sources at scale and
enabling users to easily create and explain ML/AI models
Business Value Proposition
Automate discovery of relevant
hidden Insights
in your data
Ease of Use Uncover Hidden Insights
Get instant answers with
conversational Search
driven approach
Save Time
Augment Manual discovery process
with automation powered by Machine
learning
Our Vision- Accelerate journey to AI driven Enterprise
CONNECT EXPLORE DISCOVER PREDICT
Customer 360 on Tellius
Loading Data
● Tellius exposes various kind of data sources to connect
using spark data source API
● In this use case, we will using Mysql and csv
connectors to load the data to the system
● Tellius collects the metadata about data as part of the
loading.
● Some of the connectors like Salesforce and Google
Analytics are homegrown using same data source API
Defining Data Model
● Tellius calls data models as business views
● Business view allow user to create data model across
datasets seamlessly
● Internal all datasets in Tellius are represented as spark
Data Frames
● Defining a business view in the Tellius is like defining a
join in spark sql
Multi Source analysis using NLP
● Which top 6 sources by avg revenue
● Hey Tellius what’s my revenue broken down by
department
● show revenue by cit
● show revenue by department for InstagramAds
● These ultimately runs as spark queries and produces
the results
● We can use voice also
Multi Source analysis using Assistant
● Show total revenue
● By city
● What about cost
● for InstagramAds
● Use Voice
● Try out Google Home
Challenges
Spark DataModel
● Spark join creates a flat data model which is different
than typical data ware data model
● So this flat data model is good when there no
duplication of primary keys aka star model
● But if there duplication, we end up double counting
values when we run the queries directly
● Example : DoubleCounting
Handling Double Counting in Tellius
● Tellius has implemented its own query language on top
of the Spark SQL layer to implement data warehouse
like strategies to avoid this double counting
● This layer allows Tellius to provide multi source analysis
on top spark with accuracy of a data warehouse system
● Ex : show point_redemeption_method
References
● Dataset API -
https://www.youtube.com/watch?v=hHFuKeeQujc
● Structured Data Analysis -
https://www.youtube.com/watch?v=0jd3EWmKQfo
● Anatomy of Spark SQL -
https://www.youtube.com/watch?v=TCWOJ6EJprY
We are Hiring!!!
Thank You

Más contenido relacionado

La actualidad más candente

Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLdatamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Sparkdatamantra
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIdatamantra
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark applicationdatamantra
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Sparkdatamantra
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache sparkdatamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streamingdatamantra
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to datasetdatamantra
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0datamantra
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsShashank L
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0datamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streamingdatamantra
 
Anatomy of in memory processing in Spark
Anatomy of in memory processing in SparkAnatomy of in memory processing in Spark
Anatomy of in memory processing in Sparkdatamantra
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1datamantra
 
Spark architecture
Spark architectureSpark architecture
Spark architecturedatamantra
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streamingdatamantra
 
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Databricks
 

La actualidad más candente (20)

Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Spark
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Anatomy of in memory processing in Spark
Anatomy of in memory processing in SparkAnatomy of in memory processing in Spark
Anatomy of in memory processing in Spark
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
 

Similar a Multi Source Data Analysis using Spark and Tellius

Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...Lucas Jellema
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabszekeLabs Technologies
 
Delivering a Linked Data warehouse and realising the power of graphs
Delivering a Linked Data warehouse and realising the power of graphsDelivering a Linked Data warehouse and realising the power of graphs
Delivering a Linked Data warehouse and realising the power of graphsBen Gardner
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowDatabricks
 
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...semanticsconference
 
Enterprise Data Warehousing Positioning
Enterprise Data Warehousing PositioningEnterprise Data Warehousing Positioning
Enterprise Data Warehousing PositioningEdenH6
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningProvectus
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureMark Kromer
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data PlatformDani Solà Lagares
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning India Quotient
 
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...Mark Rittman
 
DevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleDevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleatSistemas
 
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...Data Con LA
 
How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackDenodo
 
Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric IntroductionJames Serra
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Pentaho
 
Introduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataIntroduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataWeCloudData
 
Introduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataIntroduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataWeCloudData
 
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...Daniel Zivkovic
 

Similar a Multi Source Data Analysis using Spark and Tellius (20)

Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
Delivering a Linked Data warehouse and realising the power of graphs
Delivering a Linked Data warehouse and realising the power of graphsDelivering a Linked Data warehouse and realising the power of graphs
Delivering a Linked Data warehouse and realising the power of graphs
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
 
Enterprise Data Warehousing Positioning
Enterprise Data Warehousing PositioningEnterprise Data Warehousing Positioning
Enterprise Data Warehousing Positioning
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning
 
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
 
DevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleDevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-Oracle
 
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
 
How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science Stack
 
Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
 
Introduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataIntroduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudData
 
Introduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataIntroduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudData
 
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
 

Más de datamantra

State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scaladatamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scaladatamantra
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scaledatamantra
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientistsdatamantra
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPdatamantra
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2datamantra
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalystdatamantra
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streamingdatamantra
 

Más de datamantra (19)

State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 

Último

Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 

Último (20)

Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 

Multi Source Data Analysis using Spark and Tellius

  • 1. Multi Source Data Analysis Using Apache Spark and Tellius https://github.com/phatak-dev/spark2.0-examples
  • 2. ● Madhukara Phatak ● Director of Engineering,Tellius ● Work on Hadoop, Spark , ML and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Multi Source Data ● Challenges with Multi Source ● Traditional and Data Lake Approach ● Spark Approach ● Data Source and Data Frame API ● Tellius Platform ● Multi Source analysis in Tellius
  • 5. Multi Source Data ● In the era of cloud computing and big data, data for analysis can come from various sources ● In every organization, it has become very common to have multiple different sources to store wide variety of storage system ● The nature of the data will vary from source to source ● Data can be structured, semi structured or fully unstructured also.
  • 6. Multi Source Example in Ecommerce ● Relational databases are used to hold product details and customer transactions ● Big data warehousing tools like Hadoop/Hive/Impala are used to store historical transactions and ratings for analytics ● Google analytics to store the website analytics data ● Log Data in S3/ Azure Blog ● Every storage system is optimized to store specific type of data
  • 7. Multi Source Data Analysis
  • 8. Need of Multi Source Analysis ● If the analysis of the data is restricted to only one source, then we may lose sight of interesting patterns in our business ● Complete view / 360 degree view of the business in not possible unless we consider all the data which is available to us ● Advance analytics like ML or AI is more useful when there is more variety in the data
  • 9. Traditional Approach ● In traditional way of doing multi source analysis, needed all data to be moved to a single data source ● This approach made sense when number of sources were few and data was well structured ● With increasing number of sources, the time to ETL becomes bigger ● Normalizing the data for same schema becomes challenging for semi-structured sources ● Traditional databases cannot hold the data in volume also
  • 10. Data Lake Approach ● Move the data to big data enabled repository from different sources ● It solves the problem of volume, but there are still challenges with it ● All the rich schema information in the source may not translate well to the data lake repository ● ETL time will be still significant ● Will be not able to use underneath source processing capabilities ● Not good for exploratory analysis
  • 12. Requirements ● Ability to load the data uniformly from different source irrespective their type ● Ability to represent the data in a single format irrespective of their sources ● Ability to combine the data from the source naturally ● Ability to query the data across the sources naturally ● Ability to use the underneath source processing whenever possible
  • 13. Apache Spark Approach ● Data Source API of Spark SQL allows user to load the uniformly from wide variety of sources ● DataFrame/ Dataset API of Spark allows user to represent all the data source data uniformly ● Spark SQL has ability to join the data from different sources ● Spark SQL pushes filters and prune columns if the underneath source supports it
  • 15. Customer 360 ● Four different datasets from two different sources ● We will be using flat file and Mysql data sources ● Transactions - Primarily focuses on Customer information like Age, Gender, location etc. ( Mysql) ● Demographics - Cost of product, purchase date, store id, store type, brands, Retail Department, Retail cost(Mysql) ● Credit Information – Reward Member, Redemption Method ● Marketing Information - Ad source, Promotional code
  • 16. Loading Data ● We are going to use csv and jdbc connector for spark to load the data ● Due to auto inference of the schema, we will get all the needed schema in data frame ● After that we are going to preview the data, using show method ● Ex : MultiSourceLoad
  • 17. Multi Source Data Model ● We can define a data model using the join of the spark ● Here we will be joining the 4 datasets on customerid as common ● After join using inner join, we get a data model which has all the sources combine ● Ex : MultiSourceDataModel
  • 18. Multi Source Analysis ● Show us the sales by different sources ● Average Cost and Sum Revenue by City and Department ● Revenue by Campaign ● Ex : MultiSourceDataAnalysis
  • 20. About Tellius Search and AI-powered analytics platform, enabling anyone to get answers from their business data using an intuitive search-driven interface and automatically uncover hidden insights with machine learning
  • 21. SMART INTUITIVE PERSONALIZED Customers expect ON-DEMAND , Personalized experience We live in the era of intelligent consumer apps
  • 22. Takes days/weeks to get answers to ad-hoc questions Time consuming manual process of analyzing millions of combinations and charts No easy way for business users and analysts to understand, trust and leverage ML/AI techniques Low Analytics adoption Analysis process not scalable Trust with AI for business outcomes So much business data, but very few insights
  • 23. Tellius is disrupting data analytics with AI Combining modern search driven user experience with AI-driven automation to find hidden answers
  • 24. Tellius Modern Analytics experience Get Instant answers Start exploring Reduce your analysis time from Hours to Mins Explainable AI for business analysts Time consuming, Canned reports and dashboards On-Demand, Personalized experienceSelf-service data prep Scalable In-Memory Data Platform Search-driven Conversational Analytics Automated discovery Of insights Automated Machine Learning
  • 25. Only AI Platform that enables collaboration between roles DATA MANAGEMENT Visual Data prep with SQL/ Python support VISUAL ANALYSIS Voice Enabled Search Driven Interface for Asking Questions Business User Data Science Practitioner Data Analyst Data Engineer DISCOVERY OF INSIGHTS Augmented discovery of insights With natural language narrative MACHINE LEARNING AutoML and deployment of ML models with Explainable AI
  • 26. Google-like Search driven Conversational interface Reveals hidden relevant insights saving 1000’s of hours Eliminating friction between self service data prep to ad-hoc analysis and explainable ML models In-memory architecture capable of handling billions of records Intuitive UX AI-Driven Automation Unified Analytics Experience Scalable Architecture Why Tellius? Only company providing instant Natural language Search experience, surfacing AI-driven relevant insights across billions of records across data sources at scale and enabling users to easily create and explain ML/AI models
  • 27. Business Value Proposition Automate discovery of relevant hidden Insights in your data Ease of Use Uncover Hidden Insights Get instant answers with conversational Search driven approach Save Time Augment Manual discovery process with automation powered by Machine learning
  • 28. Our Vision- Accelerate journey to AI driven Enterprise CONNECT EXPLORE DISCOVER PREDICT
  • 29. Customer 360 on Tellius
  • 30. Loading Data ● Tellius exposes various kind of data sources to connect using spark data source API ● In this use case, we will using Mysql and csv connectors to load the data to the system ● Tellius collects the metadata about data as part of the loading. ● Some of the connectors like Salesforce and Google Analytics are homegrown using same data source API
  • 31. Defining Data Model ● Tellius calls data models as business views ● Business view allow user to create data model across datasets seamlessly ● Internal all datasets in Tellius are represented as spark Data Frames ● Defining a business view in the Tellius is like defining a join in spark sql
  • 32. Multi Source analysis using NLP ● Which top 6 sources by avg revenue ● Hey Tellius what’s my revenue broken down by department ● show revenue by cit ● show revenue by department for InstagramAds ● These ultimately runs as spark queries and produces the results ● We can use voice also
  • 33. Multi Source analysis using Assistant ● Show total revenue ● By city ● What about cost ● for InstagramAds ● Use Voice ● Try out Google Home
  • 35. Spark DataModel ● Spark join creates a flat data model which is different than typical data ware data model ● So this flat data model is good when there no duplication of primary keys aka star model ● But if there duplication, we end up double counting values when we run the queries directly ● Example : DoubleCounting
  • 36. Handling Double Counting in Tellius ● Tellius has implemented its own query language on top of the Spark SQL layer to implement data warehouse like strategies to avoid this double counting ● This layer allows Tellius to provide multi source analysis on top spark with accuracy of a data warehouse system ● Ex : show point_redemeption_method
  • 37. References ● Dataset API - https://www.youtube.com/watch?v=hHFuKeeQujc ● Structured Data Analysis - https://www.youtube.com/watch?v=0jd3EWmKQfo ● Anatomy of Spark SQL - https://www.youtube.com/watch?v=TCWOJ6EJprY