SlideShare a Scribd company logo
1 of 23
@joe_Caserta@GreatLakesBI
Architecting for Big Data:
Trends, Tips, and Deployment Options
Joe Caserta
President
Caserta Concepts
@joe_Caserta@GreatLakesBI
Top 20 Big Data
Consulting - CIO Review
Joe Caserta Timeline
Launched Big Data practice
Co-author, with Ralph Kimball, The
Data Warehouse ETL Toolkit (Wiley)
Dedicated to Data Warehousing,
Business Intelligence since 1996
Began consulting database
programing and data modeling 25+ years hands-on experience
building database solutions
Founded Caserta Concepts in NYC
Web log analytics solution published
in Intelligent Enterprise
Formalized Alliances / Partnerships –
System Integrators
Partnered with Big Data vendors
Cloudera, Hortonworks, IBM, Cisco,
Datameer, Basho more…
Launched Training practice, teaching
data concepts world-wide
Laser focus on extending Data
Warehouses with Big Data solutions
1986
2004
1996
2009
2001
2010
2013
Launched Big Data Warehousing
Meetup in NYC ~ 1,500 Members
2012
2014
Established best practices for big
data ecosystem implementation –
Healthcare, Finance, Insurance
Top 20 Most Powerful
Big Data consulting firms
Dedicated to Data Governance
Techniques on Big Data (Innovation)
@joe_Caserta@GreatLakesBI
About Caserta Concepts
• Technology services company with expertise in data
analysis:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Digital Marketing
• Financial Services / Insurance
• Healthcare / Higher Education
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy, Implementation, Analytics
• Writing, Education, Mentoring
• Data Science & Analytics
• Cloud Computing
• Data Interaction & Visualization
@joe_Caserta@GreatLakesBI
Sales
Marketing
Finance
ETL
Data Exploration
Horizontally Scalable Environment - Optimized for Analytics
Big Data Lake Big Data Analytics
NoSQL
Databases
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Spark MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Others…
The Evolution of Enterprise Data
Data Science
Enterprise
Data Warehouse
ETL
@joe_Caserta@GreatLakesBI
Tools and Technologies
Best Practices
Data Warehousing/
ETL/Data Integration
BI/Visualization/
Analytics
Big Data Analytics
@joe_Caserta@GreatLakesBI
@joe_Caserta@GreatLakesBI
The one’s you need to know….
Hadoop Distribution: Cloudera, Hortonworks, MapR, Pivotal-HD, IBM
 Tools:
 Hive: Map data to structures and use SQL-like queries
 Pig: Data transformation language for big data
 Sqoop: Extracts external sources and loads Hadoop
 Spark: General-purpose cluster computing framework
 Storm: Real-time ETL
 NoSQL:
 Document: MongoDB, CouchDB
 Graph: Neo4j, Titan
 Key Value: Riak, Redis
 Columnar: Cassandra, Hbase
 Search: Lucene, Solr, ElasticSearch
 Languages: Python, SciPy, Java, R, Scala
@joe_Caserta@GreatLakesBI
Advertising
Real time interactive queries on massive
audience datasets in the cloud
Global analytics on the cloud
Integrate SAP implementations from
across the globe into single cloud solution
Why are we Changing?
Recommendation Engines
“You chose… you might also like…”
Real-Time
Aggregation, Monitoring & Alerting on
events at extremely high message
rates… ~1M msgs/sec
Big Data Warehouse
Extending EDW with Hadoop
Governing data from the “lake” to the
EDW
Personal/Commercial Banking
Investment/Trading Bank
World-wide beauty company
Cable Television
Audience-based Advertising
@joe_Caserta@GreatLakesBI
• This is the ‘people’ part. Establishing Enterprise
Data Council, Data Stewards, etc.Organization
• Definitions, lineage (where does this data come
from), business definitions, technical metadataMetadata
• Identify and control sensitive data, regulatory
compliancePrivacy/Security
• Data must be complete and correct. Measure,
improve, certify
Data Quality and
Monitoring
• Policies around data frequency, source availability,
etc.
Business Process
Integration
• Ensure consistent business critical data i.e.
Members, Providers, Agents, etc.
Master Data
Management
• Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Components of Data Governance
@joe_Caserta@GreatLakesBI
• This is the ‘people’ part. Establishing Enterprise
Data Council, Data Stewards, etc.Organization
• Definitions, lineage (where does this data come
from), business definitions, technical metadataMetadata
• Identify and control sensitive data, regulatory
compliancePrivacy/Security
• Data must be complete and correct. Measure,
improve, certify
Data Quality and
Monitoring
• Policies around data frequency, source availability,
etc.
Business Process
Integration
• Ensure consistent business critical data i.e.
Members, Providers, Agents, etc.
Master Data
Management
• Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Components of Data Governance
• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)
• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms
• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map
Reduce
• Acting on large dataset quality checks may require distribution
• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables
• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory
requirement)
• Take advantage of compression and archiving (like AWS Glacier)
• Data detection and masking on unstructured data upon ingest
• Near-zero latency, DevOps, Core component of business
operations
For Big Data
@joe_Caserta@GreatLakesBI
The Big Data Pyramid
 Data has different governance demands at each tier.
 Only top tier of the is fully governed and ready for Enterprise BI
Big
Data
Warehouse
Data Science
Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
Metadata  Catalog
ILM  who has access,
how long do we
“manage it”
Raw machine
data collection,
collect everything
Data is ready to be turned
into information: organized,
well defined, complete.
Agile business insight through
data-munging, machine learning,
blending with external data,
development of to-be BDW facts
Metadata  Catalog
ILM  who has access, how long do we
“manage it”
Data Quality and Monitoring 
Monitor completeness of data
Metadata  Catalog
ILM  who has access, how long to “manage it”
Data Quality and Monitoring  Monitoring of
completeness of data
Fully Data Governed ( trusted)
User community arbitrary queries and
reporting
@joe_Caserta@GreatLakesBI
• The Big Data movement breaks the relational database
barrier and enables analysis on massive amounts of
structured and unstructured data.
• NoSQL puts the value of SQL based relational databases
into question. This disruption is forging a new road for the
progress and advancement of scalable data analytics.
• The value of legacy Business Intelligence comes into
question.
• Rather than forcing data users to become technologists, it
must make data analysis available for the masses.
BI is About to be Disrupted!
@joe_Caserta@GreatLakesBI
• The role of the ‘Business Analyst’, the primary user of the
BI tool, is being replaced or by two types of data users:
1. Highly technical Data Scientists
2. Non-technical Business Persons
• New analytics (BI) platforms must be created to
accommodate the new users. We see these very discrete
users using very different technologies.
• Perhaps legacy BI tools will not go away, but the market is
absolutely about to be disrupted.
Who Does BI Today?
@joe_Caserta@GreatLakesBI
• Data Scientists have deep technical knowledge
• They enjoy writing code and mining data
• The best way to serve a data scientist is to provide access
to raw data and then get out of their way.
Empower the Data Scientist
@joe_Caserta@GreatLakesBI
What does a Data Scientist Do, Anyway?
 Searching for the data they need
 Making sense of the data
 Figuring why the data looks the way is does and assessing its validity
 Cleaning up all the garbage within the data so it represents true
business
 Combining events with Reference data to give it context
 Correlating event data with other events
 Finally, they write algorithms to perform mining, clustering and
predictive analytics – the sexy stuff.
 Writes really cool and
sophisticated algorithms that
impacts the way the business
runs.
 Much of the time of a Data
Scientist is spent:
 NOT
@joe_Caserta@GreatLakesBI
• Business users don’t have, and don’t want to have,
technical wherewithal to interact with ‘data’.
• “We have a business to run! Programming should be done by
people in rooms with no windows.”
• “I need information at my fingertips and I should not need a PhD in
SQL to get it.”
• “It’s a myth that BI tools will solve my problems, I still need IT to get
new reports. This is unacceptable.”
• Every business professional on the planet knows how to
search for needed information via a Google search bar.
• Business people want to be able to ‘Google’ their
corporate data for the information they need.
Empower the Business Person
@joe_Caserta@GreatLakesBI
The Future of BI (if the Business gets its way)…
@joe_Caserta@GreatLakesBI
Facets created
automatically
based on
relevant data
Navigating Data in BI…
@joe_Caserta@GreatLakesBI
• During normal BI
implementations, much
time is spent/wasted on
selecting the best way to
graphically represent a
set of metrics.
• We can embed
algorithms that are
statistically proven to
best represent
information depending
on the type of question
being asked.
• The user should be able
to preview and change
from the default
infographic as easy as
clicking ‘next’ on a
Yahoo! Slideshow.
Why do we make it so difficult?
@joe_Caserta@GreatLakesBI
Lady gaga sales by state by customer age Go!
joe@casertaconcepts.com
Region
Northeast
Midwest
South
West
Product
Records
Perfume
Clothes
Performances
Dates
2009 to 2013
DOWNLOAD
TO EXCEL
Imagine the Possibilities….
@joe_Caserta@GreatLakesBI
• Modern web application framework
• Developed and supported by Google
• Bootstrap used for Mobile
Angular
• JavaScript library for data visualization.
• Exposes full capability CSS3, HTML5 and SVG. Is extremely fast
• Support large datasets and dynamic behaviors for interaction
D3.js
• The “glue” that brings other components together
• The ‘engine’ that transforms search strings into queries.
• Integrated with the Customer Metadata repository
Python
• Full-text and faceted-search engine and database
• This is the backbone of the applicationSolr
• Customer Metadata repository. Stores all business rules (default
facets, etc) and user preferences (default graph types, etc)
• Cassandra may not be ultimate selection
Cassandra
• Amazon Web Services
• Product is a zero-footprint cloud based solution
• User experience is same as Googling info
AWS
Building the Future of BI (Hint: it’s Big Data)
@joe_Caserta@GreatLakesBI
Innovation is the only sustainable
competitive advantage a company can
have.
Closing Thought
Challenge the status quo!
@joe_Caserta@GreatLakesBI
Thank You & Questions
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta

More Related Content

What's hot

Event Streaming: from Projects to Platform (Lyndon Hedderly, Confluent) 2019 ...
Event Streaming: from Projects to Platform (Lyndon Hedderly, Confluent) 2019 ...Event Streaming: from Projects to Platform (Lyndon Hedderly, Confluent) 2019 ...
Event Streaming: from Projects to Platform (Lyndon Hedderly, Confluent) 2019 ...
confluent
 
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
confluent
 
Connecting Legacy Data Sources to the Data Lifecycle
 Connecting Legacy Data Sources to the Data Lifecycle Connecting Legacy Data Sources to the Data Lifecycle
Connecting Legacy Data Sources to the Data Lifecycle
Precisely
 
Apache Kafka in Gaming Industry (Games, Mobile, Betting, Gambling, Bookmaker,...
Apache Kafka in Gaming Industry (Games, Mobile, Betting, Gambling, Bookmaker,...Apache Kafka in Gaming Industry (Games, Mobile, Betting, Gambling, Bookmaker,...
Apache Kafka in Gaming Industry (Games, Mobile, Betting, Gambling, Bookmaker,...
Kai Wähner
 
Big Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your DataBig Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your Data
Kai Wähner
 

What's hot (20)

Everything you need to know about cloud migration(Build Stuff 2021)
Everything you need to know about cloud migration(Build Stuff 2021)Everything you need to know about cloud migration(Build Stuff 2021)
Everything you need to know about cloud migration(Build Stuff 2021)
 
Event-Streaming verstehen in unter 10 Min
Event-Streaming verstehen in unter 10 MinEvent-Streaming verstehen in unter 10 Min
Event-Streaming verstehen in unter 10 Min
 
Event Streaming: from Projects to Platform (Lyndon Hedderly, Confluent) 2019 ...
Event Streaming: from Projects to Platform (Lyndon Hedderly, Confluent) 2019 ...Event Streaming: from Projects to Platform (Lyndon Hedderly, Confluent) 2019 ...
Event Streaming: from Projects to Platform (Lyndon Hedderly, Confluent) 2019 ...
 
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
 
LIVE DEMO: Big Data Suite
LIVE DEMO: Big Data SuiteLIVE DEMO: Big Data Suite
LIVE DEMO: Big Data Suite
 
Cloudera - IoT & Smart Cities
Cloudera - IoT & Smart CitiesCloudera - IoT & Smart Cities
Cloudera - IoT & Smart Cities
 
Application Modernization
Application ModernizationApplication Modernization
Application Modernization
 
Connecting Legacy Data Sources to the Data Lifecycle
 Connecting Legacy Data Sources to the Data Lifecycle Connecting Legacy Data Sources to the Data Lifecycle
Connecting Legacy Data Sources to the Data Lifecycle
 
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
 
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
 
How does a Modern Integration Platform Innovate
How does a Modern Integration Platform InnovateHow does a Modern Integration Platform Innovate
How does a Modern Integration Platform Innovate
 
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
 
Real Time Business Platform by Ivan Novick from Pivotal
Real Time Business Platform by Ivan Novick from PivotalReal Time Business Platform by Ivan Novick from Pivotal
Real Time Business Platform by Ivan Novick from Pivotal
 
[INFOGRAPHIC] Event-driven Business: How to Handle the Flow of Event Data
[INFOGRAPHIC] Event-driven Business: How to Handle the Flow of Event Data[INFOGRAPHIC] Event-driven Business: How to Handle the Flow of Event Data
[INFOGRAPHIC] Event-driven Business: How to Handle the Flow of Event Data
 
Kafka Summit SF 2017 - Real time Streaming Platform
Kafka Summit SF 2017 - Real time Streaming Platform Kafka Summit SF 2017 - Real time Streaming Platform
Kafka Summit SF 2017 - Real time Streaming Platform
 
Pivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewPivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical Overview
 
Blockchain and Apache NiFi
Blockchain and Apache NiFiBlockchain and Apache NiFi
Blockchain and Apache NiFi
 
Event Mesh Presentation at Gartner AADI Mumbai
Event Mesh Presentation at Gartner AADI MumbaiEvent Mesh Presentation at Gartner AADI Mumbai
Event Mesh Presentation at Gartner AADI Mumbai
 
Apache Kafka in Gaming Industry (Games, Mobile, Betting, Gambling, Bookmaker,...
Apache Kafka in Gaming Industry (Games, Mobile, Betting, Gambling, Bookmaker,...Apache Kafka in Gaming Industry (Games, Mobile, Betting, Gambling, Bookmaker,...
Apache Kafka in Gaming Industry (Games, Mobile, Betting, Gambling, Bookmaker,...
 
Big Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your DataBig Data beyond Apache Hadoop - How to integrate ALL your Data
Big Data beyond Apache Hadoop - How to integrate ALL your Data
 

Viewers also liked

Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
Keith Kraus
 

Viewers also liked (9)

Marketwired - Social Media in the Military: Mining & Monitoring
Marketwired - Social Media in the Military: Mining & MonitoringMarketwired - Social Media in the Military: Mining & Monitoring
Marketwired - Social Media in the Military: Mining & Monitoring
 
Defense Intelligence & The Information Challenge
Defense Intelligence & The Information ChallengeDefense Intelligence & The Information Challenge
Defense Intelligence & The Information Challenge
 
Unveiling FATA a Visual Journey.
Unveiling FATA a Visual Journey.Unveiling FATA a Visual Journey.
Unveiling FATA a Visual Journey.
 
Big data presentation linked in simon zhang 20140714
Big data presentation linked in simon zhang 20140714Big data presentation linked in simon zhang 20140714
Big data presentation linked in simon zhang 20140714
 
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
 
Big data大数据presentation1
Big data大数据presentation1Big data大数据presentation1
Big data大数据presentation1
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
Big Data for Defense and Security
Big Data for Defense and SecurityBig Data for Defense and Security
Big Data for Defense and Security
 
唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub
 

Similar to Architecting for Big Data: Trends, Tips, and Deployment Options

How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
Moacyr Passador
 

Similar to Architecting for Big Data: Trends, Tips, and Deployment Options (20)

Big Data Analytics with Microsoft
Big Data Analytics with MicrosoftBig Data Analytics with Microsoft
Big Data Analytics with Microsoft
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Big Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data ManagementBig Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data Management
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value Thereafter
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
Business in the Driver’s Seat – An Improved Model for Integration
Business in the Driver’s Seat – An Improved Model for IntegrationBusiness in the Driver’s Seat – An Improved Model for Integration
Business in the Driver’s Seat – An Improved Model for Integration
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 

More from Caserta

Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
Caserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
Caserta
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
Caserta
 

More from Caserta (18)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Architecting for Big Data: Trends, Tips, and Deployment Options

  • 1. @joe_Caserta@GreatLakesBI Architecting for Big Data: Trends, Tips, and Deployment Options Joe Caserta President Caserta Concepts
  • 2. @joe_Caserta@GreatLakesBI Top 20 Big Data Consulting - CIO Review Joe Caserta Timeline Launched Big Data practice Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit (Wiley) Dedicated to Data Warehousing, Business Intelligence since 1996 Began consulting database programing and data modeling 25+ years hands-on experience building database solutions Founded Caserta Concepts in NYC Web log analytics solution published in Intelligent Enterprise Formalized Alliances / Partnerships – System Integrators Partnered with Big Data vendors Cloudera, Hortonworks, IBM, Cisco, Datameer, Basho more… Launched Training practice, teaching data concepts world-wide Laser focus on extending Data Warehouses with Big Data solutions 1986 2004 1996 2009 2001 2010 2013 Launched Big Data Warehousing Meetup in NYC ~ 1,500 Members 2012 2014 Established best practices for big data ecosystem implementation – Healthcare, Finance, Insurance Top 20 Most Powerful Big Data consulting firms Dedicated to Data Governance Techniques on Big Data (Innovation)
  • 3. @joe_Caserta@GreatLakesBI About Caserta Concepts • Technology services company with expertise in data analysis: • Big Data Solutions • Data Warehousing • Business Intelligence • Core focus in the following industries: • eCommerce / Retail / Digital Marketing • Financial Services / Insurance • Healthcare / Higher Education • Established in 2001: • Increased growth year-over-year • Industry recognized work force • Strategy, Implementation, Analytics • Writing, Education, Mentoring • Data Science & Analytics • Cloud Computing • Data Interaction & Visualization
  • 4. @joe_Caserta@GreatLakesBI Sales Marketing Finance ETL Data Exploration Horizontally Scalable Environment - Optimized for Analytics Big Data Lake Big Data Analytics NoSQL Databases ETL Ad-Hoc/Canned Reporting Traditional BI Spark MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Others… The Evolution of Enterprise Data Data Science Enterprise Data Warehouse ETL
  • 5. @joe_Caserta@GreatLakesBI Tools and Technologies Best Practices Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics Big Data Analytics
  • 7. @joe_Caserta@GreatLakesBI The one’s you need to know…. Hadoop Distribution: Cloudera, Hortonworks, MapR, Pivotal-HD, IBM  Tools:  Hive: Map data to structures and use SQL-like queries  Pig: Data transformation language for big data  Sqoop: Extracts external sources and loads Hadoop  Spark: General-purpose cluster computing framework  Storm: Real-time ETL  NoSQL:  Document: MongoDB, CouchDB  Graph: Neo4j, Titan  Key Value: Riak, Redis  Columnar: Cassandra, Hbase  Search: Lucene, Solr, ElasticSearch  Languages: Python, SciPy, Java, R, Scala
  • 8. @joe_Caserta@GreatLakesBI Advertising Real time interactive queries on massive audience datasets in the cloud Global analytics on the cloud Integrate SAP implementations from across the globe into single cloud solution Why are we Changing? Recommendation Engines “You chose… you might also like…” Real-Time Aggregation, Monitoring & Alerting on events at extremely high message rates… ~1M msgs/sec Big Data Warehouse Extending EDW with Hadoop Governing data from the “lake” to the EDW Personal/Commercial Banking Investment/Trading Bank World-wide beauty company Cable Television Audience-based Advertising
  • 9. @joe_Caserta@GreatLakesBI • This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization • Definitions, lineage (where does this data come from), business definitions, technical metadataMetadata • Identify and control sensitive data, regulatory compliancePrivacy/Security • Data must be complete and correct. Measure, improve, certify Data Quality and Monitoring • Policies around data frequency, source availability, etc. Business Process Integration • Ensure consistent business critical data i.e. Members, Providers, Agents, etc. Master Data Management • Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) Components of Data Governance
  • 10. @joe_Caserta@GreatLakesBI • This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization • Definitions, lineage (where does this data come from), business definitions, technical metadataMetadata • Identify and control sensitive data, regulatory compliancePrivacy/Security • Data must be complete and correct. Measure, improve, certify Data Quality and Monitoring • Policies around data frequency, source availability, etc. Business Process Integration • Ensure consistent business critical data i.e. Members, Providers, Agents, etc. Master Data Management • Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) Components of Data Governance • Add Big Data to overall framework and assign responsibility • Add data scientists to the Stewardship program • Assign stewards to new data sets (twitter, call center logs, etc.) • Graph databases are more flexible than relational • Lower latency service required • Distributed data quality and matching algorithms • Data Quality and Monitoring (probably home grown, drools?) • Quality checks not only SQL: machine learning, Pig and Map Reduce • Acting on large dataset quality checks may require distribution • Larger scale • New datatypes • Integrate with Hive Metastore, HCatalog, home grown tables • Secure and mask multiple data types (not just tabular) • Deletes are more uncommon (unless there is regulatory requirement) • Take advantage of compression and archiving (like AWS Glacier) • Data detection and masking on unstructured data upon ingest • Near-zero latency, DevOps, Core component of business operations For Big Data
  • 11. @joe_Caserta@GreatLakesBI The Big Data Pyramid  Data has different governance demands at each tier.  Only top tier of the is fully governed and ready for Enterprise BI Big Data Warehouse Data Science Workspace Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” Metadata  Catalog ILM  who has access, how long do we “manage it” Raw machine data collection, collect everything Data is ready to be turned into information: organized, well defined, complete. Agile business insight through data-munging, machine learning, blending with external data, development of to-be BDW facts Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitor completeness of data Metadata  Catalog ILM  who has access, how long to “manage it” Data Quality and Monitoring  Monitoring of completeness of data Fully Data Governed ( trusted) User community arbitrary queries and reporting
  • 12. @joe_Caserta@GreatLakesBI • The Big Data movement breaks the relational database barrier and enables analysis on massive amounts of structured and unstructured data. • NoSQL puts the value of SQL based relational databases into question. This disruption is forging a new road for the progress and advancement of scalable data analytics. • The value of legacy Business Intelligence comes into question. • Rather than forcing data users to become technologists, it must make data analysis available for the masses. BI is About to be Disrupted!
  • 13. @joe_Caserta@GreatLakesBI • The role of the ‘Business Analyst’, the primary user of the BI tool, is being replaced or by two types of data users: 1. Highly technical Data Scientists 2. Non-technical Business Persons • New analytics (BI) platforms must be created to accommodate the new users. We see these very discrete users using very different technologies. • Perhaps legacy BI tools will not go away, but the market is absolutely about to be disrupted. Who Does BI Today?
  • 14. @joe_Caserta@GreatLakesBI • Data Scientists have deep technical knowledge • They enjoy writing code and mining data • The best way to serve a data scientist is to provide access to raw data and then get out of their way. Empower the Data Scientist
  • 15. @joe_Caserta@GreatLakesBI What does a Data Scientist Do, Anyway?  Searching for the data they need  Making sense of the data  Figuring why the data looks the way is does and assessing its validity  Cleaning up all the garbage within the data so it represents true business  Combining events with Reference data to give it context  Correlating event data with other events  Finally, they write algorithms to perform mining, clustering and predictive analytics – the sexy stuff.  Writes really cool and sophisticated algorithms that impacts the way the business runs.  Much of the time of a Data Scientist is spent:  NOT
  • 16. @joe_Caserta@GreatLakesBI • Business users don’t have, and don’t want to have, technical wherewithal to interact with ‘data’. • “We have a business to run! Programming should be done by people in rooms with no windows.” • “I need information at my fingertips and I should not need a PhD in SQL to get it.” • “It’s a myth that BI tools will solve my problems, I still need IT to get new reports. This is unacceptable.” • Every business professional on the planet knows how to search for needed information via a Google search bar. • Business people want to be able to ‘Google’ their corporate data for the information they need. Empower the Business Person
  • 17. @joe_Caserta@GreatLakesBI The Future of BI (if the Business gets its way)…
  • 19. @joe_Caserta@GreatLakesBI • During normal BI implementations, much time is spent/wasted on selecting the best way to graphically represent a set of metrics. • We can embed algorithms that are statistically proven to best represent information depending on the type of question being asked. • The user should be able to preview and change from the default infographic as easy as clicking ‘next’ on a Yahoo! Slideshow. Why do we make it so difficult?
  • 20. @joe_Caserta@GreatLakesBI Lady gaga sales by state by customer age Go! joe@casertaconcepts.com Region Northeast Midwest South West Product Records Perfume Clothes Performances Dates 2009 to 2013 DOWNLOAD TO EXCEL Imagine the Possibilities….
  • 21. @joe_Caserta@GreatLakesBI • Modern web application framework • Developed and supported by Google • Bootstrap used for Mobile Angular • JavaScript library for data visualization. • Exposes full capability CSS3, HTML5 and SVG. Is extremely fast • Support large datasets and dynamic behaviors for interaction D3.js • The “glue” that brings other components together • The ‘engine’ that transforms search strings into queries. • Integrated with the Customer Metadata repository Python • Full-text and faceted-search engine and database • This is the backbone of the applicationSolr • Customer Metadata repository. Stores all business rules (default facets, etc) and user preferences (default graph types, etc) • Cassandra may not be ultimate selection Cassandra • Amazon Web Services • Product is a zero-footprint cloud based solution • User experience is same as Googling info AWS Building the Future of BI (Hint: it’s Big Data)
  • 22. @joe_Caserta@GreatLakesBI Innovation is the only sustainable competitive advantage a company can have. Closing Thought Challenge the status quo!
  • 23. @joe_Caserta@GreatLakesBI Thank You & Questions Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta