SlideShare una empresa de Scribd logo
1 de 78
NoSQL for the DBA
   Lynn Langit



                    April 2013 – Big Data Tech Con
Data Expertise / Lynn Langit
• Industry awards
   – Microsoft – MVP for SQL Server
   – Google – GDE for Cloud Platform
   – 10Gen – Master for MongoDB
• Practicing Architect
• Technical author / trainer
   –   Pluralsight – Google Cloud Series
   –   DevelopMentor – SQL Server Series
   –   2 books on SQL Server BI
   –   Cloudera trainer (certified)
• Former MSFT FTE
   – 4 years
but first…
Business Intelligence
     to BigData
What is the relationship?


   Business
                NoSQL       ????
 Intelligence
“The Past” BI = Effective Reports

Data
optimized for
Static
READING
BI = Optimized RDBMS
SQL queries & Data Stored on disk
BI = OLAP Cubes storage
BI = OLAP Cubes clients
BI = Transactional Data




  Collecting    • What happened?
Transactional   • Why did that happen?
    data        • Decision Support Systems
So Why Change?
Enter
                     Big Data
Q: What is it?

A: Your Data, plus more data….
BigData Pipeline - STEP 1 – Acquire

Acquire
          Process
                    Store
                            Query & Mine
                                       Visualize
Big Data – an example from weather




          13
Big Data – an example from weather
• Source Data
   • National weather data
   • Satellite data
   • Airplanes with sensors
   • Sensors on boats
   • Sensors in the ocean
   • Sensors on the ground
   • Historical Data
   • Social Media
• Results
   • More accurate predictions
        • Tsunami
        • Tornado
Big Data – an example from health care
• Medical records
    • Regular
    • Emergency
    • Genetic data – 23andMe
• Food data
    • SparkPeople
• Purchasing
    • Grocery card
    • credit card
• Search – Google
• Social media
    • Twitter
    • Facebook
• Exercise
    • Nike Fuel Band
    • Kinect
    • Location - phone
BigData = ‘Next State’ Questions


              • What could happen?
 Collecting   • Why didn’t this happen?
              • When will the next new thing
 Behavioral     happen?
   data       • What will the next new thing be?
              • What happens?
What is the reality of personalized medicine?
2500



2000                                               Key Monitoring



1500


                                                   Sensor Readings
1000



500

                                                   Other Behavioral
   0                                               data

       12:00   12:30   1:00   1:30   2:00   2:30
BigData and Verticals
        •   Retail
        •   Manufacturing
        •   Health Care
        •   Banking
        •   Education
Collecting BigData
• Sensors everywhere
• Structured, Semi-structured, Unstructured vs. Data
  Standards
• M2M
• Public Datasets
   – Freebase
   – Azure DataMarket
   – Hillary Mason’s list




                            19
DEMO – Hilary Mason’s Datasets
• Who is Hilary Mason and why do you care
  about her datasets?
• How do you get her datasets?
• What do you do with her datasets?
Collecting Data – a note about Faces
• Facial recognition
• Voice recognition
• Gesture capture and analysis




               21
Petabytes
    of
 Big Data
Big Data at Apple
Big Data in India




Update: “The total number of AADHAARs issued as of 24-Mar-
2013 is over 304 million. This is more than 25% of the
population of India.”
BigData Pipeline – STEP 5 - Visualize

Acquire
          Process
                    Store
                            Query & Mine
                                       Visualize
DEMO - Visualizing Big Data: Wind Map




           26
Demo - Visualizing Big Data – D3




         27
BigData Pipeline – STEP 2 - Process

Acquire
          Process
                    Store
                            Query & Mine
                                       Visualize
How do you clean up the mess?
•   Data Hygiene
•   Data Scrubbing
•   Data Sprawl
•   The true cost of data
•   …and what about data integrity?
•   …and security?
•   …should your data be in the cloud?
Is NoSQL just Hadoop?

HUGE Hype factor since 2011




Apache Hadoop
• a software framework that supports data-intensive distributed
  applications
• under a free license enables applications to work with thousands of
  nodes and petabytes of data
• was inspired by Google's MapReduce and Google File System (GFS)
  papers
What is the relationship?



NoSQL     Hadoop     ???    BigData
Hadoop in the Enterprise
How you ‘get’ Hadoop
Open source

• roll your own

Commercial distribution

•   Cloudera
•   MapR
•   Hortonworks
•   More…

Rent it via the cloud

• AWS
Demo – Get and Use Cloudera CDH4 VM
Working with Hadoop
About Hadoop MapReduce




     Image from - https://developers.google.com/appengine/docs/python/images/mapreduce_mapshuffle.png
Demo - HDInsight – MapReduce w/Java
Demo - HDInsight – MapReduce w/ Hive
Example Comparison: RDBMS vs. Hadoop

            Traditional RDBMS         Hadoop / MapReduce

Data Size   Gigabytes (Terabytes)     Petabytes and greater

Access      Interactive and Batch     Batch – NOT Interactive

Updates     Read / Write many times   Write once, Read many times

Structure   Static Schema             Dynamic Schema

Integrity   High (ACID)               Low

Scaling     Nonlinear                 Linear

Query       Can be near immediate     Has latency (due to batch
Response                              processing)
Time
BigData Pipeline STEP 3 – Store

Acquire
          Process
                    Store
                            Query & Mine
                                       Visualize
“Small” BigData vs. “Big” BigData

              Hadoop


              NoSQL




              RDBMS
The reality…two pivots



Storage Methods     Storage Locations
• SQL (RDBMS)       • On premises
• NoSQL or Hadoop   • Cloud-hosted
Cloud-hosted NoSQL up to 50x CHEAPER
So many NoSQL options
• More than just the Elephant in the room
• Over 120+ types of NoSQL databases
Flavors of NoSQL
Key/Value   Key/value    Wide-Column   Document   Graph
Volatile    Persistent
Key / Value Database
• Just keys and values
  – No schema
• Persistent or Volatile
• Examples
  – AWS Dynamo DB
  – Riak
DEMO - AWS DynamoDB
• Key/Value store on the AWS cloud
NoSQL BLOB Storage Buckets in the Cloud

•   Amazon – S3 or Glacier
•   Google – Cloud Storage
•   Microsoft Azure BLOBS
•   Others
     – Dropbox
     – Box
     – More…
DEMO - Battle of the Buckets
• Google Cloud Storage VS.
• Windows Azure BLOBS VS.
• AWS S3 / Glacier
Column Database
• Wide, sparse column sets
  • Schema-light
• Examples:
  –   Cassandra
  –   HBase w/Hadoop
  –   BigTable
  –   GAE HR DS
Types of Column Databases
• Column-families
  – Non-relational
  – Sparse
  – Examples:
     • HBase
     • Cassandra
     • xVelocity (SQL 2012 Tabular)
• Column-stores
  – Relational
  – Dense
  – Example:
     • SQL Server 2012
         – Columnstore index
DEMO – SQL Server ‘NoSQL’
• SQL Server 2012 Columnstore Index
• SQL Server 2012 Tabular Model (SSAS)
Document Database (Mongo DB)
• document-oriented (collection of
  JSON documents) w/semi structured
  data
   – Encodings include BSON, JSON, XML…
• binary forms
   – PDF, Microsoft Office documents --
     Word, Excel…)
• Examples:
   – MongoDB
   – Couchbase
Demo - Mongo DB
Graph Databases
• a lot of many-to-many relationships
• recursive self-joins
• when your primary objective is quickly
  finding connections, patterns and
  relationships between the objects
  within lots of data
• Examples:
   – Neo4J
   – Google Freebase
DEMO – Neo4J
CAP Theorem applied = ‘how big is it?’

• CA = RDBMS
  – Highly-available consistency
  – Ex. SQL Server
• CP = NoSQL
  – Enforced consistency
  – Ex. Hadoop
• AP = NoSQL
  – Eventual consistency
  – Ex. MongoDB
“Small” BigData vs. “Big” BigData

                Hadoop



              Key/Value or
                Column


              Document or
                 Graph




                RDBMS
Cloud-hosted RDBMS

• AWS RDS – SQL Server,
  mySQL, Oracle
  – Medium cost
  – Solid feature set, i.e.
    backup, snapshot
  – Use existing tooling
• Google – mySQL
  – Lowest cost
  – Most limited RDBMS
    functionality
• Microsoft – SQLAzure
  – Highest cost
DEMO - AWS RDS
• SQL Server, MySQL or Oracle
• Essential to understand pricing models
Image - http://blog.outsourcing-partners.com/wp-content/uploads/2012/10/performance.png
NoSQL Applied
Columnstore   Log Files
HBase
                          Key/Value




                                       Product Catalogs
                          DynamoDB
                                                          Document




                                                                     Social Games
                                                          MongoDB
                                                                                    Graph




                                                                                            Social aggregators
                                                                                    Neo4j
                                                                                                                 RDBMS




                                                                                                                              Line-of-Business
                                                                                                                 SQL Server
Cloud Offerings– RDBMS AND NoSQL
                   AWS                 Google               Microsoft
RDBMS              RDS – all major     mySQL                SQL Azure
NoSQL buckets      S3 or Glacier       Cloud Storage        Azure Blobs
NoSQL Key-Value    DynamoDB            H/R Data on GAE      Azure Tables
Streaming ML or    Custom EC2          Prospective Search   StreamInsight
(Mahout)                               &
                                       Prediction API
NoSQL Document or MongoDB on EC2       Freebase             MongoDB on
Graph                                                       Windows Azure
NoSQL – Column     Elastic MapReduce   none                 HDInsight
Hadoop (HBase)     using S3 & EC2
Dremel/Warehousi   RedShift            BigQuery             none
ng
BigData Pipeline STEP 4 – Query

Acquire
          Process
                    Store
                            Query & Mine
                                       Visualize
Always MapReduce?
Data Scientists and Languages
Karmasphere Studio for AWS
Can Excel help?
•   Connector to Hadoop
•   Data Explorer
•   Data Quality Services
•   Master Data Services
•   Integration with Azure Data Market
•   Visualize with PowerView
•   Data Mining w/Predixion
Demo - Hadoop Connector to Excel
Google BigQuery w/Excel
• Hadoop-like (Dremel) based service
• For massive amounts of data
• SQL-like query language
DEMO - Google BigQuery
• Hadoop-like (Dremel) based service
• For massive amounts of data
• SQL-like query language
Dremel Realized => Impala
• Interactive Hadoop?
Other types of cloud data services


Hosting public datasets               Cleaning / matching (your)
• Pay to read                         data
• Earn revenue by offering for read   • ETL – Microsoft Data
                                        Explorer, Google Refine
                                      • Data Quality – Windows Azure
                                        Data
                                        Market, InfoChimps, DataMarket
                                        .com
NoSQL To-Do List
Understand CAP & types of NoSQL databases
 • Use NoSQL when business needs designate
 • Use the right type of NoSQL for your business problem

Try out NoSQL on the cloud
 • Quick and cheap for behavioral data
 • Mashup cloud datasets
 • Good for specialized use cases, i.e. dev, test , training environments

Learn noSQL access technologies
 • New query languages, i.e. MapReduce, R, Infer.NET
 • New query tools (vendor-specific) – Google Refine, Amazon
   Karmasphere, Microsoft Excel connectors, etc…
The Changing Data Landscape

                               Other
                              Services
RDBMS
          NoSQL
• recipes)




    www.TeachingKidsProgramming.org
      •   Free Courseware (
      •   Do a Recipe  Teach a Kid (Ages 10 ++)
      •   Java or Microsoft SmallBasic  TKP site
      •   C# via Pluralsight
Toward Data Craftsmanship…

                Follow me @LynnLangit


                     RSS my blog
                  www.LynnLangit.com

         Hire me
         • To help build your BI/Big Data solution
         • To teach your team next gen BI
         • To learn more about using NoSQL solutions

Más contenido relacionado

La actualidad más candente

The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationDatabricks
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story Roman Chukh
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Amazon Web Services
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing ArchitectureGang Tao
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time AnalyticsAmazon Web Services
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureBig Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureMark Tabladillo
 
AWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS ExperienceAWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS ExperienceAmazon Web Services
 
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloudJeff Hung
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringAnant Corporation
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataDataWorks Summit/Hadoop Summit
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on AzureTrivadis
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Jeff Hung
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeTorsten Steinbach
 
Azure Big Data Story
Azure Big Data StoryAzure Big Data Story
Azure Big Data StoryLynn Langit
 
Azure Hd insigth news
Azure Hd insigth newsAzure Hd insigth news
Azure Hd insigth newsnnakasone
 

La actualidad más candente (19)

The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureBig Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft Azure
 
AWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS ExperienceAWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS Experience
 
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
 
Big data in Azure
Big data in AzureBig data in Azure
Big data in Azure
 
Azure Big Data Story
Azure Big Data StoryAzure Big Data Story
Azure Big Data Story
 
Azure Hd insigth news
Azure Hd insigth newsAzure Hd insigth news
Azure Hd insigth news
 

Similar a NoSQL for the SQL Server Pro

Bd cloud v3
Bd cloud v3Bd cloud v3
Bd cloud v3scm24
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsAbhishekKumarAgrahar2
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxAIMLSEMINARS
 
Big data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalBig data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalramazan fırın
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQLPhilippe Julio
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusersBob Hardaway
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabasesAdi Challa
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQLCrate.io
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneMongoDB
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataRoi Blanco
 
Nosql Now 2012: MongoDB Use Cases
Nosql Now 2012: MongoDB Use CasesNosql Now 2012: MongoDB Use Cases
Nosql Now 2012: MongoDB Use CasesMongoDB
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB
 

Similar a NoSQL for the SQL Server Pro (20)

Bd cloud v3
Bd cloud v3Bd cloud v3
Bd cloud v3
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptx
 
Big data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalBig data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-final
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Lecture1
Lecture1Lecture1
Lecture1
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQL
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
SoftServe BI/BigData Workshop in Utah
SoftServe BI/BigData Workshop in UtahSoftServe BI/BigData Workshop in Utah
SoftServe BI/BigData Workshop in Utah
 
Nosql Now 2012: MongoDB Use Cases
Nosql Now 2012: MongoDB Use CasesNosql Now 2012: MongoDB Use Cases
Nosql Now 2012: MongoDB Use Cases
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big Data
 

Más de Lynn Langit

VariantSpark on AWS
VariantSpark on AWSVariantSpark on AWS
VariantSpark on AWSLynn Langit
 
Serverless Architectures
Serverless ArchitecturesServerless Architectures
Serverless ArchitecturesLynn Langit
 
10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids ProgrammingLynn Langit
 
Blastn plus jupyter on Docker
Blastn plus jupyter on DockerBlastn plus jupyter on Docker
Blastn plus jupyter on DockerLynn Langit
 
Testing in Ballerina Language
Testing in Ballerina LanguageTesting in Ballerina Language
Testing in Ballerina LanguageLynn Langit
 
Teaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsTeaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsLynn Langit
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesLynn Langit
 
Genome-scale Big Data Pipelines
Genome-scale Big Data PipelinesGenome-scale Big Data Pipelines
Genome-scale Big Data PipelinesLynn Langit
 
Teaching Kids Programming
Teaching Kids ProgrammingTeaching Kids Programming
Teaching Kids ProgrammingLynn Langit
 
Serverless Reality
Serverless RealityServerless Reality
Serverless RealityLynn Langit
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesGenomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesLynn Langit
 
VariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsVariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsLynn Langit
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSLynn Langit
 
Serverless Reality
Serverless RealityServerless Reality
Serverless RealityLynn Langit
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond RelationalLynn Langit
 
New AWS Services for Bioinformatics
New AWS Services for BioinformaticsNew AWS Services for Bioinformatics
New AWS Services for BioinformaticsLynn Langit
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsLynn Langit
 
Scaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformScaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformLynn Langit
 

Más de Lynn Langit (20)

VariantSpark on AWS
VariantSpark on AWSVariantSpark on AWS
VariantSpark on AWS
 
Serverless Architectures
Serverless ArchitecturesServerless Architectures
Serverless Architectures
 
10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming
 
Blastn plus jupyter on Docker
Blastn plus jupyter on DockerBlastn plus jupyter on Docker
Blastn plus jupyter on Docker
 
Testing in Ballerina Language
Testing in Ballerina LanguageTesting in Ballerina Language
Testing in Ballerina Language
 
Teaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsTeaching Kids to create Alexa Skills
Teaching Kids to create Alexa Skills
 
Practical cloud
Practical cloudPractical cloud
Practical cloud
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examples
 
Genome-scale Big Data Pipelines
Genome-scale Big Data PipelinesGenome-scale Big Data Pipelines
Genome-scale Big Data Pipelines
 
Teaching Kids Programming
Teaching Kids ProgrammingTeaching Kids Programming
Teaching Kids Programming
 
Practical Cloud
Practical CloudPractical Cloud
Practical Cloud
 
Serverless Reality
Serverless RealityServerless Reality
Serverless Reality
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesGenomic Scale Big Data Pipelines
Genomic Scale Big Data Pipelines
 
VariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsVariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomics
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWS
 
Serverless Reality
Serverless RealityServerless Reality
Serverless Reality
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond Relational
 
New AWS Services for Bioinformatics
New AWS Services for BioinformaticsNew AWS Services for Bioinformatics
New AWS Services for Bioinformatics
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
 
Scaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformScaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud Platform
 

Último

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Último (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

NoSQL for the SQL Server Pro

  • 1. NoSQL for the DBA Lynn Langit April 2013 – Big Data Tech Con
  • 2. Data Expertise / Lynn Langit • Industry awards – Microsoft – MVP for SQL Server – Google – GDE for Cloud Platform – 10Gen – Master for MongoDB • Practicing Architect • Technical author / trainer – Pluralsight – Google Cloud Series – DevelopMentor – SQL Server Series – 2 books on SQL Server BI – Cloudera trainer (certified) • Former MSFT FTE – 4 years
  • 4. What is the relationship? Business NoSQL ???? Intelligence
  • 5. “The Past” BI = Effective Reports Data optimized for Static READING
  • 6. BI = Optimized RDBMS SQL queries & Data Stored on disk
  • 7. BI = OLAP Cubes storage
  • 8. BI = OLAP Cubes clients
  • 9. BI = Transactional Data Collecting • What happened? Transactional • Why did that happen? data • Decision Support Systems
  • 11. Enter Big Data Q: What is it? A: Your Data, plus more data….
  • 12. BigData Pipeline - STEP 1 – Acquire Acquire Process Store Query & Mine Visualize
  • 13. Big Data – an example from weather 13
  • 14. Big Data – an example from weather • Source Data • National weather data • Satellite data • Airplanes with sensors • Sensors on boats • Sensors in the ocean • Sensors on the ground • Historical Data • Social Media • Results • More accurate predictions • Tsunami • Tornado
  • 15. Big Data – an example from health care • Medical records • Regular • Emergency • Genetic data – 23andMe • Food data • SparkPeople • Purchasing • Grocery card • credit card • Search – Google • Social media • Twitter • Facebook • Exercise • Nike Fuel Band • Kinect • Location - phone
  • 16. BigData = ‘Next State’ Questions • What could happen? Collecting • Why didn’t this happen? • When will the next new thing Behavioral happen? data • What will the next new thing be? • What happens?
  • 17. What is the reality of personalized medicine? 2500 2000 Key Monitoring 1500 Sensor Readings 1000 500 Other Behavioral 0 data 12:00 12:30 1:00 1:30 2:00 2:30
  • 18. BigData and Verticals • Retail • Manufacturing • Health Care • Banking • Education
  • 19. Collecting BigData • Sensors everywhere • Structured, Semi-structured, Unstructured vs. Data Standards • M2M • Public Datasets – Freebase – Azure DataMarket – Hillary Mason’s list 19
  • 20. DEMO – Hilary Mason’s Datasets • Who is Hilary Mason and why do you care about her datasets? • How do you get her datasets? • What do you do with her datasets?
  • 21. Collecting Data – a note about Faces • Facial recognition • Voice recognition • Gesture capture and analysis 21
  • 22. Petabytes of Big Data
  • 23. Big Data at Apple
  • 24. Big Data in India Update: “The total number of AADHAARs issued as of 24-Mar- 2013 is over 304 million. This is more than 25% of the population of India.”
  • 25. BigData Pipeline – STEP 5 - Visualize Acquire Process Store Query & Mine Visualize
  • 26. DEMO - Visualizing Big Data: Wind Map 26
  • 27. Demo - Visualizing Big Data – D3 27
  • 28. BigData Pipeline – STEP 2 - Process Acquire Process Store Query & Mine Visualize
  • 29. How do you clean up the mess? • Data Hygiene • Data Scrubbing • Data Sprawl • The true cost of data • …and what about data integrity? • …and security? • …should your data be in the cloud?
  • 30. Is NoSQL just Hadoop? HUGE Hype factor since 2011 Apache Hadoop • a software framework that supports data-intensive distributed applications • under a free license enables applications to work with thousands of nodes and petabytes of data • was inspired by Google's MapReduce and Google File System (GFS) papers
  • 31. What is the relationship? NoSQL Hadoop ??? BigData
  • 32. Hadoop in the Enterprise
  • 33. How you ‘get’ Hadoop Open source • roll your own Commercial distribution • Cloudera • MapR • Hortonworks • More… Rent it via the cloud • AWS
  • 34. Demo – Get and Use Cloudera CDH4 VM
  • 36. About Hadoop MapReduce Image from - https://developers.google.com/appengine/docs/python/images/mapreduce_mapshuffle.png
  • 37. Demo - HDInsight – MapReduce w/Java
  • 38. Demo - HDInsight – MapReduce w/ Hive
  • 39. Example Comparison: RDBMS vs. Hadoop Traditional RDBMS Hadoop / MapReduce Data Size Gigabytes (Terabytes) Petabytes and greater Access Interactive and Batch Batch – NOT Interactive Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear Query Can be near immediate Has latency (due to batch Response processing) Time
  • 40. BigData Pipeline STEP 3 – Store Acquire Process Store Query & Mine Visualize
  • 41. “Small” BigData vs. “Big” BigData Hadoop NoSQL RDBMS
  • 42. The reality…two pivots Storage Methods Storage Locations • SQL (RDBMS) • On premises • NoSQL or Hadoop • Cloud-hosted
  • 43. Cloud-hosted NoSQL up to 50x CHEAPER
  • 44. So many NoSQL options • More than just the Elephant in the room • Over 120+ types of NoSQL databases
  • 45. Flavors of NoSQL Key/Value Key/value Wide-Column Document Graph Volatile Persistent
  • 46. Key / Value Database • Just keys and values – No schema • Persistent or Volatile • Examples – AWS Dynamo DB – Riak
  • 47. DEMO - AWS DynamoDB • Key/Value store on the AWS cloud
  • 48. NoSQL BLOB Storage Buckets in the Cloud • Amazon – S3 or Glacier • Google – Cloud Storage • Microsoft Azure BLOBS • Others – Dropbox – Box – More…
  • 49. DEMO - Battle of the Buckets • Google Cloud Storage VS. • Windows Azure BLOBS VS. • AWS S3 / Glacier
  • 50. Column Database • Wide, sparse column sets • Schema-light • Examples: – Cassandra – HBase w/Hadoop – BigTable – GAE HR DS
  • 51. Types of Column Databases • Column-families – Non-relational – Sparse – Examples: • HBase • Cassandra • xVelocity (SQL 2012 Tabular) • Column-stores – Relational – Dense – Example: • SQL Server 2012 – Columnstore index
  • 52. DEMO – SQL Server ‘NoSQL’ • SQL Server 2012 Columnstore Index • SQL Server 2012 Tabular Model (SSAS)
  • 53. Document Database (Mongo DB) • document-oriented (collection of JSON documents) w/semi structured data – Encodings include BSON, JSON, XML… • binary forms – PDF, Microsoft Office documents -- Word, Excel…) • Examples: – MongoDB – Couchbase
  • 55. Graph Databases • a lot of many-to-many relationships • recursive self-joins • when your primary objective is quickly finding connections, patterns and relationships between the objects within lots of data • Examples: – Neo4J – Google Freebase
  • 57. CAP Theorem applied = ‘how big is it?’ • CA = RDBMS – Highly-available consistency – Ex. SQL Server • CP = NoSQL – Enforced consistency – Ex. Hadoop • AP = NoSQL – Eventual consistency – Ex. MongoDB
  • 58. “Small” BigData vs. “Big” BigData Hadoop Key/Value or Column Document or Graph RDBMS
  • 59. Cloud-hosted RDBMS • AWS RDS – SQL Server, mySQL, Oracle – Medium cost – Solid feature set, i.e. backup, snapshot – Use existing tooling • Google – mySQL – Lowest cost – Most limited RDBMS functionality • Microsoft – SQLAzure – Highest cost
  • 60. DEMO - AWS RDS • SQL Server, MySQL or Oracle • Essential to understand pricing models
  • 61.
  • 63. NoSQL Applied Columnstore Log Files HBase Key/Value Product Catalogs DynamoDB Document Social Games MongoDB Graph Social aggregators Neo4j RDBMS Line-of-Business SQL Server
  • 64. Cloud Offerings– RDBMS AND NoSQL AWS Google Microsoft RDBMS RDS – all major mySQL SQL Azure NoSQL buckets S3 or Glacier Cloud Storage Azure Blobs NoSQL Key-Value DynamoDB H/R Data on GAE Azure Tables Streaming ML or Custom EC2 Prospective Search StreamInsight (Mahout) & Prediction API NoSQL Document or MongoDB on EC2 Freebase MongoDB on Graph Windows Azure NoSQL – Column Elastic MapReduce none HDInsight Hadoop (HBase) using S3 & EC2 Dremel/Warehousi RedShift BigQuery none ng
  • 65. BigData Pipeline STEP 4 – Query Acquire Process Store Query & Mine Visualize
  • 67. Data Scientists and Languages
  • 69. Can Excel help? • Connector to Hadoop • Data Explorer • Data Quality Services • Master Data Services • Integration with Azure Data Market • Visualize with PowerView • Data Mining w/Predixion
  • 70. Demo - Hadoop Connector to Excel
  • 71. Google BigQuery w/Excel • Hadoop-like (Dremel) based service • For massive amounts of data • SQL-like query language
  • 72. DEMO - Google BigQuery • Hadoop-like (Dremel) based service • For massive amounts of data • SQL-like query language
  • 73. Dremel Realized => Impala • Interactive Hadoop?
  • 74. Other types of cloud data services Hosting public datasets Cleaning / matching (your) • Pay to read data • Earn revenue by offering for read • ETL – Microsoft Data Explorer, Google Refine • Data Quality – Windows Azure Data Market, InfoChimps, DataMarket .com
  • 75. NoSQL To-Do List Understand CAP & types of NoSQL databases • Use NoSQL when business needs designate • Use the right type of NoSQL for your business problem Try out NoSQL on the cloud • Quick and cheap for behavioral data • Mashup cloud datasets • Good for specialized use cases, i.e. dev, test , training environments Learn noSQL access technologies • New query languages, i.e. MapReduce, R, Infer.NET • New query tools (vendor-specific) – Google Refine, Amazon Karmasphere, Microsoft Excel connectors, etc…
  • 76. The Changing Data Landscape Other Services RDBMS NoSQL
  • 77. • recipes) www.TeachingKidsProgramming.org • Free Courseware ( • Do a Recipe  Teach a Kid (Ages 10 ++) • Java or Microsoft SmallBasic  TKP site • C# via Pluralsight
  • 78. Toward Data Craftsmanship… Follow me @LynnLangit RSS my blog www.LynnLangit.com Hire me • To help build your BI/Big Data solution • To teach your team next gen BI • To learn more about using NoSQL solutions

Notas del editor

  1. From the O’Reilly / Strata “Getting Ready for Big Data” Report…“the three Vs of volume, velocity and variety are commonlyused to characterize different aspects of big data”
  2. http://www.inboundlogistics.com/cms/article/m2m-101/http://www.freebase.com/Hilary Mason’s datasets - https://bitly.com/bundles/hmason/1
  3. Facial recognition database - http://www.face-rec.org/databases/http://usahitman.com/unexpected-fr/
  4. https://www.eff.org/deeplinks/2012/09/indias-gargantuan-biometric-database-raises-big-questions
  5. http://hint.fm/wind/
  6. http://d3js.org/
  7. http://hadoop.apache.org/http://en.wikipedia.org/wiki/Apache_Hadoop
  8. http://www.oracle.com/technetwork/bdc/hadoop-loader/overview/index.htmlhttp://www.microsoft.com/download/en/details.aspx?id=27584
  9. http://hortonworks.com/technology/hortonworksdataplatform/More about Hbase, from the O’Reilly ‘Getting Ready for BigData’ report“Enter HBase, a column-oriented database that runs on top of HDFS. Modeled after Google’s BigTable, the project’s goal is to host billions of rows of data for rapid access. MapReduce can use HBase as both a source and a destination for its computations, and Hive and Pig can be used in combination with HBase.In order to grant random access to the data, HBase does impose a few restrictions: performance with Hive is 4-5 times slower than plain HDFS, and the maximum amount of data you can store is approximately a petabyte, versus HDFS’ limit of over 30PB.”http://www.cloudera.com/
  10. http://hortonworks.com/technology/hortonworksdataplatform/More about Hbase, from the O’Reilly ‘Getting Ready for BigData’ report“Enter HBase, a column-oriented database that runs on top of HDFS. Modeled after Google’s BigTable, the project’s goal is to host billions of rows of data for rapid access. MapReduce can use HBase as both a source and a destination for its computations, and Hive and Pig can be used in combination with HBase.In order to grant random access to the data, HBase does impose a few restrictions: performance with Hive is 4-5 times slower than plain HDFS, and the maximum amount of data you can store is approximately a petabyte, versus HDFS’ limit of over 30PB.”http://www.cloudera.com/
  11. https://www.hadooponazure.com/AccountDemo - http://www.youtube.com/watch?v=XcHz8aUDDN8 and http://www.youtube.com/watch?v=c7oHntP8HBI
  12. https://www.hadooponazure.com/AccountDemo - http://www.youtube.com/watch?v=XcHz8aUDDN8 and http://www.youtube.com/watch?v=c7oHntP8HBI
  13. Original Reference: Tom White’s Hadoop: The Definitive Guide (I made some modifications based on my experience)
  14. http://lynnlangit.wordpress.com/2011/11/09/relational-cloud-storage-is-50x-more-expensive-than-nosql/
  15. http://nosql-database.org/http://hadoop.apache.org/ & http://www.mongodb.org/Wikipedia - http://en.wikipedia.org/wiki/NoSQLList of noSQL databases – http://nosql-database.org/The good, the bad - http://www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772
  16. http://bigdatanerd.wordpress.com/2012/01/04/why-nosql-part-2-overview-of-data-modelrelational-nosql/http://docs.jboss.org/hibernate/ogm/3.0/reference/en-US/html_single/
  17. http://en.wikipedia.org/wiki/Project_Voldemorthttp://aws.amazon.com/http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/Introduction.htmlhttp://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html
  18. http://code.google.comAccess via REST APIsVery Cheap, but not much functionality includedLots of code to write for application developmentBut…can be a good backup solution
  19. http://stage.hypertable.com/index.php/documentation/architecture/http://code.google.com/appengine/http://code.google.com/appengine/articles/datastore/overview.html
  20. http://cwebbbi.wordpress.com/2012/02/14/so-what-is-the-bi-semantic-model/http://www.databasejournal.com/features/mssql/understanding-new-column-store-index-of-sql-server-2012.htmlhttp://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.htmlhttp://ayende.com/blog/4500/that-no-sql-thing-column-family-databases
  21. http://en.wikipedia.org/wiki/MongoDBhttp://www.mongodb.org/downloadshttp://www.mongodb.org/display/DOCS/Drivers
  22. http://en.wikipedia.org/wiki/MongoDBhttp://www.mongodb.org/downloadshttp://www.mongodb.org/display/DOCS/Drivers
  23. http://www.infinitegraph.com/what-is-a-graph-database.htmlhttp://en.wikipedia.org/wiki/Graph_databasehttp://www.freebase.com/
  24. http://berb.github.com/diploma-thesis/original/061_challenge.htmlhttp://nosqltips.blogspot.com/2011/04/cap-diagram-for-distribution.htmlhttp://blog.mccrory.me/2010/11/03/cap-theorem-and-the-clouds/http://amitpiplani.blogspot.com/2010/05/u-pick-2-selection-for-nosql-providers.htmlACID VS BASEACIDRDBMS are the predominant database systems currently in use for most applications, including web applications. Their data model and their internals are strongly connected to transactional behaviour when operating on data. However, transactional behaviour is not solely related to RDBMS, but is also used for other systems. A set of properties describes the guarantees that database transactions generally provide in order to maintain the validity of data [Hae83].AtomicityThis property determines that a transaction executes with a "all or nothing" manner. A transaction can either be a single operation, or a sequence of operations resp. sub-transactions. As a result, a transaction either succeeds and the database state is changed, or the database remains unchanged, and all operations are dismissed.ConsistencyIn context of transactions, we define consistency as the transition from one valid state to another, never leaving behind an invalid state when executing transactions.IsolationThe concept of isolation guarantees that no transaction sees premature operations of other running transactions. In essence, this prevents conflicts between concurrent transactions.DurabilityThe durability property assures persistence of executed transactions. Once a transaction has committed, the outcome of the transaction such as state changes are kept, even in case of a crash or other failures.Strongly adhering to the principles of ACID results in an execution order that has the same effect as a purely serial execution. In other words, there is always a serially equivalent order of transactions that represents the exact same state [Dol05]. It is obvious that ensuring a serializable order negatively affects performance and concurrency, even when a single machine is used. In fact, some of the properties are often relaxed to a certain extent in order to improve performance. A weaker isolation level between transactions is the most used mechanism to speed up transactions and their throughput. Stepwise, a transactional system can leave serializablity and fall back to the weaker isolation levels repeatable reads, read committed and read uncommitted. These levels gradually remove range locks, read locks and resp. write locks (in that order). As a result, concurrent transactions are less isolated and can see partial results of other transactions, yielding so called read phenomena. Some implementations also weaken the durability property by not guaranteeing to write directly to disk. Instead, committed states are buffered in memory and eventually flushed to disk. This heavily decreases latencies at the cost of data integrity.Consistency is still a core property of the ACID model, that cannot be relaxed easily. The mutual dependencies of the properties make it impossible to remove a single property without affecting the others. Referring back to the CAP theorem, we have seen the trade-off between consistency and availability regarding distributed database systems that must tolerate partitions. In case we choose a database system that follows the ACID paradigm, we cannot guarantee high availability anymore. The usage of ACID as part of a distributed systems yields the need of distributed transactions or similar mechanisms for preserving the transactional properties when state is shared and sharded onto multiple nodes.Now let us reconsider what would happen if we evict the burden of distributed transactions. As we are talking about distributed systems, we have no global shared state by default. The only knowledge we have is a per-node knowledge of its own past. Having no global time, no global now, we cannot inherently have atomic operations on system level, as operations occur at different times on different machines. This softens isolation and we must leave the notion of global state for now. Having no immediate, global state of the system in turn endangers durability.In conclusion, building distributed systems adhering to the ACID paradigm is a demanding challenge. It requires complex coordination and synchronization efforts between all involved nodes, and generates considerable communication overhead within the entire system. It is not for nothing that distributed transactions are feared by many architects [Hel09,Alv11]. Although it is possible to build such systems, some of the original motivations for using a distributed database system have been mitigated on this path. Isolation and serializablity contradict scalability and concurrency. Therefore, we will now consider an alternative model for consistency that sacrifices consistency for other properties that are interesting for certain systems.BASEThis alternative consistency model is basically the subsumption of properties resulting from a system that provides availability and partition tolerance, but no strong consistency [Pri08]. While a strong consistency model as provided by ACID implies that all subsequent reads after a write yield the new, updated state for all clients and on all nodes of the system, this is weakened for BASE. Instead, the weak consistency of BASE comes up with an inconsistency window, a period in which the propagation oft the update is not yet guaranteed.Basically availableThe availability of the system even in case of failures represents a strong feature of the BASE model.Soft stateNo strong consistency is provided and clients must accept stale state under certain circumstances.Eventually consistentConsistency is provided in a "best effort" manner, yielding a consistent state as soon as possible.The optimistic behaviour of BASE represents a best effort approach towards consistency, but is also said to be simpler and faster. Availability and scaling capacities are primary objectives at the expense of consistency. This has an impact on the application logic, as the developer must be aware of the possibility of stale data. On the other hand, favoring availability over consistency has also benefits for the application logic in some scenarios. For instance, a partial system failure of an ACID system might reject write operations, forcing the application to handle the data to be written somehow. A BASE system in turn might always accept writes, even in case of network failures, but they might not be visible for other nodes immediately. The applicability of relaxed consistency models depends very much on the application requirements. Strict constraints of balanced accounts for a banking application do not fit eventual consistency naturally. But many web applications built around social objects and user interactions can actually accept slightly stale data. When the inconsistency window is on average smaller than the time between request/response cycles of user interactions, a user might not even realize any kind of inconsistency at all.http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
  25. For Google - http://code.google.comFor AWS - https://console.aws.amazon.com/console/home
  26. Hadoop on AWS - http://wiki.apache.org/hadoop/AmazonEC2
  27. http://rickosborne.org/download/SQL-to-MongoDB.pdf
  28. About Data Science -- http://www.romymisra.com/the-new-job-market-rulers-data-scientists/R language - http://www.r-project.org/Infer.NET - http://research.microsoft.com/en-us/um/cambridge/projects/infernet/Julia language -- http://julialang.org/There are a plethora of languages to access, manipulate and process bigData. These languages fall into a couple of categories:RESTful – simple, standardsETL – Pig (Hadoop) is an exampleQuery – Hive (again Hadoop), lots of *QLAnalyze – R, Mahout, Infer.NET, DMX, etc.. Applying statistical (data-mining) algorithms to the data output
  29. http://www.youtube.com/watch?v=gjsMDAcI1Mo - analysthttp://www.youtube.com/watch?v=_MT04szKlyohttp://aws.amazon.com/articles/9574327584309154
  30. http://www.microsoft.com/en-us/bi/default.aspxhttp://dennyglee.com/Demos -   http://www.youtube.com/watch?v=djfpPsGwm6Aand http://www.youtube.com/watch?v=uh9bKWO1K7U
  31. http://research.google.com/pubs/pub36632.html
  32. http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
  33. DataMarkets – InfoChimps, Factual, DataMarket, Windows Azure Data Marketplace, Wolfram Alpha, Datasifthttp://www.microsoft.com/en-us/sqlazurelabs/default.aspx andhttp://www.microsoft.com/en-us/sqlazurelabs/labs/dataexplorer.aspxhttps://datamarket.azure.com/http://www.freebase.com/http://code.google.com/p/google-refine/
  34. Lynn