SlideShare una empresa de Scribd logo
1 de 51
Descargar para leer sin conexión
Performing Data Science with HBase

  Aaron Kimball – CTO
  Kiyan Ahmadizadeh – MTS
MapReduce and log files



  Log files
              Batch analysis
                               Result data set
The way we build apps is changing
HBase plays a big role
• High performance random access
• Flexible schema & sparse storage
• Natural mechanism for time series data
  … organized by user
Batch machine learning
WibiData: architecture
Data science lays the groundwork
• Feature selection requires insight
• Data lies in HBase, not log files
• MapReduce is too cumbersome for
  exploratory analytics
Data science lays the groundwork
• Feature selection requires insight
• Data lies in HBase, not log files
• MapReduce is too cumbersome for
  exploratory analytics

• This talk: How do we explore data in HBase?
Why not Hive?

• Need to manually sync column schemas
• No complex type support for HBase + Hive
  – Our use of Avro facilitates complex record types
• No support for time series events in columns
Why not Pig?

• Tuples handle sparse data poorly
• No support for time series events in columns
• UDFs and main pipeline are written in
  different languages (Java, Pig Latin)
  – (True of Hive too)
Our analysis needs
•   Read from HBase
•   Express complex concepts
•   Support deep MapReduce pipelines
•   Be written in concise code
•   Be accessible to data scientists more
    comfortable with R & python than Java
Our analysis needs

• Concise: We use Scala
• Powerful: We use Apache Crunch
• Interactive: We built a shell

wibi>
WDL: The WibiData Language

wibi> 2 + 2     wibi> :tables
res0: Int = 4   Table     Description
                ======== ===========
                page      Wiki page info
                user      per-user stats
Outline
•   Analyzing Wikipedia
•   Introducing Scala
•   An Overview of Crunch
•   Extending Crunch to HBase + WibiData
•   Demo!
Analyzing Wikipedia

• All revisions of all English pages
• Simulates real system that could be built on
  top of WibiData
• Allows us to practice real analysis at scale
Per-user information
• Rows keyed by Wikipedia user id or IP address
• Statistics for several metrics on all edits made
  by each user
Introducing
• Scala language allows declarative statements
• Easier to express transformations over your
  data in an intuitive way
• Integrates with Java and runs on the JVM
• Supports interactive evaluation
Example: Iterating over lists
def printChar(ch: Char): Unit = {
  println(ch)
}

val lst = List('a', 'b', 'c')
lst.foreach(printChar)
… with an anonymous function
val lst = List('a', 'b', 'c')
lst.foreach( ch => println(ch) )


• Anonymous function can be specified as
  argument to foreach() method of a list.
• Lists, sets, etc. are immutable by default
Example: Transforming a list
val lst = List(1, 4, 7)
val doubled = lst.map(x => x * 2)


• map() applies a function to each element,
  yielding a new list. (doubled is the list [2,
  8, 14])
Example: Filtering
• Apply a boolean function to each element of a
  list, keep the ones that return true:

val lst = List(1, 3, 5, 6, 9)
val threes =
  lst.filter(x => x % 3 == 0)
// ‘threes’ is the list [3, 6, 9]
Example: Aggregation
val lst = List(1, 2, 12, 5)
lst.reduceLeft( (sum, x) => sum + x )
// Evaluates to 20.


• reduceLeft() aggregates elements left-to-right,
  in this case by keeping a running sum.
Crunch: MapReduce pipelines
def runWordCount(input, output) = {
  val wordCounts = read(From.textFile(input))
    .flatMap(line =>
        line.toLowerCase.split("""s+"""))
    .filter(word => !word.isEmpty())
    .count
  wordCounts.write(To.textFile(output))
}
PCollections: Crunch data sets
• Represent a parallel record-oriented data set
• Items in PCollections can be lines of text,
  tuples, or complex data structures
• Crunch functions (like flatMap() and
  filter()) do work on partitions of
  PCollections in parallel.
PCollections… of WibiRows
•   WibiRow: Represents a row in a Wibi table
•   Enables access to sparse columns
•   … as a value: row(someColumn): Value
•   … As a timeline of values to iterate/aggregate:
    row.timeline(someColumn): Timeline[Value]
Introducing: Kiyan Ahmadizadeh
Demo!
Demo: Visualizing Distributions
• Suppose you have a metric taken on some population of
  users.
• Want to visualize what the distribution of the metric among
  the population looks like.
   – Could inform next analysis steps, feature selection for models,
     etc.
• Histograms can give insight on the shape of a distribution.
   – Choose a set of bins for the metric.
   – Count the number of population members whose metric falls
     into each bin.
Demo: Wikimedia Dataset
• We have a user table containing the average
  delta for all edits made by a user to pages.
• Edit Delta: The number of characters added or
  deleted by an edit to a page.
• Want to visualize the distribution of average
  deltas among users.
Demo!
Code: Accessing Data
val stats =
   ColumnFamily[Stats]("edit_metric_stats")

val userTable = read(
    From.wibi(WibiTableAddress.of("user"), stats))
Code: Accessing Data
val stats =
   ColumnFamily[Stats]("edit_metric_stats")

val userTable = read(
    From.wibi(WibiTableAddress.of("user"), stats))



Will act as a handle
for accessing the
column family.
Code: Accessing Data
val stats =
   ColumnFamily[Stats]("edit_metric_stats")

val userTable = read(
    From.wibi(WibiTableAddress.of("user"), stats))



                Type annotation tells
                WDL what kind of
                data to read out of
                the family.
Code: Accessing Data
val stats =
   ColumnFamily[Stats]("edit_metric_stats")

val userTable = read(
    From.wibi(WibiTableAddress.of("user"), stats))



userTable is a PCollection[WibiRow] obtained by reading the
column family “edit_metric_stats” from the Wibi table
“user.”
Code: UDFs
def getBin(bins: Range, value: Double): Int = {
  bins.reduceLeft ( (choice, bin) =>
    if (value < bin) choice else bin )
}

def inRange(bins: Range, value: Double): Boolean =
  range.start <= value && value <= range.end
Code: UDFs
def getBin(bins: Range, value: Double): Int = {
  bins.reduceLeft ( (choice, bin) =>
    if (value < bin) choice else bin )
}

def inRange(bins: Range, value: Double): Boolean =
  range.start <= value && value <= range.end


Everyday Scala function declarations!
Code: Filtering
val filtered = userTable.filter { row =>
  // Keep editors who have edit_metric_stats:delta defined
  !row(stats).isEmpty && row(stats).get.contains("delta")
}
Code: Filtering
val filtered = userTable.filter { row =>
  // Keep editors who have edit_metric_stats:delta defined
  !row(stats).isEmpty && row(stats).get.contains("delta")
}



                    Boolean predicate on elements in the PCollection
Code: Filtering
val filtered = userTable.filter { row =>
  // Keep editors who have edit_metric_stats:delta defined
  !row(stats).isEmpty && row(stats).get.contains("delta")
}

  filtered is a PCollection of rows that have the column edit_metric_stats:delta
Code: Filtering
val filtered = userTable.filter { row =>
  // Keep editors who have edit_metric_stats:delta defined
  !row(stats).isEmpty && row(stats).get.contains("delta")
}



Use stats variable we declared earlier to access the column family.
       val stats = ColumnFamily[Stats]("edit_metric_stats")
Code: Binning
val binCounts = filtered.map { row =>
  // Bucket mean deltas for histogram
  getBin(bins, abs(row(stats).get("delta").getMean))
}.count()

binCounts.write(To.textFile("output_dir"))
Code: Binning
val binCounts = filtered.map { row =>
  // Bucket mean deltas for histogram
  getBin(bins, abs(row(stats).get("delta").getMean))
}.count()

binCounts.write(To.textFile("output_dir"))

          Map each editor to the bin their mean delta falls into.
Code: Binning
val binCounts = filtered.map { row =>
  // Bucket mean deltas for histogram
  getBin(bins, abs(row(stats).get("delta").getMean))
}.count()

binCounts.write(To.textFile("output_dir"))


Count how many times each bin occurs in the resulting collection.
Code: Binning
val binCounts = filtered.map { row =>
  // Bucket mean deltas for histogram
  getBin(bins, abs(row(stats).get("delta").getMean))
}.count()

binCounts.write(To.textFile("output_dir"))


        binCounts contains the number of editors that fall in each bin.
Code: Binning
val binCounts = filtered.map { row =>
  // Bucket mean deltas for histogram
  getBin(bins, abs(row(stats).get("delta").getMean))
}.count()

binCounts.write(To.textFile("output_dir"))


        Writes the result to HDFS.
Code: Visualization
Histogram.plot(binCounts,
    bins,
    "Histogram of Editors by Mean Delta",
    "Mean Delta",
    "Number of Editors",
    "delta_mean_hist.html")
Analysis Results: 1% of Data
Analysis Results: Full Data Set
Conclusions



           +                   +

     = Scalable analysis of sparse data
Aaron Kimball – aaron@wibidata.com
Kiyan Ahmadizadeh – kiyan@wibidata.com

Más contenido relacionado

La actualidad más candente

Hive Functions Cheat Sheet
Hive Functions Cheat SheetHive Functions Cheat Sheet
Hive Functions Cheat SheetHortonworks
 
Practical data science_public
Practical data science_publicPractical data science_public
Practical data science_publicLong Nguyen
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data AnalysisAndrew Henshaw
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineChester Chen
 
Using Spectrum on Demand from MapInfo Pro
Using Spectrum on Demand from MapInfo ProUsing Spectrum on Demand from MapInfo Pro
Using Spectrum on Demand from MapInfo ProPeter Horsbøll Møller
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Michael Rys
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Jeremy Walsh
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Stefan Urbanek
 
U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)Michael Rys
 
Generalized framework for using NoSQL Databases
Generalized framework for using NoSQL DatabasesGeneralized framework for using NoSQL Databases
Generalized framework for using NoSQL DatabasesKIRAN V
 

La actualidad más candente (20)

Hive Functions Cheat Sheet
Hive Functions Cheat SheetHive Functions Cheat Sheet
Hive Functions Cheat Sheet
 
R Introduction
R IntroductionR Introduction
R Introduction
 
sets and maps
 sets and maps sets and maps
sets and maps
 
Dynamodb
DynamodbDynamodb
Dynamodb
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Practical data science_public
Practical data science_publicPractical data science_public
Practical data science_public
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data Analysis
 
Unit 5-apache hive
Unit 5-apache hiveUnit 5-apache hive
Unit 5-apache hive
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
Apache hive
Apache hiveApache hive
Apache hive
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Using Spectrum on Demand from MapInfo Pro
Using Spectrum on Demand from MapInfo ProUsing Spectrum on Demand from MapInfo Pro
Using Spectrum on Demand from MapInfo Pro
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Unit 5-lecture4
Unit 5-lecture4Unit 5-lecture4
Unit 5-lecture4
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
 
U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)
 
Generalized framework for using NoSQL Databases
Generalized framework for using NoSQL DatabasesGeneralized framework for using NoSQL Databases
Generalized framework for using NoSQL Databases
 

Similar a Performing Data Science with HBase

Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Databricks
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityMapR Technologies
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.GeeksLab Odessa
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command lineSharat Chikkerur
 
Hive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamHive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamZheng Shao
 
KNOWAGE CUSTOM CHART WIDGET: a technical guide
KNOWAGE CUSTOM CHART WIDGET: a technical guideKNOWAGE CUSTOM CHART WIDGET: a technical guide
KNOWAGE CUSTOM CHART WIDGET: a technical guideKNOWAGE
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelAndrey Lomakin
 
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptxShshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx086ChintanPatel1
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 
Types Working for You, Not Against You
Types Working for You, Not Against YouTypes Working for You, Not Against You
Types Working for You, Not Against YouC4Media
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like SparkAlpine Data
 

Similar a Performing Data Science with HBase (20)

Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
 
Hive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamHive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive Team
 
KNOWAGE CUSTOM CHART WIDGET: a technical guide
KNOWAGE CUSTOM CHART WIDGET: a technical guideKNOWAGE CUSTOM CHART WIDGET: a technical guide
KNOWAGE CUSTOM CHART WIDGET: a technical guide
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptxShshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx
 
Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
No sql Database
No sql DatabaseNo sql Database
No sql Database
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
מיכאל
מיכאלמיכאל
מיכאל
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Types Working for You, Not Against You
Types Working for You, Not Against YouTypes Working for You, Not Against You
Types Working for You, Not Against You
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like Spark
 

Más de WibiData

Data Evolution on HBase with Kiji
Data Evolution on HBase with KijiData Evolution on HBase with Kiji
Data Evolution on HBase with KijiWibiData
 
Exploring the Enron Email Dataset with Kiji and Hive
Exploring the Enron Email Dataset with Kiji and HiveExploring the Enron Email Dataset with Kiji and Hive
Exploring the Enron Email Dataset with Kiji and HiveWibiData
 
Analyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseAnalyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseWibiData
 
Building Personalized Applications at Scale
Building Personalized Applications at ScaleBuilding Personalized Applications at Scale
Building Personalized Applications at ScaleWibiData
 
Analyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseAnalyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseWibiData
 
Building Personalized Applications with HBase
Building Personalized Applications with HBaseBuilding Personalized Applications with HBase
Building Personalized Applications with HBaseWibiData
 

Más de WibiData (6)

Data Evolution on HBase with Kiji
Data Evolution on HBase with KijiData Evolution on HBase with Kiji
Data Evolution on HBase with Kiji
 
Exploring the Enron Email Dataset with Kiji and Hive
Exploring the Enron Email Dataset with Kiji and HiveExploring the Enron Email Dataset with Kiji and Hive
Exploring the Enron Email Dataset with Kiji and Hive
 
Analyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseAnalyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBase
 
Building Personalized Applications at Scale
Building Personalized Applications at ScaleBuilding Personalized Applications at Scale
Building Personalized Applications at Scale
 
Analyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseAnalyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBase
 
Building Personalized Applications with HBase
Building Personalized Applications with HBaseBuilding Personalized Applications with HBase
Building Personalized Applications with HBase
 

Último

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Último (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Performing Data Science with HBase

  • 1.
  • 2. Performing Data Science with HBase Aaron Kimball – CTO Kiyan Ahmadizadeh – MTS
  • 3. MapReduce and log files Log files Batch analysis Result data set
  • 4. The way we build apps is changing
  • 5. HBase plays a big role • High performance random access • Flexible schema & sparse storage • Natural mechanism for time series data … organized by user
  • 8. Data science lays the groundwork • Feature selection requires insight • Data lies in HBase, not log files • MapReduce is too cumbersome for exploratory analytics
  • 9. Data science lays the groundwork • Feature selection requires insight • Data lies in HBase, not log files • MapReduce is too cumbersome for exploratory analytics • This talk: How do we explore data in HBase?
  • 10. Why not Hive? • Need to manually sync column schemas • No complex type support for HBase + Hive – Our use of Avro facilitates complex record types • No support for time series events in columns
  • 11. Why not Pig? • Tuples handle sparse data poorly • No support for time series events in columns • UDFs and main pipeline are written in different languages (Java, Pig Latin) – (True of Hive too)
  • 12. Our analysis needs • Read from HBase • Express complex concepts • Support deep MapReduce pipelines • Be written in concise code • Be accessible to data scientists more comfortable with R & python than Java
  • 13. Our analysis needs • Concise: We use Scala • Powerful: We use Apache Crunch • Interactive: We built a shell wibi>
  • 14. WDL: The WibiData Language wibi> 2 + 2 wibi> :tables res0: Int = 4 Table Description ======== =========== page Wiki page info user per-user stats
  • 15. Outline • Analyzing Wikipedia • Introducing Scala • An Overview of Crunch • Extending Crunch to HBase + WibiData • Demo!
  • 16. Analyzing Wikipedia • All revisions of all English pages • Simulates real system that could be built on top of WibiData • Allows us to practice real analysis at scale
  • 17. Per-user information • Rows keyed by Wikipedia user id or IP address • Statistics for several metrics on all edits made by each user
  • 18. Introducing • Scala language allows declarative statements • Easier to express transformations over your data in an intuitive way • Integrates with Java and runs on the JVM • Supports interactive evaluation
  • 19. Example: Iterating over lists def printChar(ch: Char): Unit = { println(ch) } val lst = List('a', 'b', 'c') lst.foreach(printChar)
  • 20. … with an anonymous function val lst = List('a', 'b', 'c') lst.foreach( ch => println(ch) ) • Anonymous function can be specified as argument to foreach() method of a list. • Lists, sets, etc. are immutable by default
  • 21. Example: Transforming a list val lst = List(1, 4, 7) val doubled = lst.map(x => x * 2) • map() applies a function to each element, yielding a new list. (doubled is the list [2, 8, 14])
  • 22. Example: Filtering • Apply a boolean function to each element of a list, keep the ones that return true: val lst = List(1, 3, 5, 6, 9) val threes = lst.filter(x => x % 3 == 0) // ‘threes’ is the list [3, 6, 9]
  • 23. Example: Aggregation val lst = List(1, 2, 12, 5) lst.reduceLeft( (sum, x) => sum + x ) // Evaluates to 20. • reduceLeft() aggregates elements left-to-right, in this case by keeping a running sum.
  • 24. Crunch: MapReduce pipelines def runWordCount(input, output) = { val wordCounts = read(From.textFile(input)) .flatMap(line => line.toLowerCase.split("""s+""")) .filter(word => !word.isEmpty()) .count wordCounts.write(To.textFile(output)) }
  • 25. PCollections: Crunch data sets • Represent a parallel record-oriented data set • Items in PCollections can be lines of text, tuples, or complex data structures • Crunch functions (like flatMap() and filter()) do work on partitions of PCollections in parallel.
  • 26. PCollections… of WibiRows • WibiRow: Represents a row in a Wibi table • Enables access to sparse columns • … as a value: row(someColumn): Value • … As a timeline of values to iterate/aggregate: row.timeline(someColumn): Timeline[Value]
  • 28. Demo!
  • 29. Demo: Visualizing Distributions • Suppose you have a metric taken on some population of users. • Want to visualize what the distribution of the metric among the population looks like. – Could inform next analysis steps, feature selection for models, etc. • Histograms can give insight on the shape of a distribution. – Choose a set of bins for the metric. – Count the number of population members whose metric falls into each bin.
  • 30. Demo: Wikimedia Dataset • We have a user table containing the average delta for all edits made by a user to pages. • Edit Delta: The number of characters added or deleted by an edit to a page. • Want to visualize the distribution of average deltas among users.
  • 31. Demo!
  • 32. Code: Accessing Data val stats = ColumnFamily[Stats]("edit_metric_stats") val userTable = read( From.wibi(WibiTableAddress.of("user"), stats))
  • 33. Code: Accessing Data val stats = ColumnFamily[Stats]("edit_metric_stats") val userTable = read( From.wibi(WibiTableAddress.of("user"), stats)) Will act as a handle for accessing the column family.
  • 34. Code: Accessing Data val stats = ColumnFamily[Stats]("edit_metric_stats") val userTable = read( From.wibi(WibiTableAddress.of("user"), stats)) Type annotation tells WDL what kind of data to read out of the family.
  • 35. Code: Accessing Data val stats = ColumnFamily[Stats]("edit_metric_stats") val userTable = read( From.wibi(WibiTableAddress.of("user"), stats)) userTable is a PCollection[WibiRow] obtained by reading the column family “edit_metric_stats” from the Wibi table “user.”
  • 36. Code: UDFs def getBin(bins: Range, value: Double): Int = { bins.reduceLeft ( (choice, bin) => if (value < bin) choice else bin ) } def inRange(bins: Range, value: Double): Boolean = range.start <= value && value <= range.end
  • 37. Code: UDFs def getBin(bins: Range, value: Double): Int = { bins.reduceLeft ( (choice, bin) => if (value < bin) choice else bin ) } def inRange(bins: Range, value: Double): Boolean = range.start <= value && value <= range.end Everyday Scala function declarations!
  • 38. Code: Filtering val filtered = userTable.filter { row => // Keep editors who have edit_metric_stats:delta defined !row(stats).isEmpty && row(stats).get.contains("delta") }
  • 39. Code: Filtering val filtered = userTable.filter { row => // Keep editors who have edit_metric_stats:delta defined !row(stats).isEmpty && row(stats).get.contains("delta") } Boolean predicate on elements in the PCollection
  • 40. Code: Filtering val filtered = userTable.filter { row => // Keep editors who have edit_metric_stats:delta defined !row(stats).isEmpty && row(stats).get.contains("delta") } filtered is a PCollection of rows that have the column edit_metric_stats:delta
  • 41. Code: Filtering val filtered = userTable.filter { row => // Keep editors who have edit_metric_stats:delta defined !row(stats).isEmpty && row(stats).get.contains("delta") } Use stats variable we declared earlier to access the column family. val stats = ColumnFamily[Stats]("edit_metric_stats")
  • 42. Code: Binning val binCounts = filtered.map { row => // Bucket mean deltas for histogram getBin(bins, abs(row(stats).get("delta").getMean)) }.count() binCounts.write(To.textFile("output_dir"))
  • 43. Code: Binning val binCounts = filtered.map { row => // Bucket mean deltas for histogram getBin(bins, abs(row(stats).get("delta").getMean)) }.count() binCounts.write(To.textFile("output_dir")) Map each editor to the bin their mean delta falls into.
  • 44. Code: Binning val binCounts = filtered.map { row => // Bucket mean deltas for histogram getBin(bins, abs(row(stats).get("delta").getMean)) }.count() binCounts.write(To.textFile("output_dir")) Count how many times each bin occurs in the resulting collection.
  • 45. Code: Binning val binCounts = filtered.map { row => // Bucket mean deltas for histogram getBin(bins, abs(row(stats).get("delta").getMean)) }.count() binCounts.write(To.textFile("output_dir")) binCounts contains the number of editors that fall in each bin.
  • 46. Code: Binning val binCounts = filtered.map { row => // Bucket mean deltas for histogram getBin(bins, abs(row(stats).get("delta").getMean)) }.count() binCounts.write(To.textFile("output_dir")) Writes the result to HDFS.
  • 47. Code: Visualization Histogram.plot(binCounts, bins, "Histogram of Editors by Mean Delta", "Mean Delta", "Number of Editors", "delta_mean_hist.html")
  • 50. Conclusions + + = Scalable analysis of sparse data
  • 51. Aaron Kimball – aaron@wibidata.com Kiyan Ahmadizadeh – kiyan@wibidata.com