SlideShare una empresa de Scribd logo
1 de 36
Big Data in the “Real World”
Edward Capriolo
What is “big data”?
● Big data is a collection of data sets so large and
complex that it becomes difficult to process
using traditional data processing applications.
● The challenges include capture, curation,
storage, search, sharing, transfer, analysis,
and visualization.
http://en.wikipedia.org/wiki/Big_data
Big Data Challenges
●
The challenges include:
– capture
– curation
– storage
– search
– sharing
– transfer
– analysis
– visualization
– large
– complex
What is “big data” exactly?
● What is considered "big data" varies depending on
the capabilities of the organization managing the
set, and on the capabilities of the applications that
are traditionally used to process and analyze the
data set in its domain.
● As of 2012, limits on the size of data sets that are
feasible to process in a reasonable amount of
time were on the order of exabytes of data.
http://en.wikipedia.org/wiki/Big_data
Big Data Qualifiers
● varies
● capabilities
● traditionally
● feasibly
● reasonably
● [somptha]bytes of data
My first “big data” challenge
● Real time news delivery platform
● Ingest news as text and provide full text search
● Qualifiers
– Reasonable: Real time search was < 1 second
– Capabilities: small company, <100 servers
● Big Data challenges
– Storage: roughly 300GB for 60 days data
– Search: searches of thousands of terms
Traditionally
● Data was placed in mysql
● MySQL full text search
● Easy to insert
● Easy to search
● Worked great!
– Until it got real world load
Feasibly in hardware
(circa 2008)
● 300GB data and 16GB ram
● ...MySQL stores an in-memory binary tree of the keys.
Using this tree, MySQL can calculate the count of matching
rows with reasonable speed. But speed declines
logarithmically as the number of terms increases.
● The platters revolve at 15,000 RPM or so, which works out
to 250 revolutions per second. Average latency is listed as
2.0ms
● As the speed of an HDD increases the power it takes to run
it increases disproportionately
http://serverfault.com/questions/190451/what-is-the-throughput-of-15k-rpm-sas-drive
http://thessdguy.com/why-dont-hdds-spin-faster-than-15k-rpm/
http://dev.mysql.com/doc/internals/en/full-text-search.html
“Big Data” is about giving up things
● In theoretical computer science, the CAP theorem states
that it is impossible for a distributed computer system to
simultaneously provide all three of the following guarantees:
– Consistency (all nodes see the same data at the same time)
– Availability (a guarantee that every request receives a response
about whether it was successful or failed)
– Partition tolerance (the system continues to operate despite
arbitrary message loss or failure of part of the system)
http://en.wikipedia.org/wiki/CAP_theorem
http://www.youtube.com/watch?v=I4yGBfcODmU
Multi-Master solution
● Write the data to N mysql servers and round
robin reads between them
– Good: More machines to serve reads
– Bad: Requires Nx hardware
– Hard: Keeping machines loaded with same data
especially auto-generated-ids
– Hard: What about when the data does not even fit
on a single machine?
Sharding
● Rather then replicate all data to all machines
● Replicate data to selective machines
– Good: localized data
– Good: better caching
– Hard: Joins across shards
– Hard: Management
– Hard: Failure
● Parallel RDBMS = $$$
Life lesson
“applications that are traditionally used to”
● How did we solve our problem?
– We switched to lucene
● A tool designed for full text search
● Eventually sharded lucene
● When you hold a hammer:
– Not everything is a nail
● Understand what you really need
● Understand reasonable and feasable
Big data Challenge 2
● Large high volume web site
● Process them and produce reports
● Big Data challenges
– Storage: Store GB of data a day for years
– Analysis, visualization: support reports of existing system
● Qualifiers
– Reasonable to want daily reports less then one day
– Honestly needs to be faster / reruns etc
Enter hadoop
● Hadoop (0.17.X) was fairly new at the time
● Use cases of map reduce were emerging
– Hive had just been open sourced by Facebook
● Many database vendors were calling
map/reduce “a step backwards”
– They had solved these problems “in the 80s”
Hadoop file system HDFS
● Distributed redundant storage
– We were a NoSPOF across the board
● Commodity hardware vs buying a big
SAN/NAS device
● We already had processes that scp'ed data to
servers, easily adapted to placing them into
hdfs
● HDFS easy huge
Map Reduce
● As a proof of concept I wrote a group/count
application that would group/count on column
in our logs
● Was able to show linear speed up with
increased nodes
●
Winning (why hadoop kicked arse)
● Data capture, curation
– bulk loading data into RDBMS (indexes, overhead)
– bulk loading into hadoop is network copy
● Data anaysis
– RDBMS would not parallel-ize queries (even across
partitions)
– Some queries could cause very locks and
performance degradation
http://hardcourtlessons.blogspot.com/2010/05/definition-of-winning.html
Enter hive
● Capture- NO
● Curation- YES
● Storage- YES
● Search- YES
● Sharing- YES
● Transfer- NO
● Analysis-YES
● Visualization-NO
Logging from apache to hive
Sample program group and count
Source data looks like
jan 10 2009:.........:200:/index.htm
jan 10 2009:.........:200:/index.htm
jan 10 2009:.........:200:/igloo.htm
jan 10 2009:.........:200:/ed.htm
In case your the math type
(input) <k1, v1> →
map -> <k2, v2> -> combine -> <k2, v2> ->
reduce -> <k3, v3> (output)
Map(k1,v1) -> list(k2,v2)
Reduce(k2, list (v2)) -> list(v3)
A mapper
A reducer
Hive style
hive>create table web_data
( sdate STRING, stime STRING,
envvar STRING, remotelogname STRING ,servername STRING,
localip STRING, literaldash STRING, method STRING, url
STRING, querystring STRING, status STRING, litteralzero
STRING ,bytessent INT,header STRING, timetoserver INT,
useragent STRING ,cookie STRING, referer STRING);
SELECT url,count(1) FROM web_data GROUP BY url;
Life lessons volume 2
● feasible and reasonable were completely
different then case 1#
● Query from seconds -> hours
● Size from GB to TB
● Feasilble from 4 Nodes to 15
Big Data Challenge #3
(work at m6d)
● Large high volume ad serving site
● Process them and produce reports
● Support data science and biz-dev users
● Big Data challenges
– Storage: Store and process terabytes of data
● Complex data types, encoded data
– Analysis, visualization: support reports of existing system
● Qualifiers
– Reasonable: adhoc, daily,hourly, weekly, monthly reports
Data data everywhere
● We have to use cookies in many places
● Cookies have limited size
● Cookies have complex values encoded
Some encoding tricks we might do
LastSeen: long (64 bits)
Segment: int (32 bits)
Literal ','
Segment: int (32 bits)
Zipcode (32bits)
● 1 chose a relevant
epoc and use byte
● Use a byte for # of
segments
● Use a 4 byte radix
encoded number
● ... and so on
Getting at embedded data
● Write N UDFS for each object like:
– getLastSeenForCookie(String)
– getZipcodeForCookie(String)
– ...
● But this would have made a huge toolkit
● Traditionally you do not want to break first
normal form
Struct solution
● Hive has a struct like a c struct
● Struct is list of name value pair
● Structs can contain other structs
● This gives us the serious ability to do object
mapping
● UDFs can return struct types
Using a UDF
● add jar myjar.jar;
● Create temporary function parseCookie as
'com.md6.ParseCookieIntoStruct' ;
● Select
parseCookie(encodedColumn).lastSeen from
mydata;
LATERAL VIEW + EXPLODE
SELECT
client_id, entry.spendcreativeid
FROM datatable
LATERAL VIEW explode (AdHistoryAsStruct(ad_history).adEntrylist)
entryList as entry
where hit_date=20110321 AND mid=001406;
3214498023360851706 215286
3214498023360851706 195785
3214498023360851706 128640
All that data might boil down to...
Life lessons volume #3
● Big data is not only batch or real-time
● Big data is feed back loops
– Machine learning
– Ad hoc performance checks
● Generated SQL tables periodically synced to
web server
● Data shared between sections of an
organization to make business decisions

Más contenido relacionado

La actualidad más candente

M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
ScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedJ On The Beach
 
Webinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under ControlWebinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under ControlScyllaDB
 
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...DataStax
 
Introduction to NoSQL & Apache Cassandra
Introduction to NoSQL & Apache CassandraIntroduction to NoSQL & Apache Cassandra
Introduction to NoSQL & Apache CassandraChetan Baheti
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
 
Signal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide RowsSignal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide RowsDataStax Academy
 
NewSQL - The Future of Databases?
NewSQL - The Future of Databases?NewSQL - The Future of Databases?
NewSQL - The Future of Databases?Elvis Saravia
 
Cassandra Tuning - above and beyond
Cassandra Tuning - above and beyondCassandra Tuning - above and beyond
Cassandra Tuning - above and beyondMatija Gobec
 
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + DynamoCassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamojbellis
 
Druid realtime indexing
Druid realtime indexingDruid realtime indexing
Druid realtime indexingSeoeun Park
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 
Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Jon Haddad
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseAll Things Open
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsDave Gardner
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
 
Webinar: Getting Started with Apache Cassandra
Webinar: Getting Started with Apache CassandraWebinar: Getting Started with Apache Cassandra
Webinar: Getting Started with Apache CassandraDataStax
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformMartin Zapletal
 
Cassandra Summit 2014: Cyanite — Better Graphite Storage with Apache Cassandra
Cassandra Summit 2014: Cyanite — Better Graphite Storage with Apache CassandraCassandra Summit 2014: Cyanite — Better Graphite Storage with Apache Cassandra
Cassandra Summit 2014: Cyanite — Better Graphite Storage with Apache CassandraDataStax Academy
 

La actualidad más candente (20)

M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
ScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous Speed
 
Hadoop-2.6.0 Slides
Hadoop-2.6.0 SlidesHadoop-2.6.0 Slides
Hadoop-2.6.0 Slides
 
Webinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under ControlWebinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under Control
 
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
 
Introduction to NoSQL & Apache Cassandra
Introduction to NoSQL & Apache CassandraIntroduction to NoSQL & Apache Cassandra
Introduction to NoSQL & Apache Cassandra
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
Signal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide RowsSignal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide Rows
 
NewSQL - The Future of Databases?
NewSQL - The Future of Databases?NewSQL - The Future of Databases?
NewSQL - The Future of Databases?
 
Cassandra Tuning - above and beyond
Cassandra Tuning - above and beyondCassandra Tuning - above and beyond
Cassandra Tuning - above and beyond
 
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + DynamoCassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamo
 
Druid realtime indexing
Druid realtime indexingDruid realtime indexing
Druid realtime indexing
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 
Webinar: Getting Started with Apache Cassandra
Webinar: Getting Started with Apache CassandraWebinar: Getting Started with Apache Cassandra
Webinar: Getting Started with Apache Cassandra
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive Platform
 
Cassandra Summit 2014: Cyanite — Better Graphite Storage with Apache Cassandra
Cassandra Summit 2014: Cyanite — Better Graphite Storage with Apache CassandraCassandra Summit 2014: Cyanite — Better Graphite Storage with Apache Cassandra
Cassandra Summit 2014: Cyanite — Better Graphite Storage with Apache Cassandra
 

Destacado

Shoestring Video-SoMeT 2011-Brian Matson
Shoestring Video-SoMeT 2011-Brian MatsonShoestring Video-SoMeT 2011-Brian Matson
Shoestring Video-SoMeT 2011-Brian MatsonBrian Matson
 
Paananen: Hyviä uutisia Kouluterveyskyselystä 2013
Paananen: Hyviä uutisia Kouluterveyskyselystä 2013Paananen: Hyviä uutisia Kouluterveyskyselystä 2013
Paananen: Hyviä uutisia Kouluterveyskyselystä 2013Kouluterveyskysely
 
Chatham mba open house (10 5 2013 rc)
Chatham mba open house (10 5 2013 rc)Chatham mba open house (10 5 2013 rc)
Chatham mba open house (10 5 2013 rc)Rachel Chung
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And VisualizationIvan Ermilov
 
Msu bmp widescreen
Msu bmp widescreenMsu bmp widescreen
Msu bmp widescreenJosh Johnson
 
Наши будни и праздники
Наши будни и праздникиНаши будни и праздники
Наши будни и праздникиelvira38
 
Vishal anand director of bricks and mortar
Vishal anand director of bricks and mortarVishal anand director of bricks and mortar
Vishal anand director of bricks and mortarNew Projects Noida
 
Trabajo extractase de ingles
Trabajo extractase de inglesTrabajo extractase de ingles
Trabajo extractase de inglesteacherisela
 
Cfu3721 definitions of_concepts_2013__2 (1)
Cfu3721 definitions of_concepts_2013__2 (1)Cfu3721 definitions of_concepts_2013__2 (1)
Cfu3721 definitions of_concepts_2013__2 (1)Koskim Petrus
 
Alegações Finais Impeachment Dilma
Alegações Finais Impeachment DilmaAlegações Finais Impeachment Dilma
Alegações Finais Impeachment DilmaMiguel Rosario
 
Mal ppt 2013
Mal ppt 2013Mal ppt 2013
Mal ppt 2013shineasso
 
長野市放課後子ども総合プラン有料化の方針
長野市放課後子ども総合プラン有料化の方針長野市放課後子ども総合プラン有料化の方針
長野市放課後子ども総合プラン有料化の方針長野市議会議員小泉一真
 
Practical eCommerce with WooCommerce
Practical eCommerce with WooCommercePractical eCommerce with WooCommerce
Practical eCommerce with WooCommerceBrian Krogsgard
 
Keynote01 -boris--foundation update-8-10-2012
Keynote01 -boris--foundation update-8-10-2012Keynote01 -boris--foundation update-8-10-2012
Keynote01 -boris--foundation update-8-10-2012OpenCity Community
 
Heaven - escena baralla al parc
Heaven - escena baralla al parcHeaven - escena baralla al parc
Heaven - escena baralla al parcmvinola2
 

Destacado (20)

Shoestring Video-SoMeT 2011-Brian Matson
Shoestring Video-SoMeT 2011-Brian MatsonShoestring Video-SoMeT 2011-Brian Matson
Shoestring Video-SoMeT 2011-Brian Matson
 
Paananen: Hyviä uutisia Kouluterveyskyselystä 2013
Paananen: Hyviä uutisia Kouluterveyskyselystä 2013Paananen: Hyviä uutisia Kouluterveyskyselystä 2013
Paananen: Hyviä uutisia Kouluterveyskyselystä 2013
 
Lecture Commentary On Homosexuality
Lecture Commentary On HomosexualityLecture Commentary On Homosexuality
Lecture Commentary On Homosexuality
 
Chatham mba open house (10 5 2013 rc)
Chatham mba open house (10 5 2013 rc)Chatham mba open house (10 5 2013 rc)
Chatham mba open house (10 5 2013 rc)
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
Msu bmp widescreen
Msu bmp widescreenMsu bmp widescreen
Msu bmp widescreen
 
Наши будни и праздники
Наши будни и праздникиНаши будни и праздники
Наши будни и праздники
 
Vishal anand director of bricks and mortar
Vishal anand director of bricks and mortarVishal anand director of bricks and mortar
Vishal anand director of bricks and mortar
 
Trabajo extractase de ingles
Trabajo extractase de inglesTrabajo extractase de ingles
Trabajo extractase de ingles
 
Cfu3721 definitions of_concepts_2013__2 (1)
Cfu3721 definitions of_concepts_2013__2 (1)Cfu3721 definitions of_concepts_2013__2 (1)
Cfu3721 definitions of_concepts_2013__2 (1)
 
Alegações Finais Impeachment Dilma
Alegações Finais Impeachment DilmaAlegações Finais Impeachment Dilma
Alegações Finais Impeachment Dilma
 
Mal ppt 2013
Mal ppt 2013Mal ppt 2013
Mal ppt 2013
 
Real ch.2 a
Real ch.2 aReal ch.2 a
Real ch.2 a
 
長野市放課後子ども総合プラン有料化の方針
長野市放課後子ども総合プラン有料化の方針長野市放課後子ども総合プラン有料化の方針
長野市放課後子ども総合プラン有料化の方針
 
Practical eCommerce with WooCommerce
Practical eCommerce with WooCommercePractical eCommerce with WooCommerce
Practical eCommerce with WooCommerce
 
Как стать лидером в ТРАДО
Как стать лидером в ТРАДОКак стать лидером в ТРАДО
Как стать лидером в ТРАДО
 
Rcm
RcmRcm
Rcm
 
Keynote01 -boris--foundation update-8-10-2012
Keynote01 -boris--foundation update-8-10-2012Keynote01 -boris--foundation update-8-10-2012
Keynote01 -boris--foundation update-8-10-2012
 
Heaven - escena baralla al parc
Heaven - escena baralla al parcHeaven - escena baralla al parc
Heaven - escena baralla al parc
 
Formato planeacion
Formato planeacionFormato planeacion
Formato planeacion
 

Similar a Big data nyu

Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @ScaleDr Hajji Hicham
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systemselliando dias
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data gridBogdan Dina
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopLeons Petražickis
 
Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresOzgun Erdogan
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 

Similar a Big data nyu (20)

Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @Scale
 
Spark
SparkSpark
Spark
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Big Data
Big DataBig Data
Big Data
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data grid
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
 
Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with Postgres
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
bigdata.pdf
bigdata.pdfbigdata.pdf
bigdata.pdf
 

Más de Edward Capriolo

Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeEdward Capriolo
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for CassandraEdward Capriolo
 
Cassandra NoSQL Lan party
Cassandra NoSQL Lan partyCassandra NoSQL Lan party
Cassandra NoSQL Lan partyEdward Capriolo
 
Breaking first-normal form with Hive
Breaking first-normal form with HiveBreaking first-normal form with Hive
Breaking first-normal form with HiveEdward Capriolo
 
Hadoop Monitoring best Practices
Hadoop Monitoring best PracticesHadoop Monitoring best Practices
Hadoop Monitoring best PracticesEdward Capriolo
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveEdward Capriolo
 
Counters for real-time statistics
Counters for real-time statisticsCounters for real-time statistics
Counters for real-time statisticsEdward Capriolo
 

Más de Edward Capriolo (14)

Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL store
 
Cassandra4hadoop
Cassandra4hadoopCassandra4hadoop
Cassandra4hadoop
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
 
M6d cassandra summit
M6d cassandra summitM6d cassandra summit
M6d cassandra summit
 
Apache Kafka Demo
Apache Kafka DemoApache Kafka Demo
Apache Kafka Demo
 
Cassandra NoSQL Lan party
Cassandra NoSQL Lan partyCassandra NoSQL Lan party
Cassandra NoSQL Lan party
 
Breaking first-normal form with Hive
Breaking first-normal form with HiveBreaking first-normal form with Hive
Breaking first-normal form with Hive
 
Casbase presentation
Casbase presentationCasbase presentation
Casbase presentation
 
Hadoop Monitoring best Practices
Hadoop Monitoring best PracticesHadoop Monitoring best Practices
Hadoop Monitoring best Practices
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIve
 
Cli deep dive
Cli deep diveCli deep dive
Cli deep dive
 
Cassandra as Memcache
Cassandra as MemcacheCassandra as Memcache
Cassandra as Memcache
 
Counters for real-time statistics
Counters for real-time statisticsCounters for real-time statistics
Counters for real-time statistics
 
Real world capacity
Real world capacityReal world capacity
Real world capacity
 

Último

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 

Último (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 

Big data nyu

  • 1. Big Data in the “Real World” Edward Capriolo
  • 2. What is “big data”? ● Big data is a collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. ● The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. http://en.wikipedia.org/wiki/Big_data
  • 3. Big Data Challenges ● The challenges include: – capture – curation – storage – search – sharing – transfer – analysis – visualization – large – complex
  • 4. What is “big data” exactly? ● What is considered "big data" varies depending on the capabilities of the organization managing the set, and on the capabilities of the applications that are traditionally used to process and analyze the data set in its domain. ● As of 2012, limits on the size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data. http://en.wikipedia.org/wiki/Big_data
  • 5. Big Data Qualifiers ● varies ● capabilities ● traditionally ● feasibly ● reasonably ● [somptha]bytes of data
  • 6. My first “big data” challenge ● Real time news delivery platform ● Ingest news as text and provide full text search ● Qualifiers – Reasonable: Real time search was < 1 second – Capabilities: small company, <100 servers ● Big Data challenges – Storage: roughly 300GB for 60 days data – Search: searches of thousands of terms
  • 7.
  • 8. Traditionally ● Data was placed in mysql ● MySQL full text search ● Easy to insert ● Easy to search ● Worked great! – Until it got real world load
  • 9. Feasibly in hardware (circa 2008) ● 300GB data and 16GB ram ● ...MySQL stores an in-memory binary tree of the keys. Using this tree, MySQL can calculate the count of matching rows with reasonable speed. But speed declines logarithmically as the number of terms increases. ● The platters revolve at 15,000 RPM or so, which works out to 250 revolutions per second. Average latency is listed as 2.0ms ● As the speed of an HDD increases the power it takes to run it increases disproportionately http://serverfault.com/questions/190451/what-is-the-throughput-of-15k-rpm-sas-drive http://thessdguy.com/why-dont-hdds-spin-faster-than-15k-rpm/ http://dev.mysql.com/doc/internals/en/full-text-search.html
  • 10. “Big Data” is about giving up things ● In theoretical computer science, the CAP theorem states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: – Consistency (all nodes see the same data at the same time) – Availability (a guarantee that every request receives a response about whether it was successful or failed) – Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system) http://en.wikipedia.org/wiki/CAP_theorem http://www.youtube.com/watch?v=I4yGBfcODmU
  • 11. Multi-Master solution ● Write the data to N mysql servers and round robin reads between them – Good: More machines to serve reads – Bad: Requires Nx hardware – Hard: Keeping machines loaded with same data especially auto-generated-ids – Hard: What about when the data does not even fit on a single machine?
  • 12.
  • 13. Sharding ● Rather then replicate all data to all machines ● Replicate data to selective machines – Good: localized data – Good: better caching – Hard: Joins across shards – Hard: Management – Hard: Failure ● Parallel RDBMS = $$$
  • 14. Life lesson “applications that are traditionally used to” ● How did we solve our problem? – We switched to lucene ● A tool designed for full text search ● Eventually sharded lucene ● When you hold a hammer: – Not everything is a nail ● Understand what you really need ● Understand reasonable and feasable
  • 15. Big data Challenge 2 ● Large high volume web site ● Process them and produce reports ● Big Data challenges – Storage: Store GB of data a day for years – Analysis, visualization: support reports of existing system ● Qualifiers – Reasonable to want daily reports less then one day – Honestly needs to be faster / reruns etc
  • 16. Enter hadoop ● Hadoop (0.17.X) was fairly new at the time ● Use cases of map reduce were emerging – Hive had just been open sourced by Facebook ● Many database vendors were calling map/reduce “a step backwards” – They had solved these problems “in the 80s”
  • 17. Hadoop file system HDFS ● Distributed redundant storage – We were a NoSPOF across the board ● Commodity hardware vs buying a big SAN/NAS device ● We already had processes that scp'ed data to servers, easily adapted to placing them into hdfs ● HDFS easy huge
  • 18. Map Reduce ● As a proof of concept I wrote a group/count application that would group/count on column in our logs ● Was able to show linear speed up with increased nodes ●
  • 19. Winning (why hadoop kicked arse) ● Data capture, curation – bulk loading data into RDBMS (indexes, overhead) – bulk loading into hadoop is network copy ● Data anaysis – RDBMS would not parallel-ize queries (even across partitions) – Some queries could cause very locks and performance degradation http://hardcourtlessons.blogspot.com/2010/05/definition-of-winning.html
  • 20. Enter hive ● Capture- NO ● Curation- YES ● Storage- YES ● Search- YES ● Sharing- YES ● Transfer- NO ● Analysis-YES ● Visualization-NO
  • 22. Sample program group and count Source data looks like jan 10 2009:.........:200:/index.htm jan 10 2009:.........:200:/index.htm jan 10 2009:.........:200:/igloo.htm jan 10 2009:.........:200:/ed.htm
  • 23. In case your the math type (input) <k1, v1> → map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output) Map(k1,v1) -> list(k2,v2) Reduce(k2, list (v2)) -> list(v3)
  • 26. Hive style hive>create table web_data ( sdate STRING, stime STRING, envvar STRING, remotelogname STRING ,servername STRING, localip STRING, literaldash STRING, method STRING, url STRING, querystring STRING, status STRING, litteralzero STRING ,bytessent INT,header STRING, timetoserver INT, useragent STRING ,cookie STRING, referer STRING); SELECT url,count(1) FROM web_data GROUP BY url;
  • 27. Life lessons volume 2 ● feasible and reasonable were completely different then case 1# ● Query from seconds -> hours ● Size from GB to TB ● Feasilble from 4 Nodes to 15
  • 28. Big Data Challenge #3 (work at m6d) ● Large high volume ad serving site ● Process them and produce reports ● Support data science and biz-dev users ● Big Data challenges – Storage: Store and process terabytes of data ● Complex data types, encoded data – Analysis, visualization: support reports of existing system ● Qualifiers – Reasonable: adhoc, daily,hourly, weekly, monthly reports
  • 29. Data data everywhere ● We have to use cookies in many places ● Cookies have limited size ● Cookies have complex values encoded
  • 30. Some encoding tricks we might do LastSeen: long (64 bits) Segment: int (32 bits) Literal ',' Segment: int (32 bits) Zipcode (32bits) ● 1 chose a relevant epoc and use byte ● Use a byte for # of segments ● Use a 4 byte radix encoded number ● ... and so on
  • 31. Getting at embedded data ● Write N UDFS for each object like: – getLastSeenForCookie(String) – getZipcodeForCookie(String) – ... ● But this would have made a huge toolkit ● Traditionally you do not want to break first normal form
  • 32. Struct solution ● Hive has a struct like a c struct ● Struct is list of name value pair ● Structs can contain other structs ● This gives us the serious ability to do object mapping ● UDFs can return struct types
  • 33. Using a UDF ● add jar myjar.jar; ● Create temporary function parseCookie as 'com.md6.ParseCookieIntoStruct' ; ● Select parseCookie(encodedColumn).lastSeen from mydata;
  • 34. LATERAL VIEW + EXPLODE SELECT client_id, entry.spendcreativeid FROM datatable LATERAL VIEW explode (AdHistoryAsStruct(ad_history).adEntrylist) entryList as entry where hit_date=20110321 AND mid=001406; 3214498023360851706 215286 3214498023360851706 195785 3214498023360851706 128640
  • 35. All that data might boil down to...
  • 36. Life lessons volume #3 ● Big data is not only batch or real-time ● Big data is feed back loops – Machine learning – Ad hoc performance checks ● Generated SQL tables periodically synced to web server ● Data shared between sections of an organization to make business decisions