SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
Dušan Zamurović
@codecentricRS



Name: Dušan Zamurović
Where I come from?
◦ codecentric Novi Sad



What I do?
◦ Java web-app background
◦ ♥ JavaScript ♥
 Ajax with DWR lib

◦ Android
◦ currently Big Data (reporting QA)








me
Big Data
Map/Reduce algorithm
Hadoop platform
Pig language
Showcase
◦ Java Map/Reduce implementation
◦ Pig implementation



Conclusion
A revolution that will transform how we live, work,
and think.


3 Vs of big data
◦ Volume
◦ Variety
◦ Velocity



Every day use-cases
◦ Beautiful
◦ Useful
◦ Funny



The principal characteristic
Studies report
◦ 1.2 trillion gigabytes of new data was created
worldwide in 2011 alone
◦ From 2005 to 2020, the digital universe will grow
by a factor of 300
◦ By 2020 the digital universe will amount to 40
trillion gigabytes (more than 5,200 gigabytes for
every man, woman, and child in 2020)


The biggest growth – unstructured data
◦
◦
◦
◦
◦
◦




Documents
Web logs
Sensor data
Videos and photos
Medical devices
Social media

>90% of this Big Data is unstructured
Analytic value?
◦ 33% valuable info by 2020


Generated at high speed
Needs real-time processing



Example I



◦ Financial world
◦ Thousands or millions of transactions


Example II
◦ Retail
◦ Analyze click streams to offer recommendations
Value of Big Data is potentially great but can be
released only with the right combination of
people, processes and technologies.


…unlock significant value by making
information transparent and usable at much
higher frequency


Measuring heartbeat of a city - Rio de Janeiro



More examples
◦
◦
◦
◦



Product development – most valuable features
Manufacturing – indicators of quality problems
Distribution – optimize inventory and supply chains
Sales – account targeting, resource allocation
 Beer and diapers

Possible issues?

◦ Privacy, security, intellectual property, liability…
"Map/Reduce is a programming model and an
associated implementation for processing and
generating large data sets. Users specify a map
function that processes a key/value pair to
generate a set of intermediate key/value pairs,
and a reduce function that merges all
intermediate values associated with the same
intermediate key.“
- research publication
http://research.google.com/archive/mapreduce.html


In the beginning, there was Nutch



Which problems does it address?
◦ Big Data
◦ Not fit for RDBMS
◦ Computationally extensive



Hadoop && RDBMS

◦ “Get data to process” or “send code where data is”
◦ Designed to run on large number of machines
◦ Separate storage


Distributed File System
◦ Designed for commodity hardware
◦ Highly fault-tolerant
◦ Relaxed POSIX
 To enable streaming access to file system data



Assumptions and Goals
◦
◦
◦
◦
◦

Hardware failure
Streaming data access
Large data sets
Write-once-read-many
Move computation, not data


NameNode
◦
◦
◦
◦
◦



Master server, central component
HDFS cluster has single NameNode
Manages client’s access
Keeps track where data is kept
Single point of failure

Secondary NameNode
◦ Optional component
◦ Checkpoints of the namespace
 Does not provide any real redundancy


DataNode
◦ Stores data in the file system
◦ Talks to NameNode and responds to requests
◦ Talks to other DataNodes
 Data replication



TaskTracker
◦
◦
◦
◦

Should be where DataNode is
Accepts tasks (Map, Reduce, Shuffle…)
Set of slots for tasks
♥__ ♥__ ♥__ ________ ♥_ ♥ ♥ ♥__________________


JobTracker

◦ Farms tasks to specific nodes in the cluster
◦ Point of failure for MapReduce



How it goes?
1.
2.
3.
4.
5.

Client submits jobs  JobTracker
JobTracker, whereis  NameNode
JobTracker locates TaskTracker
JobTracker, tasks  TaskTracker
TaskTracker ♥__ ♥__ ♥__

1. Job failed, TaskTracker informs, JobTracker decides
2. Job done, JobTracker updates status

6. Client can poll JobTracker for information


Platform for analyzing large data sets
◦
◦
◦
◦



Language – Pig Latin
High level approach
Compiler
Grunt shell

Pig compared to SQL
◦ Lazy evaluation
◦ Procedural language
◦ More like an execution plan


Pig Latin statements
◦
◦
◦
◦
◦

A
A
A
A
A

relation is a bag
bag is collection of tuples
tuple is on ordered set of fields
field is piece of data
relation is referenced by name, i.e. alias

A = LOAD 'student' USING PigStorage() AS
(name:chararray, age:int, gpa:float);
DUMP A;
(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)


Data types
◦ Simple









int – signed 32-bit integer
long – signed 64-bit integer
float – 32-bit floating point
double – 64-bit floating point
charrarray – UTF-8 string
bytearray – blob
boolean – since Pig 0.10
datetime

◦ Complex

 tuple – an ordered set of fields
 bag – a collection of tuples
 map – a set of key value pairs

(21,32)
{(21,32),(32,43)}
[pig#latin]


Data structure and defining schemas
◦ Why to define schema?
◦ Where to define schema?
◦ How to define schema?

/* data types not specified */
a = LOAD '1.txt' AS (a0, b0);
a: {a0: bytearray,b0: bytearray}
/* number of fields not known */
a = LOAD '1.txt';
a: Schema for a unknown








Arithmetic: +, -, *, /, %, ? :
Boolean: AND, OR, NOT
Cast
Comparison: ==, !=, <, >, <=, >=, matches
Type construction: (), {}, [] incl. eq. functions
Relational
◦
◦
◦
◦
◦
◦
◦
◦
◦

GROUP
DEFINE
FILTER
FOREACH
JOIN
UNION
STORE
LOAD
SPLIT


Eval functions



Load/Store functions



Math functions



String functions



Datetime functions



Tuple, Bag, Map functions

◦ AVG, MAX, MIN, COUNT, SUM, …
◦ BinStorage
◦ JsonLoader, JsonStorage
◦ PigStorage
◦ ABS, COS, …, EXP, RANDOM, ROUND, …
◦ TRIM, LOWER, SUBSTRING, REPLACE, …
◦ *Between, Get*, …

◦ TOTUPLE, TOBAG, TOMAP


User Defined Functions
◦ Java, Python, JavaScript, Ruby, Groovy



How to write an UDF?
◦ Eval function extends EvalFunc<something>
◦ Load function extends LoadFunc
◦ Store function extends StoreFunc



How to use an UDF?

◦ Register
◦ Define the name of the UDF if you like
◦ Call it



Imaginary social network
A lots of users…
 … with their friends, girlfriends, boyfriends, wives,
husbands, mistresses, etc…



New relationship arises…
◦ … but new friend is not shown in news feed



Where are his/her activities?
◦ Hidden, marked as not important



Find out the value of the relationship
Monitor and log user activities
◦
◦
◦
◦
◦
◦
◦

For each user, of course
Each activity has some value (event weight)
Records user’s activities
Store those logs in HDFS
Analyze those logs from time to time
Calculate needed values
Show only the activities of “important” friends


Events recorded in JSON format

{
"timestamp": 1341161607860,
"sourceUser": "marry.lee",
"targetUser": "ruby.blue",
"eventName": "VIEW_PHOTO",
"eventWeight": 1

}
public enum EventType {
VIEW_DETAILS(3),
VIEW_PROFILE(10),
VIEW_PHOTO(1),
COMMENT(2),
COMMENT_LIKE(1),
WALL_POST(3),
MESSAGE(1);
…
}
static public class InteractionMap extends
Mapper<LongWritable, Text, Text, InteractionWritable>
{
@Override
protected void map(LongWritable offset,
Text text, Context context) … {
…
}
@Override
protected void reduce(Text token,
Iterable<InteractionWritable> interactions,
Context context) … {
…
}
void map(LongWritable offset, Text text, Context context) {
String[] tokens =
String sourceUser
String targetUser
int eventWeight =
context.write(new
new
}

MyJsonParser.parse(text);
= tokens[1];
= tokens[2];
Integer.parseInt(tokens[4]);
Text(sourceUser),
InteractionWritable(targetUser,
eventWeight));
void reduce(Text token, Iterable<InteractionWritable> iActions,
Context context) … {
Map<Text, InteractionValuesWritable> iActionsGroup =
newHashMap<Text,InteractionValuesWritable>();
Iterator<InteractionWritable> iActionsIterator =
iActions.iterator();

while(iActionsIterator.hasNext()) {
InteractionWritable iAction = iActionsIterator.next();
Text targetUser = new Text(iAction.getTargetUser().toString());
int weight = iAction.getEventWeight().get();
int count = 1;
…
…
InteractionValuesWritable iActionValues = iActionGroup.get(tUser);
if (iActionsValues != null) {
weight += iActionValues.getWeight().get();
count = iActionValues.getCount.get() + 1;
}
iActionGroup.put(targetUser,
new InteractionValuesWritable(weight, count));
List orderedInteractions = sortInteractionsByWeight(iActionsGroup);
for (Entry entry : orderedInteractions) {
InteractionsValuesWritable value = entry.getValue();
String resLine = … // entry.key + value.weight + value.count
context.write(token, new Text(resLine));
}
}
casie.keller petar.petrovic 97579 32554
casie.keller marry.lee 97284 32094
casie.keller jane.doe 97247 32400
casie.keller domenico.quatro-formaggi 96712 32106
casie.keller esmeralda.aguero 96665 32251
casie.keller jason.bourne 96499 32043
casie.keller jose.miguel 96304 31927
casie.keller steve.smith 95929 32267
casie.keller john.doe 95664 31996
casie.keller swatka.mawa 95421 31785
casie.keller lee.young 95400 31758
casie.keller ruby.blue 95132 32181
domenico.quatro-formaggi jane.doe 97442 32492
domenico.quatro-formaggi ruby.blue 97072 31916
domenico.quatro-formaggi jason.bourne 96967 3223
…
class JsonLoader extends LoadFunc {
@Override
public InputFormat getInputFormat() throws IOException {
return new TextInputFormat();
}
public ResourceSchema getSchema(String location, Job job) … {
ResourceSchema schema = new ResourceSchema();
ResourceFieldSchema[] fieldSchemas = new
ResourceFieldSchema[SCHEMA_FIELDS_COUNT];
fieldSchemas[0] = new ResourceFieldSchema();
fieldSchemas[0].setName(FIELD_NAME_TIMESTAMP);
fieldSchemas[0].setType(DataType.LONG); …
schema.setFields(fieldSchemas);
return schema;
}
}
class JsonLoader extends LoadFunc {
…
@Override public Tuple getNext() throws IOException {
try {
boolean notDone = in.nextKeyValue();
if (!notDone) {
return null;
}
Text jsonRecord = (Text) in.getCurrentValue();
String[] values = MyJsonParser.parse(jsonRecord);
Tuple tuple = tuppleFactory.newTuple(Arrays.asList(values));
return tuple;
} catch (Exception exc) {
throw new IOException(exc);
}
}
}
class AverageWeight extends EvalFunc<String> {
…
@Override public String exec(Tuple input) … {
String output = null;
if (input != null && input.size() == 2) {
Integer totalWeight = (Integer) input.get(0);
Integer totalCount = (Integer) input.get(1);
BigDecimal average = new BigDecimal(totalWeight).
divide(new BigDecimal(totalCount),
SCALE, RoundingMode.HALF_UP);
output = average.stripTrailingZeros().toPlainString();
}
return output;
}
}
REGISTER codingserbia-udf.jar
DEFINE AVG_WEIGHT com.codingserbia.udf.AverageWeight();
interactionRecords = LOAD ‘/blog/user_interaction_big.json’
USING com.codingserbia.udf.JsonLoader();

interactionData = FOREACH interactionRecords GENERATE
sourceUser,
targetUser,
eventWeight;
groupInteraction = GROUP interactionData BY (sourceUser,
targetUser);
…
…
summarizedInteraction = FOREACH groupInteraction GENERATE
group.sourceUser AS sourceUser,
group.targetUser AS targetUser,
SUM(interactionData.eventWeight) AS eventWeight,
COUNT(interactionData.eventWeight) AS eventCount,
AVG_WEIGHT(
SUM(interactionData.eventWeight),
COUNT(interactionData.eventWeight)) AS averageWeight;
result = ORDER summarizedInteraction BY
sourceUser, eventWeight DESC;
STORE result INTO '/results/pig_mr’ USING PigStorage();
Coding serbia
Coding serbia

Más contenido relacionado

La actualidad más candente

Mythbusting: Understanding How We Measure the Performance of MongoDB
Mythbusting: Understanding How We Measure the Performance of MongoDBMythbusting: Understanding How We Measure the Performance of MongoDB
Mythbusting: Understanding How We Measure the Performance of MongoDBMongoDB
 
Schema Design Best Practices with Buzz Moschetti
Schema Design Best Practices with Buzz MoschettiSchema Design Best Practices with Buzz Moschetti
Schema Design Best Practices with Buzz MoschettiMongoDB
 
Softshake - Offline applications
Softshake - Offline applicationsSoftshake - Offline applications
Softshake - Offline applicationsjeromevdl
 
MongoDB Schema Design: Four Real-World Examples
MongoDB Schema Design: Four Real-World ExamplesMongoDB Schema Design: Four Real-World Examples
MongoDB Schema Design: Four Real-World ExamplesMike Friedman
 
Starting with MongoDB
Starting with MongoDBStarting with MongoDB
Starting with MongoDBDoThinger
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
MongoDB - A Document NoSQL Database
MongoDB - A Document NoSQL DatabaseMongoDB - A Document NoSQL Database
MongoDB - A Document NoSQL DatabaseRuben Inoto Soto
 
Webinar: Back to Basics: Thinking in Documents
Webinar: Back to Basics: Thinking in DocumentsWebinar: Back to Basics: Thinking in Documents
Webinar: Back to Basics: Thinking in DocumentsMongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB
 
Mongo db – document oriented database
Mongo db – document oriented databaseMongo db – document oriented database
Mongo db – document oriented databaseWojciech Sznapka
 
Map/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBMap/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBUwe Printz
 
Back to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationBack to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationMongoDB
 
MongoDB + Java - Everything you need to know
MongoDB + Java - Everything you need to know MongoDB + Java - Everything you need to know
MongoDB + Java - Everything you need to know Norberto Leite
 
Getting Started with Geospatial Data in MongoDB
Getting Started with Geospatial Data in MongoDBGetting Started with Geospatial Data in MongoDB
Getting Started with Geospatial Data in MongoDBMongoDB
 
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...MongoDB
 
2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamskyData Con LA
 

La actualidad más candente (17)

Mythbusting: Understanding How We Measure the Performance of MongoDB
Mythbusting: Understanding How We Measure the Performance of MongoDBMythbusting: Understanding How We Measure the Performance of MongoDB
Mythbusting: Understanding How We Measure the Performance of MongoDB
 
Schema Design Best Practices with Buzz Moschetti
Schema Design Best Practices with Buzz MoschettiSchema Design Best Practices with Buzz Moschetti
Schema Design Best Practices with Buzz Moschetti
 
Softshake - Offline applications
Softshake - Offline applicationsSoftshake - Offline applications
Softshake - Offline applications
 
MongoDB Schema Design: Four Real-World Examples
MongoDB Schema Design: Four Real-World ExamplesMongoDB Schema Design: Four Real-World Examples
MongoDB Schema Design: Four Real-World Examples
 
Starting with MongoDB
Starting with MongoDBStarting with MongoDB
Starting with MongoDB
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
MongoDB - A Document NoSQL Database
MongoDB - A Document NoSQL DatabaseMongoDB - A Document NoSQL Database
MongoDB - A Document NoSQL Database
 
Webinar: Back to Basics: Thinking in Documents
Webinar: Back to Basics: Thinking in DocumentsWebinar: Back to Basics: Thinking in Documents
Webinar: Back to Basics: Thinking in Documents
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
Mongo db – document oriented database
Mongo db – document oriented databaseMongo db – document oriented database
Mongo db – document oriented database
 
Map/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBMap/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDB
 
MongoDB
MongoDBMongoDB
MongoDB
 
Back to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationBack to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB Application
 
MongoDB + Java - Everything you need to know
MongoDB + Java - Everything you need to know MongoDB + Java - Everything you need to know
MongoDB + Java - Everything you need to know
 
Getting Started with Geospatial Data in MongoDB
Getting Started with Geospatial Data in MongoDBGetting Started with Geospatial Data in MongoDB
Getting Started with Geospatial Data in MongoDB
 
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
 
2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky
 

Similar a Coding serbia

Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsMapR Technologies
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.GeeksLab Odessa
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverDataWorks Summit
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseMongoDB
 
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseApplication Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseVictoriaMetrics
 
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Altinity Ltd
 
Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台jins0618
 
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' MeetupMongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' MeetupMongoDB
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?CodePolitan
 
Implementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDBImplementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDBMongoDB
 
OrientDB - The 2nd generation of (multi-model) NoSQL
OrientDB - The 2nd generation of  (multi-model) NoSQLOrientDB - The 2nd generation of  (multi-model) NoSQL
OrientDB - The 2nd generation of (multi-model) NoSQLRoberto Franchini
 
Json to hive_schema_generator
Json to hive_schema_generatorJson to hive_schema_generator
Json to hive_schema_generatorPayal Jain
 
Operational Intelligence with MongoDB Webinar
Operational Intelligence with MongoDB WebinarOperational Intelligence with MongoDB Webinar
Operational Intelligence with MongoDB WebinarMongoDB
 
Hidden pearls for High-Performance-Persistence
Hidden pearls for High-Performance-PersistenceHidden pearls for High-Performance-Persistence
Hidden pearls for High-Performance-PersistenceSven Ruppert
 
Big Data Learnings from a Vendor's Perspective
Big Data Learnings from a Vendor's PerspectiveBig Data Learnings from a Vendor's Perspective
Big Data Learnings from a Vendor's PerspectiveAerospike, Inc.
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQLWSO2
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
 

Similar a Coding serbia (20)

Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick Database
 
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseApplication Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
 
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
 
Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台
 
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' MeetupMongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Implementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDBImplementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDB
 
Siddhi - cloud-native stream processor
Siddhi - cloud-native stream processorSiddhi - cloud-native stream processor
Siddhi - cloud-native stream processor
 
OrientDB - The 2nd generation of (multi-model) NoSQL
OrientDB - The 2nd generation of  (multi-model) NoSQLOrientDB - The 2nd generation of  (multi-model) NoSQL
OrientDB - The 2nd generation of (multi-model) NoSQL
 
Json to hive_schema_generator
Json to hive_schema_generatorJson to hive_schema_generator
Json to hive_schema_generator
 
Operational Intelligence with MongoDB Webinar
Operational Intelligence with MongoDB WebinarOperational Intelligence with MongoDB Webinar
Operational Intelligence with MongoDB Webinar
 
Hidden pearls for High-Performance-Persistence
Hidden pearls for High-Performance-PersistenceHidden pearls for High-Performance-Persistence
Hidden pearls for High-Performance-Persistence
 
Big Data Learnings from a Vendor's Perspective
Big Data Learnings from a Vendor's PerspectiveBig Data Learnings from a Vendor's Perspective
Big Data Learnings from a Vendor's Perspective
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"
 

Último

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 

Último (20)

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 

Coding serbia

  • 2.   Name: Dušan Zamurović Where I come from? ◦ codecentric Novi Sad  What I do? ◦ Java web-app background ◦ ♥ JavaScript ♥  Ajax with DWR lib ◦ Android ◦ currently Big Data (reporting QA)
  • 3.       me Big Data Map/Reduce algorithm Hadoop platform Pig language Showcase ◦ Java Map/Reduce implementation ◦ Pig implementation  Conclusion
  • 4.
  • 5. A revolution that will transform how we live, work, and think.  3 Vs of big data ◦ Volume ◦ Variety ◦ Velocity  Every day use-cases ◦ Beautiful ◦ Useful ◦ Funny
  • 6.   The principal characteristic Studies report ◦ 1.2 trillion gigabytes of new data was created worldwide in 2011 alone ◦ From 2005 to 2020, the digital universe will grow by a factor of 300 ◦ By 2020 the digital universe will amount to 40 trillion gigabytes (more than 5,200 gigabytes for every man, woman, and child in 2020)
  • 7.  The biggest growth – unstructured data ◦ ◦ ◦ ◦ ◦ ◦   Documents Web logs Sensor data Videos and photos Medical devices Social media >90% of this Big Data is unstructured Analytic value? ◦ 33% valuable info by 2020
  • 8.  Generated at high speed Needs real-time processing  Example I  ◦ Financial world ◦ Thousands or millions of transactions  Example II ◦ Retail ◦ Analyze click streams to offer recommendations
  • 9. Value of Big Data is potentially great but can be released only with the right combination of people, processes and technologies.  …unlock significant value by making information transparent and usable at much higher frequency
  • 10.  Measuring heartbeat of a city - Rio de Janeiro  More examples ◦ ◦ ◦ ◦  Product development – most valuable features Manufacturing – indicators of quality problems Distribution – optimize inventory and supply chains Sales – account targeting, resource allocation  Beer and diapers Possible issues? ◦ Privacy, security, intellectual property, liability…
  • 11. "Map/Reduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.“ - research publication http://research.google.com/archive/mapreduce.html
  • 12.
  • 13.  In the beginning, there was Nutch  Which problems does it address? ◦ Big Data ◦ Not fit for RDBMS ◦ Computationally extensive  Hadoop && RDBMS ◦ “Get data to process” or “send code where data is” ◦ Designed to run on large number of machines ◦ Separate storage
  • 14.  Distributed File System ◦ Designed for commodity hardware ◦ Highly fault-tolerant ◦ Relaxed POSIX  To enable streaming access to file system data  Assumptions and Goals ◦ ◦ ◦ ◦ ◦ Hardware failure Streaming data access Large data sets Write-once-read-many Move computation, not data
  • 15.  NameNode ◦ ◦ ◦ ◦ ◦  Master server, central component HDFS cluster has single NameNode Manages client’s access Keeps track where data is kept Single point of failure Secondary NameNode ◦ Optional component ◦ Checkpoints of the namespace  Does not provide any real redundancy
  • 16.  DataNode ◦ Stores data in the file system ◦ Talks to NameNode and responds to requests ◦ Talks to other DataNodes  Data replication  TaskTracker ◦ ◦ ◦ ◦ Should be where DataNode is Accepts tasks (Map, Reduce, Shuffle…) Set of slots for tasks ♥__ ♥__ ♥__ ________ ♥_ ♥ ♥ ♥__________________
  • 17.  JobTracker ◦ Farms tasks to specific nodes in the cluster ◦ Point of failure for MapReduce  How it goes? 1. 2. 3. 4. 5. Client submits jobs  JobTracker JobTracker, whereis  NameNode JobTracker locates TaskTracker JobTracker, tasks  TaskTracker TaskTracker ♥__ ♥__ ♥__ 1. Job failed, TaskTracker informs, JobTracker decides 2. Job done, JobTracker updates status 6. Client can poll JobTracker for information
  • 18.  Platform for analyzing large data sets ◦ ◦ ◦ ◦  Language – Pig Latin High level approach Compiler Grunt shell Pig compared to SQL ◦ Lazy evaluation ◦ Procedural language ◦ More like an execution plan
  • 19.  Pig Latin statements ◦ ◦ ◦ ◦ ◦ A A A A A relation is a bag bag is collection of tuples tuple is on ordered set of fields field is piece of data relation is referenced by name, i.e. alias A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float); DUMP A; (John,18,4.0F) (Mary,19,3.8F) (Bill,20,3.9F) (Joe,18,3.8F)
  • 20.  Data types ◦ Simple         int – signed 32-bit integer long – signed 64-bit integer float – 32-bit floating point double – 64-bit floating point charrarray – UTF-8 string bytearray – blob boolean – since Pig 0.10 datetime ◦ Complex  tuple – an ordered set of fields  bag – a collection of tuples  map – a set of key value pairs (21,32) {(21,32),(32,43)} [pig#latin]
  • 21.  Data structure and defining schemas ◦ Why to define schema? ◦ Where to define schema? ◦ How to define schema? /* data types not specified */ a = LOAD '1.txt' AS (a0, b0); a: {a0: bytearray,b0: bytearray} /* number of fields not known */ a = LOAD '1.txt'; a: Schema for a unknown
  • 22.       Arithmetic: +, -, *, /, %, ? : Boolean: AND, OR, NOT Cast Comparison: ==, !=, <, >, <=, >=, matches Type construction: (), {}, [] incl. eq. functions Relational ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ GROUP DEFINE FILTER FOREACH JOIN UNION STORE LOAD SPLIT
  • 23.  Eval functions  Load/Store functions  Math functions  String functions  Datetime functions  Tuple, Bag, Map functions ◦ AVG, MAX, MIN, COUNT, SUM, … ◦ BinStorage ◦ JsonLoader, JsonStorage ◦ PigStorage ◦ ABS, COS, …, EXP, RANDOM, ROUND, … ◦ TRIM, LOWER, SUBSTRING, REPLACE, … ◦ *Between, Get*, … ◦ TOTUPLE, TOBAG, TOMAP
  • 24.  User Defined Functions ◦ Java, Python, JavaScript, Ruby, Groovy  How to write an UDF? ◦ Eval function extends EvalFunc<something> ◦ Load function extends LoadFunc ◦ Store function extends StoreFunc  How to use an UDF? ◦ Register ◦ Define the name of the UDF if you like ◦ Call it
  • 25.
  • 26.   Imaginary social network A lots of users…  … with their friends, girlfriends, boyfriends, wives, husbands, mistresses, etc…  New relationship arises… ◦ … but new friend is not shown in news feed  Where are his/her activities? ◦ Hidden, marked as not important
  • 27.   Find out the value of the relationship Monitor and log user activities ◦ ◦ ◦ ◦ ◦ ◦ ◦ For each user, of course Each activity has some value (event weight) Records user’s activities Store those logs in HDFS Analyze those logs from time to time Calculate needed values Show only the activities of “important” friends
  • 28.  Events recorded in JSON format { "timestamp": 1341161607860, "sourceUser": "marry.lee", "targetUser": "ruby.blue", "eventName": "VIEW_PHOTO", "eventWeight": 1 }
  • 29. public enum EventType { VIEW_DETAILS(3), VIEW_PROFILE(10), VIEW_PHOTO(1), COMMENT(2), COMMENT_LIKE(1), WALL_POST(3), MESSAGE(1); … }
  • 30. static public class InteractionMap extends Mapper<LongWritable, Text, Text, InteractionWritable> { @Override protected void map(LongWritable offset, Text text, Context context) … { … } @Override protected void reduce(Text token, Iterable<InteractionWritable> interactions, Context context) … { … }
  • 31. void map(LongWritable offset, Text text, Context context) { String[] tokens = String sourceUser String targetUser int eventWeight = context.write(new new } MyJsonParser.parse(text); = tokens[1]; = tokens[2]; Integer.parseInt(tokens[4]); Text(sourceUser), InteractionWritable(targetUser, eventWeight));
  • 32. void reduce(Text token, Iterable<InteractionWritable> iActions, Context context) … { Map<Text, InteractionValuesWritable> iActionsGroup = newHashMap<Text,InteractionValuesWritable>(); Iterator<InteractionWritable> iActionsIterator = iActions.iterator(); while(iActionsIterator.hasNext()) { InteractionWritable iAction = iActionsIterator.next(); Text targetUser = new Text(iAction.getTargetUser().toString()); int weight = iAction.getEventWeight().get(); int count = 1; …
  • 33. … InteractionValuesWritable iActionValues = iActionGroup.get(tUser); if (iActionsValues != null) { weight += iActionValues.getWeight().get(); count = iActionValues.getCount.get() + 1; } iActionGroup.put(targetUser, new InteractionValuesWritable(weight, count)); List orderedInteractions = sortInteractionsByWeight(iActionsGroup); for (Entry entry : orderedInteractions) { InteractionsValuesWritable value = entry.getValue(); String resLine = … // entry.key + value.weight + value.count context.write(token, new Text(resLine)); } }
  • 34. casie.keller petar.petrovic 97579 32554 casie.keller marry.lee 97284 32094 casie.keller jane.doe 97247 32400 casie.keller domenico.quatro-formaggi 96712 32106 casie.keller esmeralda.aguero 96665 32251 casie.keller jason.bourne 96499 32043 casie.keller jose.miguel 96304 31927 casie.keller steve.smith 95929 32267 casie.keller john.doe 95664 31996 casie.keller swatka.mawa 95421 31785 casie.keller lee.young 95400 31758 casie.keller ruby.blue 95132 32181 domenico.quatro-formaggi jane.doe 97442 32492 domenico.quatro-formaggi ruby.blue 97072 31916 domenico.quatro-formaggi jason.bourne 96967 3223 …
  • 35. class JsonLoader extends LoadFunc { @Override public InputFormat getInputFormat() throws IOException { return new TextInputFormat(); } public ResourceSchema getSchema(String location, Job job) … { ResourceSchema schema = new ResourceSchema(); ResourceFieldSchema[] fieldSchemas = new ResourceFieldSchema[SCHEMA_FIELDS_COUNT]; fieldSchemas[0] = new ResourceFieldSchema(); fieldSchemas[0].setName(FIELD_NAME_TIMESTAMP); fieldSchemas[0].setType(DataType.LONG); … schema.setFields(fieldSchemas); return schema; } }
  • 36. class JsonLoader extends LoadFunc { … @Override public Tuple getNext() throws IOException { try { boolean notDone = in.nextKeyValue(); if (!notDone) { return null; } Text jsonRecord = (Text) in.getCurrentValue(); String[] values = MyJsonParser.parse(jsonRecord); Tuple tuple = tuppleFactory.newTuple(Arrays.asList(values)); return tuple; } catch (Exception exc) { throw new IOException(exc); } } }
  • 37. class AverageWeight extends EvalFunc<String> { … @Override public String exec(Tuple input) … { String output = null; if (input != null && input.size() == 2) { Integer totalWeight = (Integer) input.get(0); Integer totalCount = (Integer) input.get(1); BigDecimal average = new BigDecimal(totalWeight). divide(new BigDecimal(totalCount), SCALE, RoundingMode.HALF_UP); output = average.stripTrailingZeros().toPlainString(); } return output; } }
  • 38. REGISTER codingserbia-udf.jar DEFINE AVG_WEIGHT com.codingserbia.udf.AverageWeight(); interactionRecords = LOAD ‘/blog/user_interaction_big.json’ USING com.codingserbia.udf.JsonLoader(); interactionData = FOREACH interactionRecords GENERATE sourceUser, targetUser, eventWeight; groupInteraction = GROUP interactionData BY (sourceUser, targetUser); …
  • 39. … summarizedInteraction = FOREACH groupInteraction GENERATE group.sourceUser AS sourceUser, group.targetUser AS targetUser, SUM(interactionData.eventWeight) AS eventWeight, COUNT(interactionData.eventWeight) AS eventCount, AVG_WEIGHT( SUM(interactionData.eventWeight), COUNT(interactionData.eventWeight)) AS averageWeight; result = ORDER summarizedInteraction BY sourceUser, eventWeight DESC; STORE result INTO '/results/pig_mr’ USING PigStorage();

Notas del editor

  1. Big data. One of the buzz words of the software industry in the last decade. We all heard about it but I am not sure if we actually can comprehend it as we should and as it deserves. It reminds me of the Universe – mankind has knowledge that it is big, huge, vast, but no one can really understand the size of it. Same can be said for the amount of data being collected and processed every day somewhere in the clouds if IT. As Google’s CEO, Eric Schmidt, once said: “There were 5 exabytes of information created by the entire world between the dawn of civilization and 2003. Now that same amount is created every two days.”
  2. Almost every organization has to deal with huge amounts of data. Much of this exists in conventional structured forms, stored in relational databases. However, the biggest growth comes from unstructured data, both from inside and outside the enterprise - including documents, web logs, sensor data, videos, medical devices and social media. According to some studies, more than 90% of Big Data is unstructured data.The majority of information in the digital universe, 68% in 2012, is created and consumed by consumers watching digital TV, interacting with social media, sending camera phone images and videos between devices and around the Internet, and so on.But, only a fraction if it is explored for analytic value. Some studies say that only 33% of digital universe will be contain valuable info by 2020.
  3. As well as volume and variety, Big Data is often said to exhibit &quot;velocity&quot; - meaning that the data is being generated at high speed, and needs real-time processing and analysis. One example of the need for real-time processing of Big Data is in the financial world, where thousands or millions of transactions must be continuously analyzed for possible fraud in a matter of seconds. Another example is in retail, where a business may be analyzing many customer click-streams and purchases to generate real-time intelligent recommendations.
  4. organizations create and store more data in digital form, they can collect more accurate and detailed performance information on everything from product inventories to sick days, and therefore expose variability and boost performancecompanies are using data collection and analysis to conduct controlled experiments to make better management decisions
  5. Measuring heartbeat of a city - Rio de Janeiro6.5M people8M vehicles4M bus passengers44k police...tropical monsoon climateBigData is used to monitor weather, traffic (GPS tracked busses and medical vehicles), police, emergency services - using analytics to predict problems before they occur.Not so beautiful example, but Big Data influences business and decision making- Product Development: incorporate the features that matter most- Manufacturing: flag potential indicators of quality problems- Distribution: quantify optimal inventory and supply chain activities- Marketing: identify your most effective campaigns for engagement and sales- Sales: optimize account targeting, resource allocation, revenue forecastingSeveral issues will have to be addressed to capture the full potential of big data. Policies related to privacy, security, intellectual property, and even liability will need to be addressed in a big data world. Organizations need not only to put the right talent and technology in place but also structure workflows and incentives to optimize the use of big data. Access to data is critical—companies will increasingly need to integrate information from multiple data sources, often from third parties, and the incentives have to be in place to enable this.
  6. The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn’t fit nicely into tables. It’s for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That’s exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms. Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they’re more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built. Those are just a few examples.Hadoop is designed to run on a large number of machines that don’t share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization’s data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There’s no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy.In a centralized database system, you’ve got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That’s MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set.
  7. The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject.Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates.Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.
  8. HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline. There is an optional SecondaryNameNode that can be hosted on a separate machine. It only creates checkpoints of the namespace by merging the edits file into the fsimage file and does not provide any real redundancy.
  9. A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.DataNode instances can talk to each other, which is what they do when they are replicating data.TaskTracker instances can, indeed should, be deployed on the same servers that host DataNode instances, so that MapReduce operations are performed close to the data.A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.The TaskTracker spawns a separate JVM processes to do the actual work; this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these spawned processes, capturing the output and exit codes. When the process finishes, successfully or not, the tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.
  10. The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.The JobTracker is a point of failure for the HadoopMapReduce service. If it goes down, all running jobs are halted.Client applications submit jobs to the Job tracker.The JobTracker talks to the NameNode to determine the location of the dataThe JobTracker locates TaskTracker nodes with available slots at or near the dataThe JobTracker submits the work to the chosen TaskTracker nodes.The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.When the work is completed, the JobTracker updates its status.Client applications can poll the JobTracker for information.
  11. Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.Lazy evaluationPig&apos;s ability to store data at any pointProcedural language, more like an execution plan which offers more control over the flow of processing dataSQL offers an option to join two tables but Pig offers also a choice of implementation of join
  12. A Pig relation is a bag of tuples. A Pig relation is similar to a table in a relational database, where the tuples in the bag correspond to the rows in a table. Unlike a relational table, however, Pig relations don&apos;t require that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.Also note that relations are unordered which means there is no guarantee that tuples are processed in any particular order. Furthermore, processing may be parallelized in which case tuples are not processed according to any total ordering.
  13. Schemas enable you to assign names to fields and declare types for fields. Schemas are optional but we encourage you to use them whenever possible; type declarations result in better parse-time error checking and more efficient code execution.Schemas are defined with the LOAD, STREAM, and FOREACH operators using the AS clause.You can define a schema that includes both the field name and field type.You can define a schema that includes the field name only; in this case, the field type defaults to bytearray.You can choose not to define a schema; in this case, the field is un-named and the field type defaults to bytearray.If you assign a name to a field, you can refer to that field using the name or by positional notation. If you don&apos;t assign a name to a field (the field is un-named) you can only refer to the field using positional notation.If you assign a type to a field, you can subsequently change the type using the cast operators. If you don&apos;t assign a type to a field, the field defaults to bytearray; you can change the default type using the cast operators.
  14. Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Pig UDFs can currently be implemented in five languages: Java, Python, JavaScript, Ruby and Groovy.The most extensive support is provided for Java functions. You can customize all parts of the processing including data load/store, column transformation, and aggregation. Java functions are also more efficient because they are implemented in the same language as Pig and because additional interfaces are supported.Limited support is provided for Python, JavaScript, Ruby and Groovy functions. These functions are new, still evolving, additions to the system. Currently only the basic interface is supported; load/store functions are not supported. Furthermore, JavaScript, Ruby and Groovy are provided as experimental features because they did not go through the same amount of testing as Java or Python. At runtime note that Pig will automatically detect the usage of a scripting UDF in the Pig script and will automatically ship the corresponding scripting jar, either Jython, Rhino, JRuby or Groovy-all, to the backend.Pig also provides support for Piggy Bank, a repository for JAVA UDFs. Through Piggy Bank you can access Java UDFs written by other users and contribute Java UDFs that you have written.Eval is the most common type of function. It can be used in FOREACH statements for whatever purpose.public String exec(Tuple input)The load/store UDFs control how data goes into Pig and comes out of Pig. Often, the same function handles both input and output but that does not have to be the case.The Pig load/store API is aligned with Hadoop&apos;sInputFormat and OutputFormat classes.The LoadFunc abstract class is the main class to extend for implementing a loader. The methods which need to be overridden are explained below:getInputFormat(): This method is called by Pig to get the InputFormat used by the loader. The methods in the InputFormat (and underlying RecordReader) are called by Pig in the same manner (and in the same context) as by Hadoop in a MapReduce java program. If the InputFormat is a Hadoop packaged one, the implementation should use the new API based one under org.apache.hadoop.mapreduce. If it is a custom InputFormat, it should be implemented using the new API in org.apache.hadoop.mapreduce.If a custom loader using a text-based InputFormat or a file-based InputFormat would like to read files in all subdirectories under a given input directory recursively, then it should use the PigTextInputFormat and PigFileInputFormat classes provided in org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. The Pig InputFormat classes work around a current limitation in the HadoopTextInputFormat and FileInputFormat classes which only read one level down from the provided input directory. For example, if the input in the load statement is &apos;dir1&apos; and there are subdirs &apos;dir2&apos; and &apos;dir2/dir3&apos; beneath dir1, the HadoopTextInputFormat and FileInputFormat classes read the files under &apos;dir1&apos; only. Using PigTextInputFormat or PigFileInputFormat (or by extending them), the files in all the directories can be read.setLocation(): This method is called by Pig to communicate the load location to the loader. The loader should use this method to communicate the same information to the underlying InputFormat. This method is called multiple times by pig - implementations should bear this in mind and should ensure there are no inconsistent side effects due to the multiple calls.prepareToRead(): Through this method the RecordReader associated with the InputFormat provided by the LoadFunc is passed to the LoadFunc. The RecordReader can then be used by the implementation in getNext() to return a tuple representing a record of data back to pig.getNext(): The meaning of getNext() has not changed and is called by Pig runtime to get the next tuple in the data - in this method the implementation should use the underlying RecordReader and construct the tuple to return.StoreFunc abstract class has the main methods for storing data and for most use cases it should suffice to extend it.The methods which need to be overridden in StoreFunc are explained below:getOutputFormat(): This method will be called by Pig to get the OutputFormat used by the storer. The methods in the OutputFormat (and underlying RecordWriter and OutputCommitter) will be called by pig in the same manner (and in the same context) as by Hadoop in a map-reduce java program. If the OutputFormat is a hadoop packaged one, the implementation should use the new API based one under org.apache.hadoop.mapreduce. If it is a custom OutputFormat, it should be implemented using the new API under org.apache.hadoop.mapreduce. The checkOutputSpecs() method of the OutputFormat will be called by pig to check the output location up-front. This method will also be called as part of the Hadoop call sequence when the job is launched. So implementations should ensure that this method can be called multiple times without inconsistent side effects.setStoreLocation(): This method is called by Pig to communicate the store location to the storer. The storer should use this method to communicate the same information to the underlying OutputFormat. This method is called multiple times by pig - implementations should bear in mind that this method is called multiple times and should ensure there are no inconsistent side effects due to the multiple calls.prepareToWrite(): In the new API, writing of the data is through the OutputFormat provided by the StoreFunc. In prepareToWrite() the RecordWriter associated with the OutputFormat provided by the StoreFunc is passed to the StoreFunc. The RecordWriter can then be used by the implementation in putNext() to write a tuple representing a record of data in a manner expected by the RecordWriter.putNext(): The meaning of putNext() has not changed and is called by Pig runtime to write the next tuple of data - in the new API, this is the method wherein the implementation will use the underlying RecordWriter to write the Tuple out.