2.
Name: Dušan Zamurović
Where I come from?
◦ codecentric Novi Sad
What I do?
◦ Java web-app background
◦ ♥ JavaScript ♥
Ajax with DWR lib
◦ Android
◦ currently Big Data (reporting QA)
5. A revolution that will transform how we live, work,
and think.
3 Vs of big data
◦ Volume
◦ Variety
◦ Velocity
Every day use-cases
◦ Beautiful
◦ Useful
◦ Funny
6.
The principal characteristic
Studies report
◦ 1.2 trillion gigabytes of new data was created
worldwide in 2011 alone
◦ From 2005 to 2020, the digital universe will grow
by a factor of 300
◦ By 2020 the digital universe will amount to 40
trillion gigabytes (more than 5,200 gigabytes for
every man, woman, and child in 2020)
7.
The biggest growth – unstructured data
◦
◦
◦
◦
◦
◦
Documents
Web logs
Sensor data
Videos and photos
Medical devices
Social media
>90% of this Big Data is unstructured
Analytic value?
◦ 33% valuable info by 2020
8.
Generated at high speed
Needs real-time processing
Example I
◦ Financial world
◦ Thousands or millions of transactions
Example II
◦ Retail
◦ Analyze click streams to offer recommendations
9. Value of Big Data is potentially great but can be
released only with the right combination of
people, processes and technologies.
…unlock significant value by making
information transparent and usable at much
higher frequency
10.
Measuring heartbeat of a city - Rio de Janeiro
More examples
◦
◦
◦
◦
Product development – most valuable features
Manufacturing – indicators of quality problems
Distribution – optimize inventory and supply chains
Sales – account targeting, resource allocation
Beer and diapers
Possible issues?
◦ Privacy, security, intellectual property, liability…
11. "Map/Reduce is a programming model and an
associated implementation for processing and
generating large data sets. Users specify a map
function that processes a key/value pair to
generate a set of intermediate key/value pairs,
and a reduce function that merges all
intermediate values associated with the same
intermediate key.“
- research publication
http://research.google.com/archive/mapreduce.html
12.
13.
In the beginning, there was Nutch
Which problems does it address?
◦ Big Data
◦ Not fit for RDBMS
◦ Computationally extensive
Hadoop && RDBMS
◦ “Get data to process” or “send code where data is”
◦ Designed to run on large number of machines
◦ Separate storage
14.
Distributed File System
◦ Designed for commodity hardware
◦ Highly fault-tolerant
◦ Relaxed POSIX
To enable streaming access to file system data
Assumptions and Goals
◦
◦
◦
◦
◦
Hardware failure
Streaming data access
Large data sets
Write-once-read-many
Move computation, not data
15.
NameNode
◦
◦
◦
◦
◦
Master server, central component
HDFS cluster has single NameNode
Manages client’s access
Keeps track where data is kept
Single point of failure
Secondary NameNode
◦ Optional component
◦ Checkpoints of the namespace
Does not provide any real redundancy
16.
DataNode
◦ Stores data in the file system
◦ Talks to NameNode and responds to requests
◦ Talks to other DataNodes
Data replication
TaskTracker
◦
◦
◦
◦
Should be where DataNode is
Accepts tasks (Map, Reduce, Shuffle…)
Set of slots for tasks
♥__ ♥__ ♥__ ________ ♥_ ♥ ♥ ♥__________________
17.
JobTracker
◦ Farms tasks to specific nodes in the cluster
◦ Point of failure for MapReduce
How it goes?
1.
2.
3.
4.
5.
Client submits jobs JobTracker
JobTracker, whereis NameNode
JobTracker locates TaskTracker
JobTracker, tasks TaskTracker
TaskTracker ♥__ ♥__ ♥__
1. Job failed, TaskTracker informs, JobTracker decides
2. Job done, JobTracker updates status
6. Client can poll JobTracker for information
18.
Platform for analyzing large data sets
◦
◦
◦
◦
Language – Pig Latin
High level approach
Compiler
Grunt shell
Pig compared to SQL
◦ Lazy evaluation
◦ Procedural language
◦ More like an execution plan
19.
Pig Latin statements
◦
◦
◦
◦
◦
A
A
A
A
A
relation is a bag
bag is collection of tuples
tuple is on ordered set of fields
field is piece of data
relation is referenced by name, i.e. alias
A = LOAD 'student' USING PigStorage() AS
(name:chararray, age:int, gpa:float);
DUMP A;
(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)
20.
Data types
◦ Simple
int – signed 32-bit integer
long – signed 64-bit integer
float – 32-bit floating point
double – 64-bit floating point
charrarray – UTF-8 string
bytearray – blob
boolean – since Pig 0.10
datetime
◦ Complex
tuple – an ordered set of fields
bag – a collection of tuples
map – a set of key value pairs
(21,32)
{(21,32),(32,43)}
[pig#latin]
21.
Data structure and defining schemas
◦ Why to define schema?
◦ Where to define schema?
◦ How to define schema?
/* data types not specified */
a = LOAD '1.txt' AS (a0, b0);
a: {a0: bytearray,b0: bytearray}
/* number of fields not known */
a = LOAD '1.txt';
a: Schema for a unknown
24.
User Defined Functions
◦ Java, Python, JavaScript, Ruby, Groovy
How to write an UDF?
◦ Eval function extends EvalFunc<something>
◦ Load function extends LoadFunc
◦ Store function extends StoreFunc
How to use an UDF?
◦ Register
◦ Define the name of the UDF if you like
◦ Call it
25.
26.
Imaginary social network
A lots of users…
… with their friends, girlfriends, boyfriends, wives,
husbands, mistresses, etc…
New relationship arises…
◦ … but new friend is not shown in news feed
Where are his/her activities?
◦ Hidden, marked as not important
27.
Find out the value of the relationship
Monitor and log user activities
◦
◦
◦
◦
◦
◦
◦
For each user, of course
Each activity has some value (event weight)
Records user’s activities
Store those logs in HDFS
Analyze those logs from time to time
Calculate needed values
Show only the activities of “important” friends
28.
Events recorded in JSON format
{
"timestamp": 1341161607860,
"sourceUser": "marry.lee",
"targetUser": "ruby.blue",
"eventName": "VIEW_PHOTO",
"eventWeight": 1
}
38. REGISTER codingserbia-udf.jar
DEFINE AVG_WEIGHT com.codingserbia.udf.AverageWeight();
interactionRecords = LOAD ‘/blog/user_interaction_big.json’
USING com.codingserbia.udf.JsonLoader();
interactionData = FOREACH interactionRecords GENERATE
sourceUser,
targetUser,
eventWeight;
groupInteraction = GROUP interactionData BY (sourceUser,
targetUser);
…
39. …
summarizedInteraction = FOREACH groupInteraction GENERATE
group.sourceUser AS sourceUser,
group.targetUser AS targetUser,
SUM(interactionData.eventWeight) AS eventWeight,
COUNT(interactionData.eventWeight) AS eventCount,
AVG_WEIGHT(
SUM(interactionData.eventWeight),
COUNT(interactionData.eventWeight)) AS averageWeight;
result = ORDER summarizedInteraction BY
sourceUser, eventWeight DESC;
STORE result INTO '/results/pig_mr’ USING PigStorage();
Notas del editor
Big data. One of the buzz words of the software industry in the last decade. We all heard about it but I am not sure if we actually can comprehend it as we should and as it deserves. It reminds me of the Universe – mankind has knowledge that it is big, huge, vast, but no one can really understand the size of it. Same can be said for the amount of data being collected and processed every day somewhere in the clouds if IT. As Google’s CEO, Eric Schmidt, once said: “There were 5 exabytes of information created by the entire world between the dawn of civilization and 2003. Now that same amount is created every two days.”
Almost every organization has to deal with huge amounts of data. Much of this exists in conventional structured forms, stored in relational databases. However, the biggest growth comes from unstructured data, both from inside and outside the enterprise - including documents, web logs, sensor data, videos, medical devices and social media. According to some studies, more than 90% of Big Data is unstructured data.The majority of information in the digital universe, 68% in 2012, is created and consumed by consumers watching digital TV, interacting with social media, sending camera phone images and videos between devices and around the Internet, and so on.But, only a fraction if it is explored for analytic value. Some studies say that only 33% of digital universe will be contain valuable info by 2020.
As well as volume and variety, Big Data is often said to exhibit "velocity" - meaning that the data is being generated at high speed, and needs real-time processing and analysis. One example of the need for real-time processing of Big Data is in the financial world, where thousands or millions of transactions must be continuously analyzed for possible fraud in a matter of seconds. Another example is in retail, where a business may be analyzing many customer click-streams and purchases to generate real-time intelligent recommendations.
organizations create and store more data in digital form, they can collect more accurate and detailed performance information on everything from product inventories to sick days, and therefore expose variability and boost performancecompanies are using data collection and analysis to conduct controlled experiments to make better management decisions
Measuring heartbeat of a city - Rio de Janeiro6.5M people8M vehicles4M bus passengers44k police...tropical monsoon climateBigData is used to monitor weather, traffic (GPS tracked busses and medical vehicles), police, emergency services - using analytics to predict problems before they occur.Not so beautiful example, but Big Data influences business and decision making- Product Development: incorporate the features that matter most- Manufacturing: flag potential indicators of quality problems- Distribution: quantify optimal inventory and supply chain activities- Marketing: identify your most effective campaigns for engagement and sales- Sales: optimize account targeting, resource allocation, revenue forecastingSeveral issues will have to be addressed to capture the full potential of big data. Policies related to privacy, security, intellectual property, and even liability will need to be addressed in a big data world. Organizations need not only to put the right talent and technology in place but also structure workflows and incentives to optimize the use of big data. Access to data is critical—companies will increasingly need to integrate information from multiple data sources, often from third parties, and the incentives have to be in place to enable this.
The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn’t fit nicely into tables. It’s for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That’s exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms. Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they’re more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built. Those are just a few examples.Hadoop is designed to run on a large number of machines that don’t share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization’s data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There’s no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy.In a centralized database system, you’ve got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That’s MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set.
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject.Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates.Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline. There is an optional SecondaryNameNode that can be hosted on a separate machine. It only creates checkpoints of the namespace by merging the edits file into the fsimage file and does not provide any real redundancy.
A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.DataNode instances can talk to each other, which is what they do when they are replicating data.TaskTracker instances can, indeed should, be deployed on the same servers that host DataNode instances, so that MapReduce operations are performed close to the data.A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.The TaskTracker spawns a separate JVM processes to do the actual work; this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these spawned processes, capturing the output and exit codes. When the process finishes, successfully or not, the tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.
The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.The JobTracker is a point of failure for the HadoopMapReduce service. If it goes down, all running jobs are halted.Client applications submit jobs to the Job tracker.The JobTracker talks to the NameNode to determine the location of the dataThe JobTracker locates TaskTracker nodes with available slots at or near the dataThe JobTracker submits the work to the chosen TaskTracker nodes.The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.When the work is completed, the JobTracker updates its status.Client applications can poll the JobTracker for information.
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.Lazy evaluationPig's ability to store data at any pointProcedural language, more like an execution plan which offers more control over the flow of processing dataSQL offers an option to join two tables but Pig offers also a choice of implementation of join
A Pig relation is a bag of tuples. A Pig relation is similar to a table in a relational database, where the tuples in the bag correspond to the rows in a table. Unlike a relational table, however, Pig relations don't require that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.Also note that relations are unordered which means there is no guarantee that tuples are processed in any particular order. Furthermore, processing may be parallelized in which case tuples are not processed according to any total ordering.
Schemas enable you to assign names to fields and declare types for fields. Schemas are optional but we encourage you to use them whenever possible; type declarations result in better parse-time error checking and more efficient code execution.Schemas are defined with the LOAD, STREAM, and FOREACH operators using the AS clause.You can define a schema that includes both the field name and field type.You can define a schema that includes the field name only; in this case, the field type defaults to bytearray.You can choose not to define a schema; in this case, the field is un-named and the field type defaults to bytearray.If you assign a name to a field, you can refer to that field using the name or by positional notation. If you don't assign a name to a field (the field is un-named) you can only refer to the field using positional notation.If you assign a type to a field, you can subsequently change the type using the cast operators. If you don't assign a type to a field, the field defaults to bytearray; you can change the default type using the cast operators.
Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Pig UDFs can currently be implemented in five languages: Java, Python, JavaScript, Ruby and Groovy.The most extensive support is provided for Java functions. You can customize all parts of the processing including data load/store, column transformation, and aggregation. Java functions are also more efficient because they are implemented in the same language as Pig and because additional interfaces are supported.Limited support is provided for Python, JavaScript, Ruby and Groovy functions. These functions are new, still evolving, additions to the system. Currently only the basic interface is supported; load/store functions are not supported. Furthermore, JavaScript, Ruby and Groovy are provided as experimental features because they did not go through the same amount of testing as Java or Python. At runtime note that Pig will automatically detect the usage of a scripting UDF in the Pig script and will automatically ship the corresponding scripting jar, either Jython, Rhino, JRuby or Groovy-all, to the backend.Pig also provides support for Piggy Bank, a repository for JAVA UDFs. Through Piggy Bank you can access Java UDFs written by other users and contribute Java UDFs that you have written.Eval is the most common type of function. It can be used in FOREACH statements for whatever purpose.public String exec(Tuple input)The load/store UDFs control how data goes into Pig and comes out of Pig. Often, the same function handles both input and output but that does not have to be the case.The Pig load/store API is aligned with Hadoop'sInputFormat and OutputFormat classes.The LoadFunc abstract class is the main class to extend for implementing a loader. The methods which need to be overridden are explained below:getInputFormat(): This method is called by Pig to get the InputFormat used by the loader. The methods in the InputFormat (and underlying RecordReader) are called by Pig in the same manner (and in the same context) as by Hadoop in a MapReduce java program. If the InputFormat is a Hadoop packaged one, the implementation should use the new API based one under org.apache.hadoop.mapreduce. If it is a custom InputFormat, it should be implemented using the new API in org.apache.hadoop.mapreduce.If a custom loader using a text-based InputFormat or a file-based InputFormat would like to read files in all subdirectories under a given input directory recursively, then it should use the PigTextInputFormat and PigFileInputFormat classes provided in org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. The Pig InputFormat classes work around a current limitation in the HadoopTextInputFormat and FileInputFormat classes which only read one level down from the provided input directory. For example, if the input in the load statement is 'dir1' and there are subdirs 'dir2' and 'dir2/dir3' beneath dir1, the HadoopTextInputFormat and FileInputFormat classes read the files under 'dir1' only. Using PigTextInputFormat or PigFileInputFormat (or by extending them), the files in all the directories can be read.setLocation(): This method is called by Pig to communicate the load location to the loader. The loader should use this method to communicate the same information to the underlying InputFormat. This method is called multiple times by pig - implementations should bear this in mind and should ensure there are no inconsistent side effects due to the multiple calls.prepareToRead(): Through this method the RecordReader associated with the InputFormat provided by the LoadFunc is passed to the LoadFunc. The RecordReader can then be used by the implementation in getNext() to return a tuple representing a record of data back to pig.getNext(): The meaning of getNext() has not changed and is called by Pig runtime to get the next tuple in the data - in this method the implementation should use the underlying RecordReader and construct the tuple to return.StoreFunc abstract class has the main methods for storing data and for most use cases it should suffice to extend it.The methods which need to be overridden in StoreFunc are explained below:getOutputFormat(): This method will be called by Pig to get the OutputFormat used by the storer. The methods in the OutputFormat (and underlying RecordWriter and OutputCommitter) will be called by pig in the same manner (and in the same context) as by Hadoop in a map-reduce java program. If the OutputFormat is a hadoop packaged one, the implementation should use the new API based one under org.apache.hadoop.mapreduce. If it is a custom OutputFormat, it should be implemented using the new API under org.apache.hadoop.mapreduce. The checkOutputSpecs() method of the OutputFormat will be called by pig to check the output location up-front. This method will also be called as part of the Hadoop call sequence when the job is launched. So implementations should ensure that this method can be called multiple times without inconsistent side effects.setStoreLocation(): This method is called by Pig to communicate the store location to the storer. The storer should use this method to communicate the same information to the underlying OutputFormat. This method is called multiple times by pig - implementations should bear in mind that this method is called multiple times and should ensure there are no inconsistent side effects due to the multiple calls.prepareToWrite(): In the new API, writing of the data is through the OutputFormat provided by the StoreFunc. In prepareToWrite() the RecordWriter associated with the OutputFormat provided by the StoreFunc is passed to the StoreFunc. The RecordWriter can then be used by the implementation in putNext() to write a tuple representing a record of data in a manner expected by the RecordWriter.putNext(): The meaning of putNext() has not changed and is called by Pig runtime to write the next tuple of data - in the new API, this is the method wherein the implementation will use the underlying RecordWriter to write the Tuple out.