Más contenido relacionado La actualidad más candente (20) Similar a Understanding Big Data Platform from Patents (20) Más de Alex G. Lee, Ph.D. Esq. CLP (20) Understanding Big Data Platform from Patents1. 1
©2016 TechIPm,LLC All RightsReservedhttp://www.techipm.com/
Understanding Big Data Platform from Patents
The Hadoop big data platform is based on the MapReduce framework. Google patent US7650331, titled “System
and method for efficient large-scale data processing,”described the MapReduce framework for the first time. Its
claim 1 clearly explains the function of the Map and Reduce process as follows.
A system for large-scale processingof data, comprising:
a plurality of processes executing on a plurality of interconnected processors; (Distributed Processor (File) System)
the plurality of processesincluding a master process,forcoordinating a data processing job for processing a set of
input data, and worker processes; (JobTracker)
the master process, in responseto a request to perform the data processingjob, assigning input data blocks of the
set of input data to respective ones of the worker processes; (<key, value> pairs)
each of a first plurality of the worker processes including an application-independent map module for retrieving a
respective input data block assigned to the worker process bythe master process and applying an application-
specific map operation to the respective input data block to produceintermediate data values, wherein at least a
subsetof the intermediate data values each comprises a key/value pair, and wherein at least two of the first
plurality of the worker processes operatesimultaneously so as to perform the application-specific map operation in
parallel on distinct, respective input data blocks; (Map Step)
2. 2
©2016 TechIPm,LLC All RightsReservedhttp://www.techipm.com/
a partition operator for processing the produced intermediate data values to producea plurality of intermediate data
sets, wherein each respective intermediate data set includes all key/value pairs for a distinct set of respective keys,
and wherein at least one of the respective intermediate data sets includes respective ones of the key/value pairs
produced by a plurality of the first plurality of the worker processes;and (IntermediateStep)
each of a second plurality of the worker processesincluding an application-independent reduce module for
retrieving data, the retrieved data comprising at least a subsetof the key/value pairs from a respective intermediate
data set of the plurality of intermediate data sets and applying an application-specific reduce operation to the
retrieved data to producefinal output data correspondingto the distinct set of respective keys in the respective
intermediate data set of the plurality of intermediate data sets, and wherein at least two of the second plurality of
the worker processes operatesimultaneously so as to perform the application-specific reduce operation in parallel
on multiple respective subsets ofthe produced intermediate data values. (Reduce Step)
4. 4
©2016 TechIPm,LLC All RightsReservedhttp://www.techipm.com/
MAPR’s patent application US20110313973, titled “Map-ReduceReady Distributed File System.” further
developed the MapReduce framework including the shuffle function using the distributed file system (DFS) in its
claim 1 as follows.
A map-reduce compatible shuffle function, comprising:
a distributed file system; and
a map-reduce system, wherein each map function writes to the distributed file system and each reduce function
reads input from the distributed file system.
The shuffle step redistributes the produced intermediate data from the map step, such that all data belonging to one
key is located on the same worker node.
6. 6
©2016 TechIPm,LLC All RightsReservedhttp://www.techipm.com/
Hadoop is a platform designed for large datasets (datasets measured in the terabytes, petabytes, or even greater data
size) that leverages the MapReduce framework. However, many existing websites and applications are built on
systems that differ greatly from those that can take advantage of large quantities of data. To take advantage of the
Hadoop big data platform, the systems have to be re-engineered for the new Hadoop platform. Treasure Data
patent application US20130124483, titled “System and method for operating a big-data platform,” illustrates a
system for integrating the existing websites and applications with the big-data platform without the re-engineering
process.
The system for operating a big data platform includes a data analysis platform that receives discrete client data
(formatted as a plurality of key-value pairs in row format); a network accessible distributed storage system (hosted
on a distributed cloud storage system such as Amazon's S3/EC2) that stores the client data in a real-time storage
system and merges the client data into a columnar-based distributed archive storage system (using a MapReduce);
a query interface that receives a data query request and selectively interfaces with the client data from the real-time
storage system and archive storage system according to the query.