HPCC (High Performance Computing Cluster) is a massive parallel-processing computing platform that solves Big Data problems. The platform is now Open Source!
1. HPCC (High-Performance Computing Cluster)
IS A MASSIVE PARALLEL-PROCESSING COMPUTING PLATFORM THAT
SOLVES BIG DATA PROBLEMS. THE PLATFORM IS NOW OPEN SOURCE!
2. HPCC vs HADOOP
Declarative programming language: Describe what needs to be done and not
how to do it
Powerful: Unlike Java, high level primitives such as JOIN, TRANSFORM, PROJECT,
SORT, DISTRIBUTE, MAP, etc. are available. Higher level code means less
programmers and shorter time to deliver complete projects
Extensible: As new attributes are defined, they become primitives that other
programmers can use
Implicitly parallel: Parallelism is built into the underlying platform. The
programmer needs not be concerned with it
Maintainable: A High level programming language, no side effects and attribute
encapsulation provide for more succinct, reliable and easier to troubleshoot
code
Complete: Unlike Pig and Hive, ECL provides for a complete programming
paradigm.
Homogeneous: One language to express data algorithms across the entire
HPCC platform, including data ETL and delivery.
3. The Enterprise Control Language (ECL)
HPCC Systems Enterprise Control Language (ECL) is the query and control
language developed to manage all aspects of the massive data joins, sorts
and builds. ECL truly differentiates HPCC from other technologies in its ability to
provide flexible data analysis on a massive scale.
ECL is a declarative language optimized for the manipulation of massive data
sets and provides for modular structured programming. Moreover, ECL is a
transparent and implicitly parallel programming language which is both
powerful and flexible, allowing for faster and more effective development
cycles, through higher expressiveness, encapsulation and code reuse.
Data analysts can “express” complex queries without the need for iterative,
time-consuming data transformations and sorts associated with other
programming languages. Traditional low level languages (Java, C++ etc)
force the translation of business requirements to functional requirements before
programming can occur. The abstract nature of ECL eliminates the need for
this by making it easy to express business rules directly and succinctly.
4. HPCC System Architecture
The HPCC system architecture includes two distinct cluster processing
environments, each of which can be optimized independently for its parallel
data processing purpose. The first of these platforms is called a Data Refinery
whose overall purpose is the general processing of massive volumes of raw data
of any type for any purpose but typically used for data cleansing and hygiene.
ETL processing of the raw data, record linking and entity resolution, large-scale
ad-hoc complex analytics, and creation of keyed data and indexes to support
high-performance structured queries and data warehouse applications. The
Data Refinery is also referred to as Thor.
A Thor cluster is similar in its function, execution environment, filesystem, and
capabilities to the Google and Hadoop MapReduce platforms.
5. It shows a representation of a physical Thor processing cluster which functions as a batch job execution
engine for scalable data-intensive computing applications. In addition to the Thor master and slave
nodes, additional auxiliary and common components are needed to implement a complete HPCC
processing environment.
6. Roxie(rapid data delivery engine)
The second of the parallel data processing platforms is called Roxie and
functions as a rapid data delivery engine.
This platform is designed as an online high-performance structured query and
analysis platform or data warehouse delivering the parallel data access
processing requirements of online applications through Web services interfaces
supporting thousands of simultaneous queries and users with sub-second
response times.
Roxie utilizes a distributed indexed filesystem to provide parallel processing of
queries using an optimized execution environment and filesystem for high-
performance online processing.
A Roxie cluster is similar in its function and capabilities to Hadoop with HBase
and Hive capabilities added, and provides for near real time predictable query
latencies.
Both Thor and Roxie clusters utilize the ECL programming language for
implementing applications, increasing continuity and programmer productivity.
8. Continued…
It shows a representation of a physical Roxie processing cluster
which functions as an online query execution engine for high-
performance query and data warehousing applications.
A Roxie cluster includes multiple nodes with server and worker
processes for processing queries; an additional auxiliary component
called an ESP server which provides interfaces for external client
access to the cluster; and additional common components which
are shared with a Thor cluster in an HPCC environment. Although a
Thor processing cluster can be implemented and used without a
Roxie cluster, an HPCC environment which includes a Roxie cluster
should also include a Thor cluster. The Thor cluster is used to build the
distributed index files used by the Roxie cluster and to develop
online queries which will be deployed with the index files to the
Roxie cluster.
9. More on ECL(data-centric programming language)
ECL is a declarative, data centric programming language designed in 2000 to allow a
team of programmers to process big data across a high performance computing cluster
without the programmer being involved in many of the lower level, imperative decisions.
Sorting problem
// First declare a dataset with one column containing a list of strings
// Datasets can also be binary, csv, xml or externally defined structures
D :=DATASET([{'ECL'},{'Declarative'},{'Data'},{'Centric'},{'Programming'},{'Language'}],{STRING
Value;});
SD := SORT(D,Value);
output(SD)
10. More on ECL(data-centric programming language)
ECL primitives that act upon datasets include: SORT, ROLLUP, DEDUP, ITERATE,
PROJECT, JOIN, NORMALIZE, DENORMALIZE, PARSE, CHOSEN, ENTH, TOPN,
DISTRIBUTE.
Comparison to Map-Reduce
The Hadoop Map-Reduce paradigm actually consists of three phases which
correlate to ECL primitives as follows.