The Download: Tech Talks by the HPCC Systems Community, Episode 11

The Download: Community Tech Talks
Episode 11
February 15, 2018

Welcome!
• Please share: Let others know you are here with #HPCCTechTalks
• Ask questions! We will answer as many questions as we can following each speaker.
• Look for polls at the bottom of your screen. Exit full-screen mode or refresh your screen if
you don’t see them.
• We welcome your feedback - please rate us before you leave today and visit our blog for
information after the event.
• Want to be one of our featured speakers? Let us know! techtalks@hpccsystems.com
The Download: Tech Talks #HPCCTechTalks2

Community announcements
3
Dr. Flavio Villanustre
VP Technology
RELX Distinguished Technologist
LexisNexis® Risk Solutions
Flavio.Villanustre@lexisnexisrisk.com
The Download: Tech Talks #HPCCTechTalks
• HPCC Systems Platform updates
• 6.4.10-1 is the latest gold version / Community Changelog
• 6.4.12 RC1 coming soon
• 7.0.0 Beta planned for early Q2 – among the key features:
• Spark integration
• Indexer
• Record Translation
• Session Management Improvements
• VS Code Beta version
• Roadmap items for 2018 and beyond
• Latest Blogs
• HPCC Systems/Tableau Web Data Connector v0.2 Tech Preview
• Machine Learning Demystified
• Reminder: 2018 Summer Internship Proposal Period Open
• Interested candidates can submit proposals from the Ideas List
• Visit the Student Wiki for more details
• Deadline to submit is April 6, 2018
• Program runs late May through mid August
• Don’t delay!

Today’s speakers
4 The Download: Tech Talks #HPCCTechTalks
Raj Chandrasekaran
CTO & Co-Founder
ClearFunnel
raj@clearfunnel.com
Raj is the CTO/Co-Founder of ClearFunnel, a Big
Data Analytics as a Service Platform Startup,
leading their Product Strategy and Solutions.
ClearFunnel focuses on enabling Marketing
Analytics, Advanced Text Analytics, Bio
Informatics and Image Processing for various
clients in Technology, Maritime, Publishing and
Healthcare domains.
Featured Community Speaker

Today’s speakers
James McMullan
Software Engineer III
LexisNexis Risk Solutions
James.McMullan@lexisnexisrisk.com
James has a broad range of Software Engineering experience from developing low
level system drivers for X-Ray fluorescence equipment to mobile video games and web
applications. He is a recent addition to the LexisNexis team and is part of an internal
R&D group where he has been working on multiple projects including; HPCC Systems
& Spark benchmarks, integration projects between the HPCC Systems, Spark and
Hadoop ecosystems, and document storage systems.
Bob Foreman
Senior Software Engineer
Robert.Foreman@lexisnexisrisk.com
Bob Foreman has worked with the HPCC Systems technology platform and
the ECL programming language for over 5 years, and has been a technical
trainer for over 25 years. He is the developer and designer of the HPCC
Systems Online Training Courses, and is the Senior Instructor for all
classroom and Webex/Lync based training.

Scaling Data Science Capabilities:
Leveraging a Homogeneous Big Data Ecosystem
Raj Chandrasekaran
CTO & Co-Founder
ClearFunnel

Quick poll:
Where have you had the most success in
deployment of HPCC Systems based solutions?
See poll on bottom of presentation screen

To succeed, a Big Data Analytics enterprise needs…
• An efficient Big Data ecosystem, which comprises the following key
capabilities:
• Big Data Processing
• Data Science: ML & AI
• Cloud Integration
• Leveraging these capabilities for Commercial Advantage
• Key-Success Factor for any Start-up: Cost of Operations and Cash flow

Big Data Processing
• Top of the list: Hadoop and Spark
• Lots of incremental innovations:
• Hadoop: MapReduce, Hive, HBase, Solr, Pig, Kafka, Yarn, Ambari, Ranger, Knox, Atlas, …
• Spark: Hadoop’s Successor, In-Memory, Directed Acyclic Graph – DAG, Stream Processing,
Machine Learning, SparkSQL, GraphX, Support for Python, Java, R and Scala, …
• Which also means, Lots of Integrations and…
• A variety of Engineering Talent
• Still, all of the above = version 1.01 in the HPCC Systems domain
HPCC Systems Capabilities:  Big Data Processing

Data Science: ML & AI
• Traditionally, R & Python
• Current State:
• MLLib has a core set of machine learning algorithms, but is certainly not as complete as R or other machine learning
libraries such as MADLib
• SparkR is work-in-progress… you still need a robust ML library to implement advanced Data Science use cases
• ML is also an evolving field in the HPCC Systems domain.
• ECL-ML modules are fully-parallel, and covers both - Supervised and Unsupervised Models
• Extensibility: ECL is natively designed to manage data, and is thereby easily extensible to implement custom ML
algorithms, including Neural Network and Deep Learning.
• ClearFunnel Innovations using ECL-ML:
• Text Processing (Self-learning layered taxonomy, Entity and Topic Extraction, Context Analysis, Point of View Scoring)
• Image Recognition and Pattern Matching (OCR and NN based)
• Maritime Predictive Analytics (Deep Learning with Geospatial and IOT streaming data)
HPCC Systems Capabilities:  Big Data Processing  Data Science

Cloud Integration
• AWS: The Big Daddy of Cloud
• Core strengths are really EC2 and S3. All other AWS capabilities and micro-services have been built around these 2 foundational
technologies.
• HPCC Systems on AWS:
• HPCC Systems provides native support for AWS (one-click deployment).
• Additionally, HPCC System’s simple, homogeneous tech stack makes it a breeze to operate on cloud with minimal investments in
resources and time.
• ClearFunnel Innovations:
• Spray / De-Spray data between Thor cluster and S3 at speeds of up to 2 TBPS (Netezza’s data transfer rate is 2 – 4 TB/hr)
• Failsafe job operation (recover instantly from any failures)
• Near Real-Time, Micro-batching, Monitoring, Alert, Data Delivery APIs, etc. capabilities by integrating AWS micro-services
and HPCC Systems
• Key Principles:
• Avoid creating layers of abstractions on both ends (AWS and HPCC Systems).
• Instead integrate HPCC Systems directly with core capabilities of EC2 and S3.
HPCC Systems Capabilities:  Big Data Processing  Data Science  Cloud

Leveraging HPCC Systems for Commercial Success
• ClearFunnel has implemented a full-spectrum of complex data engineering use cases
using HPCC Systems:
• Complex and large Graph traversal across nodes
• Image Analytics
• Operational Analytics with Near Real-Time and Stream Processing based Analog data
• Pattern Detection in Bioinformatics
• NLP and advanced Text Analytics
• IOT based sensor-data integration and analytics
• Advanced Search and Querying
• Single, homogeneous tech stack:
• ClearFunnel’s Big Data Analytics Platform runs these diverse use cases with a homogeneous tech stack,
extending HPCC Systems’ capabilities to meet virtually any Big Data processing requirement

Key Success Criteria: Cost of Big Data Operations
• Distinctive Cost benefits from using a Homogeneous tech stack and a highly
productive ECL language
• “Fail-fast, fail-often” and multiple iterations of solution development do not involve a
lot of time, resources, and cost
• Re-use and Refactor core ML & AI modules across use cases (single language
implementation)
• Minimal Cost of Operations:
• ClearFunnel is operating multiple Big Data clusters in Production environment with
hundreds of nodes each, without any dedicated support staff - Cloud Engineer,
Infrastructure Engineer, Network Engineer, Production Support Engineer, Dev Ops
Engineer, or Tech Ops Specialist!
• Enabled with efficient automation and close integration of AWS and HPCC Systems

Quick poll:
In your opinion, which of these use cases are most
suitable for implementing in HPCC Systems?

Questions?
Raj Chandrasekaran
CTO & Co-Founder
ClearFunnel
raj@clearfunnel.com
https://clearfunnel.com/

HDFS Connector Preview
James McMullan
LexisNexis® Risk Solutions

Quick poll:
Would you be interested in
interacting with the Hadoop
ecosystem from HPCC Systems?

Overview
• HDFS Connector Motivations
• Why are we making the connector?
• What are our goals for the connector?
• Overview of HDFS Architecture
• How is data stored in HDFS?
• How can we interact with HDFS?
• HDFS Connector Design
• Overview of how the connector works & achieves parallelism
• HDFS Connector Demo

HDFS Connector Motivations
• Interact with HDFS datasets and Hadoop processes
• Existing HPCC to Hadoop (h2h) Project
• No longer maintained
• Chance to improve upon h2h
• Tighter integration with HPCC
• Fewer dependencies
• Fewer failure points
• Possibility for New Features
• Variable length record flat files
• Hadoop File Formats?

HDFS Connector Goals
• Robust – Should “Just Work”
• Straight forward
• Few dependencies
• Little to no configuration
• Tightly integrated
• Datasets from HDFS should be first class citizens
• Performant
• Parallelism where possible
• Reduce data transfer costs

Overview of HDFS Architecture
• How are files stored in HDFS?
• Stored as blocks of data & metadata
• Blocks are usually 64 MiB
• Blocks replicated for fault tolerance
• Namenode
• File metadata
• Filesystem namespace
• Datanodes
• Blocks of data
• No knowledge of files
Datanodes
Namenode
Metadata

Overview of HDFS Architecture
• Reading & Writing in HDFS
• Namenode arbitrates reads & writes
• Datanodes fulfill reads & writes
• Multiple readers / Single writer
• Client Applications
• Java Hadoop or native libHDFS libraries
• Messaging uses Google Protocol Buffers
Datanodes
Namenode
Client
Application
Client
Application
Client
Application

HDFS Connector Design – Communicating with HDFS
• Java Hadoop or native libHDFS library?
• libHDFS relies on the Java Hadoop libraries
• Both require Hadoop to be installed locally
• Google Protocol Buffers?
• Possible but a lot of work
• libHDFS3
• Part of Apache HAWQ
• Completely native implementation of libHDFS

HDFS Connector Design – HPCC Integration
• ECL PIPE?
• High data transfer costs
• Loosely coupled
• Leverages native ECL
• Import Java Library?
• High data transfer costs
• Adds lots of dependencies
• Parallelism is difficult
• Native ECL Plugin?
• Low data transfer costs
• Fewest dependencies
• Parallelism is possible

HDFS Connector Design – Reading Data in Parallel
• CSV Files & Flat Files Fixed Record
• Break HDFS file into logical chunks
• One chunk per HPCC node
• Chunks aren’t record aligned
• Consume records that begin in our chunk
• Variable Record Flat Files
• Need record split metadata
• Create split metadata on write
• Preprocess step if no metadata
Chunk
Consumed
Records
Chunk
Split Metadata

HDFS Connector Design – Writing to HDFS
• HDFS is single writer
• Single File
• Each Thor node writes its data to the file in sequence
• Requires Append mode to be enabled
• Interacts well with existing HDFS ecosystem
• Multiple File Parts
• Similar to how HPCC stores files
• Parallel writing
• Existing Hadoop applications would need to be updated

HDFS Connector Demo – Writing a dataset to HDFS

HDFS Connector Demo – Reading a dataset from HDFS

HDFS Connector Demo – Working with HDFS Datasets

Quick poll:
Do you currently use HDFS as a data
store?

Questions?
James McMullan
James.McMullan@lexisnexisrisk.com

ECL Tips and Cool Tricks –
Building a Relational Dataset
Bob Foreman

Quick poll:
Have you ever worked with a relational
denormalized dataset in ECL?

Background
• Most of our datasets on an HPCC cluster are organized in a normalized
architecture.
• A unique linking field in one dataset can be used to join with other datasets
using a one-to-one or a one-to-many relationship.
• In LexisNexis we affectionately refer to this architecture as the “Data Donut”

The LN Data “Donut”
DID
ADL
IDL
LinkID
LexID
Sometimes, analyzing or
querying this
normalized data can be
challenging.
Enter the
“denormalized”
dataset!

Given a sample 3-level hierarchical relational
database:
People
Vehicle
Property
Taxdata
Example Data:

Denormalizing Related Data:
People Vehicles Vehicles Vehicles
Property Taxdata Taxdata Taxdata
Property Taxdata Taxdata Taxdata
Start One Record
End of record
(Continued)
(Continued)

ChildRecord := RECORD
UNSIGNED4 person_id;
UNSIGNED8 address_id;
STRING20 per_surname;
STRING20 per_forename;
END;
ParentRecord := RECORD
UNSIGNED8 id;
STRING20 address;
STRING20 CSZ;
STRING10 postcode;
UNSIGNED2 numPeople;
DATASET(ChildRecord) children {MAXCOUNT(20)};
END;
EXPORT File_Address := DATASET('CLASS::Adr_List', ParentRecord, THOR);
Nested Child Dataset RECORD:

DENORMALIZE(parentoutput,childrecset,condition,transform)
parentoutput – The set of parent records already formatted as the result of the
combination.
childrecset – The set of child records to process.
condition – An expression that specifies how to match records between the
parent and child records.
transform – The TRANSFORM function to call.
The DENORMALIZE function forms flat file records from a parent and any number
of children.
The transform function must take at least 2 parameters: a LEFT record of the
same format as the resulting combined parent and child records, and a RIGHT
record of the same format as the childrecset. An optional integer COUNTER
parameter can be included which indicates the current iteration of child record.
DENORMALIZE Function:

Implicit Dataset Relationality
(Nested child datasets):
Parent record fields are always in memory when operating at the
level of the Child
You may only reference the related set of Child records when
operating at the level of the Parent
People
Vehicle
Property
Taxdata
Querying Relational Data:

NORMALIZE(recordset, expression, transform)
recordset – The set of records to process.
expression – An numeric expression specifying the total number of times to call
the transform for that record.
transform – The TRANSFORM function to call for each record in the recordset.
The NORMALIZE function processes through all the records in the recordset
performing the transform function the expression number of times on each
record in turn to produce relational child records of the parent.
The transform function must take two parameters: A LEFT record of the same
format as the recordset, and an integer COUNTER specifying the number of times
to call the transform for that record. The format of the resulting recordset can be
different from the input.
NORMALIZE Function

ECL Code Demonstration
Let’s look at some ECL!

Summary
• Using a denormalized dataset can improve the power of your queries and
discover hidden relationship in the data.
• ECL has powerful and easy support in moving from a normalized to a
denormalized format when needed.
• Knowing how to move both ways and the best practices in doing so is a good
skill to have for all ECL developers.

In closing: LOVE YOUR DATA!

Quick poll:
After today’s ECL Tech Tip, will you use
DENORMALIZE for any advanced query
applications?

Questions?
Bob Foreman
Robert.Foreman@lexisnexisrisk.com

• Have a new success story to share?
• Want to pitch a new use case?
• Have a new HPCC Systems application you want to demo?
• Want to share some helpful ECL tips and sample code?
• Have a new suggestion for the roadmap?
• Be a featured speaker for an upcoming episode! Email your idea to
Techtalks@hpccsystems.com
• Visit The Download Tech Talks wiki for more information:
https://wiki.hpccsystems.com/display/hpcc/HPCC+Systems+Tech+Talks
Mark your calendar for the March 15 Tech Talk -
More machine learning topics coming!
Watch our Events page for details.
Submit a talk for an upcoming episode!

A copy of this presentation will be made available soon on our blog:
hpccsystems.com/blog
Thank You!

The Download: Tech Talks by the HPCC Systems Community, Episode 11

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a The Download: Tech Talks by the HPCC Systems Community, Episode 11

Similar a The Download: Tech Talks by the HPCC Systems Community, Episode 11 (20)

Más de HPCC Systems

Más de HPCC Systems (20)

Último

Último (20)

The Download: Tech Talks by the HPCC Systems Community, Episode 11