10. <?xml version="1.0"?>
<cluster colo=”NJ-datacenter" description="" name=”prod-cluster">
<interfaces>
<interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" />
<interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" />
<interface type="execute" endpoint=”rm:8050" version="2.2.0" />
<interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" />
<interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" />
<interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" />
</interfaces>
<locations>
<location name="staging" path="/apps/falcon/prod-cluster/staging" />
<location name="temp" path="/tmp" />
<location name="working" path="/apps/falcon/prod-cluster/working" />
</locations>
</cluster>
Needed by distcp
for replications
Writing to HDFS
Used to submit
processes as MR
Submit Oozie jobs
Hive metastore to
register/deregister
partitions and get
events on partition
availability
Used For alerts
HDFS directories used by Falcon server
Cluster Specification
11. Feed Specification
<?xml version="1.0"?>
<feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1">
<frequency>hours(1)</frequency>
<late-arrival cut-off="hours(6)”/>
<groups>churnAnalysisFeeds</groups>
<tags externalSource=TeradataEDW-1, externalTarget=Marketing>
<clusters>
<cluster name=”cluster-primary" type="source">
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>
<retention limit="days(2)" action="delete"/>
</cluster>
<cluster name=”cluster-secondary" type="target">
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>
<location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>
<retention limit=”days(7)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>
</locations>
<ACL owner=”hdfs" group="users" permission="0755"/>
<schema location="/none" provider="none"/>
</feed>
Feed run frequency in mins/hrs/days/mths
Late arrival cutoff
Global location across
clusters - HDFS paths
or Hive tables
Feeds can belong to
multiple groups
One or more source &
target clusters for
retention & replication
Access Permissions
Metadata tagging
12. Process Specification
<process name="process-test" xmlns="uri:falcon:process:0.1”>
<clusters>
<cluster name="cluster-primary">
<validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" />
</cluster>
</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>days(1)</frequency>
<inputs>
<input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" />
</inputs>
<outputs>
<output instance="now(0,2)" feed="feed-clicks-clean" name="output" />
</outputs>
<workflow engine="pig" path="/apps/clickstream/clean-script.pig" />
<retry policy="periodic" delay="minutes(10)" attempts="3"/>
<late-process policy="exp-backoff" delay="hours(1)">
<late-input input="input" workflow-path="/apps/clickstream/late" />
</late-process>
</process>
How frequently does the
process run , how many
instances can be run in parallel
and in what order
Which cluster should the
process run on and when
The processing logic.
Retry policy on
failure
Handling late input
feeds
Input & output feeds
for process
13. Late Data Handling
Defines how the late (out of band) data is handled
Each Feed can define a late cut-off value
<late-arrival cut-off="hours(4)”/>
Each Process can define how this late data is handled
<late-process policy="exp-backoff" delay="hours(1)”>
<late-input input="input" workflow-path="/apps/clickstream/late" />
</late-process>
Policies include:
backoff
exp-backoff
final
16. Apache Falcon
Provides Orchestrates
Data Management Needs Tools
Multi Cluster Management Oozie
Replication Sqoop
Scheduling Distcp
Data Reprocessing Flume
Dependency Management Map / Reduce
Eviction Hive and Pig Jobs
Governance
Falcon provides a single interface to orchestrate data lifecycle.
Sophisticated DLM easily added to Hadoop applications.
Falcon: One-stop Shop for
Data Management
21. Physical Architecture
• STANDALONE
– Single Data Center
– Single Falcon Server
– Hadoop jobs and relevant
processing involves only one
cluster
• DISTRBUTED
– Multiple Data Centers
– Falcon Server per DC
– Multiple instances of hadoop
clusters and workflow schedulers
HADOOP
Store & Process
Falcon
Server
(standalone)
Site 1
HADOOP
Store & Process
replication
HADOOP
Store & Process
Falcon
Server
(standalone)
Site 1
HADOOP
Store & Process
replication
Site 2
Falcon
Server
(standalone)
Falcon Prism
Server
(distributed)
26. Processing – Single Data Center
Ad Request
data
Impression
render event
Click event
Conversion
event
Continuou
s
Streaming
(minutely)
Hourly
summary
Enrichment
(minutely/5
minutely)
Summarizer
27. Global Aggregation
Ad Request data
Impression render
event
Click event
Conversion event
Continuo
us
Streamin
g
(minutely
)
Hourly
summar
y
Enrichment
(minutely/5
minutely) Summarizer
Ad Request data
Impression render
event
Click event
Conversion event
Continuo
us
Streamin
g
(minutely
)
Hourly
summar
y
Enrichment
(minutely/5
minutely) Summarizer
……..
DataCenter1
DataCenterN
Consumable
global aggregate
In a typical big data environment involving Hadoop, the use cases tend to be around processing very large volumes of data either for machine or human consumption. Some of the data that gets to the hadoop platform can contain critical business & financial information. The data processing team in such an environment is often distracted by the multitude of data management and process orchestration challenges. To name a fewIngesting large volumes of events/streamsIngesting slowly changing data typically available on a traditional databaseCreating a pipeline / sequence of processing logic to extract the desired piece of insight / informationHandling processing complexities relating to change of data / failuresManaging eviction of older data elementsBackup the data in an alternate location or archive it in a cheaper storage for DR/BCP & Compliance requirementsShip data out of the hadoop environment periodically for machine or human consumption etcThese tend to be standard challenges that are better handled in a platform and this might allow the data processing team to focus on their core business application. A platform approach to this also allows us to adopt best practices in solving each of these for subsequent users of the platform to leverage.
As we just noted that there are numerous data and process management services when made available to the data processing team, can reduce their day-to-day complexities significantly and allow them to focus on their business application. This is an enumeration of such services, which we intend to cover in adequate detail as we go along.More often than not pipelines are sequence of data processing or data movement tasks that need to happen before raw data can be transformed into a meaningfully consumable form. Normally the end stage of the pipeline where the final sets of data are produced is in the critical path and may be subject to tight SLA bounds. Any step in the sequence/pipeline if either delayed or failed could cause the pipeline to stall. It is important that each step in the pipeline handoff to the next step to avoid any buffering of time and to allow seamless progression of the pipeline. People who are familiar with Apache Oozie might be able to appreciate this feature provided through the Coordinator.As the pipelines gets more and more time critical and time sensitive, this becomes very very critical and this ought to be available off the shelf for application developers. It is also important for this feature to scalable to support the needs of concurrent pipelines.A fact that data volumes are large and increasing by the day is the reason one adopts a big data platform like Hadoop and that would automatically mean that we would run of space pretty soon, if we didn’t take care of evicting & purging older instances of data. Few problems to consider for retention areShould avoid using a general purpose super user with world writable privileges to delete old data (for obvious reasons)Different types of data may require different criteria for aging and hence purgingOther life cycle functions like Archival of old data if defined ought to be scheduled before eviction kicks inHadoop is being increasingly critical for many businesses and for some users the raw data volumes are too large for them to be shipped to one place for processing, for others data needs to be redundantly available for business continuity reasons. In either scenarios replication of data from one cluster to another plays a vital role. This being available as a service would again free up the cycles from the application developer of these responsibilities. The key challenges to consider while offering this as a service areBandwidth consumption and managementChunking/bulking strategyCorrectness guaranteesHDFS version compatibility issues
Data Lifecycle is Challenging in spite of some good Hadoop tools - Patchwork of tools complicate data lifecycle management.Some of the things we have spoken about so far can be done if we took a silo-ed approach. For instance it is possible to process few data sets and produce a few more through a scheduler. However if there are two other consumers of the data produced by the first workflow then the same will be repeatedly defined by the other two consumers and so on. There is a serious duplication of metadata information of what data is ingested, processed or produced and where they are processed and how they are produced. A single system which creates a complete view of this would be able to provide a fairly complete picture of what is happening in the system compared to collection to independent scheduled applications. Both the production support and application development team on Hadoop platform have to scramble and write custom script and monitoring system to get a broader and holistic view of what is happening. An approach where this information is systemically collected and used for seamless management can alleviate much of the pains of folks operating or developing data processing application on hadoop.There is a tendency to burn in feed locations, apps, cluster location, cluster servicesBut things may change over timeFrom where you ingest, the feed frequency, file locations, file formats, format conversions, compressions, the app, …You may end up with multiple clustersA dataset location may be different in different clustersSome dataset and apps may move from one cluster to anotherThings are slightly different in the BCP cluster
The entity graph at the core is what makes Falcon what it is and that in a way enables all the unique features that Falcon has to offer or can potentially make available in future. At the coreDependency between Data Processing logic andCluster end pointsRules governing Data managementProcessing managementMetadata management
Cluster specification is PER CLUSTER. Each cluster can have the following interfacesreadonly specifies the hadoop'shftp address, it's endpoint is the value ofdfs.http.address.ex: hftp://corp.namenode:50070/write specifies the interface to write to hdfs, it's endpoint is the valueof fs.default.name.ex: hdfs://corp.namenode:8020Use the value defined in fs.default.nameexecute specifies the interface for resource manager, it's endpoint is the valueof mapred.job.tracker. ex:corp.jt:8021Use the value defined in yarn.resourcemanager.addressworkflow specifies the interface for workflow engine, example of it’s endpoint is value for OOZIE_URL.ex: http://corp.oozie:11000/ooziemessaging specifies the interface for sending feed availability messages, it’s endpoint is broker url with tcpaddress.ex: tcp://corp.messaging:61616?daemon=trueregistry specifies the interface for Hcatalog.
LATE DATA- Source & target clustersYou can configure multiple source & target clustersACL TAG<xs:complexType name="ACL"> <xs:annotation> <xs:documentation> Access control list for this feed. </xs:documentation> </xs:annotation> <xs:attribute type="xs:string" name="owner"/> <xs:attribute type="xs:string" name="group"/> <xs:attribute type="xs:string" name="permission"/> </xs:complexType>RETENTION POLICY ACTIONS <xs:simpleType name="action-type"> <xs:restriction base="xs:string"> <xs:annotation> <xs:documentation> action type specifies the action that should be taken on a feedwhen the retention period of a feed expires on a cluster, the validactions are archive, delete, chown and chmod. </xs:documentation> </xs:annotation> <xs:enumeration value="archive"/> <xs:enumeration value="delete"/> <xs:enumeration value="chown"/> <xs:enumeration value="chmod"/> </xs:restriction> </xs:simpleType>SPECIFYING LOCATIONS .<xs:complexType name="location"> <xs:annotation> <xs:documentation> location specifies the type of location like data, meta, statsand the corresponding paths for them. A feed should at least define the location for type data, whichspecifies the HDFS path pattern where the feed is generatedperiodically. ex: type="data" path="/projects/TrafficHourly/${YEAR}-${MONTH}-${DAY}/traffic" </xs:documentation> </xs:annotation> <xs:attribute type="location-type" name="type" use="required"/> <xs:attribute type="xs:string" name="path" use="required"/> Each location has a TYPE <xs:simpleType name="location-type"> <xs:restriction base="xs:string"> <xs:enumeration value="data"/> <xs:enumeration value="stats"/> <xs:enumeration value="meta"/> <xs:enumeration value="tmp"/> </xs:restriction> </xs:simpleType>SPECIFYING HIVE TABLES<xs:complexType name="catalog-table"> <xs:annotation> <xs:documentation> catalog specifies the uri of a Hive table along with the partition spec.uri="catalog:$database:$table#(partition-key=partition-value);+" Example: catalog:logs-db:clicks#ds=${YEAR}-${MONTH}-${DAY} </xs:documentation> </xs:annotation> <xs:attribute type="xs:string" name="uri" use="required"/> </xs:complexType>
A process defines configuration for a workflow. A workflow is a directed acyclic graph(DAG) which defines the job for the workflow engine. A process definition defines the configurations required to run the workflow job. For example, process defines the frequency at which the workflow should run, the clusters on which the workflow should run, the inputs and outputs for the workflow, how the workflow failures should be handled, how the late inputs should be handled and so on.Process level validity – How long the process itself is valid Each cluster specified within a process inturn has validity mentioned, which tell the times between which the job should run on that specified clusterParallel – How many instances of the process can run in parallel. A new instance is started everytime the process is kicked-off based on the specified frequencyOrder – order in which the ready instances of the process are picked up. Mostly use FIFO..Timeout – on a per instance basis
Certain class of applications, SLA critical machine consumable data (with some tolerance to error) doesn’t get affected much if some small percentage of data arrives late. Some examples of these class of applications include forecasting, predictions, risk management etc. However, applications with a “Close of Books” notion for human consumable, are used for factual reporting, results of which may be subject to audit. For these use cases, it is not acceptable to ignore data that arrived out of order or late. Late data handling defines how the late data should be handled. Each feed is defined with a late cut-off value which specifies the time till which late data is valid. For example, late cut-off of hours(6) means that data for nth hour can get delayed by upto 6 hours. Late data specification in process defines how this late data is handled.Late data policy defines how frequently check is done to detect late data. The policies supported are: backoff, exp-backoff(exponentionbackoff) and final(at feed's late cut-off). The policy along with delay defines the interval at which late data check is done.Late input specification for each input defines the workflow that should run when late data is detected for that input.
The workflow is re-tried after 10 mins, 20 mins and 30 mins. With exponential backoff, the workflow will be re-tried after 10 mins, 20 mins and 40 mins.
Falcon provides the key services data processing applications need.Complex data processing logic handled by Falcon instead of hard-coded in apps.Faster development and higher quality for ETL, reporting and other data processing apps on Hadoop.
System accepts entities using DSLInfrastructure, Datasets, Pipeline/Processing logicTransforms the input into automated and scheduled workflowsSystem orchestrates workflowsInstruments execution of configured policiesHandles retry logic and late data processingRecords audit, lineage Seamless integration with metastore/catalog (WIP)Provides notifications based on availabilityIntegrated Seamless experience to usersAutomates processing and tracks the end to end progressData Set management (Replication, Retention, etc.) offered as a serviceUsers can cherry pick, No coupling between primitivesProvides hooks for monitoring, metrics collection
Ad Request, Click, Impression, Conversion feedMinutely (with identical location, retention configuration, but with many data centers)Summary dataHourly (with multiple partitions – one per dc, each configured as source and one target which is global datacenter)Click, Impression Conversion enrichment & Summarizer processesSingle definition with multiple data centersIdentical periodicity and scheduling configuration