SlideShare a Scribd company logo
1 of 11
Download to read offline
Scalable Distributed
                  Systems
                   Impact of reliability in the scalability of Flume
                                Report by Mário Almeida (EMDC)


Introduction

This report will be describing the mechanisms to provide fault tolerance in Flume as well as its
impact on scalability due to the increasing of number of flows and possible bottlenecks.
In this case, reliability is the ability to continue delivering events and logging them in the HDFS
in the face of failures, without losing data. This failures can be due to physical hardware fails,
scarce bandwidth or memory, software crashes, etc.

In order to eliminate single points of failure, previous versions of flume had a fail over
mechanism that would move the flow towards a new agent without user intervention. This was
done via a Flume Master node that would have a global state of all the data flows and could
dynamically reconfigure nodes and dataflows through fail-over chains.

The problem with this implementation was that Flume Master would become a bottleneck of the
system, although it could be replicated through secondary Flume Master Nodes, and it would
become complicated to configure its behavior for larger sets of dataflows. This could have a
clear impact on the scalability of the system.

The actual version of Flume addresses this problem by multiplexing or replicating flows. This
means that an agent can send data through different channels for load balancing or it can
replicate the same flow through two different channels. This solution does provide the so
wanted reliability but it either duplicates the information in the system or it needs more agents in
order to balance this load.

In order to show the impact of failures in a Flume architecture, a scenario in which the sources
are updated very often and are volatile was chosen. This specific case is relevant for the failure
of the source gathering agents, since some events get lost. The case in which the Collectors fail
can be tolerated in the newer version through the use of persistent channels.

Multiple experiments were performed that test the usage of memory channels versus persistent
JDBC channels. Also a new mechanism for tolerating failures is proposed and tested against
the already existent architectures.
Proposal
Lets start by describing the architecture of the Flume-based System represented in figure 1.
As one can observe, there are 5 distinct agents, 2 of them acting as collectors. The sources
Source1, Source2 and Source3 consisted of three different C applications that would generate
sequences of numbers and outputting them to a file. This file would only contain a sequence
number at any given moment in order to achieve the volatility property. All the agents were in
different machines.




                        Figure 1. Implemented Flume-based architecture.

Due to the mentioned probable reasons why the Master node was deprecated in the new Flume
NG, this report will be evaluating a possible architecture for achieving reliability in Flume.

In order to make it less centralized, the idea was to form smaller clusters in which its nodes are
responsible of keeping track of each others. And in case one of them fails, another will take its
responsibilities, either by gathering information from a source or aggregating data from other
Agents.
Figure 2. Representation of the clustering of Agents.
              Red arrows represent ping messages, black dotted arrows represent
                                   possible reconfigurations.

Following the architecture depicted in figure 2, a cluster could consist of Agent11, Agent12 and
Agent21 and another cluster could consist of C1 and C2. For example, in case Agent11 fails,
Agent12 or Agent21 will take its place gathering information from Source1, until it finally restarts.
In case Collector2 fails, Collector1 will be aggregating the information provenient from Agent21.

For this purpose any agent belonging to a cluster has to have knowledge about the other agents
in the same cluster. This knowledge includes its sources and sinks. In this experiment the
agents would ping each other in order to keep track of alive agents.

This notion of multiple small clusters as shown in figure 2, makes the system less dependent on
a centralized entity and also eliminates the need of having extra nodes. It might not keep some
consistency properties that could be achieved through the use of the embedded Zookeeper
though.


Experiments
All the experiments were conducted with a lifespan of a few hours, generating up to 10
thousand events. The objective was to make this periods as big as possible taking into account
the costs and other schedules.

Normal execution of flume
The first experiment consisted of the normal execution of flume, without any failures and using
memory channels.

Collector fails while using memory channels
Second experiment consisted on disconnecting collector agent 2 during the execution of flume
and register the lost events.
Collector fails while using persistent channels
Third experiment consisted on disconnecting collector agent 2 during the execution of flume
using JDBC channels and register the lost events.

Agent fails while using persistent channels
In this experiment it is the Agent21 that is disconnected using JDBC channels.

Dynamic Routing
In this last experiment, both Collector2 and Agent11 are disconnected. It uses memory
channels.


Results

Normal execution of flume
As expected during the normal execution of flume all the events generated by the sources were
logged into the HDFS. This means that the rate at which the data was being generated was well
supported by the capacity of the memory channels.

Collector fails while using memory channels
Although initially I thought that the failure of a collector would imply that the data that was read
from the source would be lost by the agent due to the limitations of channel capacity, in mosts
tests, Flume was able to store this data and resend it once the Collector was restarted.
Flume uses a transactional approach to guarantee the reliable delivery of the events. The
events are only removed from a channel when they are successfully stored in the next channel.
This should still be dependent on the capacity of the channel but for the used implementation
with a channel of 10 thousand events of capacity, the Collector2 could be down for more than
one hour without any channel overflow.
It was decided to drop this capacity to up to 100 events and double the rate of data generation.
After this change I disconnected the Collector2 until more than 100 events were read by
Agent21. Once Collector2 came back online it received the 100 events that were stored in the
channel of Agent21 but failed to deliver the subsequent events. In fact, it stopped working from
that point, requiring a restart.


Collector fails while using persistent channels
As expected since Flume stores the events in a relational database instead of in memory, when
a collector dies the events are stored and delivered whenever it becomes available again. It is
indeed the way Flume achieves recoverability. Although it works well it seems that while the
memory channel groups events before sending, JDBC channel works more like a pipeline,
sending the events one by one. This and the implicit problem of writing into persistent storage
might have significant impact on the performance for large scale systems.
Also, and probably due to this fact as well, there seemed to be an imbalance between the rate
at which the sources data flows were reaching the Collector. In around every 10 events that
reached the Collector, only 2 were from Source1 and 8 were from Source2, although both
produced data at the same rate. I wonder if it runs for long periods what will happen, in my case
after around an hour the difference already goes in 400 hundred events. There seemed to be no
options to group events.

Agent fails while using persistent channels
Although I expected that when an Agent is stopped it would lose some events, it seems that
it still logs them somehow and after restarted resends all the events since the beginning,
repeating up to hundreds of events in the process. The experiments seemed to indicate
that the JDBC logs events even in case the agent crashes. Other channels such as the
recoverableMemoryChannel provide parameters to set how often a background worker checks
on old logs but there are no mentions to it on JDBC. Overall, although it repeated hundreds if
not thousands of events, it didn’t lose a single event. This wouldn’t be the case though if the
whole machine would crash, but i couldn’t test it since my source was in the same node as the
agent itself. Further testing would need a new architecture.

Dynamic Routing
In the way it was implemented, every time a new configuration is generated and the agent
restarted, the events in memory are lost. This happens when the Collector2 is disconnected
and the Agent21 has to change its dataflow to Collector1. Overall some events were lost
while migrating between configurations but it achieved better results than the normal memory
reliability scheme of Flume. Although for all-round purpose the JDBC channels probably
achieves better reliability results, clustering has less delays in retrieving data. It might also not
flow the data in the most efficient way if due to failures it makes all flows go through the same
nodes.


Conclusions
As long as the data generation rate doesn’t overflow the available capacity in the channel
memory, Flume works well for most cases, with or without failures. There is one failure that
probably can’t be handled without the creation of replicated flows, that is the “unsubscribing”
from a source due to a failure of a machine (not only the agent) where the source has volatile
information. Further experiments should be conducted in order to evaluate the performance of
creating clusters versus the actual replication of data flows and subsequent processing in order
to store them in a HDFS. Scalability wise it seems that using well implemented clusters would
mean having less nodes and less flow of information since the pings/heartbeat rate in a real
system is much lower than a data flow. Still, the way Flume has implemented its reliability is
good for its simplicity.

The proposed architecture was implemented through multiple scripts instead of actually
changing the source code of Flume. This means that there were some workarounds, such as
restarting/reloading configurations that might introduce errors in the experimentation. Also this
scripts would never provide easy to use access to this routing mechanism. This said, more
experiments would be needed to make this report more significative. Even so, it shows an
interesting overview of the mechanisms to achieve reliability in Flume while describing their
limitations.

Implementing dynamic routing in the actual architecture of flume can also be achieved by using
an Avro client to set different headers depending on the pretend route. This could possibly be a
solution for implementing the proposed architecture.
Appendix
Configuration steps

Tools
   ●    puttygen
   ●    putty
   ●    scp
   ●    pssh
   ●    cloudera manager 3.7!!
   ●    flume
   ●    hadoop


Installing the CDH3 and setting up the HDFS
   1. Go to Security groups and create a new group
          a. In the inbound add ssh, http, icmp (for ping). Apply Rule Changes. If its only for
                test purposes you can just add all tcp, udp and icmp
   2. Go to Key pairs, create and download pem file
   3. Create the public key file
          a. puttygen key.ppk -L > ~/.ssh/id_rsa.pub
          b. puttygen /path/to/puttykey.ppk -O private-openssh -o ~/.ssh/id_rsa
   4. Create 10 Suse Medium instances in AWS Management Console:
          a. Choose Suse, next
          b. 10 instances, Medium, next
          c. Next
          d. Use the created key pair, next
          e. Choose previously created security group
   5. Choose an AWS instance, rename it to CDHManager. Right click on it, connect -> copy
      public DNS
   6. Download Cloudera Manager Free Edition and copy its bin file into the the machine:
          a. scp cloudera-manager-installer.bin root@publicdns:/root/cloudera-manager-
                installer.bin
   7. SSH the machine and perform an ls, cloudera bin file should be there. Do the following:
          a. ssh root@publicdns
          b. ls <- check that
   8. Install it:
          a. chmod u+x cloudera-manager-installer.bin
          b. sudo ./cloudera-manager-installer.bin
          c. next, next, yes, next, yes...wait till installation finishes
   9. Go to your web browser, paste the public dns and port 7180 like this: publicDns:7180.
      Note that it can’t access it. Its because our security group doesn’t allow connections on
      this port :
          a. Go to security groups. Add custom tcp rule with port 7180. Apply rule changes.
b. Reopen webpage. Username: admin, pass: admin
            c. Install only free. Continue.
            d. Proceed without registering.
   10. Go to the My instances in AWS and select all except the CDHManager. Notice that
       below all the public dns appear listed, copy them at once and paste on the webpage.
       Take out the not needed parts such as “i-2jh3h53hk3h:”. (Sometimes, some nodes
       might be not accessible, just delete/restart them, create another and put the public dns
       in the list)
   11. Find instances and install CDH 3 on them with default values. continue.
   12. Choose root, all accept same public key, select your public key and for private select the
       pem key. Install...
   13. Continue, continue, cluster CDH3
            a. if an error occur it is due to not open ports (generally icmp or other). Common
                error : "The inspector failed to run on all hosts.".
   14. Add service hdfs, for example 3 datanodes, one of them can be a name node as well.


Installing flume ng
   2. On linux install pssh: sudo apt-get install pssh (yes i have a virtualbox running ubuntu,
      and i prefer to have all this tools there. I tried xservers on windows but didnt like them)
   3. Create an host.txt file with all the public dns.
   4. Install putty on ubuntu.
          a. sudo apt-get install putty.
   5. Install flume like a boss:
          a. parallel-ssh -O StrictHostKeyChecking=no -v -i -l root -h hosts.txt “sudo zypper --
               non-interactive install flume-ng”
   6. Make it boot from startup
          a. parallel-ssh -O StrictHostKeyChecking=no -v -i -l root -h hosts.txt “sudo zypper --
               non-interactive install flume-ng-agent”

Running experiment
   1. Send my scripts to the servers, they generate the config files
         a. while read line; do scp -S ssh -o StrictHostKeyChecking=no -o
            PasswordAuthentication=no configen root@$line:/etc/flume-ng/conf; done
            <hosts.txt
         b. while read line; do scp -S ssh -o StrictHostKeyChecking=no -o
            PasswordAuthentication=no genCollector root@$line:/etc/flume-ng/conf; done
            <hosts.txt
   2. Run configen on every machine:
         b. parallel-ssh -O StrictHostKeyChecking=no -v -i -l root -h hosts.txt "cd /etc/flume-
            ng/conf; sudo chmod u+x configen; ./configen"
         a. Check results
                i. parallel-ssh -O StrictHostKeyChecking=no -v -i -l root -h hosts.txt "cd /etc/
                    flume-ng/conf; ls"
   3. SSH to the name node and do:
         a. sudo -u hdfs hadoop hdfs -mkdir flume
   4. SSH the agent collectors and do
         a. sudo -u hdfs flume-ng agent -n collector1 -f /etc/flume-ng/
            conf/flumeCollector1.conf (note that sudo -u hdfs is due to the
            authentication mechanism of the hdfs, otherwise the following error
would occur : Caused by: org.apache.hadoop.ipc.RemoteException:
               org.apache.hadoop.security.AccessControlException: Permission denied:
               user=root, access=WRITE, inode="/dfs/nn/image/flume":hdfs:hadoop:drwxr-xr-x)
    5. SSH the agents that access the sources and do the following, changing # for its
       number(1,2,3):
          a. ./run S#
          b. See if tail -f S# produces results, stop it with ctrl+c
          c. sudo flume-ng agent -n agent# -f /etc/flume-ng/conf/flumeAgent#.conf
    6. You will probably get the following error:
          a. Caused by: org.apache.hadoop.ipc.RemoteException:
               org.apache.hadoop.security.AccessControlException: Permission denied:
               user=root, access=WRITE, inode="/dfs/nn/image/flume":hdfs:hadoop:drwxr-xr-x (
               this is because of the non-dfs space reserved that is bigger than the node space)
               do:
                   i. go to the CDH Manager web ui, select services, hdfs, configurations. Set
                        dfs.datanode.du.reserved to a value that allows the hdfs to have some
                        free space. Example 1737418240 (a bit more than 1 gb)
                   ii. restart the hdfs
    7. Now you are able to run experiments. If needed you can download all the files of the
       hdfs to a local directory by using the following command:
          a. hadoop fs -copyToLocal hdfs://ec2-23-22-64-132.compute-
               1.amazonaws.com:8020/flume /etc/flume-ng/conf/fs
          b. or :
                   i. ls Flume* | awk '{print "wget http://ec2-23-22-64-132.compute-
                        1.amazonaws.com:50075/streamFile/flume/"$0"? -O res/data-"$0}' > wget

To conclude a java file was created that reads from this files. The data was parsed in order to
obtain the results. This experimenting was done for multiple configurations.



Bash script that generates the configuration files

#!/bin/bash

#ping -c4 ip |grep transmitted | awk '{if($4==0){print "online"}else{print "offline"}}'

echo "
#include <stdio.h>
#include <unistd.h>
#include <time.h>

int main(int argc, char *argv[])
{
     int ctr = 0;
        FILE *file;
        file = fopen(argv[1],"w+");
        if(file == NULL) puts("fuuuuuu!");
     while(1){
                  //puts(".");
fprintf(file,"%s : %dr",argv[1], ctr++);
                  fflush(file);
                  //printf("%s : %dn",argv[1], ctr++);
                  usleep(2000000);
                  if(ctr == 1000000) ctr = 0;
     }
         fclose(file);
         return 0;
}
" > ccode.c

gcc -o run ccode.c

for agent in 1 2 3
do

if [[ $agent == 1 || $agent == 2 ]]
then
    collector="ec2-50-17-85-221.compute-1.amazonaws.com"
else
    collector="ec2-50-19-2-196.compute-1.amazonaws.com"
fi

#echo "Setting $collector"

echo "
agent$agent.sources = generator
agent$agent.channels = memoryChannel
agent$agent.sinks = avro-forward-sink

# For each one of the sources, the type is defined
agent$agent.sources.generator.type = exec
agent$agent.sources.generator.command = tail -f /etc/flume-ng/conf/S$agent
agent$agent.sources.generator.logStdErr = true

# The channel can be defined as follows.
agent$agent.sources.generator.channels = memoryChannel

# Each sink's type must be defined
agent$agent.sinks.avro-forward-sink.type = avro
agent$agent.sinks.avro-forward-sink.hostname = $collector
agent$agent.sinks.avro-forward-sink.port = 60000

#Specify the channel the sink should use
agent$agent.sinks.avro-forward-sink.channel = memoryChannel

# Each channel's type is defined.
agent$agent.channels.memoryChannel.type = memory

# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent$agent.channels.memoryChannel.capacity = 10000
" > flumeAgent$agent.conf
done

#COLLECTORS----------------------
for collector in 1 2
do

if [[ $collector == 1 ]]
then
         #Strangely this agent1 is indeed the collectors address
    agent1="ec2-50-17-85-221.compute-1.amazonaws.com"
         #"ec2-107-21-171-50.compute-1.amazonaws.com"
    #agent2="ec2-23-22-216-49.compute-1.amazonaws.com"
    sources="avro-source-1"
    ./genCollector $collector "$sources" "$agent1"
else
    agent1="ec2-50-19-2-196.compute-1.amazonaws.com" #"ec2-107-22-64-107.compute-
1.amazonaws.com"
    sources="avro-source-1"
    ./genCollector $collector "$sources" "$agent1"
fi

done



Bash script that generates a single Collector configuration file

#!/bin/bash

echo "
collector$1.sources = $2
collector$1.channels = memory-1
collector$1.sinks = hdfs-sink
#hdfs-sink

# For each one of the sources, the type is defined
collector$1.sources.avro-source-1.type = avro
collector$1.sources.avro-source-1.bind = $3
collector$1.sources.avro-source-1.port = 60000
collector$1.sources.avro-source-1.channels = memory-1
" > flumeCollector$1.conf

echo "
# Each sink's type must be defined
collector$1.sinks.hdfs-sink.type = hdfs
#logger
collector$1.sinks.hdfs-sink.hdfs.path = hdfs://ec2-23-22-64-132.compute-
1.amazonaws.com:8020/flume
collector$1.sinks.hdfs-sink.channel = memory-1

# Each channel's type is defined.
collector$1.channels.memory-1.type = memory

# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
collector$1.channels.memory-1.capacity = 30000
" >> flumeCollector$1.conf

More Related Content

What's hot

Training Slides: 153 - Working with the CLI
Training Slides: 153 - Working with the CLITraining Slides: 153 - Working with the CLI
Training Slides: 153 - Working with the CLIContinuent
 
Scheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukScheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukAndrii Vozniuk
 
Beyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusBeyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusRebecca Bilbro
 
Types of Load distributing algorithm in Distributed System
Types of Load distributing algorithm in Distributed SystemTypes of Load distributing algorithm in Distributed System
Types of Load distributing algorithm in Distributed SystemDHIVYADEVAKI
 
Distributed process and scheduling
Distributed process and scheduling Distributed process and scheduling
Distributed process and scheduling SHATHAN
 
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...Continuent
 
Survey paper _ lakshmi yasaswi kamireddy(651771619)
Survey paper _ lakshmi yasaswi kamireddy(651771619)Survey paper _ lakshmi yasaswi kamireddy(651771619)
Survey paper _ lakshmi yasaswi kamireddy(651771619)Lakshmi Yasaswi Kamireddy
 
Error in hadoop
Error in hadoopError in hadoop
Error in hadoopLen Bass
 
resource management
  resource management  resource management
resource managementAshish Kumar
 
A load balancing model based on cloud partitioning for the public cloud. ppt
A  load balancing model based on cloud partitioning for the public cloud. ppt A  load balancing model based on cloud partitioning for the public cloud. ppt
A load balancing model based on cloud partitioning for the public cloud. ppt Lavanya Vigrahala
 
Ppt project process migration
Ppt project process migrationPpt project process migration
Ppt project process migrationjaya380
 
Dynamic load balancing in distributed systems in the presence of delays a re...
Dynamic load balancing in distributed systems in the presence of delays  a re...Dynamic load balancing in distributed systems in the presence of delays  a re...
Dynamic load balancing in distributed systems in the presence of delays a re...Mumbai Academisc
 
dos mutual exclusion algos
dos mutual exclusion algosdos mutual exclusion algos
dos mutual exclusion algosAkhil Sharma
 
pptx - Distributed Parallel Inference on Large Factor Graphs
pptx - Distributed Parallel Inference on Large Factor Graphspptx - Distributed Parallel Inference on Large Factor Graphs
pptx - Distributed Parallel Inference on Large Factor Graphsbutest
 
An efficient approach for load balancing using dynamic ab algorithm in cloud ...
An efficient approach for load balancing using dynamic ab algorithm in cloud ...An efficient approach for load balancing using dynamic ab algorithm in cloud ...
An efficient approach for load balancing using dynamic ab algorithm in cloud ...bhavikpooja
 
Simulation of BRKSS Architecture for Data Warehouse Employing Shared Nothing ...
Simulation of BRKSS Architecture for Data Warehouse Employing Shared Nothing ...Simulation of BRKSS Architecture for Data Warehouse Employing Shared Nothing ...
Simulation of BRKSS Architecture for Data Warehouse Employing Shared Nothing ...Dr. Amarjeet Singh
 
Swift container sync
Swift container syncSwift container sync
Swift container syncOpen Stack
 

What's hot (20)

Training Slides: 153 - Working with the CLI
Training Slides: 153 - Working with the CLITraining Slides: 153 - Working with the CLI
Training Slides: 153 - Working with the CLI
 
Scheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukScheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii Vozniuk
 
Beyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusBeyond Off the-Shelf Consensus
Beyond Off the-Shelf Consensus
 
CS6601 DISTRIBUTED SYSTEMS
CS6601 DISTRIBUTED SYSTEMSCS6601 DISTRIBUTED SYSTEMS
CS6601 DISTRIBUTED SYSTEMS
 
Distributed DBMS - Unit 9 - Distributed Deadlock & Recovery
Distributed DBMS - Unit 9 - Distributed Deadlock & RecoveryDistributed DBMS - Unit 9 - Distributed Deadlock & Recovery
Distributed DBMS - Unit 9 - Distributed Deadlock & Recovery
 
Types of Load distributing algorithm in Distributed System
Types of Load distributing algorithm in Distributed SystemTypes of Load distributing algorithm in Distributed System
Types of Load distributing algorithm in Distributed System
 
Distributed process and scheduling
Distributed process and scheduling Distributed process and scheduling
Distributed process and scheduling
 
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
 
Survey paper _ lakshmi yasaswi kamireddy(651771619)
Survey paper _ lakshmi yasaswi kamireddy(651771619)Survey paper _ lakshmi yasaswi kamireddy(651771619)
Survey paper _ lakshmi yasaswi kamireddy(651771619)
 
Error in hadoop
Error in hadoopError in hadoop
Error in hadoop
 
resource management
  resource management  resource management
resource management
 
A load balancing model based on cloud partitioning for the public cloud. ppt
A  load balancing model based on cloud partitioning for the public cloud. ppt A  load balancing model based on cloud partitioning for the public cloud. ppt
A load balancing model based on cloud partitioning for the public cloud. ppt
 
Ppt project process migration
Ppt project process migrationPpt project process migration
Ppt project process migration
 
The value of reactive
The value of reactiveThe value of reactive
The value of reactive
 
Dynamic load balancing in distributed systems in the presence of delays a re...
Dynamic load balancing in distributed systems in the presence of delays  a re...Dynamic load balancing in distributed systems in the presence of delays  a re...
Dynamic load balancing in distributed systems in the presence of delays a re...
 
dos mutual exclusion algos
dos mutual exclusion algosdos mutual exclusion algos
dos mutual exclusion algos
 
pptx - Distributed Parallel Inference on Large Factor Graphs
pptx - Distributed Parallel Inference on Large Factor Graphspptx - Distributed Parallel Inference on Large Factor Graphs
pptx - Distributed Parallel Inference on Large Factor Graphs
 
An efficient approach for load balancing using dynamic ab algorithm in cloud ...
An efficient approach for load balancing using dynamic ab algorithm in cloud ...An efficient approach for load balancing using dynamic ab algorithm in cloud ...
An efficient approach for load balancing using dynamic ab algorithm in cloud ...
 
Simulation of BRKSS Architecture for Data Warehouse Employing Shared Nothing ...
Simulation of BRKSS Architecture for Data Warehouse Employing Shared Nothing ...Simulation of BRKSS Architecture for Data Warehouse Employing Shared Nothing ...
Simulation of BRKSS Architecture for Data Warehouse Employing Shared Nothing ...
 
Swift container sync
Swift container syncSwift container sync
Swift container sync
 

Similar to Flume impact of reliability on scalability

What is active-active
What is active-activeWhat is active-active
What is active-activeSaif Ahmad
 
FlexRay Fault Tolerance article
FlexRay Fault Tolerance articleFlexRay Fault Tolerance article
FlexRay Fault Tolerance articleOmar Jaradat
 
congestion control.pdf
congestion control.pdfcongestion control.pdf
congestion control.pdfJayaprasanna4
 
Os solved question paper
Os solved question paperOs solved question paper
Os solved question paperAnkit Bhatnagar
 
Operating system Interview Questions
Operating system Interview QuestionsOperating system Interview Questions
Operating system Interview QuestionsKuntal Bhowmick
 
Flume Event Scalability
Flume Event ScalabilityFlume Event Scalability
Flume Event ScalabilityArinto Murdopo
 
Designing a Fault-Tolerant Channel Extension Network for Internal Recovery
Designing a Fault-Tolerant Channel Extension Network for Internal RecoveryDesigning a Fault-Tolerant Channel Extension Network for Internal Recovery
Designing a Fault-Tolerant Channel Extension Network for Internal Recoveryicu812
 
KALMAN FILTER BASED CONGESTION CONTROLLER
KALMAN FILTER BASED CONGESTION CONTROLLERKALMAN FILTER BASED CONGESTION CONTROLLER
KALMAN FILTER BASED CONGESTION CONTROLLERijdpsjournal
 
Automated re allocator of replicas
Automated re allocator of replicasAutomated re allocator of replicas
Automated re allocator of replicasIJCNCJournal
 
Cluster computing pptl (2)
Cluster computing pptl (2)Cluster computing pptl (2)
Cluster computing pptl (2)Rohit Jain
 
Clustercomputingpptl2 120204125126-phpapp01
Clustercomputingpptl2 120204125126-phpapp01Clustercomputingpptl2 120204125126-phpapp01
Clustercomputingpptl2 120204125126-phpapp01Ankit Soni
 
End to-end arguments in system design
End to-end arguments in system designEnd to-end arguments in system design
End to-end arguments in system designnody111
 
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGE
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGEADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGE
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGEijdpsjournal
 
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGE
ADVANCED DIFFUSION APPROACH TO DYNAMIC  LOAD-BALANCING FOR CLOUD STORAGEADVANCED DIFFUSION APPROACH TO DYNAMIC  LOAD-BALANCING FOR CLOUD STORAGE
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGEijdpsjournal
 
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGE
ADVANCED DIFFUSION APPROACH TO DYNAMIC  LOAD-BALANCING FOR CLOUD STORAGEADVANCED DIFFUSION APPROACH TO DYNAMIC  LOAD-BALANCING FOR CLOUD STORAGE
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGEijdpsjournal
 
Data power Performance Tuning
Data power Performance TuningData power Performance Tuning
Data power Performance TuningKINGSHUK MAJUMDER
 

Similar to Flume impact of reliability on scalability (20)

What is active-active
What is active-activeWhat is active-active
What is active-active
 
FlexRay Fault Tolerance article
FlexRay Fault Tolerance articleFlexRay Fault Tolerance article
FlexRay Fault Tolerance article
 
congestion control.pdf
congestion control.pdfcongestion control.pdf
congestion control.pdf
 
Os solved question paper
Os solved question paperOs solved question paper
Os solved question paper
 
Operating system Interview Questions
Operating system Interview QuestionsOperating system Interview Questions
Operating system Interview Questions
 
Flume Event Scalability
Flume Event ScalabilityFlume Event Scalability
Flume Event Scalability
 
Designing a Fault-Tolerant Channel Extension Network for Internal Recovery
Designing a Fault-Tolerant Channel Extension Network for Internal RecoveryDesigning a Fault-Tolerant Channel Extension Network for Internal Recovery
Designing a Fault-Tolerant Channel Extension Network for Internal Recovery
 
S peculative multi
S peculative multiS peculative multi
S peculative multi
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
Compiler design
Compiler designCompiler design
Compiler design
 
KALMAN FILTER BASED CONGESTION CONTROLLER
KALMAN FILTER BASED CONGESTION CONTROLLERKALMAN FILTER BASED CONGESTION CONTROLLER
KALMAN FILTER BASED CONGESTION CONTROLLER
 
Automated re allocator of replicas
Automated re allocator of replicasAutomated re allocator of replicas
Automated re allocator of replicas
 
Cluster computing pptl (2)
Cluster computing pptl (2)Cluster computing pptl (2)
Cluster computing pptl (2)
 
Clustercomputingpptl2 120204125126-phpapp01
Clustercomputingpptl2 120204125126-phpapp01Clustercomputingpptl2 120204125126-phpapp01
Clustercomputingpptl2 120204125126-phpapp01
 
Harmful interupts
Harmful interuptsHarmful interupts
Harmful interupts
 
End to-end arguments in system design
End to-end arguments in system designEnd to-end arguments in system design
End to-end arguments in system design
 
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGE
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGEADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGE
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGE
 
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGE
ADVANCED DIFFUSION APPROACH TO DYNAMIC  LOAD-BALANCING FOR CLOUD STORAGEADVANCED DIFFUSION APPROACH TO DYNAMIC  LOAD-BALANCING FOR CLOUD STORAGE
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGE
 
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGE
ADVANCED DIFFUSION APPROACH TO DYNAMIC  LOAD-BALANCING FOR CLOUD STORAGEADVANCED DIFFUSION APPROACH TO DYNAMIC  LOAD-BALANCING FOR CLOUD STORAGE
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGE
 
Data power Performance Tuning
Data power Performance TuningData power Performance Tuning
Data power Performance Tuning
 

More from Mário Almeida

Empirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application SchedulingEmpirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application SchedulingMário Almeida
 
Android reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skypeAndroid reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skypeMário Almeida
 
High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)Mário Almeida
 
Dimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsDimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsMário Almeida
 
Self-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsSelf-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsMário Almeida
 
Smith waterman algorithm parallelization
Smith waterman algorithm parallelizationSmith waterman algorithm parallelization
Smith waterman algorithm parallelizationMário Almeida
 
Man-In-The-Browser attacks
Man-In-The-Browser attacksMan-In-The-Browser attacks
Man-In-The-Browser attacksMário Almeida
 
Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News AggregatorMário Almeida
 
Exploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed SystemsExploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed SystemsMário Almeida
 
High Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksHigh Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksMário Almeida
 
Instrumenting parsecs raytrace
Instrumenting parsecs raytraceInstrumenting parsecs raytrace
Instrumenting parsecs raytraceMário Almeida
 
Architecting a cloud scale identity fabric
Architecting a cloud scale identity fabricArchitecting a cloud scale identity fabric
Architecting a cloud scale identity fabricMário Almeida
 

More from Mário Almeida (14)

Empirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application SchedulingEmpirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application Scheduling
 
Android reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skypeAndroid reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skype
 
Spark
SparkSpark
Spark
 
High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)
 
Dimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsDimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache Simulations
 
Self-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsSelf-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File Systems
 
Smith waterman algorithm parallelization
Smith waterman algorithm parallelizationSmith waterman algorithm parallelization
Smith waterman algorithm parallelization
 
Man-In-The-Browser attacks
Man-In-The-Browser attacksMan-In-The-Browser attacks
Man-In-The-Browser attacks
 
Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News Aggregator
 
Exploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed SystemsExploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed Systems
 
High Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksHigh Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing Networks
 
Instrumenting parsecs raytrace
Instrumenting parsecs raytraceInstrumenting parsecs raytrace
Instrumenting parsecs raytrace
 
Architecting a cloud scale identity fabric
Architecting a cloud scale identity fabricArchitecting a cloud scale identity fabric
Architecting a cloud scale identity fabric
 
SOAP vs REST
SOAP vs RESTSOAP vs REST
SOAP vs REST
 

Recently uploaded

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Flume impact of reliability on scalability

  • 1. Scalable Distributed Systems Impact of reliability in the scalability of Flume Report by Mário Almeida (EMDC) Introduction This report will be describing the mechanisms to provide fault tolerance in Flume as well as its impact on scalability due to the increasing of number of flows and possible bottlenecks. In this case, reliability is the ability to continue delivering events and logging them in the HDFS in the face of failures, without losing data. This failures can be due to physical hardware fails, scarce bandwidth or memory, software crashes, etc. In order to eliminate single points of failure, previous versions of flume had a fail over mechanism that would move the flow towards a new agent without user intervention. This was done via a Flume Master node that would have a global state of all the data flows and could dynamically reconfigure nodes and dataflows through fail-over chains. The problem with this implementation was that Flume Master would become a bottleneck of the system, although it could be replicated through secondary Flume Master Nodes, and it would become complicated to configure its behavior for larger sets of dataflows. This could have a clear impact on the scalability of the system. The actual version of Flume addresses this problem by multiplexing or replicating flows. This means that an agent can send data through different channels for load balancing or it can replicate the same flow through two different channels. This solution does provide the so wanted reliability but it either duplicates the information in the system or it needs more agents in order to balance this load. In order to show the impact of failures in a Flume architecture, a scenario in which the sources are updated very often and are volatile was chosen. This specific case is relevant for the failure of the source gathering agents, since some events get lost. The case in which the Collectors fail can be tolerated in the newer version through the use of persistent channels. Multiple experiments were performed that test the usage of memory channels versus persistent JDBC channels. Also a new mechanism for tolerating failures is proposed and tested against the already existent architectures.
  • 2. Proposal Lets start by describing the architecture of the Flume-based System represented in figure 1. As one can observe, there are 5 distinct agents, 2 of them acting as collectors. The sources Source1, Source2 and Source3 consisted of three different C applications that would generate sequences of numbers and outputting them to a file. This file would only contain a sequence number at any given moment in order to achieve the volatility property. All the agents were in different machines. Figure 1. Implemented Flume-based architecture. Due to the mentioned probable reasons why the Master node was deprecated in the new Flume NG, this report will be evaluating a possible architecture for achieving reliability in Flume. In order to make it less centralized, the idea was to form smaller clusters in which its nodes are responsible of keeping track of each others. And in case one of them fails, another will take its responsibilities, either by gathering information from a source or aggregating data from other Agents.
  • 3. Figure 2. Representation of the clustering of Agents. Red arrows represent ping messages, black dotted arrows represent possible reconfigurations. Following the architecture depicted in figure 2, a cluster could consist of Agent11, Agent12 and Agent21 and another cluster could consist of C1 and C2. For example, in case Agent11 fails, Agent12 or Agent21 will take its place gathering information from Source1, until it finally restarts. In case Collector2 fails, Collector1 will be aggregating the information provenient from Agent21. For this purpose any agent belonging to a cluster has to have knowledge about the other agents in the same cluster. This knowledge includes its sources and sinks. In this experiment the agents would ping each other in order to keep track of alive agents. This notion of multiple small clusters as shown in figure 2, makes the system less dependent on a centralized entity and also eliminates the need of having extra nodes. It might not keep some consistency properties that could be achieved through the use of the embedded Zookeeper though. Experiments All the experiments were conducted with a lifespan of a few hours, generating up to 10 thousand events. The objective was to make this periods as big as possible taking into account the costs and other schedules. Normal execution of flume The first experiment consisted of the normal execution of flume, without any failures and using memory channels. Collector fails while using memory channels Second experiment consisted on disconnecting collector agent 2 during the execution of flume and register the lost events.
  • 4. Collector fails while using persistent channels Third experiment consisted on disconnecting collector agent 2 during the execution of flume using JDBC channels and register the lost events. Agent fails while using persistent channels In this experiment it is the Agent21 that is disconnected using JDBC channels. Dynamic Routing In this last experiment, both Collector2 and Agent11 are disconnected. It uses memory channels. Results Normal execution of flume As expected during the normal execution of flume all the events generated by the sources were logged into the HDFS. This means that the rate at which the data was being generated was well supported by the capacity of the memory channels. Collector fails while using memory channels Although initially I thought that the failure of a collector would imply that the data that was read from the source would be lost by the agent due to the limitations of channel capacity, in mosts tests, Flume was able to store this data and resend it once the Collector was restarted. Flume uses a transactional approach to guarantee the reliable delivery of the events. The events are only removed from a channel when they are successfully stored in the next channel. This should still be dependent on the capacity of the channel but for the used implementation with a channel of 10 thousand events of capacity, the Collector2 could be down for more than one hour without any channel overflow. It was decided to drop this capacity to up to 100 events and double the rate of data generation. After this change I disconnected the Collector2 until more than 100 events were read by Agent21. Once Collector2 came back online it received the 100 events that were stored in the channel of Agent21 but failed to deliver the subsequent events. In fact, it stopped working from that point, requiring a restart. Collector fails while using persistent channels As expected since Flume stores the events in a relational database instead of in memory, when a collector dies the events are stored and delivered whenever it becomes available again. It is indeed the way Flume achieves recoverability. Although it works well it seems that while the memory channel groups events before sending, JDBC channel works more like a pipeline, sending the events one by one. This and the implicit problem of writing into persistent storage might have significant impact on the performance for large scale systems. Also, and probably due to this fact as well, there seemed to be an imbalance between the rate at which the sources data flows were reaching the Collector. In around every 10 events that reached the Collector, only 2 were from Source1 and 8 were from Source2, although both produced data at the same rate. I wonder if it runs for long periods what will happen, in my case
  • 5. after around an hour the difference already goes in 400 hundred events. There seemed to be no options to group events. Agent fails while using persistent channels Although I expected that when an Agent is stopped it would lose some events, it seems that it still logs them somehow and after restarted resends all the events since the beginning, repeating up to hundreds of events in the process. The experiments seemed to indicate that the JDBC logs events even in case the agent crashes. Other channels such as the recoverableMemoryChannel provide parameters to set how often a background worker checks on old logs but there are no mentions to it on JDBC. Overall, although it repeated hundreds if not thousands of events, it didn’t lose a single event. This wouldn’t be the case though if the whole machine would crash, but i couldn’t test it since my source was in the same node as the agent itself. Further testing would need a new architecture. Dynamic Routing In the way it was implemented, every time a new configuration is generated and the agent restarted, the events in memory are lost. This happens when the Collector2 is disconnected and the Agent21 has to change its dataflow to Collector1. Overall some events were lost while migrating between configurations but it achieved better results than the normal memory reliability scheme of Flume. Although for all-round purpose the JDBC channels probably achieves better reliability results, clustering has less delays in retrieving data. It might also not flow the data in the most efficient way if due to failures it makes all flows go through the same nodes. Conclusions As long as the data generation rate doesn’t overflow the available capacity in the channel memory, Flume works well for most cases, with or without failures. There is one failure that probably can’t be handled without the creation of replicated flows, that is the “unsubscribing” from a source due to a failure of a machine (not only the agent) where the source has volatile information. Further experiments should be conducted in order to evaluate the performance of creating clusters versus the actual replication of data flows and subsequent processing in order to store them in a HDFS. Scalability wise it seems that using well implemented clusters would mean having less nodes and less flow of information since the pings/heartbeat rate in a real system is much lower than a data flow. Still, the way Flume has implemented its reliability is good for its simplicity. The proposed architecture was implemented through multiple scripts instead of actually changing the source code of Flume. This means that there were some workarounds, such as restarting/reloading configurations that might introduce errors in the experimentation. Also this scripts would never provide easy to use access to this routing mechanism. This said, more experiments would be needed to make this report more significative. Even so, it shows an interesting overview of the mechanisms to achieve reliability in Flume while describing their limitations. Implementing dynamic routing in the actual architecture of flume can also be achieved by using an Avro client to set different headers depending on the pretend route. This could possibly be a solution for implementing the proposed architecture.
  • 6. Appendix Configuration steps Tools ● puttygen ● putty ● scp ● pssh ● cloudera manager 3.7!! ● flume ● hadoop Installing the CDH3 and setting up the HDFS 1. Go to Security groups and create a new group a. In the inbound add ssh, http, icmp (for ping). Apply Rule Changes. If its only for test purposes you can just add all tcp, udp and icmp 2. Go to Key pairs, create and download pem file 3. Create the public key file a. puttygen key.ppk -L > ~/.ssh/id_rsa.pub b. puttygen /path/to/puttykey.ppk -O private-openssh -o ~/.ssh/id_rsa 4. Create 10 Suse Medium instances in AWS Management Console: a. Choose Suse, next b. 10 instances, Medium, next c. Next d. Use the created key pair, next e. Choose previously created security group 5. Choose an AWS instance, rename it to CDHManager. Right click on it, connect -> copy public DNS 6. Download Cloudera Manager Free Edition and copy its bin file into the the machine: a. scp cloudera-manager-installer.bin root@publicdns:/root/cloudera-manager- installer.bin 7. SSH the machine and perform an ls, cloudera bin file should be there. Do the following: a. ssh root@publicdns b. ls <- check that 8. Install it: a. chmod u+x cloudera-manager-installer.bin b. sudo ./cloudera-manager-installer.bin c. next, next, yes, next, yes...wait till installation finishes 9. Go to your web browser, paste the public dns and port 7180 like this: publicDns:7180. Note that it can’t access it. Its because our security group doesn’t allow connections on this port : a. Go to security groups. Add custom tcp rule with port 7180. Apply rule changes.
  • 7. b. Reopen webpage. Username: admin, pass: admin c. Install only free. Continue. d. Proceed without registering. 10. Go to the My instances in AWS and select all except the CDHManager. Notice that below all the public dns appear listed, copy them at once and paste on the webpage. Take out the not needed parts such as “i-2jh3h53hk3h:”. (Sometimes, some nodes might be not accessible, just delete/restart them, create another and put the public dns in the list) 11. Find instances and install CDH 3 on them with default values. continue. 12. Choose root, all accept same public key, select your public key and for private select the pem key. Install... 13. Continue, continue, cluster CDH3 a. if an error occur it is due to not open ports (generally icmp or other). Common error : "The inspector failed to run on all hosts.". 14. Add service hdfs, for example 3 datanodes, one of them can be a name node as well. Installing flume ng 2. On linux install pssh: sudo apt-get install pssh (yes i have a virtualbox running ubuntu, and i prefer to have all this tools there. I tried xservers on windows but didnt like them) 3. Create an host.txt file with all the public dns. 4. Install putty on ubuntu. a. sudo apt-get install putty. 5. Install flume like a boss: a. parallel-ssh -O StrictHostKeyChecking=no -v -i -l root -h hosts.txt “sudo zypper -- non-interactive install flume-ng” 6. Make it boot from startup a. parallel-ssh -O StrictHostKeyChecking=no -v -i -l root -h hosts.txt “sudo zypper -- non-interactive install flume-ng-agent” Running experiment 1. Send my scripts to the servers, they generate the config files a. while read line; do scp -S ssh -o StrictHostKeyChecking=no -o PasswordAuthentication=no configen root@$line:/etc/flume-ng/conf; done <hosts.txt b. while read line; do scp -S ssh -o StrictHostKeyChecking=no -o PasswordAuthentication=no genCollector root@$line:/etc/flume-ng/conf; done <hosts.txt 2. Run configen on every machine: b. parallel-ssh -O StrictHostKeyChecking=no -v -i -l root -h hosts.txt "cd /etc/flume- ng/conf; sudo chmod u+x configen; ./configen" a. Check results i. parallel-ssh -O StrictHostKeyChecking=no -v -i -l root -h hosts.txt "cd /etc/ flume-ng/conf; ls" 3. SSH to the name node and do: a. sudo -u hdfs hadoop hdfs -mkdir flume 4. SSH the agent collectors and do a. sudo -u hdfs flume-ng agent -n collector1 -f /etc/flume-ng/ conf/flumeCollector1.conf (note that sudo -u hdfs is due to the authentication mechanism of the hdfs, otherwise the following error
  • 8. would occur : Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/dfs/nn/image/flume":hdfs:hadoop:drwxr-xr-x) 5. SSH the agents that access the sources and do the following, changing # for its number(1,2,3): a. ./run S# b. See if tail -f S# produces results, stop it with ctrl+c c. sudo flume-ng agent -n agent# -f /etc/flume-ng/conf/flumeAgent#.conf 6. You will probably get the following error: a. Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/dfs/nn/image/flume":hdfs:hadoop:drwxr-xr-x ( this is because of the non-dfs space reserved that is bigger than the node space) do: i. go to the CDH Manager web ui, select services, hdfs, configurations. Set dfs.datanode.du.reserved to a value that allows the hdfs to have some free space. Example 1737418240 (a bit more than 1 gb) ii. restart the hdfs 7. Now you are able to run experiments. If needed you can download all the files of the hdfs to a local directory by using the following command: a. hadoop fs -copyToLocal hdfs://ec2-23-22-64-132.compute- 1.amazonaws.com:8020/flume /etc/flume-ng/conf/fs b. or : i. ls Flume* | awk '{print "wget http://ec2-23-22-64-132.compute- 1.amazonaws.com:50075/streamFile/flume/"$0"? -O res/data-"$0}' > wget To conclude a java file was created that reads from this files. The data was parsed in order to obtain the results. This experimenting was done for multiple configurations. Bash script that generates the configuration files #!/bin/bash #ping -c4 ip |grep transmitted | awk '{if($4==0){print "online"}else{print "offline"}}' echo " #include <stdio.h> #include <unistd.h> #include <time.h> int main(int argc, char *argv[]) { int ctr = 0; FILE *file; file = fopen(argv[1],"w+"); if(file == NULL) puts("fuuuuuu!"); while(1){ //puts(".");
  • 9. fprintf(file,"%s : %dr",argv[1], ctr++); fflush(file); //printf("%s : %dn",argv[1], ctr++); usleep(2000000); if(ctr == 1000000) ctr = 0; } fclose(file); return 0; } " > ccode.c gcc -o run ccode.c for agent in 1 2 3 do if [[ $agent == 1 || $agent == 2 ]] then collector="ec2-50-17-85-221.compute-1.amazonaws.com" else collector="ec2-50-19-2-196.compute-1.amazonaws.com" fi #echo "Setting $collector" echo " agent$agent.sources = generator agent$agent.channels = memoryChannel agent$agent.sinks = avro-forward-sink # For each one of the sources, the type is defined agent$agent.sources.generator.type = exec agent$agent.sources.generator.command = tail -f /etc/flume-ng/conf/S$agent agent$agent.sources.generator.logStdErr = true # The channel can be defined as follows. agent$agent.sources.generator.channels = memoryChannel # Each sink's type must be defined agent$agent.sinks.avro-forward-sink.type = avro agent$agent.sinks.avro-forward-sink.hostname = $collector agent$agent.sinks.avro-forward-sink.port = 60000 #Specify the channel the sink should use agent$agent.sinks.avro-forward-sink.channel = memoryChannel # Each channel's type is defined. agent$agent.channels.memoryChannel.type = memory # Other config values specific to each type of channel(sink or source) # can be defined as well
  • 10. # In this case, it specifies the capacity of the memory channel agent$agent.channels.memoryChannel.capacity = 10000 " > flumeAgent$agent.conf done #COLLECTORS---------------------- for collector in 1 2 do if [[ $collector == 1 ]] then #Strangely this agent1 is indeed the collectors address agent1="ec2-50-17-85-221.compute-1.amazonaws.com" #"ec2-107-21-171-50.compute-1.amazonaws.com" #agent2="ec2-23-22-216-49.compute-1.amazonaws.com" sources="avro-source-1" ./genCollector $collector "$sources" "$agent1" else agent1="ec2-50-19-2-196.compute-1.amazonaws.com" #"ec2-107-22-64-107.compute- 1.amazonaws.com" sources="avro-source-1" ./genCollector $collector "$sources" "$agent1" fi done Bash script that generates a single Collector configuration file #!/bin/bash echo " collector$1.sources = $2 collector$1.channels = memory-1 collector$1.sinks = hdfs-sink #hdfs-sink # For each one of the sources, the type is defined collector$1.sources.avro-source-1.type = avro collector$1.sources.avro-source-1.bind = $3 collector$1.sources.avro-source-1.port = 60000 collector$1.sources.avro-source-1.channels = memory-1 " > flumeCollector$1.conf echo " # Each sink's type must be defined collector$1.sinks.hdfs-sink.type = hdfs #logger collector$1.sinks.hdfs-sink.hdfs.path = hdfs://ec2-23-22-64-132.compute-
  • 11. 1.amazonaws.com:8020/flume collector$1.sinks.hdfs-sink.channel = memory-1 # Each channel's type is defined. collector$1.channels.memory-1.type = memory # Other config values specific to each type of channel(sink or source) # can be defined as well # In this case, it specifies the capacity of the memory channel collector$1.channels.memory-1.capacity = 30000 " >> flumeCollector$1.conf