In today’s Internet world, log file analysis is becoming a necessary task for analyzing the customer’s
behavior in order to improve advertising and sales as well as for datasets like environment, medical,
banking system it is important to analyze the log data to get required knowledge from it. Web mining is the
process of discovering the knowledge from the web data. Log files are getting generated very fast at the
rate of 1-10 Mb/s per machine, a single data center can generate tens of terabytes of log data in a day.
These datasets are huge. In order to analyze such large datasets we need parallel processing system and
reliable data storage mechanism. Virtual database system is an effective solution for integrating the data
but it becomes inefficient for large datasets. The Hadoop framework provides reliable data storage by
Hadoop Distributed File System and MapReduce programming model which is a parallel processing
system for large datasets. Hadoop distributed file system breaks up input data and sends fractions of the
original data to several machines in hadoop cluster to hold blocks of data. This mechanism helps to
process log data in parallel using all the machines in the hadoop cluster and computes result efficiently.
The dominant approach provided by hadoop to “Store first query later”, loads the data to the Hadoop
Distributed File System and then executes queries written in Pig Latin. This approach reduces the response
time as well as the load on to the end system. This paper proposes a log analysis system using Hadoop
MapReduce which will provide accurate results in minimum response time.
The document provides an overview of Hadoop and its core components. It discusses:
- Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers.
- The two core components of Hadoop are HDFS for distributed storage, and MapReduce for distributed processing. HDFS stores data reliably across machines, while MapReduce processes large amounts of data in parallel.
- Hadoop can operate in three modes - standalone, pseudo-distributed and fully distributed. The document focuses on setting up Hadoop in standalone mode for development and testing purposes on a single machine.
There is a growing trend of applications that ought to handle huge information. However, analysing huge information may be a terribly difficult drawback nowadays. For such data many techniques can be considered. The technologies like Grid Computing, Volunteering Computing, and RDBMS can be considered as potential techniques to handle such data. We have a still in growing phase Hadoop Tool to handle such data also. We will do a survey on all this techniques to find a potential technique to manage and work with Big Data.
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
Big Data is a term defined for data sets that are extreme and complex where traditional data processing applications are inadequate to deal with them. The term Big Data often refers simply to the use of predictive investigation on analytic methods that extract value from data. Big data is generalized as a large data which is a collection of big datasets that cannot be processed using traditional computing techniques. Big data is not purely a data, rather than it is a complete subject involves various tools, techniques and frameworks. Big data can be any structured collection which results incapability of conventional data management methods. Hadoop is a distributed example used to change the large amount of data. This manipulation contains not only storage as well as processing on the data. Hadoop is an open- source software framework for dispersed storage and processing of big data sets on computer clusters built from commodity hardware. HDFS was built to support high throughput, streaming reads and writes of extremely large files. Hadoop Map Reduce is a software framework for easily writing applications which process vast amounts of data. Wordcount example reads text files and counts how often words occur. The input is text files and the result is wordcount file, each line of which contains a word and the count of how often it occurred separated by a tab.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable and distributed processing of large datasets. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for processing vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. HDFS stores data reliably across machines in a Hadoop cluster and MapReduce processes data in parallel by breaking the job into smaller fragments of work executed across cluster nodes.
This document discusses big data and Hadoop. It defines big data as large datasets that are difficult to process using traditional methods due to their volume, variety, and velocity. Hadoop is presented as an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. The key components of Hadoop are the Hadoop Distributed File System (HDFS) for storage and MapReduce as a programming model for distributed processing. A number of other technologies in Hadoop's ecosystem are also described such as HBase, Avro, Pig, Hive, Sqoop, Zookeeper and Mahout. The document concludes that Hadoop provides solutions for efficiently processing and analyzing big data.
Mankind has stored more than 295 billion gigabytes (or 295 Exabyte) of data since 1986, as per a report by the University of Southern California. Storing and monitoring this data in widely distributed environments for 24/7 is a huge task for global service organizations. These datasets require high processing power which can’t be offered by traditional databases as they are stored in an unstructured format. Although one can use Map Reduce paradigm to solve this problem using java based Hadoop, it cannot provide us with maximum functionality. Drawbacks can be overcome using Hadoop-streaming techniques that allow users to define non-java executable for processing this datasets. This paper proposes a THESAURUS model which allows a faster and easier version of business analysis.
The document provides an overview of Hadoop and its core components. It discusses:
- Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers.
- The two core components of Hadoop are HDFS for distributed storage, and MapReduce for distributed processing. HDFS stores data reliably across machines, while MapReduce processes large amounts of data in parallel.
- Hadoop can operate in three modes - standalone, pseudo-distributed and fully distributed. The document focuses on setting up Hadoop in standalone mode for development and testing purposes on a single machine.
There is a growing trend of applications that ought to handle huge information. However, analysing huge information may be a terribly difficult drawback nowadays. For such data many techniques can be considered. The technologies like Grid Computing, Volunteering Computing, and RDBMS can be considered as potential techniques to handle such data. We have a still in growing phase Hadoop Tool to handle such data also. We will do a survey on all this techniques to find a potential technique to manage and work with Big Data.
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
Big Data is a term defined for data sets that are extreme and complex where traditional data processing applications are inadequate to deal with them. The term Big Data often refers simply to the use of predictive investigation on analytic methods that extract value from data. Big data is generalized as a large data which is a collection of big datasets that cannot be processed using traditional computing techniques. Big data is not purely a data, rather than it is a complete subject involves various tools, techniques and frameworks. Big data can be any structured collection which results incapability of conventional data management methods. Hadoop is a distributed example used to change the large amount of data. This manipulation contains not only storage as well as processing on the data. Hadoop is an open- source software framework for dispersed storage and processing of big data sets on computer clusters built from commodity hardware. HDFS was built to support high throughput, streaming reads and writes of extremely large files. Hadoop Map Reduce is a software framework for easily writing applications which process vast amounts of data. Wordcount example reads text files and counts how often words occur. The input is text files and the result is wordcount file, each line of which contains a word and the count of how often it occurred separated by a tab.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable and distributed processing of large datasets. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for processing vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. HDFS stores data reliably across machines in a Hadoop cluster and MapReduce processes data in parallel by breaking the job into smaller fragments of work executed across cluster nodes.
This document discusses big data and Hadoop. It defines big data as large datasets that are difficult to process using traditional methods due to their volume, variety, and velocity. Hadoop is presented as an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. The key components of Hadoop are the Hadoop Distributed File System (HDFS) for storage and MapReduce as a programming model for distributed processing. A number of other technologies in Hadoop's ecosystem are also described such as HBase, Avro, Pig, Hive, Sqoop, Zookeeper and Mahout. The document concludes that Hadoop provides solutions for efficiently processing and analyzing big data.
Mankind has stored more than 295 billion gigabytes (or 295 Exabyte) of data since 1986, as per a report by the University of Southern California. Storing and monitoring this data in widely distributed environments for 24/7 is a huge task for global service organizations. These datasets require high processing power which can’t be offered by traditional databases as they are stored in an unstructured format. Although one can use Map Reduce paradigm to solve this problem using java based Hadoop, it cannot provide us with maximum functionality. Drawbacks can be overcome using Hadoop-streaming techniques that allow users to define non-java executable for processing this datasets. This paper proposes a THESAURUS model which allows a faster and easier version of business analysis.
This document discusses the evolution from traditional RDBMS to big data analytics. As data volumes grow rapidly, traditional RDBMS struggle to store and process large amounts of data. Hadoop provides a framework to store and process big data across commodity hardware. Key components of Hadoop include HDFS for distributed storage, MapReduce for distributed processing, Hive for SQL-like queries, and Sqoop for transferring data between Hadoop and relational databases. The document also outlines some applications and limitations of Hadoop.
Today’s era is generally treated as the era of data on each and every field of computing application huge amount of data is generated. The society is gradually more dependent on computers so large amount of data is generated in each and every second which is either in structured format, unstructured format or semi structured format. These huge amount of data are generally treated as big data. To analyze big data is a biggest challenge in current world. Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage and it generally follows horizontal processing. Map Reduce programming is generally run over Hadoop Framework and process the large amount of structured and unstructured data. This Paper describes about different joining strategies used in Map reduce programming to combine the data of two files in Hadoop Framework and also discusses the skewness problem associate to it.
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcsandit
Big data analysis has become much popular in the present day scenario and the manipulation of
big data has gained the keen attention of researchers in the field of data analytics. Analysis of
big data is currently considered as an integral part of many computational and statistical
departments. As a result, novel approaches in data analysis are evolving on a daily basis.
Thousands of transaction requests are handled and processed everyday by different websites
associated with e-commerce, e-banking, e-shopping carts etc. The network traffic and weblog
analysis comes to play a crucial role in such situations where Hadoop can be suggested as an
efficient solution for processing the Netflow data collected from switches as well as website
access-logs during fixed intervals.
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
This document summarizes a survey paper on MapReduce processing using Hadoop. It discusses how big data is growing rapidly due to factors like the internet and social media. Traditional databases cannot handle big data. Hadoop uses MapReduce and HDFS to store and process extremely large datasets across commodity servers in a distributed manner. HDFS stores data in a distributed file system, while MapReduce allows parallel processing of that data. The paper describes the MapReduce process and its core functions like map, shuffle, reduce. It explains how Hadoop provides advantages like scalability, cost effectiveness, flexibility and parallel processing for big data.
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcscpconf
Big data analysis has become much popular in the present day scenario and the manipulation of big data has gained the keen attention of researchers in the field of data analytics. Analysis of
big data is currently considered as an integral part of many computational and statistical departments. As a result, novel approaches in data analysis are evolving on a daily basis.
Thousands of transaction requests are handled and processed every day by different websites associated with e-commerce, e-banking, e-shopping carts etc. The network traffic and weblog
analysis comes to play a crucial role in such situations where Hadoop can be suggested as an efficient solution for processing the Netflow data collected from switches as well as website
access-logs during fixed intervals.
Applications of machine learning are widely used in the real world with either supervised or unsupervised
learning process. Recently emerged domain in the information technologies is Big Data which refers to
data with characteristics such as volume, velocity and variety. The existing machine learning approaches
cannot cope with Big Data. The processing of big data has to be done in an environment where distributed
programming is supported. In such environment like Hadoop, a distributed file system like Hadoop
Distributed File System (HDFS) is required to support scalable and efficient access to data. Distributed
environments are often associated with cloud computing and data centres. Naturally such environments are
equipped with GPUs (Graphical Processing Units) that support parallel processing. Thus the environment
is suitable for processing huge amount of data in short span of time. In this paper we propose a framework
that can have generic operations that support processing of big data. Our framework provides building
blocks to support clustering of unstructured data which is in the form of documents. We proposed an
algorithm that works in scheduling jobs of multiple users. We built a prototype application to demonstrate
the proof of concept. The empirical results revealed that the proposed framework shows 95% accuracy
when the results are compared with the ground truth.
Web Oriented FIM for large scale dataset using Hadoopdbpublications
In large scale datasets, mining frequent itemsets using existing parallel mining algorithm is to balance the load by distributing such enormous data between collections of computers. But we identify high performance issue in existing mining algorithms [1]. To handle this problem, we introduce a new approach called data partitioning using Map Reduce programming model.In our proposed system, we have introduced new technique called frequent itemset ultrametric tree rather than conservative FP-trees. An investigational outcome tells us that, eradicating redundant transaction results in improving the performance by reducing computing loads.
This document discusses scheduling policies in Hadoop for big data analysis. It describes the default FIFO scheduler in Hadoop as well as alternative schedulers like the Fair Scheduler and Capacity Scheduler. The Fair Scheduler was developed by Facebook to allocate resources fairly between jobs by assigning them to pools with minimum guaranteed capacities. The Capacity Scheduler allows multiple tenants to securely share a large cluster while giving each organization capacity guarantees. It also supports hierarchical queues to prioritize sharing unused resources within an organization.
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
This document discusses scheduling policies in Hadoop for big data analysis. It describes the default FIFO scheduler in Hadoop as well as alternative schedulers like the Fair Scheduler and Capacity Scheduler. The Fair Scheduler was developed by Facebook to allocate resources fairly between jobs by assigning them to pools with minimum guaranteed capacities. The Capacity Scheduler allows multiple tenants to securely share a large cluster while giving each organization capacity guarantees. It also supports hierarchical queues to prioritize sharing unused resources within an organization.
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
This document discusses big data and Hadoop. It begins with defining big data and explaining its characteristics of volume, variety, velocity, and veracity. It then provides an overview of Hadoop, describing its core components of HDFS for storage and MapReduce for processing. Key technologies in Hadoop's ecosystem are also summarized like Hive, Pig, and HBase. The document concludes by outlining some challenges of big data like issues of heterogeneity and incompleteness of data.
This document provides an overview and comparison of RDBMS, Hadoop, and Spark. It introduces RDBMS and describes its use cases such as online transaction processing and data warehouses. It then introduces Hadoop and describes its ecosystem including HDFS, YARN, MapReduce, and related sub-modules. Common use cases for Hadoop are also outlined. Spark is then introduced along with its modules like Spark Core, SQL, and MLlib. Use cases for Spark include data enrichment, trigger event detection, and machine learning. The document concludes by comparing RDBMS and Hadoop, as well as Hadoop and Spark, and addressing common misconceptions about Hadoop and Spark.
This document discusses big data analysis using Hadoop and proposes a system for validating data entering big data systems. It provides an overview of big data and Hadoop, describing how Hadoop uses MapReduce and HDFS to process and store large amounts of data across clusters of commodity hardware. The document then outlines challenges in validating big data and proposes a utility that would extract data from SQL and Hadoop databases, compare records to identify mismatches, and generate reports to ensure only correct data is processed.
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
P2P media streaming and file downloading is most popular applications over the Internet.
These systems reduce the server load and provide a scalable content distribution. P2P
networking is a new paradigm to build distributed applications. It describes the design
requirements for P2P media streaming, live and Video on demand system comparison based on their system architecture. In this paper we described and studied the traditional approaches for P2P streaming systems, design issues, challenges, and current approaches for providing P2P VoD services.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
This document discusses Big Data and Hadoop. It begins with prerequisites for Hadoop including Java, OOP concepts, and data structures. It then defines Big Data as being on the order of petabytes, far larger than typical files. Hadoop provides a solution for storing, processing, and analyzing this large data across clusters of commodity hardware using its HDFS distributed file system and MapReduce processing paradigm. A case study demonstrates how Hadoop can help a telecom company analyze usage data from millions of subscribers to improve service offerings.
This document provides an introduction and overview of Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of computers. It discusses how Hadoop uses MapReduce and HDFS to parallelize workloads and store data redundantly across nodes to solve issues around hardware failure and combining results. Key aspects covered include how HDFS distributes and replicates data, how MapReduce isolates processing into mapping and reducing functions to abstract communication, and how Hadoop moves computation to the data to improve performance.
The document summarizes Aginity's efforts to build a 10 terabyte database application using $5,682.10 worth of commodity hardware. They constructed a 9-box server farm with off-the-shelf components to test leading database systems like MapReduce, in-database analytics, and MPP on a scale that previously would have cost $2.2 million. The goal was to build similar big data capabilities on a smaller budget for their research lab to experiment with different technologies.
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
The computer industry is being challenged to develop methods and techniques for affordable data processing on large datasets at optimum response times. The technical challenges in dealing with the increasing demand to handle vast quantities of data is daunting and on the rise. One of the recent processing models with a more efficient and intuitive solution to rapidly process large amount of data in parallel is called MapReduce. It is a framework defining a template approach of programming to perform large-scale data computation on clusters of machines in a cloud computing environment. MapReduce provides automatic parallelization and distribution of computation based on several processors. It hides the complexity of writing parallel and distributed programming code. This paper provides a comprehensive systematic review and analysis of large-scale dataset processing and dataset handling challenges and
requirements in a cloud computing environment by using the MapReduce framework and its open-source implementation Hadoop. We defined requirements for MapReduce systems to perform large-scale data processing. We also proposed the MapReduce framework and one implementation of this framework on Amazon Web Services. At the end of the paper, we presented an experimentation of running MapReduce
system in a cloud environment. This paper outlines one of the best techniques to process large datasets is MapReduce; it also can help developers to do parallel and distributed computation in a cloud environment.
This document discusses big data analytics techniques like Hadoop MapReduce and NoSQL databases. It begins with an introduction to big data and how the exponential growth of data presents challenges that conventional databases can't handle. It then describes Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers using a simple programming model. Key aspects of Hadoop covered include MapReduce, HDFS, and various other related projects like Pig, Hive, HBase etc. The document concludes with details about how Hadoop MapReduce works, including its master-slave architecture and how it provides fault tolerance.
Users Approach on Providing Feedback for Smart Home Devices – Phase IIijujournal
Smart Home technology has accomplished extraordinary success in making individuals' lives more straightforward and relaxing. Technology has recently brought about numerous savvy and refined frame works that advanced clever living innovation. In this paper, we will investigate the behavioral intention of user's approach to providing feedback for smart home devices. We will conduct an online survey for a sample of three to five students selected by simple random sampling to study the user's motto for giving feedback on smart home devices and their expectations. We have observed that most users are ready to actively share their input on smart home devices to improve the product's service and quality to fulfill the user’s needs and make their lives easier.
Users Approach on Providing Feedback for Smart Home Devices – Phase IIijujournal
Smart Home technology has accomplished extraordinary success in making individuals' lives more
straightforward and relaxing. Technology has recently brought about numerous savvy and refined frame
works that advanced clever living innovation. In this paper, we will investigate the behavioral intention of
user's approach to providing feedback for smart home devices. We will conduct an online survey for a
sample of three to five students selected by simple random sampling to study the user's motto for giving
feedback on smart home devices and their expectations. We have observed that most users are ready to
actively share their input on smart home devices to improve the product's service and quality to fulfill the
user’s needs and make their lives easier.
Más contenido relacionado
Similar a HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
This document discusses the evolution from traditional RDBMS to big data analytics. As data volumes grow rapidly, traditional RDBMS struggle to store and process large amounts of data. Hadoop provides a framework to store and process big data across commodity hardware. Key components of Hadoop include HDFS for distributed storage, MapReduce for distributed processing, Hive for SQL-like queries, and Sqoop for transferring data between Hadoop and relational databases. The document also outlines some applications and limitations of Hadoop.
Today’s era is generally treated as the era of data on each and every field of computing application huge amount of data is generated. The society is gradually more dependent on computers so large amount of data is generated in each and every second which is either in structured format, unstructured format or semi structured format. These huge amount of data are generally treated as big data. To analyze big data is a biggest challenge in current world. Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage and it generally follows horizontal processing. Map Reduce programming is generally run over Hadoop Framework and process the large amount of structured and unstructured data. This Paper describes about different joining strategies used in Map reduce programming to combine the data of two files in Hadoop Framework and also discusses the skewness problem associate to it.
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcsandit
Big data analysis has become much popular in the present day scenario and the manipulation of
big data has gained the keen attention of researchers in the field of data analytics. Analysis of
big data is currently considered as an integral part of many computational and statistical
departments. As a result, novel approaches in data analysis are evolving on a daily basis.
Thousands of transaction requests are handled and processed everyday by different websites
associated with e-commerce, e-banking, e-shopping carts etc. The network traffic and weblog
analysis comes to play a crucial role in such situations where Hadoop can be suggested as an
efficient solution for processing the Netflow data collected from switches as well as website
access-logs during fixed intervals.
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
This document summarizes a survey paper on MapReduce processing using Hadoop. It discusses how big data is growing rapidly due to factors like the internet and social media. Traditional databases cannot handle big data. Hadoop uses MapReduce and HDFS to store and process extremely large datasets across commodity servers in a distributed manner. HDFS stores data in a distributed file system, while MapReduce allows parallel processing of that data. The paper describes the MapReduce process and its core functions like map, shuffle, reduce. It explains how Hadoop provides advantages like scalability, cost effectiveness, flexibility and parallel processing for big data.
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcscpconf
Big data analysis has become much popular in the present day scenario and the manipulation of big data has gained the keen attention of researchers in the field of data analytics. Analysis of
big data is currently considered as an integral part of many computational and statistical departments. As a result, novel approaches in data analysis are evolving on a daily basis.
Thousands of transaction requests are handled and processed every day by different websites associated with e-commerce, e-banking, e-shopping carts etc. The network traffic and weblog
analysis comes to play a crucial role in such situations where Hadoop can be suggested as an efficient solution for processing the Netflow data collected from switches as well as website
access-logs during fixed intervals.
Applications of machine learning are widely used in the real world with either supervised or unsupervised
learning process. Recently emerged domain in the information technologies is Big Data which refers to
data with characteristics such as volume, velocity and variety. The existing machine learning approaches
cannot cope with Big Data. The processing of big data has to be done in an environment where distributed
programming is supported. In such environment like Hadoop, a distributed file system like Hadoop
Distributed File System (HDFS) is required to support scalable and efficient access to data. Distributed
environments are often associated with cloud computing and data centres. Naturally such environments are
equipped with GPUs (Graphical Processing Units) that support parallel processing. Thus the environment
is suitable for processing huge amount of data in short span of time. In this paper we propose a framework
that can have generic operations that support processing of big data. Our framework provides building
blocks to support clustering of unstructured data which is in the form of documents. We proposed an
algorithm that works in scheduling jobs of multiple users. We built a prototype application to demonstrate
the proof of concept. The empirical results revealed that the proposed framework shows 95% accuracy
when the results are compared with the ground truth.
Web Oriented FIM for large scale dataset using Hadoopdbpublications
In large scale datasets, mining frequent itemsets using existing parallel mining algorithm is to balance the load by distributing such enormous data between collections of computers. But we identify high performance issue in existing mining algorithms [1]. To handle this problem, we introduce a new approach called data partitioning using Map Reduce programming model.In our proposed system, we have introduced new technique called frequent itemset ultrametric tree rather than conservative FP-trees. An investigational outcome tells us that, eradicating redundant transaction results in improving the performance by reducing computing loads.
This document discusses scheduling policies in Hadoop for big data analysis. It describes the default FIFO scheduler in Hadoop as well as alternative schedulers like the Fair Scheduler and Capacity Scheduler. The Fair Scheduler was developed by Facebook to allocate resources fairly between jobs by assigning them to pools with minimum guaranteed capacities. The Capacity Scheduler allows multiple tenants to securely share a large cluster while giving each organization capacity guarantees. It also supports hierarchical queues to prioritize sharing unused resources within an organization.
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
This document discusses scheduling policies in Hadoop for big data analysis. It describes the default FIFO scheduler in Hadoop as well as alternative schedulers like the Fair Scheduler and Capacity Scheduler. The Fair Scheduler was developed by Facebook to allocate resources fairly between jobs by assigning them to pools with minimum guaranteed capacities. The Capacity Scheduler allows multiple tenants to securely share a large cluster while giving each organization capacity guarantees. It also supports hierarchical queues to prioritize sharing unused resources within an organization.
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
This document discusses big data and Hadoop. It begins with defining big data and explaining its characteristics of volume, variety, velocity, and veracity. It then provides an overview of Hadoop, describing its core components of HDFS for storage and MapReduce for processing. Key technologies in Hadoop's ecosystem are also summarized like Hive, Pig, and HBase. The document concludes by outlining some challenges of big data like issues of heterogeneity and incompleteness of data.
This document provides an overview and comparison of RDBMS, Hadoop, and Spark. It introduces RDBMS and describes its use cases such as online transaction processing and data warehouses. It then introduces Hadoop and describes its ecosystem including HDFS, YARN, MapReduce, and related sub-modules. Common use cases for Hadoop are also outlined. Spark is then introduced along with its modules like Spark Core, SQL, and MLlib. Use cases for Spark include data enrichment, trigger event detection, and machine learning. The document concludes by comparing RDBMS and Hadoop, as well as Hadoop and Spark, and addressing common misconceptions about Hadoop and Spark.
This document discusses big data analysis using Hadoop and proposes a system for validating data entering big data systems. It provides an overview of big data and Hadoop, describing how Hadoop uses MapReduce and HDFS to process and store large amounts of data across clusters of commodity hardware. The document then outlines challenges in validating big data and proposes a utility that would extract data from SQL and Hadoop databases, compare records to identify mismatches, and generate reports to ensure only correct data is processed.
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
P2P media streaming and file downloading is most popular applications over the Internet.
These systems reduce the server load and provide a scalable content distribution. P2P
networking is a new paradigm to build distributed applications. It describes the design
requirements for P2P media streaming, live and Video on demand system comparison based on their system architecture. In this paper we described and studied the traditional approaches for P2P streaming systems, design issues, challenges, and current approaches for providing P2P VoD services.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
This document discusses Big Data and Hadoop. It begins with prerequisites for Hadoop including Java, OOP concepts, and data structures. It then defines Big Data as being on the order of petabytes, far larger than typical files. Hadoop provides a solution for storing, processing, and analyzing this large data across clusters of commodity hardware using its HDFS distributed file system and MapReduce processing paradigm. A case study demonstrates how Hadoop can help a telecom company analyze usage data from millions of subscribers to improve service offerings.
This document provides an introduction and overview of Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of computers. It discusses how Hadoop uses MapReduce and HDFS to parallelize workloads and store data redundantly across nodes to solve issues around hardware failure and combining results. Key aspects covered include how HDFS distributes and replicates data, how MapReduce isolates processing into mapping and reducing functions to abstract communication, and how Hadoop moves computation to the data to improve performance.
The document summarizes Aginity's efforts to build a 10 terabyte database application using $5,682.10 worth of commodity hardware. They constructed a 9-box server farm with off-the-shelf components to test leading database systems like MapReduce, in-database analytics, and MPP on a scale that previously would have cost $2.2 million. The goal was to build similar big data capabilities on a smaller budget for their research lab to experiment with different technologies.
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
The computer industry is being challenged to develop methods and techniques for affordable data processing on large datasets at optimum response times. The technical challenges in dealing with the increasing demand to handle vast quantities of data is daunting and on the rise. One of the recent processing models with a more efficient and intuitive solution to rapidly process large amount of data in parallel is called MapReduce. It is a framework defining a template approach of programming to perform large-scale data computation on clusters of machines in a cloud computing environment. MapReduce provides automatic parallelization and distribution of computation based on several processors. It hides the complexity of writing parallel and distributed programming code. This paper provides a comprehensive systematic review and analysis of large-scale dataset processing and dataset handling challenges and
requirements in a cloud computing environment by using the MapReduce framework and its open-source implementation Hadoop. We defined requirements for MapReduce systems to perform large-scale data processing. We also proposed the MapReduce framework and one implementation of this framework on Amazon Web Services. At the end of the paper, we presented an experimentation of running MapReduce
system in a cloud environment. This paper outlines one of the best techniques to process large datasets is MapReduce; it also can help developers to do parallel and distributed computation in a cloud environment.
This document discusses big data analytics techniques like Hadoop MapReduce and NoSQL databases. It begins with an introduction to big data and how the exponential growth of data presents challenges that conventional databases can't handle. It then describes Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers using a simple programming model. Key aspects of Hadoop covered include MapReduce, HDFS, and various other related projects like Pig, Hive, HBase etc. The document concludes with details about how Hadoop MapReduce works, including its master-slave architecture and how it provides fault tolerance.
Similar a HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE (20)
Users Approach on Providing Feedback for Smart Home Devices – Phase IIijujournal
Smart Home technology has accomplished extraordinary success in making individuals' lives more straightforward and relaxing. Technology has recently brought about numerous savvy and refined frame works that advanced clever living innovation. In this paper, we will investigate the behavioral intention of user's approach to providing feedback for smart home devices. We will conduct an online survey for a sample of three to five students selected by simple random sampling to study the user's motto for giving feedback on smart home devices and their expectations. We have observed that most users are ready to actively share their input on smart home devices to improve the product's service and quality to fulfill the user’s needs and make their lives easier.
Users Approach on Providing Feedback for Smart Home Devices – Phase IIijujournal
Smart Home technology has accomplished extraordinary success in making individuals' lives more
straightforward and relaxing. Technology has recently brought about numerous savvy and refined frame
works that advanced clever living innovation. In this paper, we will investigate the behavioral intention of
user's approach to providing feedback for smart home devices. We will conduct an online survey for a
sample of three to five students selected by simple random sampling to study the user's motto for giving
feedback on smart home devices and their expectations. We have observed that most users are ready to
actively share their input on smart home devices to improve the product's service and quality to fulfill the
user’s needs and make their lives easier.
October 2023-Top Cited Articles in IJU.pdfijujournal
International Journal of Ubiquitous Computing (IJU) is a quarterly open access peer-reviewed journal that provides excellent international forum for sharing knowledge and results in theory, methodology and applications of ubiquitous computing. Current information age is witnessing a dramatic use of digital and electronic devices in the workplace and beyond. Ubiquitous Computing presents a rather arduous requirement of robustness, reliability and availability to the end user. Ubiquitous computing has received a significant and sustained research interest in terms of designing and deploying large scale and high performance computational applications in real life. The aim of the journal is to provide a platform to the researchers and practitioners from both academia as well as industry to meet and share cutting-edge development in the field.
ACCELERATION DETECTION OF LARGE (PROBABLY) PRIME NUMBERSijujournal
This document discusses methods for efficiently generating large prime numbers for use in RSA cryptography. It presents experimental results measuring the time taken to generate prime numbers when trial dividing the starting number by different numbers of initial primes before applying the Miller-Rabin primality test. The optimal number of trial divisions can be estimated as B=E/D, where E is the time for Miller-Rabin test and D is the maximum usefulness of trial division. Experimental results on different sized numbers support dividing by around 20 initial primes as optimal.
A novel integrated approach for handling anomalies in RFID dataijujournal
Radio Frequency Identification (RFID) is a convenient technology employed in various applications. The
success of these RFID applications depends heavily on the quality of the data stream generated by RFID
readers. Due to various anomalies found predominantly in RFID data it limits the widespread adoption of
this technology. Our work is to eliminate the anomalies present in RFID data in an effective manner so that
it can be applied for high end applications. Our approach is a hybrid approach of middleware and
deferred because it is not always possible to remove all anomalies and redundancies in middleware. The
processing of other anomalies is deferred until the query time and cleaned by business rules. Experimental
results show that the proposed approach performs the cleaning in an effective manner compared to the
existing approaches.
UBIQUITOUS HEALTHCARE MONITORING SYSTEM USING INTEGRATED TRIAXIAL ACCELEROMET...ijujournal
Ubiquitous healthcare has become one of the prominent areas of research inorder to address the
challenges encountered in healthcare environment. In contribution to this area, this study developed a
system prototype that recommends diagonostic services based on physiological data collected in real time
from a distant patient. The prototype uses WBAN body sensors to be worn by the individual and an android
smart phone as a personal server. Physiological data is collected and uploaded to a Medical Health
Server (MHS) via GPRS/internet to be analysed. Our implemented prototype monitors the activity, location
and physiological data such as SpO2 and Heart Rate (HR) of the elderly and patients in rehabilitation. The
uploaded information can be accessed in real time by medical practitioners through a web application.
ENHANCING INDEPENDENT SENIOR LIVING THROUGH SMART HOME TECHNOLOGIESijujournal
The population of elderly folks is ballooning worldwide as people live longer. But getting older often
means declining health and trouble living solo. Smart home tech could keep an eye on old folks and get
help quickly when needed so they can stay independent. This paper looks at a system combining wireless
sensors, video watches, automation, resident monitoring, emergency detection, and remote access. Sensors
track health signs, activities, appliance use. Video analytics spot odd stuff like falls. Sensor fusion and
machine learning find normal patterns so wonks can see unhealthy changes and send alerts. Multi-channel
alerts reach caregivers and emergency folks. A LabVIEW can integrate devices and enables local and
remote oversight and can control and handle emergency responses. Benefits seem to be early illness clues,
quick help, less burden on caregivers, and optimized home settings. But will old folks use all this tech? Can
we prove it really helps folks live longer and better? More research on maximizing reliability and
evaluating real-world impacts is needed. But designed thoughtfully, smart homes could may profoundly
improve the aging experience.
SERVICE DISCOVERY – A SURVEY AND COMPARISONijujournal
The document summarizes and compares several major service discovery approaches. It provides an overview of service discovery objectives and techniques, then surveys prominent protocols including SLP, Jini, and UPnP. Each approach is analyzed based on features like service description, discovery architecture, announcement/query mechanisms, and how they handle service usage and dynamic network changes. The comparison aims to identify strengths and limitations to guide future research in improving service discovery.
SIX DEGREES OF SEPARATION TO IMPROVE ROUTING IN OPPORTUNISTIC NETWORKSijujournal
Opportunistic Networks are able to exploit social behavior to create connectivity opportunities. This
paradigm uses pair-wise contacts for routing messages between nodes. In this context we investigated if the
“six degrees of separation” conjecture of small-world networks can be used as a basis to route messages in
Opportunistic Networks. We propose a simple approach for routing that outperforms some popular
protocols in simulations that are carried out with real world traces using ONE simulator. We conclude that
static graph models are not suitable for underlay routing approaches in highly dynamic networks like
Opportunistic Networks without taking account of temporal factors such as time, duration and frequency of
previous encounters.
International Journal of Ubiquitous Computing (IJU)ijujournal
International Journal of Ubiquitous Computing (IJU) is a quarterly open access peer-reviewed journal that provides excellent international forum for sharing knowledge and results in theory, methodology and applications of ubiquitous computing. Current information age is witnessing a dramatic use of digital and electronic devices in the workplace and beyond. Ubiquitous Computing presents a rather arduous requirement of robustness, reliability and availability to the end user. Ubiquitous computing has received a significant and sustained research interest in terms of designing and deploying large scale and high performance computational applications in real life. The aim of the journal is to provide a platform to the researchers and practitioners from both academia as well as industry to meet and share cutting-edge development in the field.
PERVASIVE COMPUTING APPLIED TO THE CARE OF PATIENTS WITH DEMENTIA IN HOMECARE...ijujournal
The aging population and the consequent increase in the incidence of dementias is causing many
challenges to health systems, mainly related to infrastructure, low services quality and high costs. One
solution is to provide the care at house of the patient, through of home care services. However, it is not a
trivial task, since a patient with dementia requires constant care and monitoring from a caregiver, who
suffers physical and emotional overload. In this context, this work presents an modelling for development of
pervasive systems aimed at helping the care of these patients in order to lessen the burden of the caregiver
while the patient continue to receive the necessary care.
A proposed Novel Approach for Sentiment Analysis and Opinion Miningijujournal
as the people are being dependent on internet the requirement of user view analysis is increasing
exponentially. Customer posts their experience and opinion about the product policy and services. But,
because of the massive volume of reviews, customers can’t read all reviews. In order to solve this problem,
a lot of research is being carried out in Opinion Mining. In order to solve this problem, a lot of research is
being carried out in Opinion Mining. Through the Opinion Mining, we can know about contents of whole
product reviews, Blogs are websites that allow one or more individuals to write about things they want to
share with other The valuable data contained in posts from a large number of users across geographic,
demographic and cultural boundaries provide a rich data source not only for commercial exploitation but
also for psychological & sociopolitical research. This paper tries to demonstrate the plausibility of the idea
through our clustering and classifying opinion mining experiment on analysis of blog posts on recent
product policy and services reviews. We are proposing a Nobel approach for analyzing the Review for the
customer opinion
International Journal of Ubiquitous Computing (IJU)ijujournal
International Journal of Ubiquitous Computing (IJU) is a quarterly open access peer-reviewed journal that provides excellent international forum for sharing knowledge and results in theory, methodology and applications of ubiquitous computing. Current information age is witnessing a dramatic use of digital and electronic devices in the workplace and beyond. Ubiquitous Computing presents a rather arduous requirement of robustness, reliability and availability to the end user. Ubiquitous computing has received a significant and sustained research interest in terms of designing and deploying large scale and high performance computational applications in real life. The aim of the journal is to provide a platform to the researchers and practitioners from both academia as well as industry to meet and share cutting-edge development in the field.
USABILITY ENGINEERING OF GAMES: A COMPARATIVE ANALYSIS OF MEASURING EXCITEMEN...ijujournal
Usability engineering and usability testing are concepts that continue to evolve. Interesting research studies
and new ideas come up every now and then. This paper tests the hypothesis of using an EDA-based
physiological measurements as a usability testing tool by considering three measures; which are observers‟
opinions, self-reported data and EDA-based physiological sensor data. These data were analyzed
comparatively and statistically. It concludes by discussing the findings that has been obtained from those
subjective and objective measures, which partially supports the hypothesis.
SECURED SMART SYSTEM DESING IN PERVASIVE COMPUTING ENVIRONMENT USING VCSijujournal
Ubiquitous Computing uses mobile phones or tiny devices for application development with sensors
embedded in mobile phones. The information generated by these devices is a big task in collection and
storage. For further, the data transmission to the intended destination is delay tolerant. In this paper, we
made an attempt to propose a new security algorithm for providing security to Pervasive Computing
Environment (PCE) system using Public-key Encryption (PKE) algorithm, Biometric Security (BS)
algorithm and Visual Cryptography Scheme (VCS) algorithm. In the proposed PCE monitoring system it
automates various home appliances using VCS and also provides security against intrusion using Zigbee
IEEE 802.15.4 based Sensor Network, GSM and Wi-Fi networks are embedded through a standard Home
gateway.
PERFORMANCE COMPARISON OF ROUTING PROTOCOLS IN MOBILE AD HOC NETWORKSijujournal
Routing protocols have an important role in any Mobile Ad Hoc Network (MANET). Researchers have
elaborated several routing protocols that possess different performance levels. In this paper we give a
performance evaluation of AODV, DSR, DSDV, OLSR and DYMO routing protocols in Mobile Ad Hoc
Networks (MANETS) to determine the best in different scenarios. We analyse these MANET routing
protocols by using NS-2 simulator. We specify how the Number of Nodes parameter influences their
performance. In this study, performance is calculated in terms of Packet Delivery Ratio, Average End to
End Delay, Normalised Routing Load and Average Throughput.
The document compares the performance of various optical character recognition (OCR) tools. It analyzes eight OCR tools - Online OCR, Free Online OCR, OCR Convert, Convert image to text.net, Free OCR, i2OCR, Free OCR to Word Convert, and Google Docs. The document provides sample outputs of each tool processing the same input image. It then evaluates the tools based on character accuracy, character error rate, special symbol accuracy, and special symbol error rate to determine which tools most accurately convert images to editable text.
Optical Character Recognition (OCR) is a technique, used to convert scanned image into editable text
format. Many different types of Optical Character Recognition (OCR) tools are commercially available
today; it is a useful and popular method for different types of applications. OCR can predict the accurate
result depends on text pre-processing and segmentation algorithms. Image quality is one of the most
important factors that improve quality of recognition in performing OCR tools. Images can be processed
independently (.png, .jpg, and .gif files) or in multi-page PDF documents (.pdf). The primary objective of
this work is to provide the overview of various Optical Character Recognition (OCR) tools and analyses of
their performance by applying the two factors of OCR tool performance i.e. accuracy and error rate.
DETERMINING THE NETWORK THROUGHPUT AND FLOW RATE USING GSR AND AAL2Rijujournal
In multi-radio wireless mesh networks, one node is eligible to transmit packets over multiple channels to
different destination nodes simultaneously. This feature of multi-radio wireless mesh network makes high
throughput for the network and increase the chance for multi path routing. This is because the multiple
channel availability for transmission decreases the probability of the most elegant problem called as
interference problem which is either of interflow and intraflow type. For avoiding the problem like
interference and maintaining the constant network performance or increasing the performance the WMN
need to consider the packet aggregation and packet forwarding. Packet aggregation is process of collecting
several packets ready for transmission and sending them to the intended recipient through the channel,
while the packet forwarding holds the hop-by-hop routing. But choosing the correct path among different
available multiple paths is most the important factor in the both case for a routing algorithm. Hence the
most challenging factor is to determine a forwarding strategy which will provide the schedule for each
node for transmission within the channel. In this research work we have tried to implement two forwarding
strategies for the multi path multi radio WMN as the approximate solution for the above said problem. We
have implemented Global State Routing (GSR) which will consider the packet forwarding concept and
Aggregation Aware Layer 2 Routing (AAL2R) which considers the both concept i.e. both packet forwarding
and packet aggregation. After the successful implementation the network performance has been measured
by means of simulation study.
A SURVEY: TO HARNESS AN EFFICIENT ENERGY IN CLOUD COMPUTINGijujournal
Cloud computing affords huge potential for dynamism, flexibility and cost-effective IT operations. Cloud
computing requires many tasks to be executed by the provided resources to achieve good performance,
shortest response time and high utilization of resources. To achieve these challenges there is a need to
develop a new energy aware scheduling algorithm that outperform appropriate allocation map of task to
optimize energy consumption. This study accomplished with all the existing techniques mainly focus on
reducing energy consumption
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
1. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
DOI:10.5121/iju.2013.4304 41
HMR LOG ANALYZER: ANALYZE WEB
APPLICATION LOGS OVER HADOOP MAPREDUCE
Sayalee Narkhede1
and Tripti Baraskar2
Department of Information Technology, MIT-Pune,University of Pune, Pune
sayleenarkhede@gmail.com
Department of Information Technology, MIT-Pune, University of Pune, Pune
baraskartn@gmail.com
ABSTRACT
In today’s Internet world, log file analysis is becoming a necessary task for analyzing the customer’s
behavior in order to improve advertising and sales as well as for datasets like environment, medical,
banking system it is important to analyze the log data to get required knowledge from it. Web mining is the
process of discovering the knowledge from the web data. Log files are getting generated very fast at the
rate of 1-10 Mb/s per machine, a single data center can generate tens of terabytes of log data in a day.
These datasets are huge. In order to analyze such large datasets we need parallel processing system and
reliable data storage mechanism. Virtual database system is an effective solution for integrating the data
but it becomes inefficient for large datasets. The Hadoop framework provides reliable data storage by
Hadoop Distributed File System and MapReduce programming model which is a parallel processing
system for large datasets. Hadoop distributed file system breaks up input data and sends fractions of the
original data to several machines in hadoop cluster to hold blocks of data. This mechanism helps to
process log data in parallel using all the machines in the hadoop cluster and computes result efficiently.
The dominant approach provided by hadoop to “Store first query later”, loads the data to the Hadoop
Distributed File System and then executes queries written in Pig Latin. This approach reduces the response
time as well as the load on to the end system. This paper proposes a log analysis system using Hadoop
MapReduce which will provide accurate results in minimum response time.
KEYWORDS
Hadoop, MapReduce, Log Files, Parallel Processing, Hadoop Distributed File System.
1. INTRODUCTION
As per the need of today’s world, everything is going online. Each and every field is having their
own way of putting their applications, business online on Internet. Seating at home we can do
shopping, banking related work; we get weather information, and many more services. And in
such a competitive environment, service providers are eager to know about are they providing
best service in the market, whether people are purchasing their product, are they finding
application interesting and friendly to use, or in the field of banking they need to know about how
many customers are looking forward to our bank scheme. In similar way, they also need to know
about problems that have been occurred, how to resolve them, how to make websites or web
application interesting, which products people are not purchasing and in that case how to improve
advertising strategies to attract customer, what will be the future marketing plans. To answer
these entire questions, log files are helpful. Log files contain list of actions that have been
occurred whenever someone accesses to your website or web application. These log files resides
in web servers. Each individual request is listed on a separate line in a log file, called a log entry.
It is automatically created every time someone makes a request to your web site. The point of a
log file is to keep track of what is happening with the web server. Log files are also used to keep
track of complex systems, so that when a problem does occur, it is easy to pinpoint and fix. But
2. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
42
there are times when log files are too difficult to read or make sense of, and it is then that log file
analysis is necessary. These log files have tons of useful information for service providers,
analyzing these log files can give lots of insights that help understand website traffic patterns,
user activity, there interest etc[10][11]. Thus, through the log file analysis we can get the
information about all the above questions as log is the record of people interaction with websites
and applications.
Figure 1. WorkFlow of The System
1.1. Background
Log files are generating at a record rate. Thousands of terabytes or petabytes of log files are
generated by a data center in a day. It is very challenging to store and analyze these large volumes
of log files. The problem of analyzing log files is complicated not only because of its volume but
also because of the disparate structure of log files. Conventional database solutions are not
suitable for analyzing such log files because they are not capable of handling such a large volume
of logs efficiently. . Andrew Pavlo and Erik Paulson in 2009 [13] compared the SQL DBMS and
Hadoop MapReduce and suggested that Hadoop MapReduce tunes up with the task faster and
also load data faster than DBMS. Also traditional DBMS cannot handle large datasets. This is
where big data technologies come to the rescue[8]. Hadoop-MapReduce[6][8][17] is applicable in
many areas for Big Data analysis. As log files is one of the type of big data so Hadoop is the best
suitable platform for storing log files and parallel implementation of MapReduce[3] program for
analyzing them. Apache Hadoop is a new way for enterprises to store and analyze data. Hadoop is
an open-source project created by Doug Cutting[17], administered by the Apache Software
Foundation. It enables applications to work with thousands of nodes and petabytes of data. While
it can be used on a single machine, its true power lies in its ability to scale to hundreds or
thousands of computers, each with several processor cores. As described by Tom White [6] in
Hadoop cluster, there are thousands of nodes which store multiple blocks of log files. Hadoop is
specially designed to work on large volume of information by connecting commodity computers
to work I parallel. Hadoop breaks up log files into blocks and these blocks are evenly distributed
over thousands of nodes in a Hadoop cluster. Also it does the replication of these blocks over the
multiple nodes so as to provide features like reliability and fault tolerance. Parallel computation
of MapReduce improves performance for large log files by breaking job into number of tasks.
1.2. Special Issues
1.2.1. Data Distribution
Performing computation on large volumes of log files have been done before but what makes
Hadoop different from them is its simplified programming model and its efficient, automatic
3. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
43
distribution of data and work across the machines. Condor does not provide automatic
distribution of data; separate SAN must be managed in addition to the compute cluster.
1.2.2. Isolation of Processes
Each individual record is processed by a task in isolation with one another, limiting the
communication overhead between the processes by Hadoop. This makes the whole framework
more reliable. In MapReduce, Mapper tasks process records in isolation. Individual node failure
can be worked around by restarting tasks on other machine. Due to isolation other nodes continue
to operate as nothing went wrong.
1.2.3. Type of Data
Log files are a plain text files consisting of semi-structured or unstructured records. Traditional
RDBMS has a pre-defined schema; you need to fit all the log data in that schema only. For this
Trans-Log algorithm is used which converts such unstructured logs to structured one by
transforming simple text log file into oracle table[12]. But again traditional RDBMS has
limitation on the size of data. Hadoop is compatible for any type of data; it works well for simple
text files too.
1.2.4. Fault Tolerance
In a Hadoop cluster, problem of data loss is resolved. Blocks of input file are replicated by factor
of three on multiple machines in the Hadoop cluster[4]. So even if any machine goes down, other
machine where same block is residing will take care of further processing. Thus failure of some
nodes in the cluster does not affect the performance of the system.
1.2.5. Data Locality and Network Bandwidth
Log files are spread across HDFS as blocks, so compute process running on a node operates on a
subset of files. Which data operated upon by a node is chosen by the locality of a node i.e.
reading from the local system, reducing the strain on network bandwidth and preventing
unnecessary network transfer. This property of moving computation near to the data makes
Hadoop different from other systems[4][6]. Data locality is achieved by this property of Hadoop
which results into providing high performance.
2. SYSTEM ARCHITECTURE
Following system architecture shown in Figure 2. consists of major components like Web servers,
Cloud Framework implementing Hadoop storage and MapReduce programming model and user
interface.
Figure 2. System Architecture
4. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
44
2.1. Web Servers
This Module consists of multiple web servers from which log files are collected. As log files
reside in a web server we need to collect them from these servers and collected log files may
require pre-processing to be done before storing log files to HDFS[4][6]. Pre-processing consists
on cleaning log files, removing redundancy, etc. Because we need quality of data, pre-processing
has to be performed. So these servers are responsible for fetching relevant log files which are
required to be processed.
2.2. Cloud Framework
Cloud consists of storage module and processing MapReduce[3] model. Many virtual servers
configured with Hadoop stores log files in distributed manner in HDFS[4][6]. Dividing log files
into blocks of size 64MB or above we can store them on multiple virtual servers in a Hadoop
cluster. Workers in the MapReduce are assigned with Map and Reduce tasks. Workers do parallel
computation of Map tasks. So it does analysis of log files in just two phases Map and Reduce
wherein the Map tasks it generate intermediate results (Key,value) pairs and Reduce task provides
with the summarized value for a particular key. Pig[5] installed on Hadoop virtual servers map
user query to MapReduce jobs because working out how to fit log processing into pattern of
MapReduce is challenge[14]. Evaluated results of log analysis are stored back onto virtual servers
of HDFS.
2.3. User Interface
This module communicate between the user and HMR system allowing user to interact with the
system specifying processing query, evaluate results and also get visualize results of log analysis
in different form of graphical reports.
3. IMPLEMENTATION OF HMR LOG PROCESSOR
HMR log processor is implemented in three phases. It includes log pre-processing,
interacting with HDFS and implementation of MapReduce programming model.
3.1. Log Pre-processing
In any analytical tool, pre-processing is necessary, because Log file may contain noisy &
ambiguous data which may affect result of analysis process. Log pre-processing is an important
step to filter and organize only appropriate information before applying MapReduce algorithm.
Pre-processing reduces size of log file also it increases quality of available data. The purpose of
log pre-processing is to improve log quality and increase accuracy of results.
3.1.1. Individual Field In The Log
Log file contains many fields like IP address, URL, Date, Hit, Age, Country, State, City, etc.But
as log file is a simple text file we need to separate out each field in each log entry. This could be
done by using the separator like ‘space’ or ‘;’ or’#’. We have used here ‘#’ to separate fields in
the log entry.
3.1.2. Removing Redundant Log Entries
It happens many times that a person visits same page many times. So even if a person had visited
particular page 10 times processor should take this as 1 hit to that page. This we can check by
using IP address and URL. If IP address and URL is same for 10 log entries then other 9 entries
must be removed leaving only 1 entry of that log. This improves the accuracy of the results of the
system.
5. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
45
3.1.3. Cleaning
Cleaning contains removing unnecessary log entries from the log file i.e. removing multimedia
files, image, page styles with extensions like .jpg, .gif, .css from any log. These fields are
unnecessary for application logs so we need to remove them so that we can get log file with
quality logs. This will make log file size to be reduced somewhat.
3.2. Interacting With HDFS
Figure 3. Hadoop Distributed File System
Hadoop Distributed File System holds a large log files in a redundant way across multiple
machines to achieve high availability for parallel processing and durability during failures. It also
provides high throughput access to log files. It is block-structured file system as it breaks up log
files into small blocks of fixed size. Default size of block is 64 MB as given by Hadoop but we
can also set block size of our choice. These blocks are replicated over multiple machines across a
Hadoop cluster. Replication factor is 3 so Hadoop replicates each block 3 times so even if in the
failure of any node there should be no data loss. Hadoop storage is shown in the Figure 3. In
Hadoop, log file data is accessed in Write once, Read many times manner. HDFS is a powerful
companion to Hadoop MapReduce. Hadoop MapReduce jobs automatically draws their input log
files from HDFS by setting the fs.default.name configuration option to point to the NameNode.
3.3. MapReduce Framework
MapReduce is a simple programming model for parallel processing of large volume of data. This
data could be anything but it is specifically designed to process lists of data. Fundamental concept
of MapReduce is to transform lists of input data to lists of output data. It happens many times that
input data is not in readable format; it could be the difficult task to understand large input
datasets. In that case, we need a model that can mold input data lists into readable, understandable
output lists. MapReduce does this conversion twice for the two major tasks: Map and Reduce just
by dividing whole workload into number of tasks and distributing them over different machines
in the Hadoop cluster.As we know, logs in the log files are also in the form of lists. Log file
consists of thousands of records i.e. logs which are in the text format. Nowadays business servers
are generating large volumes of log files of the size of terabytes in a day. According to business
perspective, there is a need to process these log files so that we can have appropriate reports of
6. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
46
how our business is going. For this application MapReduce implementation in Hadoop is one of
the best solutions. MapReduce is divided into two phases: Map phase and Reduce phase.
3.3.1. Map Phase
Figure 4. Architecture of Map Phase
Input to the MapReduce is log file, each record in log file is considered as an input to a Map task.
Map function takes a key-value pair as an input thus producing intermediate result in terms of
key-value pair. It takes each attribute in the record as a key and Maps each value in a record to its
key generating intermediate output as key-value pair. Map reads each log from simple text file,
breaks each log into the sequence of keys (x1, x2, …., xn) and emits value for each key which is
always 1. If key appears n times among all records then there will be n key-value pairs (x, 1)
among its output.
Map: (x1, v1) [(x2, v2)]
3.3.2. Grouping
After all the Map tasks have completed successfully, the master controller merges the
intermediate results file from each Map task that are destined for a particular Reduce task and
feeds the merged file to that process as a sequence of key-value pairs. That is, for each key x, the
input to the Reduce task that handles key x is a pair of the form (x, [v1, v2, . . . , vn]), where (x,
v1), (x, v2), . . . , (x, vn) are all the key-value pairs with key x coming from all the Map tasks.
3.3.3. Reduce Phase
Reduce task takes key and its list of associated values as an input. It combines values for input
key by reducing list of values as single value which is the count of occurrences of each key in the
log file, thus generating output in the form of key-value pair (x, sum).
Reduce: (x2, [v2]) (x3, v3)
7. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
47
Figure 5. Architecture of Reduce Phase
4. IN DEPTH MAPREDUCE DATA FLOW
4.1. Input Data
Input to the MapReduce comes from HDFS where log files are stored on the processing cluster.
By dividing log files into small blocks we can distribute them over nodes of Hadoop cluster. The
format of input files to MapReduce is arbitrary but it is line-based for log files. As each line is
considered as a one record as we can say one log. InputFormat is a class that defines splitting of
input files and how to read these files. FileInputFormat is the abstract class; all other
InputFormats that operate on files inherits functionality from this class. While starting
MapReduce job, FileInputFormat reads log files from the directory and splits it into the chunks.
By calling setInputFormat() method from JobConf object we can set any InputFormat provided
by Hadoop. TextInputFormat treats each line in the input file as a one record and best suited for
unformatted data. For the line-based log files, TextInputFormat is useful.While implementing
MapReduce whole job is broken up into pieces to operate over blocks of input files. These pieces
are Map tasks which operate on input blocks or whole log file. By default file breaks up into
blocks of size 64MB but we can choose parameter to break file as per our need by setting
mapred.min.split.size parameter. So for the large log files it is helpful to improve performance by
parallelizing Map tasks. Due to the significance of allowance of scheduling tasks on the different
nodes of the cluster, where log files actually reside, it is possible to process log files locally.
8. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
48
Figure 6. MapReduce WorkFlow
4.2. RecordReader
RecordReader is the class that defines how to access the logs in the log files and converts them
into (Key, value) pairs readable by Mapper. TextInputFormat provides LineRecordReader which
treats each line as a new value. RecordReader gets invoked until it does not deplete whole block.
The Map() method of the mapper is called each time when RecordReader is invoked.
4.3. Mapper
Mapper does Map phase in the MapReduce job. Map() method emits intermediate output in the
form of (key, value) pairs. For each Map task, a new instance of Mapper is instantiated in a
separate java process. Map() method receives four parameters. Theses parameters are key, value,
OutputCollector, Reporter. Collect() method from the OutputCollector object forwards
intermediate (Key, value) pairs to Reducer as an input for Reduce phase. Map task provides
information about its progress to Reporter object. Thus providing information about current task
such as InputSplit, Map task, also it emits status back to the user.
4.4. Shuffle And Sort
After first Map tasks have completed, nodes starts exchanging intermediate output from Map
tasks to destined Reduce tasks. This process of moving intermediate output from Map tasks to
Reduce tasks as an input is called as shuffling. This is the only communication step in
MapReduce. Before putting these (key, value) pairs as an input to Reducers; Hadoop groups all
the values for the same key no matter where they come from. Such partitioned data is then
assigned to Reduce tasks.
9. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
49
4.5. Reducer
Reducer instance calls Reduce() method for each key in the partition assigned to a Reducer. It
receives a key and iterator over all the values associated with that key. Reduce() method also has
parameters like key, iterator for all values, OutputCollector and Reporter which works in similar
manner as for Map() method.
4.6. Output Data
MapReduce provides (key, value) pairs to the OutputCollector to write them in the output files.
OutputFormat provides a way for how to write these pairs in output files. TextOutputFormat
works Similar to TextInputFormat. TextInputFormat writes each (key, value) pair on a separate
line in the output file. It writes a line in “Key Value” format. OutputFormat of Hadoop writes to
the files mainly on HDFS.
5. EXPERIMENTAL SETUP
We present here sample results of our proposed work. For sample test we have taken log files of
banking application server. These log files contain fields like IP address, URL, date, age, hit, city,
state and country. We have installed Hadoop on two machines with three instances of Hadoop on
each machine having Ubuntu 10.10 and above version. We have also installed Pig for query
processing. Log files are distributed evenly on these nodes. We have created web application for
user to interact with the system where he can distribute log files on the Hadoop cluster, run the
MapReduce job on these files and get analysed results in the graphical formats like bar charts and
pie charts. Analyzed results shows total hits for web application, hits per page, hits for city, hits
for page from each city, hits for quarter of the year, hits for each page during whole year, etc.
Figure 7. gives an idea about total hits from each city, Figure 8. shows total hits per page.
Figure 7. Total hits from each city
10. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
50
Figure 8. Total hits per page
6. CONCLUSIONS
In order to have a summarized data results for a particular web application, we need to do log
analysis which will help to improve the business strategies as well as to generate statistical
reports. Hadoop – MR log file analysis tool will provide us graphical reports showing hits for web
pages, user’s activity, in which part of website users are interested, traffic sources, etc. From
these reports business communities can evaluate which parts of the website need to be improved,
which are the potential customers, from which geographical region website is getting maximum
hits, etc, which will help in designing future marketing plans. Log analysis can be done by
various methods but what matters is response time. Hadoop MapReduce framework provides
parallel distributed processing and reliable data storage for large volumes of log files. Firstly, data
get stored in the hierarchy on several nodes in a cluster so that access time required can be
reduced which saves much of the processing time. Here hadoop’s characteristic of moving
computation to the data rather moving data to computation helps to improve response time.
Secondly, MapReduce successfully works for large datasets giving the efficient results.
REFERENCES
[1] S.Sathya Prof. M.Victor Jose, (2011) “Application of Hadoop MapReduce Technique to Virtual
Database System Design”, International Conference on Emerging Trends in Electrical and
Computer Technology (ICETECT), pp. 892-896.
[2] Yulai Yuan, Yongwei Wu_, Xiao Feng, Jing Li, Guangwen Yang, Weimin Zheng, (2010) “VDB-
MR: MapReduce- based distributed data integration using virtual database”, Future Generation
Computer Systems, vol. 26, pp. 1418-1425.
[3] Jeffrey Dean and Sanjay Ghemawat., (2004) “MapReduce: Simplified Data Processing on Large
Clusters”, Google Research Publication.
[4] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, (2010) “The Hadoop
Distributed File System”, Mass Storage Systems and Technologies(MSST), Sunnyvale, California
USA, vol. 10, pp. 1-10.
[5] C.Olston, B.Reed, U.Srivastava, R.Kumar, and A.Tomkins, (2008) “Pig latin: a not-so-foreign
language for data processing”, ACM SIGMOD International conference on Management of data,
pp. 1099– 1110.
[6] Tom White, (2009) “Hadoop: The Definitive Guide. O’Reilly”, Scbastopol, California.
[7] M.Zaharia, A.Konwinski, A.Joseph, Y.zatz, and I.Stoica, (2008) “Improving mapreduce
performance in heterogeneous environments” OSDI’08: 8th USENIX Symposium on Operating
Systems Design and Implementation.
11. International Journal of UbiComp (IJU), Vol.4, No.3, July 2013
51
[8] Mr. Yogesh Pingle, Vaibhav Kohli, Shruti Kamat, Nimesh Poladia, (2012)“Big Data Processing
using Apache Hadoop in Cloud System”, National Conference on Emerging Trends in
Engineering & Technology.
[9] Cooley R., Srivastava J., Mobasher B., (1997) “Web mining: informationa and pattern discovery
on world wide web”, IEEE International conference on tools with artificial intelligence, pp. 558-
567.
[10] Liu Zhijing, Wang Bin, (2003) “Web mining research”, International conference on computational
intelligence and multimedia applications, pp. 84-89.
[11] Yang, Q. and Zhang, H., (2003) “Web-Log Mining for predictive Caching”, IEEE Trans.
Knowledge and Data Eng., 15( 4), pp. 1050-1053.
[12] P. Nithya, Dr. P. Sumathi, (2012) “A Survey on Web Usage Mining: Theory and Applications”,
International Journal Computer Technology and Applications, Vol. 3, pp. 1625-1629.
[13] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden,
Michael Stonebraker, (2009) ”A Comparison of Approaches to Large-Scale Data Analysis”, ACM
SIGMOD’09.
[14] Gates et al., (2009) “Building a High-Level Dataflow System on top of Map-Reduce: The Pig
Experience”, VLDB 2009, Section 4.
[15] LI Jing-min, HE Guo-hui, (2010) “Research of Distributed Database System Based on Hadoop”,
IEEE International conference on Information Science and Engineering (ICISE), pp. 1417-1420.
[16] T. Hoff, (2008) “How Rackspace Now Uses MapReduce and Hadoop To Query Terabytes of
Data”.
[17] Apache-Hadoop,http://Hadoop.apache.org
Authors
Sayalee Narkhede is a ME student of IT department in MIT-Pune under University
of Pune. She has Received BE degree in Computer Engineering from AISSMS IOIT
under University of Pune. Her research interest includes Distributed Systems, Cloud
Computing.
Tripti Baraskar is an assistant professor in IT department of MIT- Pune. She has
received her ME degree in Communication Controls and Networking from MIT
Gwalior and BE degree from Oriental Institute of Science and Technology under
RGTV. Her research interest includes Image Processing.