Hadoop is an optimal solution for big data processing and storing since being released in the late of 2006, hadoop data processing stands on master-slaves manner that’s splits the large file job into several small files in order to process them separately, this technique was adopted instead of pushing one large file into a costly super machine to insights some useful information. Hadoop runs very good with large file of big data, but when it comes to big data in small files it could facing some problems in performance, processing slow down, data access delay, high latency and up to a completely cluster shutting down. In this paper we will high light on one of hadoop’s limitations, that’s affects the data processing performance, one of these limits called “big data in small files” accrued when a massive number of small files pushed into a hadoop cluster which will rides the cluster to shut down totally. This paper also high light on some native and proposed solutions for big data in small files, how do they work to reduce the negative effects on hadoop cluster, and add extra performance on storing and accessing mechanism.
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET Journal
This document proposes a new HDFS architecture that eliminates the single point of failure of the NameNode by distributing metadata storage using blockchain technology. In the traditional HDFS, the NameNode stores all metadata, but in the new architecture this is replaced by blockchain miners that securely store encrypted metadata across data nodes. Blockchain links data blocks in a serial manner with cryptographic hashes to ensure integrity. The key components are HDFS clients, data nodes for storage, and specially designated miner nodes that help create and store metadata blocks in an encrypted and distributed fashion similar to how transactions are recorded in a blockchain. This architecture aims to provide reliable, secure and faster metadata access without a single point of failure.
Data Deduplication: Venti and its improvementsUmair Amjad
This document summarizes the data deduplication system called Venti and improvements over it. Venti identifies duplicate data blocks using cryptographic hashes of block contents. It stores only a single copy of each unique block. The document discusses three key limitations of Venti: hash collisions, fixed-size chunking sensitivity, and access control. It then summarizes approaches taken by other systems to improve on these limitations, such as using multiple hash functions to reduce collisions, variable-length chunking, and stronger authentication and encryption. In conclusion, while Venti was effective at eliminating data duplication, later systems aimed to address its remaining challenges to handle growing archive sizes securely and efficiently.
Cloud Data De Duplication in Multiuser Environment DeposM2ijtsrd
Nowadays, cloud computing produce a huge amount of sensitive data, such as personal Information, financial data, and electronic health records, social media data. And that causes duplication of data and that suffers to storage and performance of cloud system. Data De Duplication has been widely used to eliminate redundant storage overhead in cloud storage system to improve IT resources efficiency. However, traditional techniques face a great challenge in big data De Duplication to strike a sensible tradeoff between the conflicting goals of scalable De Duplication throughput and high duplicate elimination ratio. De Duplication reduces the space and bandwidth requirements of data storage services, and is most effective when applied across multiple users, a common practice by cloud storage offerings. I study the privacy implications of cross user De Duplication. Thus, an interesting challenging problem is how to deduplicate multimedia data with a multi user environment and propose an efficient system to overcome these types of problems. In this paper, I introduce a new primitive called Depos M2 which gives a partial positive answer for these challenging problem. I propose two phases De Duplication and proof of storage, where the first one allows De Duplication of data and letter one allows proof of storage that means give permission to respective user i.e. owner of that file. Mr. Kaustubh Borate | Prof. Bharti Dhote "Cloud Data De-Duplication in Multiuser Environment: DeposM2" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd25270.pdfPaper URL: https://www.ijtsrd.com/engineering/computer-engineering/25270/cloud-data-de-duplication-in-multiuser-environment-deposm2/mr-kaustubh-borate
Load Rebalancing with Security for Hadoop File System in CloudIJERD Editor
[1]A file system is used for the organization, storage,[1]retrieval, naming, sharing, and protection of
files. Distributed file system has certain degrees of transparency to the user and the system such as access
transparency,[2] location transparency, failure transparency, heterogeneity, replication transparency etc.
[1][3]NFS (Network File System), RFS (Remote File Sharing), Andrew File System (AFS) are examples of
Distributed file system. Distributed file systems are generally used for cloud computing applications based on
[4] the MapReduce programming model. A MapReduce program consist of a Map () procedure that performs
filtering and a Reduce () procedure that performs a summary operation. However, in a cloud computing
environment, sometimes failure is occurs and nodes may be upgraded, replaced, and added in the system.
Therefore load imbalanced problem arises. To solve this problem, load rebalancing algorithm is implemented in
this paper so that central node should not overloaded. The implementation is done in hadoop distributed file
system. As apache hadoop is used, security issues are arises. To solve these security issues and to increase
security, [20] Kerberos authentication protocol is implemented to handle multiple nodes. This paper shows real
time implementation experiment on cluster with result.
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...IRJET Journal
This document summarizes a research paper that proposes a methodology for optimizing storage on the cloud using authorized de-duplication. It discusses how de-duplication works to eliminate duplicate data and optimize storage. The key steps are chunking files into blocks, applying secure hash algorithms like SHA-512 to generate unique hashes for each block, and comparing hashes to reference duplicate blocks instead of storing multiple copies. It also discusses using cryptographic techniques like ciphertext-policy attribute-based encryption for authentication and security on public clouds. The proposed approach aims to optimize storage while providing authorized de-duplication functionality.
This document discusses computer hardware concepts. It explains that organizations invest in computer hardware to improve productivity, increase revenue, reduce costs, provide better customer service, speed up time-to-market, and enable collaboration. It describes the central processing unit (CPU) and its components. It also discusses different types of memory, storage technologies like magnetic disks, tapes, DVDs, and Blu-ray discs. The document also covers access methods, RAID, enterprise storage options, parallel and grid computing, cloud computing, and differences between grid and cloud computing.
Using Fedora Commons To Create A Persistent ArchivePhil Cryer
With the increasing amount of digital data and demand for open access to view and reuse such data continually increasing, the adoption of open source digital repository software is critical for long term storage and management of digital objects. By utilizing the open source Fedora Commons software, the Missouri Botanical Garden has created a stable, persistent archive for Tropicos digital objects, including specimen images, plant photos, and other digital media. Metadata, organized in standard Dublin Core extracted from Tropicos, are stored alongside the digital objects providing search and sharing of data via open standards such as REST and OAI, opening the door for mash-ups and alternative uses. The presentation will cover initial discovery, required hardware and software, and an overview of our experience implementing Fedora Commons. Lessons learned, pros and cons, and other options will also be covered.
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET Journal
This document proposes a new HDFS architecture that eliminates the single point of failure of the NameNode by distributing metadata storage using blockchain technology. In the traditional HDFS, the NameNode stores all metadata, but in the new architecture this is replaced by blockchain miners that securely store encrypted metadata across data nodes. Blockchain links data blocks in a serial manner with cryptographic hashes to ensure integrity. The key components are HDFS clients, data nodes for storage, and specially designated miner nodes that help create and store metadata blocks in an encrypted and distributed fashion similar to how transactions are recorded in a blockchain. This architecture aims to provide reliable, secure and faster metadata access without a single point of failure.
Data Deduplication: Venti and its improvementsUmair Amjad
This document summarizes the data deduplication system called Venti and improvements over it. Venti identifies duplicate data blocks using cryptographic hashes of block contents. It stores only a single copy of each unique block. The document discusses three key limitations of Venti: hash collisions, fixed-size chunking sensitivity, and access control. It then summarizes approaches taken by other systems to improve on these limitations, such as using multiple hash functions to reduce collisions, variable-length chunking, and stronger authentication and encryption. In conclusion, while Venti was effective at eliminating data duplication, later systems aimed to address its remaining challenges to handle growing archive sizes securely and efficiently.
Cloud Data De Duplication in Multiuser Environment DeposM2ijtsrd
Nowadays, cloud computing produce a huge amount of sensitive data, such as personal Information, financial data, and electronic health records, social media data. And that causes duplication of data and that suffers to storage and performance of cloud system. Data De Duplication has been widely used to eliminate redundant storage overhead in cloud storage system to improve IT resources efficiency. However, traditional techniques face a great challenge in big data De Duplication to strike a sensible tradeoff between the conflicting goals of scalable De Duplication throughput and high duplicate elimination ratio. De Duplication reduces the space and bandwidth requirements of data storage services, and is most effective when applied across multiple users, a common practice by cloud storage offerings. I study the privacy implications of cross user De Duplication. Thus, an interesting challenging problem is how to deduplicate multimedia data with a multi user environment and propose an efficient system to overcome these types of problems. In this paper, I introduce a new primitive called Depos M2 which gives a partial positive answer for these challenging problem. I propose two phases De Duplication and proof of storage, where the first one allows De Duplication of data and letter one allows proof of storage that means give permission to respective user i.e. owner of that file. Mr. Kaustubh Borate | Prof. Bharti Dhote "Cloud Data De-Duplication in Multiuser Environment: DeposM2" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd25270.pdfPaper URL: https://www.ijtsrd.com/engineering/computer-engineering/25270/cloud-data-de-duplication-in-multiuser-environment-deposm2/mr-kaustubh-borate
Load Rebalancing with Security for Hadoop File System in CloudIJERD Editor
[1]A file system is used for the organization, storage,[1]retrieval, naming, sharing, and protection of
files. Distributed file system has certain degrees of transparency to the user and the system such as access
transparency,[2] location transparency, failure transparency, heterogeneity, replication transparency etc.
[1][3]NFS (Network File System), RFS (Remote File Sharing), Andrew File System (AFS) are examples of
Distributed file system. Distributed file systems are generally used for cloud computing applications based on
[4] the MapReduce programming model. A MapReduce program consist of a Map () procedure that performs
filtering and a Reduce () procedure that performs a summary operation. However, in a cloud computing
environment, sometimes failure is occurs and nodes may be upgraded, replaced, and added in the system.
Therefore load imbalanced problem arises. To solve this problem, load rebalancing algorithm is implemented in
this paper so that central node should not overloaded. The implementation is done in hadoop distributed file
system. As apache hadoop is used, security issues are arises. To solve these security issues and to increase
security, [20] Kerberos authentication protocol is implemented to handle multiple nodes. This paper shows real
time implementation experiment on cluster with result.
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...IRJET Journal
This document summarizes a research paper that proposes a methodology for optimizing storage on the cloud using authorized de-duplication. It discusses how de-duplication works to eliminate duplicate data and optimize storage. The key steps are chunking files into blocks, applying secure hash algorithms like SHA-512 to generate unique hashes for each block, and comparing hashes to reference duplicate blocks instead of storing multiple copies. It also discusses using cryptographic techniques like ciphertext-policy attribute-based encryption for authentication and security on public clouds. The proposed approach aims to optimize storage while providing authorized de-duplication functionality.
This document discusses computer hardware concepts. It explains that organizations invest in computer hardware to improve productivity, increase revenue, reduce costs, provide better customer service, speed up time-to-market, and enable collaboration. It describes the central processing unit (CPU) and its components. It also discusses different types of memory, storage technologies like magnetic disks, tapes, DVDs, and Blu-ray discs. The document also covers access methods, RAID, enterprise storage options, parallel and grid computing, cloud computing, and differences between grid and cloud computing.
Using Fedora Commons To Create A Persistent ArchivePhil Cryer
With the increasing amount of digital data and demand for open access to view and reuse such data continually increasing, the adoption of open source digital repository software is critical for long term storage and management of digital objects. By utilizing the open source Fedora Commons software, the Missouri Botanical Garden has created a stable, persistent archive for Tropicos digital objects, including specimen images, plant photos, and other digital media. Metadata, organized in standard Dublin Core extracted from Tropicos, are stored alongside the digital objects providing search and sharing of data via open standards such as REST and OAI, opening the door for mash-ups and alternative uses. The presentation will cover initial discovery, required hardware and software, and an overview of our experience implementing Fedora Commons. Lessons learned, pros and cons, and other options will also be covered.
Cloud computing is widely considered as potentially the next dominant technology in IT industry. It
offers basic system maintenance and scalable source management with Virtual Machines (VMs). As a essential
technology of cloud computing, VM has been a searing research issue in recent years. The high overhead of
virtualization has been well address by hardware expansion in CPU industry, and by software realization
improvement in hypervisors themselves. However, the high order on VM image storage remains a difficult
problem. Existing systems have made efforts to decrease VM image storage consumption by means of
deduplication inside a storage area network system. Nevertheless, storage area network cannot assure the
increasing demand of large-scale VM hosting for cloud computing because of its cost limitation. In this project,
we propose SILO, improved deduplication file system that has been particularly designed for major VM
deployment. Its design provide fast VM deployment with similarity and locality based fingerprint index for data
transfer and low storage consumption by means of deduplication on VM images. And implement heart beat
protocol in Meta Data Server (MDS) to recover the data from data server. It also provides a comprehensive set of
storage features including backup server for VM images, on-demand attractive through a network, and caching
through local disks by copy-on-read techniques. Experiments show that SILO features execute well and introduce
minor performance overhead.
Keywords — Deduplication, Storage area network, Load Balancing, Hash table, Disk copies.
Chapter 4 record storage and primary file organizationJafar Nesargi
This document discusses file organization and storage in a database management system. It describes how records are stored on secondary storage devices like disks and tapes. Records can have fixed or variable lengths and are organized into files. The document explains how disks are organized into tracks and sectors and how buffers in memory are used to improve performance when accessing multiple disk blocks.
Users are accustomed to having local, immediate access to their email in PST Files so they can connect to email records when they need to – IT & Legal see uncontrolled email records scattered across the organization with little
retention or discovery capability.
Most Popular Hadoop Interview Questions and AnswersSprintzeal
When we talk about the average salary of a Big Data Hadoop developer, it is close to 135 thousand dollars per annum. In European countries as well as in the United Kingdom, with the big data Hadoop certification, one can simply earn more than £67,000 per annum. These data reflect the reality of how great the career is. It was no less than a decade when companies are generating more than ten terabytes of data, we're paying heavily two database managers, and we are not satisfied with their services. For companies like Google, after a surge and lateral expansion, managing data became very cumbersome. Scientists and engineers of Google pioneer a project that was further known to be Hadoop. The idea here was to play with different types of data like XML, text, binary, SQL, log, and objects but further mapping them and reducing them do a single structured architecture.
iaetsd Controlling data deuplication in cloud storageIaetsd Iaetsd
This document discusses controlling data deduplication in cloud storage. It proposes an architecture that provides duplicate check procedures with minimal overhead compared to normal cloud storage operations. The key aspects of the proposed system are:
1) It uses convergent encryption to encrypt data for privacy while still allowing for deduplication of duplicate files.
2) It introduces a private cloud that manages user privileges and generates tokens for authorized duplicate checking in a hybrid cloud architecture.
3) It evaluates the overhead of the proposed authorized duplicate checking scheme and finds it incurs negligible overhead compared to normal cloud storage operations.
The document discusses various digital preservation activities the author undertook as part of an assignment, including archiving, harvesting, mirroring files, extracting metadata, and verifying checksums. The author learned how to use tools like PeaZip, Xena, emulators, and metadata extraction software. They created disk images and analyzed them using bulk extractor to identify sensitive data. The author automated a workflow to generate checksums and write them to an Excel file. Overall, the assignment helped the author gain hands-on experience with digital preservation concepts and tools.
SiDe is a cost-efficient cloud storage mechanism that uses data deduplication and compression to minimize storage usage while maintaining reliability. It uses chunk-level deduplication to identify and store only unique chunks of files. For files stored short-term, only one replica is kept, while files stored long-term have one replica and one compressed copy. Simulation results show SiDe reduces storage by 81-84% compared to traditional 3-replica strategies, significantly lowering cloud storage costs.
NameNode and DataNode Couplingfor a Power-proportional Hadoop Distributed F...Hanh Le Hieu
The document proposes a novel architecture called NDCouplingHDFS for efficient gear-shifting in power-proportional Hadoop Distributed File Systems (HDFS). Current HDFS-based methods have inefficient gear-shifting processes due to bottlenecks from centralized metadata access and high communication costs. The proposed approach distributes metadata management across nodes to eliminate bottlenecks and couples name nodes and data nodes to reduce communication during gear shifts by localizing metadata and enabling bulk data transfers. Experimental evaluation demonstrates the efficiency gains of the new architecture.
A Survey: Enhanced Block Level Message Locked Encryption for data DeduplicationIRJET Journal
This document summarizes various techniques for data deduplication. It discusses inline and post-process deduplication approaches and encryption-based deduplication methods like message locked encryption (MLE) and block-level message locked encryption (BL-MLE). The document also reviews literature on deduplication schemes using sampling techniques, flash memory indexing, and combining deduplication with Hadoop Distributed File System. Overall, the document provides an overview of different data deduplication methods and their advantages and disadvantages.
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
This document describes a proposed tool called Warehouse Creator that can automatically generate data warehouses from heterogeneous data sources within an enterprise. The tool extracts data from various data sources like databases and files, integrates the data by generating dimension and fact tables, and provides a web interface for users to search and retrieve information from the warehouse without needing direct access to the underlying data sources. The tool aims to address issues like the need for users to have detailed knowledge of different data sources and query languages by providing a centralized warehouse that integrates data from multiple sources.
Data deDuplication - deDup
deDuplication - English
Deduplikace - Čeština
Deduplikation - Deutsch
Deduplicación Español
Déduplication - Français
Datadeduplisering - Norsk
Дедупликация - Русский
Дедублікація - Українська
The document provides information about a database course including:
1) An overview of the course content which covers database fundamentals, the relational model, normalization, conceptual modeling, query languages, and advanced SQL topics.
2) Details about the lecturer including their academic background and publications.
3) Assessment details for the course including exams, labs, and project work accounting for 100% of the grade.
Small Data: Bridging the Gap Between Generic and Specific RepositoriesAnita de Waard
The document discusses the challenges of data preservation in scientific research. There are currently many different databases and repositories for storing research data, but the majority of data is still stored locally on hard drives. The document advocates for increasing the amount of data preserved in repositories by addressing scientists' objections and creating tools that provide direct benefits like enabling new discoveries. Examples provided include developing mobile apps to digitize paper lab notebooks and dashboards to analyze shared data. Overall, the goal is to bridge scientific data preservation across the many different existing domain-specific and generic repositories.
The HiTiME project aims to develop a system that can recognize entities like people, organizations, locations, dates and professions in historical text documents. The system splits documents into words, recognizes entities using named entity recognition and stores the output in a database. It also aims to integrate with other systems at the International Institute of Social History to improve search, metadata and visualization of historical data. Some planned improvements include using additional natural language processing tools, disambiguating entities, recognizing composite entities, and integrating with applications like the Basic Word Sequence Analysis tool.
The document discusses the history and development of multimedia database management systems (MMDBMS). It traces the evolution of MMDBMS from early systems in the 1980s to recent advancements in security, indexing/retrieval, manageability, and performance. Personal computer growth, hardware compatibility, and internet usage drove increased demand for multimedia databases. Recent improvements include enhanced security features, improved indexing and retrieval of multimedia data through standards like MPEG, better manageability through synchronization and replication, and higher retrieval performance due to disk technology advances. The ongoing development of networking and hardware will continue shaping this evolving technology.
A New Efficient Cache Replacement Strategy for Named Data NetworkingIJCNCJournal
The Information-Centric Network (ICN) is a future internet architecture with efficient content retrieval and distribution. Named Data Networking (NDN) is one of the proposed architectures for ICN. NDN’s innetwork caching improves data availability, reduce retrieval delays, network load, alleviate producer load, and limit data traffic. Despite the existence of several caching decision algorithms, the fetching and distribution of contents with minimum resource utilization remains a great challenge. In this paper, we introduce a new cache replacement strategy called Enhanced Time and Frequency Cache Replacement strategy (ETFCR) where both cache hit frequency and cache retrieval time are used to select evicted data chunks. ETFCR adds time cycles between the last two requests to adjust data chunk’s popularity and cache hits. We conducted extensive simulations using the ccnSim simulator to evaluate the performance of ETFCR and compare it to that of some well-known cache replacement strategies. Simulations results show that ETFCR outperforms the other cache replacement strategies in terms of cache hit ratio, and lower content retrieval delay.
DocuWare is an enterprise content management system that allows users to store, organize, and access documents from a centralized digital file cabinet. It supports importing documents from various sources like scanners, email, Microsoft Office files, and print streams. Documents are automatically indexed and filed away in user-defined file cabinets. The system provides security, version control, and integrates with other business systems. It aims to streamline document management and improve efficiency.
The document provides an introduction to Hadoop and its distributed file system (HDFS) design and issues. It describes what Hadoop and big data are, and examples of large amounts of data generated every minute on the internet. It then discusses the types of big data and problems with traditional storage. The document outlines how Hadoop provides a solution through its HDFS and MapReduce components. It details the architecture and components of HDFS including the name node, data nodes, block replication, and rack awareness. Some advantages of Hadoop like scalability, flexibility and fault tolerance are also summarized along with some issues like small file handling and security problems.
This document discusses the problem of storing and processing many small files in HDFS and Hadoop. It introduces the concept of "harballing" where Hadoop uses an archiving technique called Hadoop Archive (HAR) to collect many small files into a single large file to reduce overhead on the namenode and improve performance. HAR packs small files into an archive file with a .har extension so the original files can still be accessed efficiently and in parallel. This reduction of small files through harballing increases scalability by reducing namespace usage and load on the namenode.
Cloud computing is widely considered as potentially the next dominant technology in IT industry. It
offers basic system maintenance and scalable source management with Virtual Machines (VMs). As a essential
technology of cloud computing, VM has been a searing research issue in recent years. The high overhead of
virtualization has been well address by hardware expansion in CPU industry, and by software realization
improvement in hypervisors themselves. However, the high order on VM image storage remains a difficult
problem. Existing systems have made efforts to decrease VM image storage consumption by means of
deduplication inside a storage area network system. Nevertheless, storage area network cannot assure the
increasing demand of large-scale VM hosting for cloud computing because of its cost limitation. In this project,
we propose SILO, improved deduplication file system that has been particularly designed for major VM
deployment. Its design provide fast VM deployment with similarity and locality based fingerprint index for data
transfer and low storage consumption by means of deduplication on VM images. And implement heart beat
protocol in Meta Data Server (MDS) to recover the data from data server. It also provides a comprehensive set of
storage features including backup server for VM images, on-demand attractive through a network, and caching
through local disks by copy-on-read techniques. Experiments show that SILO features execute well and introduce
minor performance overhead.
Keywords — Deduplication, Storage area network, Load Balancing, Hash table, Disk copies.
Chapter 4 record storage and primary file organizationJafar Nesargi
This document discusses file organization and storage in a database management system. It describes how records are stored on secondary storage devices like disks and tapes. Records can have fixed or variable lengths and are organized into files. The document explains how disks are organized into tracks and sectors and how buffers in memory are used to improve performance when accessing multiple disk blocks.
Users are accustomed to having local, immediate access to their email in PST Files so they can connect to email records when they need to – IT & Legal see uncontrolled email records scattered across the organization with little
retention or discovery capability.
Most Popular Hadoop Interview Questions and AnswersSprintzeal
When we talk about the average salary of a Big Data Hadoop developer, it is close to 135 thousand dollars per annum. In European countries as well as in the United Kingdom, with the big data Hadoop certification, one can simply earn more than £67,000 per annum. These data reflect the reality of how great the career is. It was no less than a decade when companies are generating more than ten terabytes of data, we're paying heavily two database managers, and we are not satisfied with their services. For companies like Google, after a surge and lateral expansion, managing data became very cumbersome. Scientists and engineers of Google pioneer a project that was further known to be Hadoop. The idea here was to play with different types of data like XML, text, binary, SQL, log, and objects but further mapping them and reducing them do a single structured architecture.
iaetsd Controlling data deuplication in cloud storageIaetsd Iaetsd
This document discusses controlling data deduplication in cloud storage. It proposes an architecture that provides duplicate check procedures with minimal overhead compared to normal cloud storage operations. The key aspects of the proposed system are:
1) It uses convergent encryption to encrypt data for privacy while still allowing for deduplication of duplicate files.
2) It introduces a private cloud that manages user privileges and generates tokens for authorized duplicate checking in a hybrid cloud architecture.
3) It evaluates the overhead of the proposed authorized duplicate checking scheme and finds it incurs negligible overhead compared to normal cloud storage operations.
The document discusses various digital preservation activities the author undertook as part of an assignment, including archiving, harvesting, mirroring files, extracting metadata, and verifying checksums. The author learned how to use tools like PeaZip, Xena, emulators, and metadata extraction software. They created disk images and analyzed them using bulk extractor to identify sensitive data. The author automated a workflow to generate checksums and write them to an Excel file. Overall, the assignment helped the author gain hands-on experience with digital preservation concepts and tools.
SiDe is a cost-efficient cloud storage mechanism that uses data deduplication and compression to minimize storage usage while maintaining reliability. It uses chunk-level deduplication to identify and store only unique chunks of files. For files stored short-term, only one replica is kept, while files stored long-term have one replica and one compressed copy. Simulation results show SiDe reduces storage by 81-84% compared to traditional 3-replica strategies, significantly lowering cloud storage costs.
NameNode and DataNode Couplingfor a Power-proportional Hadoop Distributed F...Hanh Le Hieu
The document proposes a novel architecture called NDCouplingHDFS for efficient gear-shifting in power-proportional Hadoop Distributed File Systems (HDFS). Current HDFS-based methods have inefficient gear-shifting processes due to bottlenecks from centralized metadata access and high communication costs. The proposed approach distributes metadata management across nodes to eliminate bottlenecks and couples name nodes and data nodes to reduce communication during gear shifts by localizing metadata and enabling bulk data transfers. Experimental evaluation demonstrates the efficiency gains of the new architecture.
A Survey: Enhanced Block Level Message Locked Encryption for data DeduplicationIRJET Journal
This document summarizes various techniques for data deduplication. It discusses inline and post-process deduplication approaches and encryption-based deduplication methods like message locked encryption (MLE) and block-level message locked encryption (BL-MLE). The document also reviews literature on deduplication schemes using sampling techniques, flash memory indexing, and combining deduplication with Hadoop Distributed File System. Overall, the document provides an overview of different data deduplication methods and their advantages and disadvantages.
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
This document describes a proposed tool called Warehouse Creator that can automatically generate data warehouses from heterogeneous data sources within an enterprise. The tool extracts data from various data sources like databases and files, integrates the data by generating dimension and fact tables, and provides a web interface for users to search and retrieve information from the warehouse without needing direct access to the underlying data sources. The tool aims to address issues like the need for users to have detailed knowledge of different data sources and query languages by providing a centralized warehouse that integrates data from multiple sources.
Data deDuplication - deDup
deDuplication - English
Deduplikace - Čeština
Deduplikation - Deutsch
Deduplicación Español
Déduplication - Français
Datadeduplisering - Norsk
Дедупликация - Русский
Дедублікація - Українська
The document provides information about a database course including:
1) An overview of the course content which covers database fundamentals, the relational model, normalization, conceptual modeling, query languages, and advanced SQL topics.
2) Details about the lecturer including their academic background and publications.
3) Assessment details for the course including exams, labs, and project work accounting for 100% of the grade.
Small Data: Bridging the Gap Between Generic and Specific RepositoriesAnita de Waard
The document discusses the challenges of data preservation in scientific research. There are currently many different databases and repositories for storing research data, but the majority of data is still stored locally on hard drives. The document advocates for increasing the amount of data preserved in repositories by addressing scientists' objections and creating tools that provide direct benefits like enabling new discoveries. Examples provided include developing mobile apps to digitize paper lab notebooks and dashboards to analyze shared data. Overall, the goal is to bridge scientific data preservation across the many different existing domain-specific and generic repositories.
The HiTiME project aims to develop a system that can recognize entities like people, organizations, locations, dates and professions in historical text documents. The system splits documents into words, recognizes entities using named entity recognition and stores the output in a database. It also aims to integrate with other systems at the International Institute of Social History to improve search, metadata and visualization of historical data. Some planned improvements include using additional natural language processing tools, disambiguating entities, recognizing composite entities, and integrating with applications like the Basic Word Sequence Analysis tool.
The document discusses the history and development of multimedia database management systems (MMDBMS). It traces the evolution of MMDBMS from early systems in the 1980s to recent advancements in security, indexing/retrieval, manageability, and performance. Personal computer growth, hardware compatibility, and internet usage drove increased demand for multimedia databases. Recent improvements include enhanced security features, improved indexing and retrieval of multimedia data through standards like MPEG, better manageability through synchronization and replication, and higher retrieval performance due to disk technology advances. The ongoing development of networking and hardware will continue shaping this evolving technology.
A New Efficient Cache Replacement Strategy for Named Data NetworkingIJCNCJournal
The Information-Centric Network (ICN) is a future internet architecture with efficient content retrieval and distribution. Named Data Networking (NDN) is one of the proposed architectures for ICN. NDN’s innetwork caching improves data availability, reduce retrieval delays, network load, alleviate producer load, and limit data traffic. Despite the existence of several caching decision algorithms, the fetching and distribution of contents with minimum resource utilization remains a great challenge. In this paper, we introduce a new cache replacement strategy called Enhanced Time and Frequency Cache Replacement strategy (ETFCR) where both cache hit frequency and cache retrieval time are used to select evicted data chunks. ETFCR adds time cycles between the last two requests to adjust data chunk’s popularity and cache hits. We conducted extensive simulations using the ccnSim simulator to evaluate the performance of ETFCR and compare it to that of some well-known cache replacement strategies. Simulations results show that ETFCR outperforms the other cache replacement strategies in terms of cache hit ratio, and lower content retrieval delay.
DocuWare is an enterprise content management system that allows users to store, organize, and access documents from a centralized digital file cabinet. It supports importing documents from various sources like scanners, email, Microsoft Office files, and print streams. Documents are automatically indexed and filed away in user-defined file cabinets. The system provides security, version control, and integrates with other business systems. It aims to streamline document management and improve efficiency.
The document provides an introduction to Hadoop and its distributed file system (HDFS) design and issues. It describes what Hadoop and big data are, and examples of large amounts of data generated every minute on the internet. It then discusses the types of big data and problems with traditional storage. The document outlines how Hadoop provides a solution through its HDFS and MapReduce components. It details the architecture and components of HDFS including the name node, data nodes, block replication, and rack awareness. Some advantages of Hadoop like scalability, flexibility and fault tolerance are also summarized along with some issues like small file handling and security problems.
This document discusses the problem of storing and processing many small files in HDFS and Hadoop. It introduces the concept of "harballing" where Hadoop uses an archiving technique called Hadoop Archive (HAR) to collect many small files into a single large file to reduce overhead on the namenode and improve performance. HAR packs small files into an archive file with a .har extension so the original files can still be accessed efficiently and in parallel. This reduction of small files through harballing increases scalability by reducing namespace usage and load on the namenode.
Toy Hadoop is a distributed compute engine that improves upon Hadoop's ability to handle metadata-intensive workloads with many small files. It uses FusionFS as the underlying distributed file system due to FusionFS's efficient metadata management. Toy Hadoop addresses Hadoop's "small files problem" by grouping small files together and assigning them to task trackers in batches, rather than processing each file individually. This reduces communication overhead and improves performance even for small file workloads. Evaluation shows that Toy Hadoop has significantly better performance than Hadoop for files under 100MB, especially when using FusionFS for its metadata handling capabilities.
Tiny datablock in saving Hadoop distributed file system wasted memory IJECEIAES
This document summarizes a research paper that proposes a new method called Tiny Datablock-HDFS (TD-HDFS) to reduce wasted memory in the Hadoop Distributed File System (HDFS) when storing small files. Currently, HDFS wastes a significant amount of memory by storing each small file in a full-sized datablock. The proposed method reduces the default datablock size and merges related small datablocks into a single "master datablock", exploiting more of the available memory and increasing storage capacity. The results were examined through a comparison of standard HDFS file hosting versus the proposed solution, finding that TD-HDFS uses a minimum amount of wasted memory while maintaining the same read/write
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
Big Data is a term defined for data sets that are extreme and complex where traditional data processing applications are inadequate to deal with them. The term Big Data often refers simply to the use of predictive investigation on analytic methods that extract value from data. Big data is generalized as a large data which is a collection of big datasets that cannot be processed using traditional computing techniques. Big data is not purely a data, rather than it is a complete subject involves various tools, techniques and frameworks. Big data can be any structured collection which results incapability of conventional data management methods. Hadoop is a distributed example used to change the large amount of data. This manipulation contains not only storage as well as processing on the data. Hadoop is an open- source software framework for dispersed storage and processing of big data sets on computer clusters built from commodity hardware. HDFS was built to support high throughput, streaming reads and writes of extremely large files. Hadoop Map Reduce is a software framework for easily writing applications which process vast amounts of data. Wordcount example reads text files and counts how often words occur. The input is text files and the result is wordcount file, each line of which contains a word and the count of how often it occurred separated by a tab.
This document outlines the course objectives and modules for a Big Data Analytics course. The course focuses on Hadoop Distributed File System concepts, Hadoop tools, business intelligence applications, and data mining techniques including decision trees, text mining, and social network analysis. Key topics covered include HDFS architecture, MapReduce programming, YARN, and ensuring high availability of HDFS through techniques like rack awareness and NameNode replication.
Hadoop is a framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses HDFS for fault-tolerant storage and MapReduce as a programming model for distributed computing. HDFS stores data across clusters of machines and replicates it for reliability. MapReduce allows processing of large datasets in parallel by splitting work into independent tasks. Hadoop provides reliable and scalable storage and analysis of very large amounts of data.
Hadoop is a framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses HDFS for fault-tolerant storage and MapReduce as a programming model for distributed computing. HDFS stores data across clusters of machines as blocks that are replicated for reliability. The namenode manages filesystem metadata while datanodes store and retrieve blocks. MapReduce allows processing of large datasets in parallel using a map function to distribute work and a reduce function to aggregate results. Hadoop provides reliable and scalable distributed computing on commodity hardware.
This document discusses big data and Hadoop. It defines big data as large datasets that are difficult to process using traditional methods due to their volume, variety, and velocity. Hadoop is presented as an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. The key components of Hadoop are the Hadoop Distributed File System (HDFS) for storage and MapReduce as a programming model for distributed processing. A number of other technologies in Hadoop's ecosystem are also described such as HBase, Avro, Pig, Hive, Sqoop, Zookeeper and Mahout. The document concludes that Hadoop provides solutions for efficiently processing and analyzing big data.
this presentation describes the company from where I did my summer training and what is bigdata why we use big data, big data challenges, the issue in big data, the solution of big data issues, hadoop, docker , Ansible etc.
One of the challenges in storing and processing the data and using the latest internet technologies has resulted in large volumes of data. The technique to manage this massive amount of data and to pull out the value, out of this volume is collectively called Big data. Over the recent years, there has been a rising interest in big data for social media analysis. Online social media have become the important platform across the world to share information. Facebook, one of the largest social media site receives posts in millions every day. One of the efficient technologies that deal with the Big Data is Hadoop. Hadoop, for processing large data volume jobs uses MapReduce programming model. This paper provides a survey on Hadoop and its role in facebook and a brief introduction to HIVE.
This document provides an overview of Apache Hadoop, an open source framework for distributed storage and processing of large datasets across clusters of computers. It discusses big data and the need for solutions like Hadoop, describes the key components of Hadoop including HDFS for storage and MapReduce for processing, and outlines some applications and pros and cons of the Hadoop framework.
This document discusses big data and Hadoop. It defines big data as large amounts of unstructured data that would be too costly to store and analyze in a traditional database. It then describes how Hadoop provides a solution to this challenge through distributed and parallel processing across clusters of commodity hardware. Key aspects of Hadoop covered include HDFS for reliable storage, MapReduce for distributed computing, and how together they allow scalable analysis of very large datasets. Popular users of Hadoop like Amazon, Yahoo and Facebook are also mentioned.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through HDFS and processes large amounts of data in parallel using MapReduce. The core components of Hadoop are HDFS for storage, MapReduce for processing, and YARN for resource management. Hadoop allows for scalable and cost-effective solutions to various big data problems like storage, processing speed, and scalability by distributing data and computation across clusters.
AN OVERVIEW OF BIGDATA AND HADOOP . THE ARCHITECHTURE IT USES AND THE WAY IT WORKS ON THE DATA SETS. THE SIDES ALSO SHOW THE VARIOUS FIELDS WHERE THEY ARE MOSTLY USED AND IMPLIMENTED
Asserting that Big Data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly Big Data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc.
Big data refers to large amounts of data from various sources that is analyzed to solve problems. It is characterized by volume, velocity, and variety. Hadoop is an open source framework used to store and process big data across clusters of computers. Key components of Hadoop include HDFS for storage, MapReduce for processing, and HIVE for querying. Other tools like Pig and HBase provide additional functionality. Together these tools provide a scalable infrastructure to handle the volume, speed, and complexity of big data.
Big data refers to large amounts of data from various sources that is analyzed to solve problems. It is characterized by volume, velocity, and variety. Hadoop is an open source framework used to store and process big data across clusters of computers. Key components of Hadoop include HDFS for storage, MapReduce for processing, and HIVE for querying. Other tools like Pig and HBase provide additional functionality. Together these tools provide a scalable infrastructure to handle the volume, speed, and complexity of big data.
This document provides an introduction to big data and Hadoop. It defines big data as massive amounts of structured and unstructured data that is too large for traditional databases to handle. Hadoop is an open-source framework for storing and processing big data across clusters of commodity hardware. Key components of Hadoop include HDFS for storage, MapReduce for parallel processing, and an ecosystem of tools like Hive, Pig, and Spark. The document outlines the architecture of Hadoop, including the roles of the master node, slave nodes, and clients. It also explains concepts like rack awareness, MapReduce jobs, and how files are stored in HDFS in blocks across nodes.
Similar a Available techniques in hadoop small file issue (20)
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Neural network optimizer of proportional-integral-differential controller par...IJECEIAES
Wide application of proportional-integral-differential (PID)-regulator in industry requires constant improvement of methods of its parameters adjustment. The paper deals with the issues of optimization of PID-regulator parameters with the use of neural network technology methods. A methodology for choosing the architecture (structure) of neural network optimizer is proposed, which consists in determining the number of layers, the number of neurons in each layer, as well as the form and type of activation function. Algorithms of neural network training based on the application of the method of minimizing the mismatch between the regulated value and the target value are developed. The method of back propagation of gradients is proposed to select the optimal training rate of neurons of the neural network. The neural network optimizer, which is a superstructure of the linear PID controller, allows increasing the regulation accuracy from 0.23 to 0.09, thus reducing the power consumption from 65% to 53%. The results of the conducted experiments allow us to conclude that the created neural superstructure may well become a prototype of an automatic voltage regulator (AVR)-type industrial controller for tuning the parameters of the PID controller.
An improved modulation technique suitable for a three level flying capacitor ...IJECEIAES
This research paper introduces an innovative modulation technique for controlling a 3-level flying capacitor multilevel inverter (FCMLI), aiming to streamline the modulation process in contrast to conventional methods. The proposed
simplified modulation technique paves the way for more straightforward and
efficient control of multilevel inverters, enabling their widespread adoption and
integration into modern power electronic systems. Through the amalgamation of
sinusoidal pulse width modulation (SPWM) with a high-frequency square wave
pulse, this controlling technique attains energy equilibrium across the coupling
capacitor. The modulation scheme incorporates a simplified switching pattern
and a decreased count of voltage references, thereby simplifying the control
algorithm.
A review on features and methods of potential fishing zoneIJECEIAES
This review focuses on the importance of identifying potential fishing zones in seawater for sustainable fishing practices. It explores features like sea surface temperature (SST) and sea surface height (SSH), along with classification methods such as classifiers. The features like SST, SSH, and different classifiers used to classify the data, have been figured out in this review study. This study underscores the importance of examining potential fishing zones using advanced analytical techniques. It thoroughly explores the methodologies employed by researchers, covering both past and current approaches. The examination centers on data characteristics and the application of classification algorithms for classification of potential fishing zones. Furthermore, the prediction of potential fishing zones relies significantly on the effectiveness of classification algorithms. Previous research has assessed the performance of models like support vector machines, naïve Bayes, and artificial neural networks (ANN). In the previous result, the results of support vector machine (SVM) were 97.6% more accurate than naive Bayes's 94.2% to classify test data for fisheries classification. By considering the recent works in this area, several recommendations for future works are presented to further improve the performance of the potential fishing zone models, which is important to the fisheries community.
Electrical signal interference minimization using appropriate core material f...IJECEIAES
As demand for smaller, quicker, and more powerful devices rises, Moore's law is strictly followed. The industry has worked hard to make little devices that boost productivity. The goal is to optimize device density. Scientists are reducing connection delays to improve circuit performance. This helped them understand three-dimensional integrated circuit (3D IC) concepts, which stack active devices and create vertical connections to diminish latency and lower interconnects. Electrical involvement is a big worry with 3D integrates circuits. Researchers have developed and tested through silicon via (TSV) and substrates to decrease electrical wave involvement. This study illustrates a novel noise coupling reduction method using several electrical involvement models. A 22% drop in electrical involvement from wave-carrying to victim TSVs introduces this new paradigm and improves system performance even at higher THz frequencies.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Bibliometric analysis highlighting the role of women in addressing climate ch...IJECEIAES
Fossil fuel consumption increased quickly, contributing to climate change
that is evident in unusual flooding and draughts, and global warming. Over
the past ten years, women's involvement in society has grown dramatically,
and they succeeded in playing a noticeable role in reducing climate change.
A bibliometric analysis of data from the last ten years has been carried out to
examine the role of women in addressing the climate change. The analysis's
findings discussed the relevant to the sustainable development goals (SDGs),
particularly SDG 7 and SDG 13. The results considered contributions made
by women in the various sectors while taking geographic dispersion into
account. The bibliometric analysis delves into topics including women's
leadership in environmental groups, their involvement in policymaking, their
contributions to sustainable development projects, and the influence of
gender diversity on attempts to mitigate climate change. This study's results
highlight how women have influenced policies and actions related to climate
change, point out areas of research deficiency and recommendations on how
to increase role of the women in addressing the climate change and
achieving sustainability. To achieve more successful results, this initiative
aims to highlight the significance of gender equality and encourage
inclusivity in climate change decision-making processes.
Voltage and frequency control of microgrid in presence of micro-turbine inter...IJECEIAES
The active and reactive load changes have a significant impact on voltage
and frequency. In this paper, in order to stabilize the microgrid (MG) against
load variations in islanding mode, the active and reactive power of all
distributed generators (DGs), including energy storage (battery), diesel
generator, and micro-turbine, are controlled. The micro-turbine generator is
connected to MG through a three-phase to three-phase matrix converter, and
the droop control method is applied for controlling the voltage and
frequency of MG. In addition, a method is introduced for voltage and
frequency control of micro-turbines in the transition state from gridconnected mode to islanding mode. A novel switching strategy of the matrix
converter is used for converting the high-frequency output voltage of the
micro-turbine to the grid-side frequency of the utility system. Moreover,
using the switching strategy, the low-order harmonics in the output current
and voltage are not produced, and consequently, the size of the output filter
would be reduced. In fact, the suggested control strategy is load-independent
and has no frequency conversion restrictions. The proposed approach for
voltage and frequency regulation demonstrates exceptional performance and
favorable response across various load alteration scenarios. The suggested
strategy is examined in several scenarios in the MG test systems, and the
simulation results are addressed.
Enhancing battery system identification: nonlinear autoregressive modeling fo...IJECEIAES
Precisely characterizing Li-ion batteries is essential for optimizing their
performance, enhancing safety, and prolonging their lifespan across various
applications, such as electric vehicles and renewable energy systems. This
article introduces an innovative nonlinear methodology for system
identification of a Li-ion battery, employing a nonlinear autoregressive with
exogenous inputs (NARX) model. The proposed approach integrates the
benefits of nonlinear modeling with the adaptability of the NARX structure,
facilitating a more comprehensive representation of the intricate
electrochemical processes within the battery. Experimental data collected
from a Li-ion battery operating under diverse scenarios are employed to
validate the effectiveness of the proposed methodology. The identified
NARX model exhibits superior accuracy in predicting the battery's behavior
compared to traditional linear models. This study underscores the
importance of accounting for nonlinearities in battery modeling, providing
insights into the intricate relationships between state-of-charge, voltage, and
current under dynamic conditions.
Smart grid deployment: from a bibliometric analysis to a surveyIJECEIAES
Smart grids are one of the last decades' innovations in electrical energy.
They bring relevant advantages compared to the traditional grid and
significant interest from the research community. Assessing the field's
evolution is essential to propose guidelines for facing new and future smart
grid challenges. In addition, knowing the main technologies involved in the
deployment of smart grids (SGs) is important to highlight possible
shortcomings that can be mitigated by developing new tools. This paper
contributes to the research trends mentioned above by focusing on two
objectives. First, a bibliometric analysis is presented to give an overview of
the current research level about smart grid deployment. Second, a survey of
the main technological approaches used for smart grid implementation and
their contributions are highlighted. To that effect, we searched the Web of
Science (WoS), and the Scopus databases. We obtained 5,663 documents
from WoS and 7,215 from Scopus on smart grid implementation or
deployment. With the extraction limitation in the Scopus database, 5,872 of
the 7,215 documents were extracted using a multi-step process. These two
datasets have been analyzed using a bibliometric tool called bibliometrix.
The main outputs are presented with some recommendations for future
research.
Use of analytical hierarchy process for selecting and prioritizing islanding ...IJECEIAES
One of the problems that are associated to power systems is islanding
condition, which must be rapidly and properly detected to prevent any
negative consequences on the system's protection, stability, and security.
This paper offers a thorough overview of several islanding detection
strategies, which are divided into two categories: classic approaches,
including local and remote approaches, and modern techniques, including
techniques based on signal processing and computational intelligence.
Additionally, each approach is compared and assessed based on several
factors, including implementation costs, non-detected zones, declining
power quality, and response times using the analytical hierarchy process
(AHP). The multi-criteria decision-making analysis shows that the overall
weight of passive methods (24.7%), active methods (7.8%), hybrid methods
(5.6%), remote methods (14.5%), signal processing-based methods (26.6%),
and computational intelligent-based methods (20.8%) based on the
comparison of all criteria together. Thus, it can be seen from the total weight
that hybrid approaches are the least suitable to be chosen, while signal
processing-based methods are the most appropriate islanding detection
method to be selected and implemented in power system with respect to the
aforementioned factors. Using Expert Choice software, the proposed
hierarchy model is studied and examined.
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...IJECEIAES
The power generated by photovoltaic (PV) systems is influenced by
environmental factors. This variability hampers the control and utilization of
solar cells' peak output. In this study, a single-stage grid-connected PV
system is designed to enhance power quality. Our approach employs fuzzy
logic in the direct power control (DPC) of a three-phase voltage source
inverter (VSI), enabling seamless integration of the PV connected to the
grid. Additionally, a fuzzy logic-based maximum power point tracking
(MPPT) controller is adopted, which outperforms traditional methods like
incremental conductance (INC) in enhancing solar cell efficiency and
minimizing the response time. Moreover, the inverter's real-time active and
reactive power is directly managed to achieve a unity power factor (UPF).
The system's performance is assessed through MATLAB/Simulink
implementation, showing marked improvement over conventional methods,
particularly in steady-state and varying weather conditions. For solar
irradiances of 500 and 1,000 W/m2
, the results show that the proposed
method reduces the total harmonic distortion (THD) of the injected current
to the grid by approximately 46% and 38% compared to conventional
methods, respectively. Furthermore, we compare the simulation results with
IEEE standards to evaluate the system's grid compatibility.
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...IJECEIAES
Photovoltaic systems have emerged as a promising energy resource that
caters to the future needs of society, owing to their renewable, inexhaustible,
and cost-free nature. The power output of these systems relies on solar cell
radiation and temperature. In order to mitigate the dependence on
atmospheric conditions and enhance power tracking, a conventional
approach has been improved by integrating various methods. To optimize
the generation of electricity from solar systems, the maximum power point
tracking (MPPT) technique is employed. To overcome limitations such as
steady-state voltage oscillations and improve transient response, two
traditional MPPT methods, namely fuzzy logic controller (FLC) and perturb
and observe (P&O), have been modified. This research paper aims to
simulate and validate the step size of the proposed modified P&O and FLC
techniques within the MPPT algorithm using MATLAB/Simulink for
efficient power tracking in photovoltaic systems.
Adaptive synchronous sliding control for a robot manipulator based on neural ...IJECEIAES
Robot manipulators have become important equipment in production lines, medical fields, and transportation. Improving the quality of trajectory tracking for
robot hands is always an attractive topic in the research community. This is a
challenging problem because robot manipulators are complex nonlinear systems
and are often subject to fluctuations in loads and external disturbances. This
article proposes an adaptive synchronous sliding control scheme to improve trajectory tracking performance for a robot manipulator. The proposed controller
ensures that the positions of the joints track the desired trajectory, synchronize
the errors, and significantly reduces chattering. First, the synchronous tracking
errors and synchronous sliding surfaces are presented. Second, the synchronous
tracking error dynamics are determined. Third, a robust adaptive control law is
designed,the unknown components of the model are estimated online by the neural network, and the parameters of the switching elements are selected by fuzzy
logic. The built algorithm ensures that the tracking and approximation errors
are ultimately uniformly bounded (UUB). Finally, the effectiveness of the constructed algorithm is demonstrated through simulation and experimental results.
Simulation and experimental results show that the proposed controller is effective with small synchronous tracking errors, and the chattering phenomenon is
significantly reduced.
Remote field-programmable gate array laboratory for signal acquisition and de...IJECEIAES
A remote laboratory utilizing field-programmable gate array (FPGA) technologies enhances students’ learning experience anywhere and anytime in embedded system design. Existing remote laboratories prioritize hardware access and visual feedback for observing board behavior after programming, neglecting comprehensive debugging tools to resolve errors that require internal signal acquisition. This paper proposes a novel remote embeddedsystem design approach targeting FPGA technologies that are fully interactive via a web-based platform. Our solution provides FPGA board access and debugging capabilities beyond the visual feedback provided by existing remote laboratories. We implemented a lab module that allows users to seamlessly incorporate into their FPGA design. The module minimizes hardware resource utilization while enabling the acquisition of a large number of data samples from the signal during the experiments by adaptively compressing the signal prior to data transmission. The results demonstrate an average compression ratio of 2.90 across three benchmark signals, indicating efficient signal acquisition and effective debugging and analysis. This method allows users to acquire more data samples than conventional methods. The proposed lab allows students to remotely test and debug their designs, bridging the gap between theory and practice in embedded system design.
Detecting and resolving feature envy through automated machine learning and m...IJECEIAES
Efficiently identifying and resolving code smells enhances software project quality. This paper presents a novel solution, utilizing automated machine learning (AutoML) techniques, to detect code smells and apply move method refactoring. By evaluating code metrics before and after refactoring, we assessed its impact on coupling, complexity, and cohesion. Key contributions of this research include a unique dataset for code smell classification and the development of models using AutoGluon for optimal performance. Furthermore, the study identifies the top 20 influential features in classifying feature envy, a well-known code smell, stemming from excessive reliance on external classes. We also explored how move method refactoring addresses feature envy, revealing reduced coupling and complexity, and improved cohesion, ultimately enhancing code quality. In summary, this research offers an empirical, data-driven approach, integrating AutoML and move method refactoring to optimize software project quality. Insights gained shed light on the benefits of refactoring on code quality and the significance of specific features in detecting feature envy. Future research can expand to explore additional refactoring techniques and a broader range of code metrics, advancing software engineering practices and standards.
Smart monitoring technique for solar cell systems using internet of things ba...IJECEIAES
Rapidly and remotely monitoring and receiving the solar cell systems status parameters, solar irradiance, temperature, and humidity, are critical issues in enhancement their efficiency. Hence, in the present article an improved smart prototype of internet of things (IoT) technique based on embedded system through NodeMCU ESP8266 (ESP-12E) was carried out experimentally. Three different regions at Egypt; Luxor, Cairo, and El-Beheira cities were chosen to study their solar irradiance profile, temperature, and humidity by the proposed IoT system. The monitoring data of solar irradiance, temperature, and humidity were live visualized directly by Ubidots through hypertext transfer protocol (HTTP) protocol. The measured solar power radiation in Luxor, Cairo, and El-Beheira ranged between 216-1000, 245-958, and 187-692 W/m 2 respectively during the solar day. The accuracy and rapidity of obtaining monitoring results using the proposed IoT system made it a strong candidate for application in monitoring solar cell systems. On the other hand, the obtained solar power radiation results of the three considered regions strongly candidate Luxor and Cairo as suitable places to build up a solar cells system station rather than El-Beheira.
An efficient security framework for intrusion detection and prevention in int...IJECEIAES
Over the past few years, the internet of things (IoT) has advanced to connect billions of smart devices to improve quality of life. However, anomalies or malicious intrusions pose several security loopholes, leading to performance degradation and threat to data security in IoT operations. Thereby, IoT security systems must keep an eye on and restrict unwanted events from occurring in the IoT network. Recently, various technical solutions based on machine learning (ML) models have been derived towards identifying and restricting unwanted events in IoT. However, most ML-based approaches are prone to miss-classification due to inappropriate feature selection. Additionally, most ML approaches applied to intrusion detection and prevention consider supervised learning, which requires a large amount of labeled data to be trained. Consequently, such complex datasets are impossible to source in a large network like IoT. To address this problem, this proposed study introduces an efficient learning mechanism to strengthen the IoT security aspects. The proposed algorithm incorporates supervised and unsupervised approaches to improve the learning models for intrusion detection and mitigation. Compared with the related works, the experimental outcome shows that the model performs well in a benchmark dataset. It accomplishes an improved detection accuracy of approximately 99.21%.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
artificial intelligence and data science contents.pptxGauravCar
What is artificial intelligence? Artificial intelligence is the ability of a computer or computer-controlled robot to perform tasks that are commonly associated with the intellectual processes characteristic of humans, such as the ability to reason.
› ...
Artificial intelligence (AI) | Definitio
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
Discover the latest insights on Data Driven Maintenance with our comprehensive webinar presentation. Learn about traditional maintenance challenges, the right approach to utilizing data, and the benefits of adopting a Data Driven Maintenance strategy. Explore real-world examples, industry best practices, and innovative solutions like FMECA and the D3M model. This presentation, led by expert Jules Oudmans, is essential for asset owners looking to optimize their maintenance processes and leverage digital technologies for improved efficiency and performance. Download now to stay ahead in the evolving maintenance landscape.
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...shadow0702a
This document serves as a comprehensive step-by-step guide on how to effectively use PyCharm for remote debugging of the Windows Subsystem for Linux (WSL) on a local Windows machine. It meticulously outlines several critical steps in the process, starting with the crucial task of enabling permissions, followed by the installation and configuration of WSL.
The guide then proceeds to explain how to set up the SSH service within the WSL environment, an integral part of the process. Alongside this, it also provides detailed instructions on how to modify the inbound rules of the Windows firewall to facilitate the process, ensuring that there are no connectivity issues that could potentially hinder the debugging process.
The document further emphasizes on the importance of checking the connection between the Windows and WSL environments, providing instructions on how to ensure that the connection is optimal and ready for remote debugging.
It also offers an in-depth guide on how to configure the WSL interpreter and files within the PyCharm environment. This is essential for ensuring that the debugging process is set up correctly and that the program can be run effectively within the WSL terminal.
Additionally, the document provides guidance on how to set up breakpoints for debugging, a fundamental aspect of the debugging process which allows the developer to stop the execution of their code at certain points and inspect their program at those stages.
Finally, the document concludes by providing a link to a reference blog. This blog offers additional information and guidance on configuring the remote Python interpreter in PyCharm, providing the reader with a well-rounded understanding of the process.
Batteries -Introduction – Types of Batteries – discharging and charging of battery - characteristics of battery –battery rating- various tests on battery- – Primary battery: silver button cell- Secondary battery :Ni-Cd battery-modern battery: lithium ion battery-maintenance of batteries-choices of batteries for electric vehicle applications.
Fuel Cells: Introduction- importance and classification of fuel cells - description, principle, components, applications of fuel cells: H2-O2 fuel cell, alkaline fuel cell, molten carbonate fuel cell and direct methanol fuel cells.
Null Bangalore | Pentesters Approach to AWS IAMDivyanshu
#Abstract:
- Learn more about the real-world methods for auditing AWS IAM (Identity and Access Management) as a pentester. So let us proceed with a brief discussion of IAM as well as some typical misconfigurations and their potential exploits in order to reinforce the understanding of IAM security best practices.
- Gain actionable insights into AWS IAM policies and roles, using hands on approach.
#Prerequisites:
- Basic understanding of AWS services and architecture
- Familiarity with cloud security concepts
- Experience using the AWS Management Console or AWS CLI.
- For hands on lab create account on [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
# Scenario Covered:
- Basics of IAM in AWS
- Implementing IAM Policies with Least Privilege to Manage S3 Bucket
- Objective: Create an S3 bucket with least privilege IAM policy and validate access.
- Steps:
- Create S3 bucket.
- Attach least privilege policy to IAM user.
- Validate access.
- Exploiting IAM PassRole Misconfiguration
-Allows a user to pass a specific IAM role to an AWS service (ec2), typically used for service access delegation. Then exploit PassRole Misconfiguration granting unauthorized access to sensitive resources.
- Objective: Demonstrate how a PassRole misconfiguration can grant unauthorized access.
- Steps:
- Allow user to pass IAM role to EC2.
- Exploit misconfiguration for unauthorized access.
- Access sensitive resources.
- Exploiting IAM AssumeRole Misconfiguration with Overly Permissive Role
- An overly permissive IAM role configuration can lead to privilege escalation by creating a role with administrative privileges and allow a user to assume this role.
- Objective: Show how overly permissive IAM roles can lead to privilege escalation.
- Steps:
- Create role with administrative privileges.
- Allow user to assume the role.
- Perform administrative actions.
- Differentiation between PassRole vs AssumeRole
Try at [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
2. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 2, April 2020 : 2097 - 2101
2098
2. HADOOP ARCHITECTURE
Hadoop architecture stands on master-slave manner in data processing and storing, the main node in
hadoop called “Namenode” [1], this one knows everything happening across the cluster such as data
transactions, data storing ID, data replications, other Datanode health and availability and much more
technical jobs [12], in other words, Namenode holds every metadata about the cluster, on the other hand,
the rest of nodes called “Datanodes” that’s save the actual data in a smaller unites called data blocks, every
data block is 64 MB size [13]. The saving mechanism force the income data to occupy 80% of the block size
as the rest of the data block is used for metadata purposes. Hadoop has a good data availability technique
that’s every file in the cluster must has three replicas distributed in some other datanode in order to keep
the file accessible even if the Datanode somehow goes down [4, 14, 15].
3. BIG DATA IN SMALL FILES DEFINITION
A big data, can be more than 1 TB, or a 500 GB of multimedia files such as videos, or else can be 10
TB of logs files, however, big data is not only a bunch of big files, it can be a group of small files related
together in making sense. Generally, files with volume within 5 MB are called small file [16], but some
others said that’s files are less than data block size (64 MB) is called a small files [17]. The small files can be
logs, images, documents, voice records …etc.
4. SMALL FILE ISSUE
Due to hadoop is an open source [18], it was adopted by several companies under different names
such as: HortonWorks distribution by yahoo, Cloudera from Facebook, Google, Oracle and Yahoo, Facebook
corona by Facebook and some others [19, 20]. That’s means hadoop is a novel successful tool to treats
Big Data problem, thus, during these long years, there were a problem flout on the surface called “big data in
small files”.Namenode, is the master node of hadoop, this node stores all the meta data about datanodes, and
all metadata about the data in the datanodes, on the other hand, datanodes are used to store the actual data,
and due to Hadoop’s architecture, every single file must be replicated into 3 copies in the cluster in order of
availability purposes. In Datanode memory, every 64 MB of memory called data block, if the stored file is
less than the data block, it considers as a small file [17], this data block can hold one file only whether the file
is 1 MB or 64 MB, every data block is known to the Namenode by metadata, every metadata size is 120B,
thus, if one Datanode has 1000000 data blocks, the total size of metadata that sent already to Namenode is
120*1000000=120 MB/Datanode, if the total nodes in one hadoop cluster is 10000 nodes, this leads to
120 MB*10000=1.2 TB of metadata, this massive number of metadata cannot be handled by the Namenode
which could leads to a dramatical shut down, and if Namenode goes down, the secondary Namenode
will take the lead for a while then shutting down [19], and finally the wholly hadoop cluster will be
use-less [21, 22].
5. SOLUTIONS
Of course, many research papers proposed some exits for this problem, some of these researches are
already in use while the rest still documented:
- HAR file: Hadoop Archive was the first novel exit for big data in small file issue, basically HAR file
used to pack the small files into HDFS blocks that’s can treats directly with Namenode. HAR file is
created by running a hadoop archive command, means a Mapreduce job starts packing all the huge
number of small files to be a small number of HAR files, thus treating with ten of HAR files is much
more efficient and smoother rather than treating with thousands of small size files [1], and beside of this,
HAR file doesn’t alter the architecture of HDFS. HAR is an excellent exit to avoid Namenode metadata
jam, HAR file gives an option to access the files directly inside it as well as the creation steps done in
easy commands, but on the other side, as a shortage, HAR cannot be altered after being created, cannot
add more files, or delete some unwanted files from it. The bumpiest shortage in HAR is every file inside
it requires 2 index files (Master index, Index) to read [23], that’s mean reading a file from HDFS it self is
much easier than reading from HAR file. Another limitation of HAR is the memory; HAR files puts extra
pressure on the file system due to generate a copy of the original files and takes space as much as they
need [23].
- nHAR file: new Hadoop Archive is a correction of HAR file, its almost the same story but with some
differences in architecture, first different is nHAR file needs only one index file to read, the second
different is nHAR can be edit, you can add more files to the archive after create it. (Mir 2017). As well as
HAR, nHAR is used to read files without altering HDFS architecture, but on the contrary of it, nHAR use
3. Int J Elec & Comp Eng ISSN: 2088-8708
Available techniques in hadoop small file issue (M. B. Masadeh)
2099
only one index to access the files instead, which mean more reading performance and efficiency,
however, the reading mechanism is done based on creating a hash table instead of master index
approach [24].
Figure 1 shown nHAR working mechanisim that’s every small file 1…n, every file will be hashing
and given an index number right before being saved inside a large file called “Part File”. Hash table in nHAR
is using instead of the index file that’s adopted in HAR, to generate a hash code, the file name and
the number of index file are used, then merging index files and moving to the part file, once part file being
full, and new one will generate and continue filling [25]. Similar to HAR, the actual file will be stored to
the part file, so the same problem with HAR that’s the actual file will add more data load to HDFS whole
storage, even this, nHAR easier and simpler than HAR in adding a new file to an existing nHAR file. Both of
HAR and nHAR techniques are preventing to avoid altering HDFS architecture, the rest of this paper is
shown another small file reduction technique but with HDFS architecture upgrade.
Figure 1. New archive technique [17, 24]
- Improved HDFS: This technique stands of two parts, the first one is client components which packing
the small files into one big file, the second one is data node component which treating the resource
management. Improved HDFS is an index based that’s used to collect all small files of the same HDFS
directory in one file, on the other side, the cash manager will resident in the Datanode.The packing and
storing method are running based on the small files offset and length, every small file will store in one big
file one by one, then determine the total sum of the small files in that big file and finally comparing
the result with the data block size is 64 MB, thus, if the large file total length goes greater than the block
size, then multiple block will used to save the file separately [24].
Figure 2 uncover the general file storing mechanism in HDFS that’s only one file can be allocated in every
data block, no matter if the file fits the block size or not, thus, every data block is connected to the Namenode
via a metadata about it ID, Datanode location and some other info. As shown in Figure 2, once the file took
place in the data block, the data block will be locked form inserting more files.
Figure 2. How the files being located in data blocks [26]
- Sequence file: Is another <key, value> technique for big data in small files, in a comparison with HAR,
sequence file is faster in files access and file creation due to access a file inside a sequence file requires 1
index each. Sequence file can be update after being created, can add more small files depends on
the block size. To create a sequence file, running a Mapreduce job is a must, but firstly, all the targeted
small files must migrate to a respective block of Hadoop, then run a Mapreduce job to create the sequence
4. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 2, April 2020 : 2097 - 2101
2100
file, however, Mapreduce job is required due to the sequence file architecture is based on <key, value>,
which file name is the key, and file data is the value [27]. The Mapreduce job should maps all the small
files till reaches the default block size and then pass it to reducer. The reducer will merge the files [28].
- Map File: Actually it’s a two files, data file and index file, the data file consists all pairs <key,value>,
and the key information are stored in index file, however, both of these two files are sequence files
(Ahad and Biswas 2018), and acting like sequence file but map file doesn’t looking for a entire file when
search by key [23].
- Extended HDFS: EHDFS is another successful technique based on altering HDFS itself, however,
all files are stored in a file called combine file in order to reduce the time in accessing files and reduce
the load on Namenode, EHDFS stands on advance four indexing technique and prefetching of index
information are file merging, file mapping, file prefetching and file extraction, the first file merging,
file mapping and file prefetching are carried out by Namenode while the last one file extraction is carried
out by Datanode [24]. As mentioned before, all of small files are stored in combine file, in the first stage
(file merging), Namenode maintains a data only for this combine file, however, due to each data block
can holds actual data size up to 80% of the block size and left the rest 20% for blocks metadata, there will
be a hash table in the beginning of each block, this table holds an entry for each small file in the block in
(offset, length), so based of this, the relation between block merging and combine file is the merged block
is a part of the combine file. File mapping is the next stage of EHDFS, a request containing small file and
combined file name is sent to the Namenode, for obtaining the location of the desired small file, in other
words it’s a process to match the small files names to the block of the combined file. File prefetching is
just reducing the metadata usage in the Namenode. File extraction (carried by data block) used to extract
the desired small file from the data block [24, 29].
6. CONCLUSION
There are a lot of big data in small files solutions, few of them use HDFS as is while the rest of
others force to alter HDFS attitude, however, using HDFS as is would be a recomended choice to treats
a pure hadoop, due to adding more customization will clear the small file issue, but on the other hand could
add more complexity on hadoops work.
ACKNOWLEDGMENT
The authors thank the Ministry of Education for funding this study through the following grants:
FRGS/1/2017/ICT02/FTMK-CACT/F00345. Gratitude is also due to Universiti Teknikal Malaysia Melaka
and Faculty of Information Technology and Communication for providing excellent research facilities.
REFERENCES
[1] V. G. Korat and K. S. Pamu, “Reduction of Data at Namenode in HDFS using harballing Technique,” vol. 1, no. 4,
pp. 635–642, 2012.
[2] Y. Mo, “A Data Security Storage Method for IoT Under Hadoop Cloud Computing Platform,” Int. J. Wirel. Inf.
Networks, vol. 26, no. 3, pp. 152–157, 2019.
[3] G. Ramu, M. Soumya, A. Jayanthi, J. Somasekar, and K. K. Baseer, “Protecting big data mining association rules
using fuzzy system,” vol. 17, no. 6, pp. 3057–3065, 2019.
[4] M. Arslan, Z. Riaz, and S. Munawar, “Building Information Modeling (BIM) Enabled Facilities Management
Using Hadoop Architecture,” 2017 Portland International Conference on Management of Engineering and
Technology (PICMET), Portland, OR, pp. 1-7. 2017.
[5] T. Renner, J. Müller, L. Thamsen, and O. Kao, “Addressing Hadoop’s Small File Problem With an Appendable
Archive File Format,” Proceedings of the Computing Frontiers Conference, pp. 367–372, 2017.
[6] E. G. Ularu, F. C. Puican, A. Apostu, M. Velicanu, and P. Student, “Perspectives on Big Data and Big Data
Analytics,” International Journal of Production Economics, Elsevier, vol. 191(C), pages 97-112, 2012.
[7] I. Ahmed, “A brief review: security issues in cloud computing and their solutions,” TELKOMNIKA
(Telecommunication Comput. Electron. Control.), vol. 17, no. 6, pp. 2812–2817, 2019.
[8] Pooja Sharma,Shimi S. L., S.Chatterji, “A Review On Obstacle Detection And Vision,” International journal of
engineering sciences & research technology, vol. 4, no. 1, pp. 2–5, 2015.
[9] M. April, I. No, B. P. Singh, and A. Agrawal, “Available Online at www.ijarcs.info Big Data : An Introduction,”
vol. 8, no. 3, pp. 21–23, 2017.
[10] G. Suciu, M. Anwar, I. Rogojanu, A. Pasat, and A. Stanoiu, “Big data technology for scientific applications,” 11th
ROLCG Conf. Grid, Cloud High-Performance Comput. Sci. - Proc., pp. 1–4, 2018.
[11] W. Fan, Z. Han, and R. Wang, “An Evaluation Model and Benchmark for Parallel Computing Frameworks,” 2018.
5. Int J Elec & Comp Eng ISSN: 2088-8708
Available techniques in hadoop small file issue (M. B. Masadeh)
2101
[12] D. C. U. P. S. D. Y. Singh, “Hadoop: Understanding the Big Data Processing Method,” Int. J. Sci. Res., vol. 4,
no. 3, pp. 1620–1624, 2015.
[13] A. Mirarab, S. L. Mirtaheri, and S. A. Asghari, “Value creation with big data analytics for enterprises : a survey,”
TELKOMNIKA (Telecommunication Comput. Electron. Control.), vol. 17, no. 6, pp. 5–7, 2019.
[14] Z. Li, H. Shen, W. Ligon, and J. Denton, “An Exploration of Designing a Hybrid Scale-Up/Out Hadoop
Architecture Based on Performance Measurements,” IEEE Trans. Parallel Distrib. Syst., vol. 28, no. 2,
pp. 386–400, 2017.
[15] P. P. Deshpande, “Hadoop Distributed FileSystem: Metadata Management,” Int. Res. J. Eng. Technol., vol. 10,
pp. 2395–56, 2017.
[16] H. He, Z. Du, W. Zhang, and A. Chen, “Optimization strategy of Hadoop small file storage for big data in
healthcare,” J. Supercomput., vol. 72, no. 10, pp. 3696–3707, 2016.
[17] M. A. Mir, “An Optimal Solution for small file problem in Hadoop,” International Journal of Advanced Research
in Computer Science, vol. 8, no. 5, p. 5697, 2017.
[18] M. A. Ahad and R. Biswas, “Soft Computing in Data Science,” vol. 545. Springer Singapore, 2015.
[19] K. Aishwarya, A. Arvind Ram, M. C. Sreevatson, C. Babu, and B. Prabavathy, “Efficient prefetching technique for
storage of heterogeneous small files in hadoop distributed file system federation,” 2013 5th Int. Conf. Adv. Comput.
ICoAC 2013, pp. 523–530, 2014.
[20] A. Erraissi, A. Belangour, and A. Tragha, “Digging into Hadoop-based Big Data Architectures,” Int. J. Comput.
Sci. Issues, vol. 14, no. 6, p. 7, 2017.
[21] M. A. Ahad and R. Biswas, “Dynamic Merging based Small File Storage (DM-SFS) Architecture for Efficiently
Storing Small Size Files in Hadoop,” Procedia Comput. Sci., vol. 132, pp. 1626–1635, 2018.
[22] Y. Guan, Z. Ma, and L. Li, “HDFS Optimization Strategy Based On Hierarchical Storage of Hot and Cold Data,”
Procedia CIRP, vol. 83, pp. 415–418, 2019.
[23] B. Gupta, R. Nath, G. Gopal, and K. K, “An Efficient Approach for Storing and Accessing Small Files with Big
Data Technology,” Int. J. Comput. Appl., vol. 146, no. 1, pp. 36–39, 2016.
[24] S. Bende and R. Shedge, “Dealing with Small Files Problem in Hadoop Distributed File System,” Procedia
Comput. Sci., vol. 79, pp. 1001–1012, 2016.
[25] S. Rangrao Kadam and V. Patil, “Review on Big Data Security in Hadoop,” Int. Res. J. Eng. Technol., vol. 4, no. 1,
pp. 1362–1365, 2017.
[26] C. Choi, C. Choi, J. Choi, and P. Kim, “Improved performance optimization for massive small files in cloud
computing environment,” Ann. Oper. Res., vol. 265, no. 2, pp. 305–317, 2018.
[27] R. Chethan, J. K. S. G. H. J, and P. M. C. N, “A Selective Approach for Storing Small Files in Respective Blocks
of Hadoop,” International Journal of Advanced Networking & Applications (IJANA), pp. 461–465.
[28] B. Sahare, A. Naik, and K. Patel, “Study of HADOOP,” Int. J. Comput. Sci. Trends Technol., vol. 2, no. 6,
pp. 40–43, 2014.
[29] S. Dzinamarira, F. Dinu, and T. S. E. Ng, “Ignem: Upward migration of cold data in big data file systems,” Proc. -
Int. Conf. Distrib. Comput. Syst., vol. 2018-July, pp. 65–75, 2018.
BIOGRAPHIES OF AUTHORS
Mohammad Bahjat Masadeh, PhD candidate Data Minig/Big Data UTeM | Universiti Teknikal
Malaysia Melaka, Master in information technology UUM | Universiti Utara Malaysia in 2008, 2009.
He is a Sr .Net developer in info tech deanship Umm alQurah university, Mecca Sr Sharepoint
developer in info tech deanship Umm alQurah university, Mecca, Wordpress blogger about hadoop
administration, big data geek, Co-founder of wrshaa.com.
Mohd Sanusi Azmi received BSc., Msc and Ph.D from Universiti Kebangsaan Malaysia (UKM) in
2000, 2003 and 2013. He joined Department of Software Enginering, Universiti Teknikal Malaysia
Melaka (UTeM) in 2003. Now, he is currently an Associate Professor at UTeM. He is the Malaysian
pioneer researcher in identification and verification of digital images of Al-Quran Mushaf. He is also
involved in Digital Jawi Paleography. He actively contributes in the feature extraction domain. He has
proposed a novel technique based on geometry feature used in Digit and Arabic based handwritten
documents
Sharifah Sakinah Syed Ahmad received B. Applied Science (Hons) Computer Modelling and M. Sc
Mathematics from Universiti Sains Malaysia (USM). Her PhD is in Software Engineering and
Intelligent System from University of Alberta, Canada.She is expert in Computer Aided Geometric
Design (CAGD) and neural networks.