A survey on data mining and analysis in hadoop and mongo db

Alexander Decker
Alexander DeckerEditor in chief en International Institute for Science, Technology and Education
Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.6, No.6, 2015
11
A Survey on Data Mining and Analysis in Hadoop and MongoDb
Manmitsinh C.Zala
Department of CS&E, Governmernt Engineering Collage,Modasa, Aravalli,Gujarat,India
E-mail manmit.zala@gmail.com
Prof. Jitendra S.Dhobi
Department of CS&E, Governmernt Engineering Collage,Modasa, Aravalli,Gujarat,India
E-mail: jsdhobi@gmail.com
Abstract:
Data Mining is a process to generate pattern and rules from various types of data marts and data warehouses ,in
this process there are several steps which contains data cleaning data anomaly detection then clean data is mined
with various approaches .In this research we have discussed data mining on large datasets ( Big Data) with this
large data set major issues are scalability and security ,Hadoop is the tool to mine the data and Mongo db
provides input for it, which is a key-value paradigm for parsing the data ,Other approaches are discussed with
this report and their capability for data storage ,Map reduce is method which can be used to reduce the data set
to reduce query processing time and improve system throughput, In the Proposed system we are going to mine
the big data this Hadoop and Mongo db and we will try to mine the data with sorted or double sorted key value
pair ,for and analyze the outcome of system.
Keywords- DataMIning , Hadoop, MapReduce, HDFS, MongoDb.
1. Introduction
“Big Data” is data whose scale, diversity, and complexity require new architecture, Techniques, algorithms, and
analytics to manage it and extract value and hidden knowledge from it.
Amount of data generated every day is expanding in drastic manner. Big data is a popular term used to
describe the data which is in zeta byte.[1] . Big Data is large amount of data. This vast amount of data is
generated by social media and networks, scientific instruments, mobile devices, sensor technology and
networks. Ability to manage, analyze, summarize, visualize, and discover knowledge from the collected
unstructured data in a timely manner and in a scalable fashion is very difficult task using traditional data mining
tools. To analyze the data Apache introduce a new technology called Hadoop. We can describe the
characteristics of big data using three Vs Volume, Variety and Velocity.[ 13][14]
Hadoop is the part of Apache projects, Hadoop software library is a framework that supports
distributed processing of large data across clusters of computers using simple programming models. Hadoop is
combination of Map Reduce and Hadoop distributed file system (HDFS). Work of Map Reduce is to process the
data and work of HDFS is to store the data into file system.
NOSQL [3][7] is the term related to “ Not Only Sql “ Sql is a relational database language but for big
data analysis these techniques are not enough so alternative solutions are NoSql databases like
Mongodb,Cassandra,Voldmort etc ..
2. Literature Survey
2.1 Applications
This proposed will provide a new approach analysis Big datamining with hadoop and mongodb which is based
on MapReduce Paradigm. This new approach will try to improve the computational time, more fault tolerance of
system and will handle or deal with Bigdata analysis.
2.2 Related Work
This chapter will provide information about the work done in big data mining and various approaches use and
method proposed
In [1] author has discussed the meaning and importance of big data analysis programming tool use for
big data mining and important of big data , with the example of facebook we can understood that today it is
required to process large number of data sets ,our traditional data sets are not enough for that ,for example
instead of taking large MySql tables we can use caching approach from memcached for n tier elements as Mysql
has very good performance in read but they are lagging in write ,which leads us to very high reliability but low
partition tolerance in our CAP model ,another example author has given is Yelp which uses AWS and Hadoop
for data analysis which uses Amazon S3 server to store large datasets which is RAID service.
The author proposed such data analysis using Apache Hadoop and JSON and data stored From
Amazon web services using their web services and analyze the data, the analysis showed that this method can
Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.6, No.6, 2015
12
analyze the large data from different sources with minimum utilization of resources
In [2] In this paper author has utilized Nosql database Mongo db to implement the big data analysis as
it is advantageous over rigid sql tables which is not useful in today’s large scale data for web logs generated
every day .more over author has compared performance between Mongo db and HDFS frame work using inbuilt
map reduce method with mongo db , author has not defined the modern data store technology and integration
available with hadoop like Hbase , and HIVE for that experiments and results are shown for large amount of data
sets , this is the motive why we choose mongo db data store for
Large data sets .
Figure 2.1 [ HDFS –Mongo Db comparison]
With this framework proposed by author the output comparison shows that Figure shows the effect of
the split size on performance using mongo-hadoop. The number of input records is _ 9:3 million, or 4GB of
input data. With the default split size of 8MB, Hadoop schedules over 500 mappers; by increasing the split size,
we are able to reduce this number to around 40 and achieve a considerable performance improvement. The curve
levels o_ between 128MB and 256MB, so we decided to use 128MB as the split size for the rest of our tests both
for native Hadoop-HDFS and mongo-hadoop.
In , MS At el.[3] has discussed various security issues and threats available with Big data as data is in
zeta byte size it also contains some sensitive and confidential information it is necessary to prevent unauthorized
use of data so apart from storage retrieve and processing security is also an important concern for data mining ,
data application from social web ,consumer oriented work has large impact on big data security according to
author vast use of smart phones has increased photo uploading and other sensitive information on web it is an
issue for that author has proposed metadata analysis in big data which creates an index of each images uploaded
on social web and we can identify from link which gives confidentiality over social media, so each images can
be scanned from big data bases of social media and can be apply for future security policies .
In [4] After considering security in analysis we again come with our problem of analysis the big data
with this paper integration of NOSQL with big data analysis author proposed model of unity architecture for
analysis of data as shown in figure 3.2
Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.6, No.6, 2015
13
Figure 2.2 Unity Architecture
The objectives of this architecture is as follow
• SQL is a declarative language that allows descriptive queries while hiding implementation and query execution
details.
• SQL is a standardized language allowing portability between systems and leveraging a massive existing
knowledge base of database developers.
• Supporting SQL allows a NoSQL system to seamlessly interact with other enterprise systems that use SQL and
JDBC/ODBC without requiring changes.
This system provides combination of both Relational data base system and Nosql system for this
interaction we can translate one schema to another schema by JDBC API and Mongo db connector .
In [6] Mongo db and Oracle databases are compared by their storage method ,syntax and their retrieval
methods also various experiments conducted with different query processing time and number of processing the
results here we are discussing few results achieved with this research .
Figure 2.3. Insert query with Mongo db and Oracle[6]
As we can see for small records inserting oracle databases are faster then mongo db but as the size increases for
Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.6, No.6, 2015
14
records the mongo db is impressively ahead then Oracle database
Same results are achieved with update query comparison
Figure 2.4 Update query on records and time comparison Mongo db vs Oracle
From this we can conclude that mongodb is flexible and scalable for large data sets which provides
batter integration for data storage and retrieval .
In [5] In this paper author has discussed about some very important parameters of mongo db focusing
on CAP model and compared various types of data store available with no Nosql and tested them among various
business intelligent system provided , and concluded that Nosql data stores provides huge opportunity where Sql
data bases are not useful basic advantages are their scalability and cross node operation .the intersection
algorithm for mongo db states the effectiveness of mongo db data store for key value approach to modelize the
data.
In [7] [10] and [11] some practical approaches are shown to interact no sql data stores with various
systems such as distributed architecture[7] ,Hashdoop[11] and evolution in hadoop [10] are proposed in
distributed system data bases are handled by structure system but it fails when data items increase so
unstructured data stores are useful for such problems some major industries are capable to develop their own
unstructured data stores for ex. Google’s Big table, Yahoo’s PNUTS ,Hadoop’s Hbase and many more but what
about small industries ,author stated that there are many open source products are available to handle such data
the comparison between them is shown in below figure
Figure 2.5 Comparison of Difference Data stores [7]
Among all this data stores mongo db is better replacement for MySql as it is semi structured, and
provides batter joins contains laser time for searching and in performing other queries.
In paper [10] multiple Nosql data stores are compared and we can see that mongo db provides
consistency , partition tolerance and crash handling over any other data stores
But in this paper author has limited computation power this system can be improve by adding some
more computation power over large datasets by cloud computing or distributor approach .[11] is an example of
hadoop hash function for anomaly detection using map reduce programming model .the hashdoop framework
Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.6, No.6, 2015
15
splits the traffic using has functions and the detector detects the anomaly from carious hadoop clusters then the
traffic has been divided in less traffic lines how ever author has not applied to store the data back to original data
sets which will be lost vice versa.
3. Background Study
3.1 Big Data
3.1.1 Architecture
Figure 3.1.1. Bigdata [ 9]
Big data is a distributed architecture for storing large amount of data ,According to a research recently the online
data has increased in size CERN research says that data without operating online. For example, “will produce
roughly 15 peta bytes (15 million gigabytes) of data annually – enough to fill more than 1.7 million dual-layer
DVDs a year!” [11 ]
Bigdata architecture consists following three segments
• Storage System
• Processing
• Analysis
Figure 3.1.1 BigData system [9]
3.2 What is NOSQL?
No SQL means an alternative of traditional database system the term was generated ny a scientist named Eric Evans
in Sanfransisco. The nosql databases have variety of different database systems and they provides the data
manipulation as well as low time in reading and writing
Many large organizations have their own NoSql Databases as BIG Table in Google which has much effect
on the no sql The whole point is that they provides alternatives to the traditional databases product. For Example the
many nosql products are available in the market and they are widely used by many companies .
Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.6, No.6, 2015
16
3.2.1 Mongo DB
MongoDb is the nosql type database which is document oriented , the organization which developed it is
10gen ,mongo word means large size it is very fast and reliable it is written in C++ it is also used to store large
files over a distribute location
Also we can store binary data like images , videos , mp3 files.
3.2.1.1: Query Model
Queries for MongoDB are bidding in a JSON like syntax and are forward to MongoDB as BSON altar by the
database driver. The concern archetypal of MongoDB allows queries over all abstracts central a collection,
including the anchored altar and arrays. Through the acceptance of predefined indexes queries can dynamically
be formulated during runtime.
Not all aspects of a concern are formulated aural a concern accent in MongoDB, depending on the MongoDB
disciplinarian for a programming language, some things may be bidding through the abracadabra of a adjustment
of the driver.
{ “employees":[{ "firstName":"Manmit","lastName":"Zala" },
{"firstName":"Pradip" , "lastName":"Chavda" },{"firstName":"Nilay", "lastName":"Parekh" }]} The query
model supports the following features:
1. Queries over documents and embedded subdocuments
2. Comparators (<;_;_;>)
3. Conditional operators (equals, not equals, exists, in, not in ...)
4. Logical perators: AND
5. Sorting by multiple Attributes
6. Group by
3.2.1.2: Sharding
Mongodb harding means the components of mongodb it is also uses following components :
1) Configuration servers
2 ) Shard nodes
3) Services for Routing
These are known as mongos .
3.3Apache Hadoop
Apache Hadoop is java based programming framework which is used for processing large data sets in distributed
computer environment. Hadoop is used in system where multiple nodes are present which can process terabytes
of data hadoop uses its own file system HDFS which facilitates fast transfer of data which can sustain node
failure and avoid system failure as whole.[1]
3.3.1 Architecutre
Figure 3.3.1 HDFS ARCHITECTURE [14]
3.4 Map Reduce Algorithm and Approaches
Map/Reduce is a programming paradigm that was made popular by Google where a task is divided into small
Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.6, No.6, 2015
17
portions and distributed to a large number of nodes for processing (map), and the results are then summarized
into the final answer (reduce). Hadoop also uses Map/Reduce for data processing. Hence different functions for
the processing are written in the form of Hadoop job. A Hadoop job consists of mapper and reducer functions
framework i.e. STS is used as the integrated development environment (IDE) [1].
3.4.1Map reduce with Hadoop
A .Many company uses Hadoop for big data analysis ,For example facebook use HIVE with Hadoop [1]
B. Yelp :uses AWS and Hadoop Yelp uses Amazon S3 to store daily logs and photos, generating around 100GB
of logs per day. The company also uses Amazon Elastic Map Reduce to power approximately 20 separate batch
scripts, most of those processing the logs.
Features powered by Amazon ElasticMapReduce include:[1]
1. People Who Viewed this Also Viewed
2. Review highlights
3. Auto complete as you type on search
4. Search spelling suggestions
5. Top searches
3.4.2 Map Reduce using MongoDb
Mongo db uses key value pair type of storage The Map primitive consists in processing a data list in order to
create key/value pairs. Then, the Reduce primitive will process each pair in order to create new aggregated
key/value pairs.[5]
Example[5]: map(k1, v1) = list(k2, v2) ………...(1)
reduce(k2, list(v2)) = list(v3) …………………...(2)
List : (a; 2)(a; 4)(b; 4)(c; 5)(b; 2)(a; 1)………… (3)
After mapping : (a; [2, 4, 1]), (b; [4, 2]), (c[5]) …(4)
After reducing : (a; 7), (b; 6), (c; 5) …………….(5)
Equations (1) , (2) show both map and reduce primitives.
Figure 3.5.2 Map reduce example[5]
4. Conclusion
Now a day data increases day by day the storage, retrieval and analysis of bigdata in structured databases like
Oracle and Mysql is not possible so we have presented many Nosql system among them Mongo db is preferable
for as an alternate for Mysql , still it is an Active search are for data mining to mine knowledge from bigdata.In
future we are interested in batter method and system for efficient mining of bigdata.
5. References
[1] Jyoti Nandimath , Ankur Patil , Ekata Banerjee , Pratima Kakade :”Big Data Analysis using Apache
Hadoop “ In SKNCOE Pune India,2013
[2] E. Dede, M. Govindaraju ,D. Gunter, R. Canon, L. Ramakrishnan “Performance Evaluation of a MongoDB
and HadoopPlatform for Scientific Data Analysis” In Lawrence Berekely National Lab Berkeley, CA 94720
[3] Matthew Smith, Christian Szongott,Benjamin Henne, Gabriele von Voigt” Big Data Privacy Issues in Public
Social Media”,2013
[4] Ramon Lawrence :“Integration and Virtualization of Relational SQL and NoSQL Systems including MySQL
and MongoDB”At 2014 International Conference on Computational Science and Computational
Intelligence,2014
[5] Laurent Bonne, Anne Laurent, Michel Sala,Benedicte Laurent,Nicolas Sicard:” REDUCE, YOU SAY: What
NoSQL can do for Data Aggregation and BI in Large Repositories “In 2011 22nd International Workshop on
Database and Expert Systems Applications,2011
Computer Engineering and Intelligent Systems www.iiste.org
ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)
Vol.6, No.6, 2015
18
[6] Alexandru Boicea, Florin Radulescu, Laura Ioana Agapin” MongoDB vs Oracle - database comparison
“2012 Third International Conference on Emerging Intelligent Data and Web Technologies,2012
[7] Suyog S. Nyati, Shivanand Pawar ,Rajesh Ingle: ” Performance Evaluation of Unstructured NoSQL data over
distributed framework” 2013 International Conference on Advances in Computing, Communications and
Informatics (ICACCI),2013
[8] Lior Okman,Nurit Gal-Oz,, Yaron Gonen,, Ehud Gudes ,Jenny Abramov:” Security Issues in NoSQL
Databases” 2011 International Joint Conference of IEEE TrustCom-11/IEEE ICESS-11/FCST-11,2011
[9] Chanchal Yadav, Shuliang Wang , Manoj Kumar: Algorithm and approaches to handle large Data- A Survey.
IJCSN International Journal of Computer Science and Network, Vol 2, Issue 3, 2013
[10] Ruxandra Burtica, Eleonora Maria Mocanu, Mugurel Ionut Andreica,
Nicolae Tapus: Practical application and evaluation of no-SQL databases in Cloud Computing , ©2012 IEEE
[11] Romain Fontugne,,Johan Mazel, Kensuke FukudaHashdoop: A MapReduce Framework for
NetworkAnomaly DetectionCERN – European organization for nuclear Research 2014 IEEE INFOCOM
Workshops: 2014 IEEE INFOCOM Workshop on Security and Privacy in Big Data,2014
WEBREFERENCES
[12] Bigdata-Wikipedia : http://en.wikipedia.org/wiki/Big_data
[13] Big Data Characteristics : http://en.wikipedia.org/wiki/Big_data#Characteristics
[14]Hadoop Architecture:http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
BOOKS
[15]By Jiawei Han And Micheline Kamber,Data Mining Concept and Techniques, Copyright 2006, Second
Edition,pp 5-9 ,119-145.
Business, Economics, Finance and Management Journals PAPER SUBMISSION EMAIL
European Journal of Business and Management EJBM@iiste.org
Research Journal of Finance and Accounting RJFA@iiste.org
Journal of Economics and Sustainable Development JESD@iiste.org
Information and Knowledge Management IKM@iiste.org
Journal of Developing Country Studies DCS@iiste.org
Industrial Engineering Letters IEL@iiste.org
Physical Sciences, Mathematics and Chemistry Journals PAPER SUBMISSION EMAIL
Journal of Natural Sciences Research JNSR@iiste.org
Journal of Chemistry and Materials Research CMR@iiste.org
Journal of Mathematical Theory and Modeling MTM@iiste.org
Advances in Physics Theories and Applications APTA@iiste.org
Chemical and Process Engineering Research CPER@iiste.org
Engineering, Technology and Systems Journals PAPER SUBMISSION EMAIL
Computer Engineering and Intelligent Systems CEIS@iiste.org
Innovative Systems Design and Engineering ISDE@iiste.org
Journal of Energy Technologies and Policy JETP@iiste.org
Information and Knowledge Management IKM@iiste.org
Journal of Control Theory and Informatics CTI@iiste.org
Journal of Information Engineering and Applications JIEA@iiste.org
Industrial Engineering Letters IEL@iiste.org
Journal of Network and Complex Systems NCS@iiste.org
Environment, Civil, Materials Sciences Journals PAPER SUBMISSION EMAIL
Journal of Environment and Earth Science JEES@iiste.org
Journal of Civil and Environmental Research CER@iiste.org
Journal of Natural Sciences Research JNSR@iiste.org
Life Science, Food and Medical Sciences PAPER SUBMISSION EMAIL
Advances in Life Science and Technology ALST@iiste.org
Journal of Natural Sciences Research JNSR@iiste.org
Journal of Biology, Agriculture and Healthcare JBAH@iiste.org
Journal of Food Science and Quality Management FSQM@iiste.org
Journal of Chemistry and Materials Research CMR@iiste.org
Education, and other Social Sciences PAPER SUBMISSION EMAIL
Journal of Education and Practice JEP@iiste.org
Journal of Law, Policy and Globalization JLPG@iiste.org
Journal of New Media and Mass Communication NMMC@iiste.org
Journal of Energy Technologies and Policy JETP@iiste.org
Historical Research Letter HRL@iiste.org
Public Policy and Administration Research PPAR@iiste.org
International Affairs and Global Strategy IAGS@iiste.org
Research on Humanities and Social Sciences RHSS@iiste.org
Journal of Developing Country Studies DCS@iiste.org
Journal of Arts and Design Studies ADS@iiste.org
The IISTE is a pioneer in the Open-Access hosting service and academic event management.
The aim of the firm is Accelerating Global Knowledge Sharing.
More information about the firm can be found on the homepage:
http://www.iiste.org
CALL FOR JOURNAL PAPERS
There are more than 30 peer-reviewed academic journals hosted under the hosting platform.
Prospective authors of journals can find the submission instruction on the following
page: http://www.iiste.org/journals/ All the journals articles are available online to the
readers all over the world without financial, legal, or technical barriers other than those
inseparable from gaining access to the internet itself. Paper version of the journals is also
available upon request of readers and authors.
MORE RESOURCES
Book publication information: http://www.iiste.org/book/
IISTE Knowledge Sharing Partners
EBSCO, Index Copernicus, Ulrich's Periodicals Directory, JournalTOCS, PKP Open
Archives Harvester, Bielefeld Academic Search Engine, Elektronische Zeitschriftenbibliothek
EZB, Open J-Gate, OCLC WorldCat, Universe Digtial Library , NewJour, Google Scholar

Recomendados

B1803031217 por
B1803031217B1803031217
B1803031217IOSR Journals
166 vistas6 diapositivas
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL por
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLijscai
156 vistas7 diapositivas
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING por
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
22 vistas4 diapositivas
C1803041317 por
C1803041317C1803041317
C1803041317IOSR Journals
152 vistas5 diapositivas
Iaetsd mapreduce streaming over cassandra datasets por
Iaetsd mapreduce streaming over cassandra datasetsIaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd Iaetsd
112 vistas4 diapositivas

Más contenido relacionado

La actualidad más candente

Big Data Processing with Hadoop : A Review por
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewIRJET Journal
30 vistas4 diapositivas
disertation por
disertationdisertation
disertationRuben Casas
127 vistas15 diapositivas
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A... por
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...IJSRD
149 vistas4 diapositivas
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS por
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSaciijournal
78 vistas9 diapositivas
E018142329 por
E018142329E018142329
E018142329IOSR Journals
170 vistas7 diapositivas
Trends in Computer Science and Information Technology por
Trends in Computer Science and Information TechnologyTrends in Computer Science and Information Technology
Trends in Computer Science and Information Technologypeertechzpublication
33 vistas6 diapositivas

La actualidad más candente(16)

Big Data Processing with Hadoop : A Review por IRJET Journal
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
IRJET Journal30 vistas
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A... por IJSRD
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
IJSRD 149 vistas
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS por aciijournal
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
aciijournal78 vistas
Web Oriented FIM for large scale dataset using Hadoop por dbpublications
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
dbpublications46 vistas
Key aspects of big data storage and its architecture por Rahul Chaturvedi
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architecture
Rahul Chaturvedi313 vistas
A Quantified Approach for large Dataset Compression in Association Mining por IOSR Journals
A Quantified Approach for large Dataset Compression in Association MiningA Quantified Approach for large Dataset Compression in Association Mining
A Quantified Approach for large Dataset Compression in Association Mining
IOSR Journals358 vistas
Implementation of Multi-node Clusters in Column Oriented Database using HDFS por IJEACS
Implementation of Multi-node Clusters in Column Oriented Database using HDFSImplementation of Multi-node Clusters in Column Oriented Database using HDFS
Implementation of Multi-node Clusters in Column Oriented Database using HDFS
IJEACS23 vistas
A Comprehensive Study on Big Data Applications and Challenges por ijcisjournal
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challenges
ijcisjournal88 vistas
Social Media World News Impact on Stock Index Values - Investment Fund Analyt... por Bernardo Najlis
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Bernardo Najlis2.7K vistas

Similar a A survey on data mining and analysis in hadoop and mongo db

Unstructured Datasets Analysis: Thesaurus Model por
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
87 vistas4 diapositivas
A Study on Graph Storage Database of NOSQL por
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLIJSCAI Journal
56 vistas7 diapositivas
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL por
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLijscai
20 vistas7 diapositivas
A Study on Graph Storage Database of NOSQL por
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQLIJSCAI Journal
33 vistas7 diapositivas
Bridging the gap between the semantic web and big data: answering SPARQL que... por
Bridging the gap between the semantic web and big data:  answering SPARQL que...Bridging the gap between the semantic web and big data:  answering SPARQL que...
Bridging the gap between the semantic web and big data: answering SPARQL que...IJECEIAES
4 vistas7 diapositivas
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME... por
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...ijcsity
5 vistas8 diapositivas

Similar a A survey on data mining and analysis in hadoop and mongo db(20)

Unstructured Datasets Analysis: Thesaurus Model por Editor IJCATR
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
Editor IJCATR87 vistas
A Study on Graph Storage Database of NOSQL por IJSCAI Journal
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQL
IJSCAI Journal56 vistas
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL por ijscai
A STUDY ON GRAPH STORAGE DATABASE OF NOSQLA STUDY ON GRAPH STORAGE DATABASE OF NOSQL
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
ijscai20 vistas
A Study on Graph Storage Database of NOSQL por IJSCAI Journal
A Study on Graph Storage Database of NOSQLA Study on Graph Storage Database of NOSQL
A Study on Graph Storage Database of NOSQL
IJSCAI Journal33 vistas
Bridging the gap between the semantic web and big data: answering SPARQL que... por IJECEIAES
Bridging the gap between the semantic web and big data:  answering SPARQL que...Bridging the gap between the semantic web and big data:  answering SPARQL que...
Bridging the gap between the semantic web and big data: answering SPARQL que...
IJECEIAES4 vistas
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME... por ijcsity
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
ijcsity5 vistas
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME... por ijcsity
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
ijcsity21 vistas
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME... por ijcsity
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
MONGODB VS MYSQL: A COMPARATIVE STUDY OF PERFORMANCE IN SUPER MARKET MANAGEME...
ijcsity33 vistas
hadoop seminar training report por Sarvesh Meena
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
Sarvesh Meena351 vistas
The Rise of Nosql Databases por JAMES NGONDO
The Rise of Nosql DatabasesThe Rise of Nosql Databases
The Rise of Nosql Databases
JAMES NGONDO334 vistas
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS por ijdms
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONSDATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
ijdms20 vistas
Introduction to Big Data and Hadoop using Local Standalone Mode por inventionjournals
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals47 vistas
Building a Big Data platform with the Hadoop ecosystem por Gregg Barrett
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
Gregg Barrett4.6K vistas
Introduction to Big Data por Vipin Batra
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Vipin Batra4.8K vistas
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm por IRJET Journal
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmPerformance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
IRJET Journal23 vistas
CS828 P5 Individual Project v101 por ThienSi Le
CS828 P5 Individual Project v101CS828 P5 Individual Project v101
CS828 P5 Individual Project v101
ThienSi Le948 vistas
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector... por IRJET Journal
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET Journal14 vistas
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector... por IRJET Journal
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET Journal9 vistas

Más de Alexander Decker

Abnormalities of hormones and inflammatory cytokines in women affected with p... por
Abnormalities of hormones and inflammatory cytokines in women affected with p...Abnormalities of hormones and inflammatory cytokines in women affected with p...
Abnormalities of hormones and inflammatory cytokines in women affected with p...Alexander Decker
10.4K vistas5 diapositivas
A validation of the adverse childhood experiences scale in por
A validation of the adverse childhood experiences scale inA validation of the adverse childhood experiences scale in
A validation of the adverse childhood experiences scale inAlexander Decker
8.1K vistas8 diapositivas
A usability evaluation framework for b2 c e commerce websites por
A usability evaluation framework for b2 c e commerce websitesA usability evaluation framework for b2 c e commerce websites
A usability evaluation framework for b2 c e commerce websitesAlexander Decker
6.6K vistas20 diapositivas
A universal model for managing the marketing executives in nigerian banks por
A universal model for managing the marketing executives in nigerian banksA universal model for managing the marketing executives in nigerian banks
A universal model for managing the marketing executives in nigerian banksAlexander Decker
5.9K vistas13 diapositivas
A unique common fixed point theorems in generalized d por
A unique common fixed point theorems in generalized dA unique common fixed point theorems in generalized d
A unique common fixed point theorems in generalized dAlexander Decker
5.6K vistas14 diapositivas
A trends of salmonella and antibiotic resistance por
A trends of salmonella and antibiotic resistanceA trends of salmonella and antibiotic resistance
A trends of salmonella and antibiotic resistanceAlexander Decker
4.9K vistas13 diapositivas

Más de Alexander Decker(20)

Abnormalities of hormones and inflammatory cytokines in women affected with p... por Alexander Decker
Abnormalities of hormones and inflammatory cytokines in women affected with p...Abnormalities of hormones and inflammatory cytokines in women affected with p...
Abnormalities of hormones and inflammatory cytokines in women affected with p...
Alexander Decker10.4K vistas
A validation of the adverse childhood experiences scale in por Alexander Decker
A validation of the adverse childhood experiences scale inA validation of the adverse childhood experiences scale in
A validation of the adverse childhood experiences scale in
Alexander Decker8.1K vistas
A usability evaluation framework for b2 c e commerce websites por Alexander Decker
A usability evaluation framework for b2 c e commerce websitesA usability evaluation framework for b2 c e commerce websites
A usability evaluation framework for b2 c e commerce websites
Alexander Decker6.6K vistas
A universal model for managing the marketing executives in nigerian banks por Alexander Decker
A universal model for managing the marketing executives in nigerian banksA universal model for managing the marketing executives in nigerian banks
A universal model for managing the marketing executives in nigerian banks
Alexander Decker5.9K vistas
A unique common fixed point theorems in generalized d por Alexander Decker
A unique common fixed point theorems in generalized dA unique common fixed point theorems in generalized d
A unique common fixed point theorems in generalized d
Alexander Decker5.6K vistas
A trends of salmonella and antibiotic resistance por Alexander Decker
A trends of salmonella and antibiotic resistanceA trends of salmonella and antibiotic resistance
A trends of salmonella and antibiotic resistance
Alexander Decker4.9K vistas
A transformational generative approach towards understanding al-istifham por Alexander Decker
A transformational  generative approach towards understanding al-istifhamA transformational  generative approach towards understanding al-istifham
A transformational generative approach towards understanding al-istifham
Alexander Decker4K vistas
A time series analysis of the determinants of savings in namibia por Alexander Decker
A time series analysis of the determinants of savings in namibiaA time series analysis of the determinants of savings in namibia
A time series analysis of the determinants of savings in namibia
Alexander Decker2.8K vistas
A therapy for physical and mental fitness of school children por Alexander Decker
A therapy for physical and mental fitness of school childrenA therapy for physical and mental fitness of school children
A therapy for physical and mental fitness of school children
Alexander Decker1.7K vistas
A theory of efficiency for managing the marketing executives in nigerian banks por Alexander Decker
A theory of efficiency for managing the marketing executives in nigerian banksA theory of efficiency for managing the marketing executives in nigerian banks
A theory of efficiency for managing the marketing executives in nigerian banks
Alexander Decker1.1K vistas
A systematic evaluation of link budget for por Alexander Decker
A systematic evaluation of link budget forA systematic evaluation of link budget for
A systematic evaluation of link budget for
Alexander Decker778 vistas
A synthetic review of contraceptive supplies in punjab por Alexander Decker
A synthetic review of contraceptive supplies in punjabA synthetic review of contraceptive supplies in punjab
A synthetic review of contraceptive supplies in punjab
Alexander Decker733 vistas
A synthesis of taylor’s and fayol’s management approaches for managing market... por Alexander Decker
A synthesis of taylor’s and fayol’s management approaches for managing market...A synthesis of taylor’s and fayol’s management approaches for managing market...
A synthesis of taylor’s and fayol’s management approaches for managing market...
Alexander Decker766 vistas
A survey paper on sequence pattern mining with incremental por Alexander Decker
A survey paper on sequence pattern mining with incrementalA survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incremental
Alexander Decker616 vistas
A survey on live virtual machine migrations and its techniques por Alexander Decker
A survey on live virtual machine migrations and its techniquesA survey on live virtual machine migrations and its techniques
A survey on live virtual machine migrations and its techniques
Alexander Decker697 vistas
A survey on challenges to the media cloud por Alexander Decker
A survey on challenges to the media cloudA survey on challenges to the media cloud
A survey on challenges to the media cloud
Alexander Decker544 vistas
A survey of private equity investments in kenya por Alexander Decker
A survey of private equity investments in kenyaA survey of private equity investments in kenya
A survey of private equity investments in kenya
Alexander Decker482 vistas
A study to measures the financial health of por Alexander Decker
A study to measures the financial health ofA study to measures the financial health of
A study to measures the financial health of
Alexander Decker617 vistas
A study to evaluate the attitude of faculty members of public universities of... por Alexander Decker
A study to evaluate the attitude of faculty members of public universities of...A study to evaluate the attitude of faculty members of public universities of...
A study to evaluate the attitude of faculty members of public universities of...
Alexander Decker246 vistas

A survey on data mining and analysis in hadoop and mongo db

  • 1. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol.6, No.6, 2015 11 A Survey on Data Mining and Analysis in Hadoop and MongoDb Manmitsinh C.Zala Department of CS&E, Governmernt Engineering Collage,Modasa, Aravalli,Gujarat,India E-mail manmit.zala@gmail.com Prof. Jitendra S.Dhobi Department of CS&E, Governmernt Engineering Collage,Modasa, Aravalli,Gujarat,India E-mail: jsdhobi@gmail.com Abstract: Data Mining is a process to generate pattern and rules from various types of data marts and data warehouses ,in this process there are several steps which contains data cleaning data anomaly detection then clean data is mined with various approaches .In this research we have discussed data mining on large datasets ( Big Data) with this large data set major issues are scalability and security ,Hadoop is the tool to mine the data and Mongo db provides input for it, which is a key-value paradigm for parsing the data ,Other approaches are discussed with this report and their capability for data storage ,Map reduce is method which can be used to reduce the data set to reduce query processing time and improve system throughput, In the Proposed system we are going to mine the big data this Hadoop and Mongo db and we will try to mine the data with sorted or double sorted key value pair ,for and analyze the outcome of system. Keywords- DataMIning , Hadoop, MapReduce, HDFS, MongoDb. 1. Introduction “Big Data” is data whose scale, diversity, and complexity require new architecture, Techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it. Amount of data generated every day is expanding in drastic manner. Big data is a popular term used to describe the data which is in zeta byte.[1] . Big Data is large amount of data. This vast amount of data is generated by social media and networks, scientific instruments, mobile devices, sensor technology and networks. Ability to manage, analyze, summarize, visualize, and discover knowledge from the collected unstructured data in a timely manner and in a scalable fashion is very difficult task using traditional data mining tools. To analyze the data Apache introduce a new technology called Hadoop. We can describe the characteristics of big data using three Vs Volume, Variety and Velocity.[ 13][14] Hadoop is the part of Apache projects, Hadoop software library is a framework that supports distributed processing of large data across clusters of computers using simple programming models. Hadoop is combination of Map Reduce and Hadoop distributed file system (HDFS). Work of Map Reduce is to process the data and work of HDFS is to store the data into file system. NOSQL [3][7] is the term related to “ Not Only Sql “ Sql is a relational database language but for big data analysis these techniques are not enough so alternative solutions are NoSql databases like Mongodb,Cassandra,Voldmort etc .. 2. Literature Survey 2.1 Applications This proposed will provide a new approach analysis Big datamining with hadoop and mongodb which is based on MapReduce Paradigm. This new approach will try to improve the computational time, more fault tolerance of system and will handle or deal with Bigdata analysis. 2.2 Related Work This chapter will provide information about the work done in big data mining and various approaches use and method proposed In [1] author has discussed the meaning and importance of big data analysis programming tool use for big data mining and important of big data , with the example of facebook we can understood that today it is required to process large number of data sets ,our traditional data sets are not enough for that ,for example instead of taking large MySql tables we can use caching approach from memcached for n tier elements as Mysql has very good performance in read but they are lagging in write ,which leads us to very high reliability but low partition tolerance in our CAP model ,another example author has given is Yelp which uses AWS and Hadoop for data analysis which uses Amazon S3 server to store large datasets which is RAID service. The author proposed such data analysis using Apache Hadoop and JSON and data stored From Amazon web services using their web services and analyze the data, the analysis showed that this method can
  • 2. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol.6, No.6, 2015 12 analyze the large data from different sources with minimum utilization of resources In [2] In this paper author has utilized Nosql database Mongo db to implement the big data analysis as it is advantageous over rigid sql tables which is not useful in today’s large scale data for web logs generated every day .more over author has compared performance between Mongo db and HDFS frame work using inbuilt map reduce method with mongo db , author has not defined the modern data store technology and integration available with hadoop like Hbase , and HIVE for that experiments and results are shown for large amount of data sets , this is the motive why we choose mongo db data store for Large data sets . Figure 2.1 [ HDFS –Mongo Db comparison] With this framework proposed by author the output comparison shows that Figure shows the effect of the split size on performance using mongo-hadoop. The number of input records is _ 9:3 million, or 4GB of input data. With the default split size of 8MB, Hadoop schedules over 500 mappers; by increasing the split size, we are able to reduce this number to around 40 and achieve a considerable performance improvement. The curve levels o_ between 128MB and 256MB, so we decided to use 128MB as the split size for the rest of our tests both for native Hadoop-HDFS and mongo-hadoop. In , MS At el.[3] has discussed various security issues and threats available with Big data as data is in zeta byte size it also contains some sensitive and confidential information it is necessary to prevent unauthorized use of data so apart from storage retrieve and processing security is also an important concern for data mining , data application from social web ,consumer oriented work has large impact on big data security according to author vast use of smart phones has increased photo uploading and other sensitive information on web it is an issue for that author has proposed metadata analysis in big data which creates an index of each images uploaded on social web and we can identify from link which gives confidentiality over social media, so each images can be scanned from big data bases of social media and can be apply for future security policies . In [4] After considering security in analysis we again come with our problem of analysis the big data with this paper integration of NOSQL with big data analysis author proposed model of unity architecture for analysis of data as shown in figure 3.2
  • 3. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol.6, No.6, 2015 13 Figure 2.2 Unity Architecture The objectives of this architecture is as follow • SQL is a declarative language that allows descriptive queries while hiding implementation and query execution details. • SQL is a standardized language allowing portability between systems and leveraging a massive existing knowledge base of database developers. • Supporting SQL allows a NoSQL system to seamlessly interact with other enterprise systems that use SQL and JDBC/ODBC without requiring changes. This system provides combination of both Relational data base system and Nosql system for this interaction we can translate one schema to another schema by JDBC API and Mongo db connector . In [6] Mongo db and Oracle databases are compared by their storage method ,syntax and their retrieval methods also various experiments conducted with different query processing time and number of processing the results here we are discussing few results achieved with this research . Figure 2.3. Insert query with Mongo db and Oracle[6] As we can see for small records inserting oracle databases are faster then mongo db but as the size increases for
  • 4. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol.6, No.6, 2015 14 records the mongo db is impressively ahead then Oracle database Same results are achieved with update query comparison Figure 2.4 Update query on records and time comparison Mongo db vs Oracle From this we can conclude that mongodb is flexible and scalable for large data sets which provides batter integration for data storage and retrieval . In [5] In this paper author has discussed about some very important parameters of mongo db focusing on CAP model and compared various types of data store available with no Nosql and tested them among various business intelligent system provided , and concluded that Nosql data stores provides huge opportunity where Sql data bases are not useful basic advantages are their scalability and cross node operation .the intersection algorithm for mongo db states the effectiveness of mongo db data store for key value approach to modelize the data. In [7] [10] and [11] some practical approaches are shown to interact no sql data stores with various systems such as distributed architecture[7] ,Hashdoop[11] and evolution in hadoop [10] are proposed in distributed system data bases are handled by structure system but it fails when data items increase so unstructured data stores are useful for such problems some major industries are capable to develop their own unstructured data stores for ex. Google’s Big table, Yahoo’s PNUTS ,Hadoop’s Hbase and many more but what about small industries ,author stated that there are many open source products are available to handle such data the comparison between them is shown in below figure Figure 2.5 Comparison of Difference Data stores [7] Among all this data stores mongo db is better replacement for MySql as it is semi structured, and provides batter joins contains laser time for searching and in performing other queries. In paper [10] multiple Nosql data stores are compared and we can see that mongo db provides consistency , partition tolerance and crash handling over any other data stores But in this paper author has limited computation power this system can be improve by adding some more computation power over large datasets by cloud computing or distributor approach .[11] is an example of hadoop hash function for anomaly detection using map reduce programming model .the hashdoop framework
  • 5. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol.6, No.6, 2015 15 splits the traffic using has functions and the detector detects the anomaly from carious hadoop clusters then the traffic has been divided in less traffic lines how ever author has not applied to store the data back to original data sets which will be lost vice versa. 3. Background Study 3.1 Big Data 3.1.1 Architecture Figure 3.1.1. Bigdata [ 9] Big data is a distributed architecture for storing large amount of data ,According to a research recently the online data has increased in size CERN research says that data without operating online. For example, “will produce roughly 15 peta bytes (15 million gigabytes) of data annually – enough to fill more than 1.7 million dual-layer DVDs a year!” [11 ] Bigdata architecture consists following three segments • Storage System • Processing • Analysis Figure 3.1.1 BigData system [9] 3.2 What is NOSQL? No SQL means an alternative of traditional database system the term was generated ny a scientist named Eric Evans in Sanfransisco. The nosql databases have variety of different database systems and they provides the data manipulation as well as low time in reading and writing Many large organizations have their own NoSql Databases as BIG Table in Google which has much effect on the no sql The whole point is that they provides alternatives to the traditional databases product. For Example the many nosql products are available in the market and they are widely used by many companies .
  • 6. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol.6, No.6, 2015 16 3.2.1 Mongo DB MongoDb is the nosql type database which is document oriented , the organization which developed it is 10gen ,mongo word means large size it is very fast and reliable it is written in C++ it is also used to store large files over a distribute location Also we can store binary data like images , videos , mp3 files. 3.2.1.1: Query Model Queries for MongoDB are bidding in a JSON like syntax and are forward to MongoDB as BSON altar by the database driver. The concern archetypal of MongoDB allows queries over all abstracts central a collection, including the anchored altar and arrays. Through the acceptance of predefined indexes queries can dynamically be formulated during runtime. Not all aspects of a concern are formulated aural a concern accent in MongoDB, depending on the MongoDB disciplinarian for a programming language, some things may be bidding through the abracadabra of a adjustment of the driver. { “employees":[{ "firstName":"Manmit","lastName":"Zala" }, {"firstName":"Pradip" , "lastName":"Chavda" },{"firstName":"Nilay", "lastName":"Parekh" }]} The query model supports the following features: 1. Queries over documents and embedded subdocuments 2. Comparators (<;_;_;>) 3. Conditional operators (equals, not equals, exists, in, not in ...) 4. Logical perators: AND 5. Sorting by multiple Attributes 6. Group by 3.2.1.2: Sharding Mongodb harding means the components of mongodb it is also uses following components : 1) Configuration servers 2 ) Shard nodes 3) Services for Routing These are known as mongos . 3.3Apache Hadoop Apache Hadoop is java based programming framework which is used for processing large data sets in distributed computer environment. Hadoop is used in system where multiple nodes are present which can process terabytes of data hadoop uses its own file system HDFS which facilitates fast transfer of data which can sustain node failure and avoid system failure as whole.[1] 3.3.1 Architecutre Figure 3.3.1 HDFS ARCHITECTURE [14] 3.4 Map Reduce Algorithm and Approaches Map/Reduce is a programming paradigm that was made popular by Google where a task is divided into small
  • 7. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol.6, No.6, 2015 17 portions and distributed to a large number of nodes for processing (map), and the results are then summarized into the final answer (reduce). Hadoop also uses Map/Reduce for data processing. Hence different functions for the processing are written in the form of Hadoop job. A Hadoop job consists of mapper and reducer functions framework i.e. STS is used as the integrated development environment (IDE) [1]. 3.4.1Map reduce with Hadoop A .Many company uses Hadoop for big data analysis ,For example facebook use HIVE with Hadoop [1] B. Yelp :uses AWS and Hadoop Yelp uses Amazon S3 to store daily logs and photos, generating around 100GB of logs per day. The company also uses Amazon Elastic Map Reduce to power approximately 20 separate batch scripts, most of those processing the logs. Features powered by Amazon ElasticMapReduce include:[1] 1. People Who Viewed this Also Viewed 2. Review highlights 3. Auto complete as you type on search 4. Search spelling suggestions 5. Top searches 3.4.2 Map Reduce using MongoDb Mongo db uses key value pair type of storage The Map primitive consists in processing a data list in order to create key/value pairs. Then, the Reduce primitive will process each pair in order to create new aggregated key/value pairs.[5] Example[5]: map(k1, v1) = list(k2, v2) ………...(1) reduce(k2, list(v2)) = list(v3) …………………...(2) List : (a; 2)(a; 4)(b; 4)(c; 5)(b; 2)(a; 1)………… (3) After mapping : (a; [2, 4, 1]), (b; [4, 2]), (c[5]) …(4) After reducing : (a; 7), (b; 6), (c; 5) …………….(5) Equations (1) , (2) show both map and reduce primitives. Figure 3.5.2 Map reduce example[5] 4. Conclusion Now a day data increases day by day the storage, retrieval and analysis of bigdata in structured databases like Oracle and Mysql is not possible so we have presented many Nosql system among them Mongo db is preferable for as an alternate for Mysql , still it is an Active search are for data mining to mine knowledge from bigdata.In future we are interested in batter method and system for efficient mining of bigdata. 5. References [1] Jyoti Nandimath , Ankur Patil , Ekata Banerjee , Pratima Kakade :”Big Data Analysis using Apache Hadoop “ In SKNCOE Pune India,2013 [2] E. Dede, M. Govindaraju ,D. Gunter, R. Canon, L. Ramakrishnan “Performance Evaluation of a MongoDB and HadoopPlatform for Scientific Data Analysis” In Lawrence Berekely National Lab Berkeley, CA 94720 [3] Matthew Smith, Christian Szongott,Benjamin Henne, Gabriele von Voigt” Big Data Privacy Issues in Public Social Media”,2013 [4] Ramon Lawrence :“Integration and Virtualization of Relational SQL and NoSQL Systems including MySQL and MongoDB”At 2014 International Conference on Computational Science and Computational Intelligence,2014 [5] Laurent Bonne, Anne Laurent, Michel Sala,Benedicte Laurent,Nicolas Sicard:” REDUCE, YOU SAY: What NoSQL can do for Data Aggregation and BI in Large Repositories “In 2011 22nd International Workshop on Database and Expert Systems Applications,2011
  • 8. Computer Engineering and Intelligent Systems www.iiste.org ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online) Vol.6, No.6, 2015 18 [6] Alexandru Boicea, Florin Radulescu, Laura Ioana Agapin” MongoDB vs Oracle - database comparison “2012 Third International Conference on Emerging Intelligent Data and Web Technologies,2012 [7] Suyog S. Nyati, Shivanand Pawar ,Rajesh Ingle: ” Performance Evaluation of Unstructured NoSQL data over distributed framework” 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI),2013 [8] Lior Okman,Nurit Gal-Oz,, Yaron Gonen,, Ehud Gudes ,Jenny Abramov:” Security Issues in NoSQL Databases” 2011 International Joint Conference of IEEE TrustCom-11/IEEE ICESS-11/FCST-11,2011 [9] Chanchal Yadav, Shuliang Wang , Manoj Kumar: Algorithm and approaches to handle large Data- A Survey. IJCSN International Journal of Computer Science and Network, Vol 2, Issue 3, 2013 [10] Ruxandra Burtica, Eleonora Maria Mocanu, Mugurel Ionut Andreica, Nicolae Tapus: Practical application and evaluation of no-SQL databases in Cloud Computing , ©2012 IEEE [11] Romain Fontugne,,Johan Mazel, Kensuke FukudaHashdoop: A MapReduce Framework for NetworkAnomaly DetectionCERN – European organization for nuclear Research 2014 IEEE INFOCOM Workshops: 2014 IEEE INFOCOM Workshop on Security and Privacy in Big Data,2014 WEBREFERENCES [12] Bigdata-Wikipedia : http://en.wikipedia.org/wiki/Big_data [13] Big Data Characteristics : http://en.wikipedia.org/wiki/Big_data#Characteristics [14]Hadoop Architecture:http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html BOOKS [15]By Jiawei Han And Micheline Kamber,Data Mining Concept and Techniques, Copyright 2006, Second Edition,pp 5-9 ,119-145.
  • 9. Business, Economics, Finance and Management Journals PAPER SUBMISSION EMAIL European Journal of Business and Management EJBM@iiste.org Research Journal of Finance and Accounting RJFA@iiste.org Journal of Economics and Sustainable Development JESD@iiste.org Information and Knowledge Management IKM@iiste.org Journal of Developing Country Studies DCS@iiste.org Industrial Engineering Letters IEL@iiste.org Physical Sciences, Mathematics and Chemistry Journals PAPER SUBMISSION EMAIL Journal of Natural Sciences Research JNSR@iiste.org Journal of Chemistry and Materials Research CMR@iiste.org Journal of Mathematical Theory and Modeling MTM@iiste.org Advances in Physics Theories and Applications APTA@iiste.org Chemical and Process Engineering Research CPER@iiste.org Engineering, Technology and Systems Journals PAPER SUBMISSION EMAIL Computer Engineering and Intelligent Systems CEIS@iiste.org Innovative Systems Design and Engineering ISDE@iiste.org Journal of Energy Technologies and Policy JETP@iiste.org Information and Knowledge Management IKM@iiste.org Journal of Control Theory and Informatics CTI@iiste.org Journal of Information Engineering and Applications JIEA@iiste.org Industrial Engineering Letters IEL@iiste.org Journal of Network and Complex Systems NCS@iiste.org Environment, Civil, Materials Sciences Journals PAPER SUBMISSION EMAIL Journal of Environment and Earth Science JEES@iiste.org Journal of Civil and Environmental Research CER@iiste.org Journal of Natural Sciences Research JNSR@iiste.org Life Science, Food and Medical Sciences PAPER SUBMISSION EMAIL Advances in Life Science and Technology ALST@iiste.org Journal of Natural Sciences Research JNSR@iiste.org Journal of Biology, Agriculture and Healthcare JBAH@iiste.org Journal of Food Science and Quality Management FSQM@iiste.org Journal of Chemistry and Materials Research CMR@iiste.org Education, and other Social Sciences PAPER SUBMISSION EMAIL Journal of Education and Practice JEP@iiste.org Journal of Law, Policy and Globalization JLPG@iiste.org Journal of New Media and Mass Communication NMMC@iiste.org Journal of Energy Technologies and Policy JETP@iiste.org Historical Research Letter HRL@iiste.org Public Policy and Administration Research PPAR@iiste.org International Affairs and Global Strategy IAGS@iiste.org Research on Humanities and Social Sciences RHSS@iiste.org Journal of Developing Country Studies DCS@iiste.org Journal of Arts and Design Studies ADS@iiste.org
  • 10. The IISTE is a pioneer in the Open-Access hosting service and academic event management. The aim of the firm is Accelerating Global Knowledge Sharing. More information about the firm can be found on the homepage: http://www.iiste.org CALL FOR JOURNAL PAPERS There are more than 30 peer-reviewed academic journals hosted under the hosting platform. Prospective authors of journals can find the submission instruction on the following page: http://www.iiste.org/journals/ All the journals articles are available online to the readers all over the world without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. Paper version of the journals is also available upon request of readers and authors. MORE RESOURCES Book publication information: http://www.iiste.org/book/ IISTE Knowledge Sharing Partners EBSCO, Index Copernicus, Ulrich's Periodicals Directory, JournalTOCS, PKP Open Archives Harvester, Bielefeld Academic Search Engine, Elektronische Zeitschriftenbibliothek EZB, Open J-Gate, OCLC WorldCat, Universe Digtial Library , NewJour, Google Scholar