SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
Blending Big Data and Cloud - Epilepsy
Global Data Research and Information
System
BITS ZG629T: Thesis
by
AnupSingh
2012HZ12707
Thesis work carried out at
Tata Consultancy Services Limited, LCH.Clearnet Limited,
Investec Bank Plc London,
Birmingham Cancer Research Institute, United Kingdom
Submitted in fulfillment of M.S. by Research - Software Systems
Under the Supervision of
Sandeep Patil, Researcher in NASA, Arlington University,
Ex. BARC Sr. Scientist
Kalwar Shivram, Project Manager,
Tata Consultancy Services Limited, SanJose, UnitedStates
Professor B.M. Deshpande, bmd@goa.bits-pilani.ac.in
BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE
PILANI (RAJASTHAN)
April, 2014
ABSTRACT
Epilepsy is the most common neurological disorder affecting 65 million people
worldwide. While medications and other treatments help many people of all ages who
live with epilepsy, more than a million people continue to have seizures that can
severely limit their school achievements, employment prospects and participation in all
of life's experiences. It strikes most often among the very young and the very old,
although anyone can develop epilepsy at any age. Its prevalence is greater than autism
spectrum disorder, cerebral palsy, multiple sclerosis and Parkinson's disease combined.
Despite how common it is and major advances in diagnosis and treatment, epilepsy is
among the least understood of major chronic medical conditions, even though one in
three adults knows someone with the disorder. Epilepsy Global Data Research and
Information System is aimed to leverage Big Data, Cloud Computing, Datawarehouse
features to build a global system which will help the doctors, neurosurgeons to use the
information and methodologies to treat the childrens and people worldwide.
Objectives
 This initiative is aimed to build a federated database of medical information and
services that act to serve as the platform for medical research into neurological
cases of epilepsy.
 Providing access to very large data sets on patients with different neurological
disorders help the researchers, doctors, surgeons to make efficient decisions and
share their experiences.
 Best treatment to be given to childrens and other people all over the world.
 System to enrich and enhance its knowledge base so as to stimulate new questions
about Epilepsy and its symptoms – and, ultimately, lead to the fruitful answers on
its treatment.
 To harness super-computer power and capabilties of Big Data and Cloud Computing.
Broad Academic Area of Work: Cloud Computing, Big Data, Datawarehouse.
Key words: Hadoop, Twitter Apps, Spring XD, HBASE, HDFS, MapReduce, Hue, Hive,
Pig, HCatalog, JSON Serde, Flume.
ACKNOWLEDGEMENTS
I would like to express my since gratitude and deep regards to my supervisor and
additional examiner for their constant motivation, monitoring and guidance throughout
the course of Dissertation work. This is indeed a new beginning for professionals like us
to extend technology beyond boundaries in healthcare. The blessing, guidance and help
had given me to begin this journey.
My prime motivation behind this dissertation is my loving nephew Aakash who is being
treated from epilepsy since past seven years and all childrens over the world. My sincere
regards and appreciation is extended to Dr. Vrajesh Udani, Hinduja Hospital, hospital
staff, Mumbai, Dr Neeta Ajit Naik, Sion, Mumbai who are the pioneers in treating
epileptic childrens in India.
I virtually would like to thank my family for motivating me to build this. It would have
been not possible without the constant support and help from them.
Indeed we have a lot to go beyond this.
AnupSingh
TABLE OF CONTENTS
Chapter
No
Topic Page No
1. Introduction: Understanding the power of Big Data, Cloud
features
1
2. Feasibility Study and Analysis of Algorithms, Application
Methodologies
2
3. Architecture Design of the System 4
4. Cloud Design of the Epilepsy Global Data Centre 5
5. Data Storage Structure and Query Processing in HDFS and
HBASE
6
6. Use Cases Overview 9
7. Conclusion and Recommendations 22
BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI
WORK-INTEGRATED LEARNING PROGRAMMES DIVISION
Second Semester 2013-2014
Introduction: Understanding the power of Big Data, Cloud features
Data analysis on large volumes in fields like epilepsy, cardiac diseases, genetic,
neuroimaging, etc. on group of individuals with shared and variable characteristics or
subjects remains poorly approached as well as understood. Hence the very significant
challenges in terms of storing, accessing, building, accuracy and implementing complex
computations cannot be achieved with the traditional methods of data warehouse.
Globally as well as locally many families from different geographies in rural, urban areas
along with the modern sophisticated hospitals are unaware of different types of health
diseases, symptoms, medicines and health care solutions. Sharing a structured and
unstructured knowledge base amongst researchers, neurologists, doctors, associates,
parents is a must. There is a need of specific scientific environment as well as
automated software applications along with cost reduction to complement the above
scenarios. Epileptic disease among children need to be bridged a gap by leveraging the
technological revolution and predicting as well as finding new improved ways of cure.
Matured methodologies like Kimball's approach, Enterprise Wide DataWarehouse (EDW),
traditional RDBMS, ETL/ELT approach is insufficient for huge amount of epileptic data.
Over the course of years we have Terabytes to Petabytes to Zetabytes of unused data
which can be transformed, utilised, reengineered to device new findings to cure epilepsy.
We need better data access, data storage and data structures techniques.
Big Data environments create the opportunity to ease some of the rigidity of ETL-driven
data integration processes. The nature of big data requires that the infrastructure for
this process can scale cost-effectively. Hadoop*, MongoDB has emerged as the standard
solution for managing big data. Big Data refers to the large amounts, at least terabytes,
of poly-structured data that flows continuously through and around organizations,
including video, text, sensor logs, and transactional records.
Rapidly ingesting, storing, and processing big data requires a cost effective
infrastructure that can scale with the amount of data and the scope of analysis. Hadoop
has rapidly emerged as the de facto standard for managing large volumes of
unstructured data.
Hadoop is an open source distributed software platform for storing and processing data.
Written in Java, it runs on a cluster of industry-standard servers configured with direct-
attached storage. Using Hadoop, you can store petabytes of data reliably on tens of
thousands of servers while scaling performance cost-effectively by merely adding
inexpensive nodes to the cluster.
Cloud computing has emerged as a viable alternative to the acquisition and
management of physical or software resources. Scientific applications are being ported
on clouds to build on their inherent elasticity and scalability. The application needs to
run in parallel on a large set of resources in order to achieve reasonable execution
times. Cloud platforms, such as Amazon Web Services, Azure, Cloudera, are an
interesting option to tackle this problem. They provide High Performance Cloud
Computing Infrastructure for handling epileptic "Big Data" variability and provides some
eased as well as optimized deployment configurations.
We will be using Amazon Web Services (AWS) to blend the features of Big Data and
Cloud Computing.
1
Feasibility Study and Analysis of Algorithms, Application Methodologies
Assumptions: Representation of all the features of Big Data and Cloud is out of scope
and can be taken for separate research areas in epilepsy and other healthcare problems.
We will use Amazon EMR with the Hortonworks Distribution for Hadoop.
It makes it easy to provision and manage Hadoop in the AWS Cloud. Hadoop is available
in multiple distributions and Amazon EMR gives you the option of using the Amazon
Distribution or the Hortonworks Distribution for Hadoop. Hortonworks delivers on the
promise of Hadoop with a proven, enterprise-grade platform that supports a broad set of
mission-critical and real-time production uses. Hortonworks brings unprecedented
dependability, ease-of-use and world-record speed to Hadoop, NoSQL, database and
streaming applications in one unified Big Data platform. Hortonworks is used across
financial services, retail, media, healthcare, manufacturing, telecommunications and
government organizations
Hadoop for Big Data and Cloud
Hadoop is an open source distributed software platform for storing and processing data.
Written in Java, it runs on a cluster of industry-standard servers configured with direct-
attached storage. Using Hadoop, you can store petabytes of data reliably on tens of
thousands of servers while scaling performance cost-effectively by merely adding
inexpensive nodes to the cluster. Central to the scalability of Hadoop is the distributed
processing framework known as MapReduce.
MapReduce, the programming paradigm implemented by Hadoop, breaks-up a batch job
into many smaller tasks for parallel processing on a distributed system. HDFS, the
distributed file system stores the data reliably.
2
MapReduce helps programmers solve data-parallel problems for which the data set can
be sub-divided into small parts and processed independently. MapReduce is an
important advance because it allows ordinary developers, not just those skilled in high-
performance computing, to use parallel programming constructs without worrying about
the complex details of intra-cluster communication, task monitoring, and failure
handling. MapReduce simplifies all that. The system splits the input data-set into
multiple chunks, each of which is assigned a map task that can process the data in
parallel.
Each map task reads the input as a set of (key, value) pairs and produces a transformed
set of (key, value) pairs as the output. The framework shuffles and sorts outputs of the
map tasks, sending the intermediate (key, value) pairs to the reduce tasks, which group
them into final results. MapReduce uses JobTracker and TaskTracker mechanisms to
schedule tasks, monitor them, and restart any that fail. The Hadoop platform also
includes the Hadoop Distributed File System (HDFS), which is designed for scalability
and fault tolerance. HDFS stores large files by dividing them into blocks (usually 64 or
128 MB) and replicating the blocks on three or more servers. HDFS provides APIs for
MapReduce applications to read and write data in parallel. Capacity and performance can
be scaled by adding Data Nodes, and a single NameNode mechanism manages data
placement and monitors server availability. HDFS clusters in production use today
reliably hold petabytes of data on thousands of nodes.
In addition to MapReduce and HDFS, Hadoop includes many other components, some of
which are very useful for ETL.
• Flume* is a distributed system for collecting, aggregating, and moving large amounts
of data from multiple sources into HDFS or another central data store. Enterprises
typically collect log files on application servers or other systems and archive the log files
in order to comply with regulations. Being able to ingest and analyze that unstructured
or semi-structured data in Hadoop can turn this passive resource into a valuable asset.
Spring XD is one of the system similar to Flume.
• Sqoop* is a tool for transferring data between Hadoop and relational databases. You
can use Sqoop to import data from a MySQL or Oracle database into HDFS, run
MapReduce on the data, and then export the data back into an RDBMS. Sqoop
automates these processes, using MapReduce to import and export the data in parallel
with fault-tolerance.
• Hive* and Pig* are programming languages that simplify development of applications
employing the MapReduce framework. HiveQL is a dialect of SQL and supports a subset
of the syntax. Although slow, Hive is being actively enhanced by the developer
community to enable low-latency queries on HBase* and HDFS. Pig Latin is a procedural
programming language that provides high-level abstractions for MapReduce. You can
extend it with User Defined Functions written in Java, Python, and other languages.
• ODBC/JDBC Connectors for HBase and Hive are often proprietary components included
in distributions for Hadoop software. They provide connectivity with SQL applications by
translating standard SQL queries into HiveQL commands that can be executed upon
the data in HDFS or HBase.
• YARN provides cluster resource management capabilities to enable multiple data
processing engines with multiple workloads & applications across a single clustered
environment.
Thus Hadoop is a powerful platform for big data storage and processing.
3
Architecture Design of the System
Hadoop receives input structured and unstructured data from different sources hospitals,
healthcare vaccines, social media, information document to its various platform.
The features listed previously in feasibility section is depicted which is the core and
HDFS nodes which can be scaled for storage.
The output is the multiple application layers derived on the collated epileptic data in
terms of audios, videos, documents, research publications and collaboration forums
information from social media. We can also form data science to find out new research
areas, to predict and do analytical reporting.
Hospitals
and
Epileptic
Patient’s
Data
Files-
Epileptic
Cases,
Scenarios
Social
Media
ETL
ETL
ETL
Healthcare-
Worldwide
Epileptic
Vaccines,
Instruments
ETL
ETL
Information
Epilepsy Information And
Knowledge Sharing
HDFS Data Nodes
Advanced
Analytics
Architecture Design of the System
4
Cloud Design of the Epilepsy Global Data Centre
PAKISTAN
UK
INDIA
US
MALAYSIA
SRILANKA
EPILEPSY GLOBAL DATA CENTRES LEVERAGING CLOUD COMPUTING FEATURES
Cloud is core to provide the infrastructure as a service (IAAS) to the Epilepsy Global
Data Centre across the world. Volumes, Variety and Velocity being huge we can scale up
the system automatically based on our data needs. Here the overhead of maintaining,
upgrade, version management and the services of Hadoop, mail services, reporting is at
the Cloud provider's end. Information sharing on epilepsy across different countries is
achievable. We can create our customised services on "Epilepsy Data As a Service" for
different clinical research, hospitals, doctors, neuroscientists, social media. Data
volumes in terms of trillions and trillions of Zetabytes or more can be stored. However
Cloud framework, network portability and components and legal matters, law across
different countries will hold the key. The cloud is also used to provide extra capacity for
an existing cluster or for test your Hadoop applications. Moreover Hortonworks Data
Platform (HDP) 2.0 features the NameNode High Availability functionality automates
failovers and ensures the availability of the full HDP stack. Cloud also leverages uses of
multiple database platforms whether it is mysql, oracle, sqlserver or other databases. It
also provides different reporting tools like Jasper, SAP Business Objects, Microstrategy,
Qlikview to interface with the hadoop. Cloud is certainly a multi-use platform when
coupled with BigData. Hadoop in the cloud makes a great deal of sense: the elastic
resource allocation that cloud computing is premised on works well for cluster-based
data processing infrastructure used on varying analyses and data sets of indeterminate
size.
5
Data Storage Structure and Query Processing in HDFS and HBASE
Data Storage Structure and Query Processing Flow in Hadoop Distributed File System (HDFS) and HBASE
HDFS is a distributed file system that is well suited for the storage of large files. Data in
HDFS is organized into files and directories and is stored in encrypted format. We cannot
access the data like we do in our normal practice using the dir commands or explorer
commands. Files are divided into uniform sized blocks and distributed across cluster
nodes. Blocks are replicated to handle hardware failure. HDFS keeps checksums of data
for corruption detection and recovery. Depending upon the configuration the files are
broken into blocks of 128 MB. The blocks can be configured per file. The namenode
manages the file namespace, authorisation, authentication. It collects blocks reports
from datanodes based on block locations. It replicates the missing blocks in datanodes in
case of failures. Datanodes handles thousands of block storage. It stores the blocks
using the underlying OS's files. Client acess the blocks directly from data nodes based
on the metadata read from namenode. MapReduce uses the FileSystem interface -
hence it can run on multiple file systems. HDFS file system storage is depicted below.
Metadata
Hadoop Distributed File System Storage Structure
6
Sample java code to read the files in HDFS
package org.myorg;
import java.io.*;
import java.util.*;
import java.net.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class cat{
public static void main (String [] args) throws Exception{
try{
FileSystem fs = FileSystem.get(new Configuration());
FileStatus[] status = fs.listStatus(new Path("/hdfs/epilepsycases"));
for (int i=0;i<status.length;i++)
{
BufferedReader br=new BufferedReader(new
InputStreamReader(fs.open(status[i].getPath())));
System.out.println(status[i]);
String line;
line=br.readLine();
while (line != null){
System.out.println(line);
line=br.readLine();
}
}
}catch(Exception e){
System.out.println("File not found");
}
}
}
[root@sandbox /]# hadoop jar epilepsy_case_files.jar org.myorg.cat >
epilepsy_case_files.txt
Here we can see the namenode, blocksize , replication mode, permissions.
7
HBase is designed as column stores. This is a more advanced form of a key-value pair
database. Essentially, the keys and values become composite. Think of this as a hash
map crossed with a multidimensional array. Essentially each column contains a row of
data. It is ideally suited for semi-structured data since the MapReduce is very often used
on these. The columns are naturally indexed and is good for scaling out horizontally.
Imagine the difference between the RDBMS table having hundred columns and HBASE
table having around 500 columns. However it is unsuited for complex data reads. HBase,
on the other hand, is built on top of HDFS and provides fast record lookups (and
updates) for large tables. This can sometimes be a point of conceptual confusion. HBase
internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed
lookups. A sample HBASE storage structure in contrast to SQL RDBMS table is depicted
below.
Firstname_lastname Doctorname_hospitalname Evaluation_date_Observations
FirstName Lastname DoctorName HospitalName
Surgical
EvaluationDate
Evaluation/
Observations
PatientID Country
PatientID_Country
Key Value
Column Family: CF_Data
Primary Key Table Columns
HBASE
SQL
(RDBMS)
HBASE Storage Structure using Key Value Pair and SQL RDBMS Storage Structure
8
Use Cases
Pool in social media data and analyse the information on epilepsy. This is aimed for self
support care as well globally. In todays fast changing world there is a huge population
on twitter, facebook, linkedin and we see a common synergy and huge exchange of
information sharing.
XD Engine
Epilepsy Social
Media (Twitter
App)
HADOOP - HDFS
STREAM APP DATA
INGESTDATA
ANALYTICS
PARSE UNSTRUCTURED
DATA (JSON FORMAT)
Streaming and Analysing Social Media Data Flow in Hadoop
9
Scenario
This scenario is focused to stream unstructured data in real time from twitter app -
Epilepsy Social Media and transform into useful information.
Step 1:
Create a collaboration forum app "Epilepsy Social Media" on the twitter
https://dev.twitter.com/
Note down the API Keys, API secret, Access token and Access secret. In order to stream
in information from Twitter, then we will need these necessary keys. Once we have the
keys we configure the XD engine installed in Hadoop server.
10
Step2:
Login to Spring XD engine under a separate shell from hadoop. Test whether hdfs is
accessible or not.
hadoop fs ls /
It should display some files and directories
Step 3
Create the tweet stream on collaboration forum in Spring XD
stream create --name epilepsytweets --definition "twitterstream --
track='epilepsysociety, epilepsy society' | hdfs"
11
Step 4
Check whether we are able to stream files in xd
hadoop fs -ls /xd/epilepsytweets
12
The tweets that were posted is listed in the files below screenshot.
13
JSON Data Format
{"created_at":"Wed Mar 19 19:33:25 +0000
2014","id":446368866097065984,"id_str":"446368866097065984","text":"@epilepsysoc
iety Hi we should build some ideas and come together to create awareness on epilepsy
many countries mothers and fathers dont
knw","source":"web","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_i
d_str":null,"in_reply_to_user_id":87454049,"in_reply_to_user_id_str":"87454049","in_r
eply_to_screen_name":"epilepsysociety","user":{"id":2387686938,"id_str":"238768693
8","name":"AnupSingh","screen_name":"anupsingh4u","location":"","url":null,"descriptio
n":null,"protected":false,"followers_count":4,"friends_count":8,"listed_count":0,"created
_at":"Thu Mar 13 19:00:48 +0000
2014","favourites_count":0,"utc_offset":null,"time_zone":null,"geo_enabled":false,"verifi
ed":false,"statuses_count":8,"lang":"en-
gb","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"pro
file_background_color":"C0DEED","profile_background_image_url":"http://abs.twimg.c
om/images/themes/theme1/bg.png","profile_background_image_url_https":"https:/
/abs.twimg.com/images/themes/theme1/bg.png","profile_background_tile":false,"p
rofile_image_url":"http://abs.twimg.com/sticky/default_profile_images/default_prof
ile_0_normal.png","profile_image_url_https":"https://abs.twimg.com/sticky/default_
profile_images/default_profile_0_normal.png","profile_link_color":"0084B4","profile_sid
ebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":
"333333","profile_use_background_image":true,"default_profile":true,"default_profile_i
mage":true,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"c
oordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"e
ntities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[{"screen_name":"epilep
sysociety","name":"epilepsy
society","id":87454049,"id_str":"87454049","indices":[0,16]}]},"favorited":false,"retwe
eted":false,"filter_level":"medium","lang":"en"}
14
{"created_at":"Wed Mar 19 20:07:31 +0000
2014","id":446377448163143680,"id_str":"446377448163143680","text":"I'm
fundraising for Epilepsy Society &amp; I'd love your support! Text HERB49 u00a32 to
70070 to sponsor me today. Thanks. http://t.co/C74muxXk9P","source":"u003ca
href="http://twitter.com/tweetbutton" rel="nofollow"u003eTweet
Buttonu003c/au003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_stat
us_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_sc
reen_name":null,"user":{"id":98352324,"id_str":"98352324","name":"Steven
Herbert","screen_name":"sherbie40","location":"chepstow","url":null,"description":"Play
the guitar til your fingers bleed, quoted by Ted Nugent..nnLifes to short get on with
it...","protected":false,"followers_count":43,"friends_count":107,"listed_count":1,"create
d_at":"Mon Dec 21 11:14:17 +0000
2009","favourites_count":1,"utc_offset":0,"time_zone":"London","geo_enabled":true,"ve
rified":false,"statuses_count":119,"lang":"en","contributors_enabled":false,"is_translator
":false,"is_translation_enabled":false,"profile_background_color":"C0DEED","profile_bac
kground_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profi
le_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/
bg.png","profile_background_tile":false,"profile_image_url":"http://pbs.twimg.com/pr
ofile_images/442675380076703744/Oje9Ifzk_normal.jpeg","profile_image_url_https":
"https://pbs.twimg.com/profile_images/442675380076703744/Oje9Ifzk_normal.jpe
g","profile_banner_url":"https://pbs.twimg.com/profile_banners/98352324/139437
7010","profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_si
debar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_imag
e":true,"default_profile":true,"default_profile_image":false,"following":null,"follow_reque
st_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors"
:null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":
[{"url":"http://t.co/C74muxXk9P","expanded_url":"http://www.justgiving.com/Ste
ven-Herbert","display_url":"justgiving.com/Steven-
Herbert","indices":[119,141]}],"user_mentions":[]},"favorited":false,"retweeted":false,"
possibly_sensitive":false,"filter_level":"medium","lang":"en"}
15
Step 5
Stop or undeploy the stream after collecting some data.
stream undeploy --name epilepsytweets
Step 6
Refine the Data using Hive
Create the tables based on the streamed data collected in Hive.
16
We can see the tweets in hadoop interface has been brought into structured format. A
report can be build on top of the same.
17
Use Cases
Collect and represent the information on epilepsy types, symptoms, medicines and pros
and cons of the same. Collect and represent the information on neurosurgeons, success
scenarios handled, publications.
Scenario: Collect doctors data from different hospitals and research centres
The HIVE ETL script below in Hadoop will load the list of doctors data into warehouse.
create table tbl_doctor ( id string, name string, age int, hospitalname string, expertise
string, publications_link string, profile_info string, country string, city string)
insert overwrite table tbl_doctor
SELECT
regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) doctor_id,
regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) fullname,
regexp_extract(col_value, '^(?:([^,]*),?){10}', 1) age,
regexp_extract(col_value, '^(?:([^,]*),?){3}', 1) organisation,
regexp_extract(col_value, '^(?:([^,]*),?){11}', 1) specialisation,
regexp_extract(col_value, '^(?:([^,]*),?){8}', 1) articles_cited,
regexp_extract(col_value, '^(?:([^,]*),?){13}', 1) wiki_profile,
regexp_extract(col_value, '^(?:([^,]*),?){4}', 1) Country,
regexp_extract(col_value, '^(?:([^,]*),?){5}', 1) City
from temp_doctor;
LOAD DATA INPATH '/user/hue/Doctors_List.csv' OVERWRITE INTO TABLE tbl_doctors
We can customise our script based on the information received from hospitals and
research centres. Columns position can be toggled For example if the specialisation field
from list of of doctors of Hinduja hospital is at position 11 the we go by the below script.
If the specialisation field from list of of doctors of Fortis hospital is at position 14 the we
modify the below script for statement "regexp_extract(col_value, '^(?:([^,]*),?){14}',
1) specialisation".
18
Scenario:
Build a catalog of epilepsy types and epilepsy medicines.
HCATALOG provides easy interface to upload the files in different formats and set up the
data.
19
Scenario
Collect patients data related to his presurgical evaluation, medical history, physical
examination and lab tests. The other tables are represented in the below. We can have
customised ETL jobs based on the hospitals data. We can automate this process once we
have the list of files. However it will be essential to encrypt and store the data or mask
the data rather than revealing individual name. This will be subject to the healthcare
laws of different nations. This scenario can be complimented by writing PIG scripts to
compare data on epileptic patients across different states or countries.
20
Scenario
Information can be shared easily on emails about the events to increase the awareness.
Design the job in Oozie Editor/Dashboard
21
Conclusion and Recommendations
The aim of this blend case is to increase networking amongst hospitals, doctors, people,
childrens thus improving the healthcare systems. We can have proper data warehouse
Kimball model as well as federated data warehouse in Hadoop. BigData is feasible for
structured as well as unstructured data.
Data across different testing methods, research is already available we can carry out
data mining and able to predict on epileptic data. This will also aid to recognise the
difference between the normal and abnormal flow on epileptic sufferers.
Cognitive features on neural networking can be aimed to read the machine language of
test carried out on epilepsy patients. Test data and their scenarios can be known upfront
based on the parameters. Algorithms can be developed to make the system precision
and agnostic.
We can aim to build a language interpreter app which can share the epilepsy data
primarily into different languages to the target audience across different countries. This
will help in bridging the language barrier on communication between different languages
spoken over the world.
Document stores for CT scans, MRI, EEG recordings can be explored in MongoDB to
optimize audio, video data.
Interfacing with SAP HANA, SAP Business Objects, Microstrategy, Jasper. Qlikview and
other reporting tools can be carried so that we can have the graphs and data
representing a normal behaviour and deviated behaviour on seizures.
22
List of Abbreviations
AWS - Amazon Web Services
EMR - Elastic Map Reduce
HDP - Hortonworks Data Platform
EDW - Enterprise wide Datawarehouse
HDFS - Hadoop Distributed File System
IAAS - Infrastructure As A Service
List of Figures
Page 1: Hadoop Architecture
Page 2: Architecture Design of the System
Page 3: Epilepsy Global Data Centres Leveraging Cloud Computing Features
Page 4: Data Storage Structure and Query Processing Flow in HDFS and HBASE
Page 4: HDFS Storage Structure
Page 8: HBASE Storage Structure
23
Literature References
[1] http://www.epilepsyfoundation.org
[2] Moving To The Cloud. Developing Apps in the New World of Cloud Computing. Dinkar
Sitaram. Geetha Manjunath.
[3] http://bigdatauniversity.com
[4] http://www.mongodb.com/learn/big-data
[5] http://ocw.mit.edu/courses/brain-and-cognitive-sciences/
[6] http://aws.amazon.com/
[7] Artificial Intelligence and Soft Computing: Behavioral and Cognitive Modeling of the
Human Brain, Volume 1 By Amit Konar
[8] Computational Intelligence: Principles, Techniques and Applications By Amit Konar
[9] http://hortonworks.com/
[10] http://hadoop.apache.org/
[11] http://projects.spring.io/spring-xd/
[12] http://guidance.nice.org.uk/
[13] https://www.hemr.org/wiki/Category:Epilepsy_syndromes
[14] Dr. Vrajesh Udani.
http://www.hindujahospital.com/communityportal/doctors/doctor-
details.aspx?did=140&name=dr-vrajesh-udani&cid=36&cname=
[15] https://twitter.com/epilepsysociety
[16] https://www.hemr.org/wiki/Category:Epilepsy_syndromes
[17] Jayapandian CP, Chen CH, Bozorgi A, Lhatoo SD, Zhang GQ, Sahoo SS.
Electrophysiological Signal Analysis and Visualization using Cloudwave for Epilepsy
Clinical Research. The 14th World Congress on Medical and Health Informatics
(MedInfo), 2013. http://www.ncbi.nlm.nih.gov/pubmed/23920671
[18] Hadoop Architecture http://www.intel.co.uk/content/www/xa/en/big-data/big-data-
analytics-turning-big-data-into-intelligence.html
24

Más contenido relacionado

La actualidad más candente

Trust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEADTrust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEADBeth Plale
 
Where Technology Meets Medicine: SickKids High Performance Computing Data Centre
Where Technology Meets Medicine: SickKids High Performance Computing Data CentreWhere Technology Meets Medicine: SickKids High Performance Computing Data Centre
Where Technology Meets Medicine: SickKids High Performance Computing Data CentreScalar Decisions
 
Map Reduce in Big fata
Map Reduce in Big fataMap Reduce in Big fata
Map Reduce in Big fataSuraj Sawant
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
 
Opportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckOpportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckPistoia Alliance
 
Random Decision Forests at Scale
Random Decision Forests at ScaleRandom Decision Forests at Scale
Random Decision Forests at ScaleCloudera, Inc.
 
Insider's Guide- The Data Protection Imperative
Insider's Guide- The Data Protection ImperativeInsider's Guide- The Data Protection Imperative
Insider's Guide- The Data Protection ImperativeDataCore Software
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Robert Grossman
 
Increasing the Efficiency of Workflows: Use Cases in the Life Sciences
Increasing the Efficiency of Workflows: Use Cases in the Life SciencesIncreasing the Efficiency of Workflows: Use Cases in the Life Sciences
Increasing the Efficiency of Workflows: Use Cases in the Life SciencesSandra Gesing
 
Hanover Attains ‘Always on, Always up’ Availability
Hanover Attains ‘Always on, Always up’ AvailabilityHanover Attains ‘Always on, Always up’ Availability
Hanover Attains ‘Always on, Always up’ AvailabilityDataCore Software
 
Chapman university-big-data-healthcare-final
Chapman university-big-data-healthcare-finalChapman university-big-data-healthcare-final
Chapman university-big-data-healthcare-finalColin McNamara
 
Cloud Computing Stats - Cloud for Healthcare
Cloud Computing Stats - Cloud for HealthcareCloud Computing Stats - Cloud for Healthcare
Cloud Computing Stats - Cloud for HealthcareRapidScale
 
Building Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball ApproachBuilding Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball Approachjoshwills
 
EHR Hosting Demystified - What to Look for on Your Way to the Healthcare Cloud
EHR Hosting Demystified - What to Look for on Your Way to the Healthcare CloudEHR Hosting Demystified - What to Look for on Your Way to the Healthcare Cloud
EHR Hosting Demystified - What to Look for on Your Way to the Healthcare CloudeMedApps
 
Neches And Upperman, Wiscr
Neches And Upperman, WiscrNeches And Upperman, Wiscr
Neches And Upperman, WiscrRNeches
 
Address the multidepartmental digital imaging conundrum with enterprise level...
Address the multidepartmental digital imaging conundrum with enterprise level...Address the multidepartmental digital imaging conundrum with enterprise level...
Address the multidepartmental digital imaging conundrum with enterprise level...Hitachi Vantara
 
2018 10 igneous
2018 10 igneous2018 10 igneous
2018 10 igneousChris Dwan
 
How Safe and Always-Available Data as Lifeblood Helps a University Medical Ce...
How Safe and Always-Available Data as Lifeblood Helps a University Medical Ce...How Safe and Always-Available Data as Lifeblood Helps a University Medical Ce...
How Safe and Always-Available Data as Lifeblood Helps a University Medical Ce...Dana Gardner
 

La actualidad más candente (20)

Trust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEADTrust threads : Active Curation and Publishing in SEAD
Trust threads : Active Curation and Publishing in SEAD
 
Where Technology Meets Medicine: SickKids High Performance Computing Data Centre
Where Technology Meets Medicine: SickKids High Performance Computing Data CentreWhere Technology Meets Medicine: SickKids High Performance Computing Data Centre
Where Technology Meets Medicine: SickKids High Performance Computing Data Centre
 
Map Reduce in Big fata
Map Reduce in Big fataMap Reduce in Big fata
Map Reduce in Big fata
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Opportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckOpportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deck
 
Random Decision Forests at Scale
Random Decision Forests at ScaleRandom Decision Forests at Scale
Random Decision Forests at Scale
 
Paul Allen Open Science
Paul Allen Open SciencePaul Allen Open Science
Paul Allen Open Science
 
Insider's Guide- The Data Protection Imperative
Insider's Guide- The Data Protection ImperativeInsider's Guide- The Data Protection Imperative
Insider's Guide- The Data Protection Imperative
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
 
Increasing the Efficiency of Workflows: Use Cases in the Life Sciences
Increasing the Efficiency of Workflows: Use Cases in the Life SciencesIncreasing the Efficiency of Workflows: Use Cases in the Life Sciences
Increasing the Efficiency of Workflows: Use Cases in the Life Sciences
 
Hanover Attains ‘Always on, Always up’ Availability
Hanover Attains ‘Always on, Always up’ AvailabilityHanover Attains ‘Always on, Always up’ Availability
Hanover Attains ‘Always on, Always up’ Availability
 
Chapman university-big-data-healthcare-final
Chapman university-big-data-healthcare-finalChapman university-big-data-healthcare-final
Chapman university-big-data-healthcare-final
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data Analysis
 
Cloud Computing Stats - Cloud for Healthcare
Cloud Computing Stats - Cloud for HealthcareCloud Computing Stats - Cloud for Healthcare
Cloud Computing Stats - Cloud for Healthcare
 
Building Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball ApproachBuilding Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball Approach
 
EHR Hosting Demystified - What to Look for on Your Way to the Healthcare Cloud
EHR Hosting Demystified - What to Look for on Your Way to the Healthcare CloudEHR Hosting Demystified - What to Look for on Your Way to the Healthcare Cloud
EHR Hosting Demystified - What to Look for on Your Way to the Healthcare Cloud
 
Neches And Upperman, Wiscr
Neches And Upperman, WiscrNeches And Upperman, Wiscr
Neches And Upperman, Wiscr
 
Address the multidepartmental digital imaging conundrum with enterprise level...
Address the multidepartmental digital imaging conundrum with enterprise level...Address the multidepartmental digital imaging conundrum with enterprise level...
Address the multidepartmental digital imaging conundrum with enterprise level...
 
2018 10 igneous
2018 10 igneous2018 10 igneous
2018 10 igneous
 
How Safe and Always-Available Data as Lifeblood Helps a University Medical Ce...
How Safe and Always-Available Data as Lifeblood Helps a University Medical Ce...How Safe and Always-Available Data as Lifeblood Helps a University Medical Ce...
How Safe and Always-Available Data as Lifeblood Helps a University Medical Ce...
 

Destacado

Scalable Data Management Systems for Big Data
Scalable Data Management Systems for Big DataScalable Data Management Systems for Big Data
Scalable Data Management Systems for Big DataViet-Trung TRAN
 
Presentation Thesis Big Data
Presentation Thesis Big DataPresentation Thesis Big Data
Presentation Thesis Big DataNatan Meekers
 
Term paper of cse(211) avdhesh sharma c1801 a24 regd 10802037
Term paper of cse(211) avdhesh sharma c1801 a24 regd 10802037Term paper of cse(211) avdhesh sharma c1801 a24 regd 10802037
Term paper of cse(211) avdhesh sharma c1801 a24 regd 10802037Upendra Sengar
 
Big data performance management thesis
Big data performance management thesisBig data performance management thesis
Big data performance management thesisAhmad Muammar
 
Effective Meetings (Assignment)
Effective Meetings (Assignment)Effective Meetings (Assignment)
Effective Meetings (Assignment)umailaila
 
Demonstration in teaching
Demonstration in teachingDemonstration in teaching
Demonstration in teachingFamelaMelate
 
Acknowledgement
AcknowledgementAcknowledgement
Acknowledgementferdzzz
 

Destacado (10)

Scalable Data Management Systems for Big Data
Scalable Data Management Systems for Big DataScalable Data Management Systems for Big Data
Scalable Data Management Systems for Big Data
 
Presentation Thesis Big Data
Presentation Thesis Big DataPresentation Thesis Big Data
Presentation Thesis Big Data
 
Term paper of cse(211) avdhesh sharma c1801 a24 regd 10802037
Term paper of cse(211) avdhesh sharma c1801 a24 regd 10802037Term paper of cse(211) avdhesh sharma c1801 a24 regd 10802037
Term paper of cse(211) avdhesh sharma c1801 a24 regd 10802037
 
C Programming
C ProgrammingC Programming
C Programming
 
Big data performance management thesis
Big data performance management thesisBig data performance management thesis
Big data performance management thesis
 
Effective Meetings (Assignment)
Effective Meetings (Assignment)Effective Meetings (Assignment)
Effective Meetings (Assignment)
 
Demonstration in teaching
Demonstration in teachingDemonstration in teaching
Demonstration in teaching
 
Example of acknowledgment
Example of acknowledgmentExample of acknowledgment
Example of acknowledgment
 
Acknowledgement
AcknowledgementAcknowledgement
Acknowledgement
 
Theory of teaching
Theory of teachingTheory of teaching
Theory of teaching
 

Similar a Thesis blending big data and cloud -epilepsy global data research and information system

Big Data in Healthcare Made Simple Where It Stands Today and Where .pdf
Big Data in Healthcare Made Simple Where It Stands Today and Where .pdfBig Data in Healthcare Made Simple Where It Stands Today and Where .pdf
Big Data in Healthcare Made Simple Where It Stands Today and Where .pdfannamalaiagencies
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformIRJET Journal
 
Big Data-Survey
Big Data-SurveyBig Data-Survey
Big Data-Surveyijeei-iaes
 
Big Data Analytics in Hospitals By Dr.Mahboob ali khan Phd
Big Data Analytics in Hospitals By Dr.Mahboob ali khan PhdBig Data Analytics in Hospitals By Dr.Mahboob ali khan Phd
Big Data Analytics in Hospitals By Dr.Mahboob ali khan PhdHealthcare consultant
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessAjay Ohri
 
Moving Health Care Analytics to Hadoop to Build a Better Predictive Model
Moving Health Care Analytics to Hadoop to Build a Better Predictive ModelMoving Health Care Analytics to Hadoop to Build a Better Predictive Model
Moving Health Care Analytics to Hadoop to Build a Better Predictive ModelDataWorks Summit
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
Cloud computing in healthcare industry.pdf
Cloud computing in healthcare industry.pdfCloud computing in healthcare industry.pdf
Cloud computing in healthcare industry.pdfMobibizIndia1
 
2016.10 HPDA in Precision Medicine
2016.10 HPDA in Precision Medicine2016.10 HPDA in Precision Medicine
2016.10 HPDA in Precision MedicineMichael Atkins
 
Big Data and Enterprise Data - Oracle -1663869
Big Data and Enterprise Data - Oracle -1663869Big Data and Enterprise Data - Oracle -1663869
Big Data and Enterprise Data - Oracle -1663869Edgar Alejandro Villegas
 
REAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATA
REAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATAREAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATA
REAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATAijp2p
 

Similar a Thesis blending big data and cloud -epilepsy global data research and information system (20)

Big Data in Healthcare Made Simple Where It Stands Today and Where .pdf
Big Data in Healthcare Made Simple Where It Stands Today and Where .pdfBig Data in Healthcare Made Simple Where It Stands Today and Where .pdf
Big Data in Healthcare Made Simple Where It Stands Today and Where .pdf
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
Big Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop PlatformBig Data Testing Using Hadoop Platform
Big Data Testing Using Hadoop Platform
 
Bar camp bigdata
Bar camp bigdataBar camp bigdata
Bar camp bigdata
 
Big Data-Survey
Big Data-SurveyBig Data-Survey
Big Data-Survey
 
Big Data Analytics in Hospitals By Dr.Mahboob ali khan Phd
Big Data Analytics in Hospitals By Dr.Mahboob ali khan PhdBig Data Analytics in Hospitals By Dr.Mahboob ali khan Phd
Big Data Analytics in Hospitals By Dr.Mahboob ali khan Phd
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Moving Health Care Analytics to Hadoop to Build a Better Predictive Model
Moving Health Care Analytics to Hadoop to Build a Better Predictive ModelMoving Health Care Analytics to Hadoop to Build a Better Predictive Model
Moving Health Care Analytics to Hadoop to Build a Better Predictive Model
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
On Big Data
On Big DataOn Big Data
On Big Data
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Cloud computing in healthcare industry.pdf
Cloud computing in healthcare industry.pdfCloud computing in healthcare industry.pdf
Cloud computing in healthcare industry.pdf
 
Big Data
Big DataBig Data
Big Data
 
2016.10 HPDA in Precision Medicine
2016.10 HPDA in Precision Medicine2016.10 HPDA in Precision Medicine
2016.10 HPDA in Precision Medicine
 
Big Data and Enterprise Data - Oracle -1663869
Big Data and Enterprise Data - Oracle -1663869Big Data and Enterprise Data - Oracle -1663869
Big Data and Enterprise Data - Oracle -1663869
 
REAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATA
REAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATAREAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATA
REAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATA
 

Último

Hot Call Girl In Chandigarh 👅🥵 9053'900678 Call Girls Service In Chandigarh
Hot  Call Girl In Chandigarh 👅🥵 9053'900678 Call Girls Service In ChandigarhHot  Call Girl In Chandigarh 👅🥵 9053'900678 Call Girls Service In Chandigarh
Hot Call Girl In Chandigarh 👅🥵 9053'900678 Call Girls Service In ChandigarhVip call girls In Chandigarh
 
Call Girls Hyderabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Hyderabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Hyderabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Hyderabad Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
VIP Call Girl Sector 32 Noida Just Book Me 9711199171
VIP Call Girl Sector 32 Noida Just Book Me 9711199171VIP Call Girl Sector 32 Noida Just Book Me 9711199171
VIP Call Girl Sector 32 Noida Just Book Me 9711199171Call Girls Service Gurgaon
 
❤️♀️@ Jaipur Call Girls ❤️♀️@ Meghna Jaipur Call Girls Number CRTHNR Call G...
❤️♀️@ Jaipur Call Girls ❤️♀️@ Meghna Jaipur Call Girls Number CRTHNR   Call G...❤️♀️@ Jaipur Call Girls ❤️♀️@ Meghna Jaipur Call Girls Number CRTHNR   Call G...
❤️♀️@ Jaipur Call Girls ❤️♀️@ Meghna Jaipur Call Girls Number CRTHNR Call G...Gfnyt.com
 
(Sonam Bajaj) Call Girl in Jaipur- 09257276172 Escorts Service 50% Off with C...
(Sonam Bajaj) Call Girl in Jaipur- 09257276172 Escorts Service 50% Off with C...(Sonam Bajaj) Call Girl in Jaipur- 09257276172 Escorts Service 50% Off with C...
(Sonam Bajaj) Call Girl in Jaipur- 09257276172 Escorts Service 50% Off with C...indiancallgirl4rent
 
❤️♀️@ Jaipur Call Girls ❤️♀️@ Jaispreet Call Girl Services in Jaipur QRYPCF ...
❤️♀️@ Jaipur Call Girls ❤️♀️@ Jaispreet Call Girl Services in Jaipur QRYPCF  ...❤️♀️@ Jaipur Call Girls ❤️♀️@ Jaispreet Call Girl Services in Jaipur QRYPCF  ...
❤️♀️@ Jaipur Call Girls ❤️♀️@ Jaispreet Call Girl Services in Jaipur QRYPCF ...Gfnyt.com
 
Hot Call Girl In Ludhiana 👅🥵 9053'900678 Call Girls Service In Ludhiana
Hot  Call Girl In Ludhiana 👅🥵 9053'900678 Call Girls Service In LudhianaHot  Call Girl In Ludhiana 👅🥵 9053'900678 Call Girls Service In Ludhiana
Hot Call Girl In Ludhiana 👅🥵 9053'900678 Call Girls Service In LudhianaRussian Call Girls in Ludhiana
 
Call Girls Service Faridabad 📲 9999965857 ヅ10k NiGhT Call Girls In Faridabad
Call Girls Service Faridabad 📲 9999965857 ヅ10k NiGhT Call Girls In FaridabadCall Girls Service Faridabad 📲 9999965857 ヅ10k NiGhT Call Girls In Faridabad
Call Girls Service Faridabad 📲 9999965857 ヅ10k NiGhT Call Girls In Faridabadgragmanisha42
 
VIP Call Girls Noida Jhanvi 9711199171 Best VIP Call Girls Near Me
VIP Call Girls Noida Jhanvi 9711199171 Best VIP Call Girls Near MeVIP Call Girls Noida Jhanvi 9711199171 Best VIP Call Girls Near Me
VIP Call Girls Noida Jhanvi 9711199171 Best VIP Call Girls Near Memriyagarg453
 
Dehradun Call Girls Service ❤️🍑 8854095900 👄🫦Independent Escort Service Dehradun
Dehradun Call Girls Service ❤️🍑 8854095900 👄🫦Independent Escort Service DehradunDehradun Call Girls Service ❤️🍑 8854095900 👄🫦Independent Escort Service Dehradun
Dehradun Call Girls Service ❤️🍑 8854095900 👄🫦Independent Escort Service DehradunNiamh verma
 
Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Book me...
Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Book me...Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Book me...
Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Book me...gragteena
 
❤️Call girls in Jalandhar ☎️9876848877☎️ Call Girl service in Jalandhar☎️ Jal...
❤️Call girls in Jalandhar ☎️9876848877☎️ Call Girl service in Jalandhar☎️ Jal...❤️Call girls in Jalandhar ☎️9876848877☎️ Call Girl service in Jalandhar☎️ Jal...
❤️Call girls in Jalandhar ☎️9876848877☎️ Call Girl service in Jalandhar☎️ Jal...chandigarhentertainm
 
Call Girl In Zirakpur ❤️♀️@ 9988299661 Zirakpur Call Girls Near Me ❤️♀️@ Sexy...
Call Girl In Zirakpur ❤️♀️@ 9988299661 Zirakpur Call Girls Near Me ❤️♀️@ Sexy...Call Girl In Zirakpur ❤️♀️@ 9988299661 Zirakpur Call Girls Near Me ❤️♀️@ Sexy...
Call Girl In Zirakpur ❤️♀️@ 9988299661 Zirakpur Call Girls Near Me ❤️♀️@ Sexy...Sheetaleventcompany
 
VIP Call Girl Sector 25 Gurgaon Just Call Me 9899900591
VIP Call Girl Sector 25 Gurgaon Just Call Me 9899900591VIP Call Girl Sector 25 Gurgaon Just Call Me 9899900591
VIP Call Girl Sector 25 Gurgaon Just Call Me 9899900591adityaroy0215
 
Call Girls Thane Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Thane Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Thane Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Thane Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
raisen Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
raisen Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meetraisen Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
raisen Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real MeetCall Girls Service
 
VIP Call Girl Sector 88 Gurgaon Delhi Just Call Me 9899900591
VIP Call Girl Sector 88 Gurgaon Delhi Just Call Me 9899900591VIP Call Girl Sector 88 Gurgaon Delhi Just Call Me 9899900591
VIP Call Girl Sector 88 Gurgaon Delhi Just Call Me 9899900591adityaroy0215
 
Chandigarh Call Girls 👙 7001035870 👙 Genuine WhatsApp Number for Real Meet
Chandigarh Call Girls 👙 7001035870 👙 Genuine WhatsApp Number for Real MeetChandigarh Call Girls 👙 7001035870 👙 Genuine WhatsApp Number for Real Meet
Chandigarh Call Girls 👙 7001035870 👙 Genuine WhatsApp Number for Real Meetpriyashah722354
 
Krishnagiri call girls Tamil aunty 7877702510
Krishnagiri call girls Tamil aunty 7877702510Krishnagiri call girls Tamil aunty 7877702510
Krishnagiri call girls Tamil aunty 7877702510Vipesco
 
Vip sexy Call Girls Service In Sector 137,9999965857 Young Female Escorts Ser...
Vip sexy Call Girls Service In Sector 137,9999965857 Young Female Escorts Ser...Vip sexy Call Girls Service In Sector 137,9999965857 Young Female Escorts Ser...
Vip sexy Call Girls Service In Sector 137,9999965857 Young Female Escorts Ser...Call Girls Noida
 

Último (20)

Hot Call Girl In Chandigarh 👅🥵 9053'900678 Call Girls Service In Chandigarh
Hot  Call Girl In Chandigarh 👅🥵 9053'900678 Call Girls Service In ChandigarhHot  Call Girl In Chandigarh 👅🥵 9053'900678 Call Girls Service In Chandigarh
Hot Call Girl In Chandigarh 👅🥵 9053'900678 Call Girls Service In Chandigarh
 
Call Girls Hyderabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Hyderabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Hyderabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Hyderabad Just Call 9907093804 Top Class Call Girl Service Available
 
VIP Call Girl Sector 32 Noida Just Book Me 9711199171
VIP Call Girl Sector 32 Noida Just Book Me 9711199171VIP Call Girl Sector 32 Noida Just Book Me 9711199171
VIP Call Girl Sector 32 Noida Just Book Me 9711199171
 
❤️♀️@ Jaipur Call Girls ❤️♀️@ Meghna Jaipur Call Girls Number CRTHNR Call G...
❤️♀️@ Jaipur Call Girls ❤️♀️@ Meghna Jaipur Call Girls Number CRTHNR   Call G...❤️♀️@ Jaipur Call Girls ❤️♀️@ Meghna Jaipur Call Girls Number CRTHNR   Call G...
❤️♀️@ Jaipur Call Girls ❤️♀️@ Meghna Jaipur Call Girls Number CRTHNR Call G...
 
(Sonam Bajaj) Call Girl in Jaipur- 09257276172 Escorts Service 50% Off with C...
(Sonam Bajaj) Call Girl in Jaipur- 09257276172 Escorts Service 50% Off with C...(Sonam Bajaj) Call Girl in Jaipur- 09257276172 Escorts Service 50% Off with C...
(Sonam Bajaj) Call Girl in Jaipur- 09257276172 Escorts Service 50% Off with C...
 
❤️♀️@ Jaipur Call Girls ❤️♀️@ Jaispreet Call Girl Services in Jaipur QRYPCF ...
❤️♀️@ Jaipur Call Girls ❤️♀️@ Jaispreet Call Girl Services in Jaipur QRYPCF  ...❤️♀️@ Jaipur Call Girls ❤️♀️@ Jaispreet Call Girl Services in Jaipur QRYPCF  ...
❤️♀️@ Jaipur Call Girls ❤️♀️@ Jaispreet Call Girl Services in Jaipur QRYPCF ...
 
Hot Call Girl In Ludhiana 👅🥵 9053'900678 Call Girls Service In Ludhiana
Hot  Call Girl In Ludhiana 👅🥵 9053'900678 Call Girls Service In LudhianaHot  Call Girl In Ludhiana 👅🥵 9053'900678 Call Girls Service In Ludhiana
Hot Call Girl In Ludhiana 👅🥵 9053'900678 Call Girls Service In Ludhiana
 
Call Girls Service Faridabad 📲 9999965857 ヅ10k NiGhT Call Girls In Faridabad
Call Girls Service Faridabad 📲 9999965857 ヅ10k NiGhT Call Girls In FaridabadCall Girls Service Faridabad 📲 9999965857 ヅ10k NiGhT Call Girls In Faridabad
Call Girls Service Faridabad 📲 9999965857 ヅ10k NiGhT Call Girls In Faridabad
 
VIP Call Girls Noida Jhanvi 9711199171 Best VIP Call Girls Near Me
VIP Call Girls Noida Jhanvi 9711199171 Best VIP Call Girls Near MeVIP Call Girls Noida Jhanvi 9711199171 Best VIP Call Girls Near Me
VIP Call Girls Noida Jhanvi 9711199171 Best VIP Call Girls Near Me
 
Dehradun Call Girls Service ❤️🍑 8854095900 👄🫦Independent Escort Service Dehradun
Dehradun Call Girls Service ❤️🍑 8854095900 👄🫦Independent Escort Service DehradunDehradun Call Girls Service ❤️🍑 8854095900 👄🫦Independent Escort Service Dehradun
Dehradun Call Girls Service ❤️🍑 8854095900 👄🫦Independent Escort Service Dehradun
 
Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Book me...
Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Book me...Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Book me...
Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Book me...
 
❤️Call girls in Jalandhar ☎️9876848877☎️ Call Girl service in Jalandhar☎️ Jal...
❤️Call girls in Jalandhar ☎️9876848877☎️ Call Girl service in Jalandhar☎️ Jal...❤️Call girls in Jalandhar ☎️9876848877☎️ Call Girl service in Jalandhar☎️ Jal...
❤️Call girls in Jalandhar ☎️9876848877☎️ Call Girl service in Jalandhar☎️ Jal...
 
Call Girl In Zirakpur ❤️♀️@ 9988299661 Zirakpur Call Girls Near Me ❤️♀️@ Sexy...
Call Girl In Zirakpur ❤️♀️@ 9988299661 Zirakpur Call Girls Near Me ❤️♀️@ Sexy...Call Girl In Zirakpur ❤️♀️@ 9988299661 Zirakpur Call Girls Near Me ❤️♀️@ Sexy...
Call Girl In Zirakpur ❤️♀️@ 9988299661 Zirakpur Call Girls Near Me ❤️♀️@ Sexy...
 
VIP Call Girl Sector 25 Gurgaon Just Call Me 9899900591
VIP Call Girl Sector 25 Gurgaon Just Call Me 9899900591VIP Call Girl Sector 25 Gurgaon Just Call Me 9899900591
VIP Call Girl Sector 25 Gurgaon Just Call Me 9899900591
 
Call Girls Thane Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Thane Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Thane Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Thane Just Call 9907093804 Top Class Call Girl Service Available
 
raisen Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
raisen Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meetraisen Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
raisen Call Girls 👙 6297143586 👙 Genuine WhatsApp Number for Real Meet
 
VIP Call Girl Sector 88 Gurgaon Delhi Just Call Me 9899900591
VIP Call Girl Sector 88 Gurgaon Delhi Just Call Me 9899900591VIP Call Girl Sector 88 Gurgaon Delhi Just Call Me 9899900591
VIP Call Girl Sector 88 Gurgaon Delhi Just Call Me 9899900591
 
Chandigarh Call Girls 👙 7001035870 👙 Genuine WhatsApp Number for Real Meet
Chandigarh Call Girls 👙 7001035870 👙 Genuine WhatsApp Number for Real MeetChandigarh Call Girls 👙 7001035870 👙 Genuine WhatsApp Number for Real Meet
Chandigarh Call Girls 👙 7001035870 👙 Genuine WhatsApp Number for Real Meet
 
Krishnagiri call girls Tamil aunty 7877702510
Krishnagiri call girls Tamil aunty 7877702510Krishnagiri call girls Tamil aunty 7877702510
Krishnagiri call girls Tamil aunty 7877702510
 
Vip sexy Call Girls Service In Sector 137,9999965857 Young Female Escorts Ser...
Vip sexy Call Girls Service In Sector 137,9999965857 Young Female Escorts Ser...Vip sexy Call Girls Service In Sector 137,9999965857 Young Female Escorts Ser...
Vip sexy Call Girls Service In Sector 137,9999965857 Young Female Escorts Ser...
 

Thesis blending big data and cloud -epilepsy global data research and information system

  • 1. Blending Big Data and Cloud - Epilepsy Global Data Research and Information System BITS ZG629T: Thesis by AnupSingh 2012HZ12707 Thesis work carried out at Tata Consultancy Services Limited, LCH.Clearnet Limited, Investec Bank Plc London, Birmingham Cancer Research Institute, United Kingdom Submitted in fulfillment of M.S. by Research - Software Systems Under the Supervision of Sandeep Patil, Researcher in NASA, Arlington University, Ex. BARC Sr. Scientist Kalwar Shivram, Project Manager, Tata Consultancy Services Limited, SanJose, UnitedStates Professor B.M. Deshpande, bmd@goa.bits-pilani.ac.in BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE PILANI (RAJASTHAN) April, 2014
  • 2. ABSTRACT Epilepsy is the most common neurological disorder affecting 65 million people worldwide. While medications and other treatments help many people of all ages who live with epilepsy, more than a million people continue to have seizures that can severely limit their school achievements, employment prospects and participation in all of life's experiences. It strikes most often among the very young and the very old, although anyone can develop epilepsy at any age. Its prevalence is greater than autism spectrum disorder, cerebral palsy, multiple sclerosis and Parkinson's disease combined. Despite how common it is and major advances in diagnosis and treatment, epilepsy is among the least understood of major chronic medical conditions, even though one in three adults knows someone with the disorder. Epilepsy Global Data Research and Information System is aimed to leverage Big Data, Cloud Computing, Datawarehouse features to build a global system which will help the doctors, neurosurgeons to use the information and methodologies to treat the childrens and people worldwide. Objectives  This initiative is aimed to build a federated database of medical information and services that act to serve as the platform for medical research into neurological cases of epilepsy.  Providing access to very large data sets on patients with different neurological disorders help the researchers, doctors, surgeons to make efficient decisions and share their experiences.  Best treatment to be given to childrens and other people all over the world.  System to enrich and enhance its knowledge base so as to stimulate new questions about Epilepsy and its symptoms – and, ultimately, lead to the fruitful answers on its treatment.  To harness super-computer power and capabilties of Big Data and Cloud Computing. Broad Academic Area of Work: Cloud Computing, Big Data, Datawarehouse. Key words: Hadoop, Twitter Apps, Spring XD, HBASE, HDFS, MapReduce, Hue, Hive, Pig, HCatalog, JSON Serde, Flume.
  • 3. ACKNOWLEDGEMENTS I would like to express my since gratitude and deep regards to my supervisor and additional examiner for their constant motivation, monitoring and guidance throughout the course of Dissertation work. This is indeed a new beginning for professionals like us to extend technology beyond boundaries in healthcare. The blessing, guidance and help had given me to begin this journey. My prime motivation behind this dissertation is my loving nephew Aakash who is being treated from epilepsy since past seven years and all childrens over the world. My sincere regards and appreciation is extended to Dr. Vrajesh Udani, Hinduja Hospital, hospital staff, Mumbai, Dr Neeta Ajit Naik, Sion, Mumbai who are the pioneers in treating epileptic childrens in India. I virtually would like to thank my family for motivating me to build this. It would have been not possible without the constant support and help from them. Indeed we have a lot to go beyond this. AnupSingh
  • 4. TABLE OF CONTENTS Chapter No Topic Page No 1. Introduction: Understanding the power of Big Data, Cloud features 1 2. Feasibility Study and Analysis of Algorithms, Application Methodologies 2 3. Architecture Design of the System 4 4. Cloud Design of the Epilepsy Global Data Centre 5 5. Data Storage Structure and Query Processing in HDFS and HBASE 6 6. Use Cases Overview 9 7. Conclusion and Recommendations 22
  • 5. BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI WORK-INTEGRATED LEARNING PROGRAMMES DIVISION Second Semester 2013-2014 Introduction: Understanding the power of Big Data, Cloud features Data analysis on large volumes in fields like epilepsy, cardiac diseases, genetic, neuroimaging, etc. on group of individuals with shared and variable characteristics or subjects remains poorly approached as well as understood. Hence the very significant challenges in terms of storing, accessing, building, accuracy and implementing complex computations cannot be achieved with the traditional methods of data warehouse. Globally as well as locally many families from different geographies in rural, urban areas along with the modern sophisticated hospitals are unaware of different types of health diseases, symptoms, medicines and health care solutions. Sharing a structured and unstructured knowledge base amongst researchers, neurologists, doctors, associates, parents is a must. There is a need of specific scientific environment as well as automated software applications along with cost reduction to complement the above scenarios. Epileptic disease among children need to be bridged a gap by leveraging the technological revolution and predicting as well as finding new improved ways of cure. Matured methodologies like Kimball's approach, Enterprise Wide DataWarehouse (EDW), traditional RDBMS, ETL/ELT approach is insufficient for huge amount of epileptic data. Over the course of years we have Terabytes to Petabytes to Zetabytes of unused data which can be transformed, utilised, reengineered to device new findings to cure epilepsy. We need better data access, data storage and data structures techniques. Big Data environments create the opportunity to ease some of the rigidity of ETL-driven data integration processes. The nature of big data requires that the infrastructure for this process can scale cost-effectively. Hadoop*, MongoDB has emerged as the standard solution for managing big data. Big Data refers to the large amounts, at least terabytes, of poly-structured data that flows continuously through and around organizations, including video, text, sensor logs, and transactional records. Rapidly ingesting, storing, and processing big data requires a cost effective infrastructure that can scale with the amount of data and the scope of analysis. Hadoop has rapidly emerged as the de facto standard for managing large volumes of unstructured data. Hadoop is an open source distributed software platform for storing and processing data. Written in Java, it runs on a cluster of industry-standard servers configured with direct- attached storage. Using Hadoop, you can store petabytes of data reliably on tens of thousands of servers while scaling performance cost-effectively by merely adding inexpensive nodes to the cluster. Cloud computing has emerged as a viable alternative to the acquisition and management of physical or software resources. Scientific applications are being ported on clouds to build on their inherent elasticity and scalability. The application needs to run in parallel on a large set of resources in order to achieve reasonable execution times. Cloud platforms, such as Amazon Web Services, Azure, Cloudera, are an interesting option to tackle this problem. They provide High Performance Cloud Computing Infrastructure for handling epileptic "Big Data" variability and provides some eased as well as optimized deployment configurations. We will be using Amazon Web Services (AWS) to blend the features of Big Data and Cloud Computing. 1
  • 6. Feasibility Study and Analysis of Algorithms, Application Methodologies Assumptions: Representation of all the features of Big Data and Cloud is out of scope and can be taken for separate research areas in epilepsy and other healthcare problems. We will use Amazon EMR with the Hortonworks Distribution for Hadoop. It makes it easy to provision and manage Hadoop in the AWS Cloud. Hadoop is available in multiple distributions and Amazon EMR gives you the option of using the Amazon Distribution or the Hortonworks Distribution for Hadoop. Hortonworks delivers on the promise of Hadoop with a proven, enterprise-grade platform that supports a broad set of mission-critical and real-time production uses. Hortonworks brings unprecedented dependability, ease-of-use and world-record speed to Hadoop, NoSQL, database and streaming applications in one unified Big Data platform. Hortonworks is used across financial services, retail, media, healthcare, manufacturing, telecommunications and government organizations Hadoop for Big Data and Cloud Hadoop is an open source distributed software platform for storing and processing data. Written in Java, it runs on a cluster of industry-standard servers configured with direct- attached storage. Using Hadoop, you can store petabytes of data reliably on tens of thousands of servers while scaling performance cost-effectively by merely adding inexpensive nodes to the cluster. Central to the scalability of Hadoop is the distributed processing framework known as MapReduce. MapReduce, the programming paradigm implemented by Hadoop, breaks-up a batch job into many smaller tasks for parallel processing on a distributed system. HDFS, the distributed file system stores the data reliably. 2
  • 7. MapReduce helps programmers solve data-parallel problems for which the data set can be sub-divided into small parts and processed independently. MapReduce is an important advance because it allows ordinary developers, not just those skilled in high- performance computing, to use parallel programming constructs without worrying about the complex details of intra-cluster communication, task monitoring, and failure handling. MapReduce simplifies all that. The system splits the input data-set into multiple chunks, each of which is assigned a map task that can process the data in parallel. Each map task reads the input as a set of (key, value) pairs and produces a transformed set of (key, value) pairs as the output. The framework shuffles and sorts outputs of the map tasks, sending the intermediate (key, value) pairs to the reduce tasks, which group them into final results. MapReduce uses JobTracker and TaskTracker mechanisms to schedule tasks, monitor them, and restart any that fail. The Hadoop platform also includes the Hadoop Distributed File System (HDFS), which is designed for scalability and fault tolerance. HDFS stores large files by dividing them into blocks (usually 64 or 128 MB) and replicating the blocks on three or more servers. HDFS provides APIs for MapReduce applications to read and write data in parallel. Capacity and performance can be scaled by adding Data Nodes, and a single NameNode mechanism manages data placement and monitors server availability. HDFS clusters in production use today reliably hold petabytes of data on thousands of nodes. In addition to MapReduce and HDFS, Hadoop includes many other components, some of which are very useful for ETL. • Flume* is a distributed system for collecting, aggregating, and moving large amounts of data from multiple sources into HDFS or another central data store. Enterprises typically collect log files on application servers or other systems and archive the log files in order to comply with regulations. Being able to ingest and analyze that unstructured or semi-structured data in Hadoop can turn this passive resource into a valuable asset. Spring XD is one of the system similar to Flume. • Sqoop* is a tool for transferring data between Hadoop and relational databases. You can use Sqoop to import data from a MySQL or Oracle database into HDFS, run MapReduce on the data, and then export the data back into an RDBMS. Sqoop automates these processes, using MapReduce to import and export the data in parallel with fault-tolerance. • Hive* and Pig* are programming languages that simplify development of applications employing the MapReduce framework. HiveQL is a dialect of SQL and supports a subset of the syntax. Although slow, Hive is being actively enhanced by the developer community to enable low-latency queries on HBase* and HDFS. Pig Latin is a procedural programming language that provides high-level abstractions for MapReduce. You can extend it with User Defined Functions written in Java, Python, and other languages. • ODBC/JDBC Connectors for HBase and Hive are often proprietary components included in distributions for Hadoop software. They provide connectivity with SQL applications by translating standard SQL queries into HiveQL commands that can be executed upon the data in HDFS or HBase. • YARN provides cluster resource management capabilities to enable multiple data processing engines with multiple workloads & applications across a single clustered environment. Thus Hadoop is a powerful platform for big data storage and processing. 3
  • 8. Architecture Design of the System Hadoop receives input structured and unstructured data from different sources hospitals, healthcare vaccines, social media, information document to its various platform. The features listed previously in feasibility section is depicted which is the core and HDFS nodes which can be scaled for storage. The output is the multiple application layers derived on the collated epileptic data in terms of audios, videos, documents, research publications and collaboration forums information from social media. We can also form data science to find out new research areas, to predict and do analytical reporting. Hospitals and Epileptic Patient’s Data Files- Epileptic Cases, Scenarios Social Media ETL ETL ETL Healthcare- Worldwide Epileptic Vaccines, Instruments ETL ETL Information Epilepsy Information And Knowledge Sharing HDFS Data Nodes Advanced Analytics Architecture Design of the System 4
  • 9. Cloud Design of the Epilepsy Global Data Centre PAKISTAN UK INDIA US MALAYSIA SRILANKA EPILEPSY GLOBAL DATA CENTRES LEVERAGING CLOUD COMPUTING FEATURES Cloud is core to provide the infrastructure as a service (IAAS) to the Epilepsy Global Data Centre across the world. Volumes, Variety and Velocity being huge we can scale up the system automatically based on our data needs. Here the overhead of maintaining, upgrade, version management and the services of Hadoop, mail services, reporting is at the Cloud provider's end. Information sharing on epilepsy across different countries is achievable. We can create our customised services on "Epilepsy Data As a Service" for different clinical research, hospitals, doctors, neuroscientists, social media. Data volumes in terms of trillions and trillions of Zetabytes or more can be stored. However Cloud framework, network portability and components and legal matters, law across different countries will hold the key. The cloud is also used to provide extra capacity for an existing cluster or for test your Hadoop applications. Moreover Hortonworks Data Platform (HDP) 2.0 features the NameNode High Availability functionality automates failovers and ensures the availability of the full HDP stack. Cloud also leverages uses of multiple database platforms whether it is mysql, oracle, sqlserver or other databases. It also provides different reporting tools like Jasper, SAP Business Objects, Microstrategy, Qlikview to interface with the hadoop. Cloud is certainly a multi-use platform when coupled with BigData. Hadoop in the cloud makes a great deal of sense: the elastic resource allocation that cloud computing is premised on works well for cluster-based data processing infrastructure used on varying analyses and data sets of indeterminate size. 5
  • 10. Data Storage Structure and Query Processing in HDFS and HBASE Data Storage Structure and Query Processing Flow in Hadoop Distributed File System (HDFS) and HBASE HDFS is a distributed file system that is well suited for the storage of large files. Data in HDFS is organized into files and directories and is stored in encrypted format. We cannot access the data like we do in our normal practice using the dir commands or explorer commands. Files are divided into uniform sized blocks and distributed across cluster nodes. Blocks are replicated to handle hardware failure. HDFS keeps checksums of data for corruption detection and recovery. Depending upon the configuration the files are broken into blocks of 128 MB. The blocks can be configured per file. The namenode manages the file namespace, authorisation, authentication. It collects blocks reports from datanodes based on block locations. It replicates the missing blocks in datanodes in case of failures. Datanodes handles thousands of block storage. It stores the blocks using the underlying OS's files. Client acess the blocks directly from data nodes based on the metadata read from namenode. MapReduce uses the FileSystem interface - hence it can run on multiple file systems. HDFS file system storage is depicted below. Metadata Hadoop Distributed File System Storage Structure 6
  • 11. Sample java code to read the files in HDFS package org.myorg; import java.io.*; import java.util.*; import java.net.*; import org.apache.hadoop.fs.*; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class cat{ public static void main (String [] args) throws Exception{ try{ FileSystem fs = FileSystem.get(new Configuration()); FileStatus[] status = fs.listStatus(new Path("/hdfs/epilepsycases")); for (int i=0;i<status.length;i++) { BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(status[i].getPath()))); System.out.println(status[i]); String line; line=br.readLine(); while (line != null){ System.out.println(line); line=br.readLine(); } } }catch(Exception e){ System.out.println("File not found"); } } } [root@sandbox /]# hadoop jar epilepsy_case_files.jar org.myorg.cat > epilepsy_case_files.txt Here we can see the namenode, blocksize , replication mode, permissions. 7
  • 12. HBase is designed as column stores. This is a more advanced form of a key-value pair database. Essentially, the keys and values become composite. Think of this as a hash map crossed with a multidimensional array. Essentially each column contains a row of data. It is ideally suited for semi-structured data since the MapReduce is very often used on these. The columns are naturally indexed and is good for scaling out horizontally. Imagine the difference between the RDBMS table having hundred columns and HBASE table having around 500 columns. However it is unsuited for complex data reads. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups. A sample HBASE storage structure in contrast to SQL RDBMS table is depicted below. Firstname_lastname Doctorname_hospitalname Evaluation_date_Observations FirstName Lastname DoctorName HospitalName Surgical EvaluationDate Evaluation/ Observations PatientID Country PatientID_Country Key Value Column Family: CF_Data Primary Key Table Columns HBASE SQL (RDBMS) HBASE Storage Structure using Key Value Pair and SQL RDBMS Storage Structure 8
  • 13. Use Cases Pool in social media data and analyse the information on epilepsy. This is aimed for self support care as well globally. In todays fast changing world there is a huge population on twitter, facebook, linkedin and we see a common synergy and huge exchange of information sharing. XD Engine Epilepsy Social Media (Twitter App) HADOOP - HDFS STREAM APP DATA INGESTDATA ANALYTICS PARSE UNSTRUCTURED DATA (JSON FORMAT) Streaming and Analysing Social Media Data Flow in Hadoop 9
  • 14. Scenario This scenario is focused to stream unstructured data in real time from twitter app - Epilepsy Social Media and transform into useful information. Step 1: Create a collaboration forum app "Epilepsy Social Media" on the twitter https://dev.twitter.com/ Note down the API Keys, API secret, Access token and Access secret. In order to stream in information from Twitter, then we will need these necessary keys. Once we have the keys we configure the XD engine installed in Hadoop server. 10
  • 15. Step2: Login to Spring XD engine under a separate shell from hadoop. Test whether hdfs is accessible or not. hadoop fs ls / It should display some files and directories Step 3 Create the tweet stream on collaboration forum in Spring XD stream create --name epilepsytweets --definition "twitterstream -- track='epilepsysociety, epilepsy society' | hdfs" 11
  • 16. Step 4 Check whether we are able to stream files in xd hadoop fs -ls /xd/epilepsytweets 12
  • 17. The tweets that were posted is listed in the files below screenshot. 13
  • 18. JSON Data Format {"created_at":"Wed Mar 19 19:33:25 +0000 2014","id":446368866097065984,"id_str":"446368866097065984","text":"@epilepsysoc iety Hi we should build some ideas and come together to create awareness on epilepsy many countries mothers and fathers dont knw","source":"web","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_i d_str":null,"in_reply_to_user_id":87454049,"in_reply_to_user_id_str":"87454049","in_r eply_to_screen_name":"epilepsysociety","user":{"id":2387686938,"id_str":"238768693 8","name":"AnupSingh","screen_name":"anupsingh4u","location":"","url":null,"descriptio n":null,"protected":false,"followers_count":4,"friends_count":8,"listed_count":0,"created _at":"Thu Mar 13 19:00:48 +0000 2014","favourites_count":0,"utc_offset":null,"time_zone":null,"geo_enabled":false,"verifi ed":false,"statuses_count":8,"lang":"en- gb","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"pro file_background_color":"C0DEED","profile_background_image_url":"http://abs.twimg.c om/images/themes/theme1/bg.png","profile_background_image_url_https":"https:/ /abs.twimg.com/images/themes/theme1/bg.png","profile_background_tile":false,"p rofile_image_url":"http://abs.twimg.com/sticky/default_profile_images/default_prof ile_0_normal.png","profile_image_url_https":"https://abs.twimg.com/sticky/default_ profile_images/default_profile_0_normal.png","profile_link_color":"0084B4","profile_sid ebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color": "333333","profile_use_background_image":true,"default_profile":true,"default_profile_i mage":true,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"c oordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"e ntities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[{"screen_name":"epilep sysociety","name":"epilepsy society","id":87454049,"id_str":"87454049","indices":[0,16]}]},"favorited":false,"retwe eted":false,"filter_level":"medium","lang":"en"} 14
  • 19. {"created_at":"Wed Mar 19 20:07:31 +0000 2014","id":446377448163143680,"id_str":"446377448163143680","text":"I'm fundraising for Epilepsy Society &amp; I'd love your support! Text HERB49 u00a32 to 70070 to sponsor me today. Thanks. http://t.co/C74muxXk9P","source":"u003ca href="http://twitter.com/tweetbutton" rel="nofollow"u003eTweet Buttonu003c/au003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_stat us_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_sc reen_name":null,"user":{"id":98352324,"id_str":"98352324","name":"Steven Herbert","screen_name":"sherbie40","location":"chepstow","url":null,"description":"Play the guitar til your fingers bleed, quoted by Ted Nugent..nnLifes to short get on with it...","protected":false,"followers_count":43,"friends_count":107,"listed_count":1,"create d_at":"Mon Dec 21 11:14:17 +0000 2009","favourites_count":1,"utc_offset":0,"time_zone":"London","geo_enabled":true,"ve rified":false,"statuses_count":119,"lang":"en","contributors_enabled":false,"is_translator ":false,"is_translation_enabled":false,"profile_background_color":"C0DEED","profile_bac kground_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profi le_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/ bg.png","profile_background_tile":false,"profile_image_url":"http://pbs.twimg.com/pr ofile_images/442675380076703744/Oje9Ifzk_normal.jpeg","profile_image_url_https": "https://pbs.twimg.com/profile_images/442675380076703744/Oje9Ifzk_normal.jpe g","profile_banner_url":"https://pbs.twimg.com/profile_banners/98352324/139437 7010","profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_si debar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_imag e":true,"default_profile":true,"default_profile_image":false,"following":null,"follow_reque st_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors" :null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls": [{"url":"http://t.co/C74muxXk9P","expanded_url":"http://www.justgiving.com/Ste ven-Herbert","display_url":"justgiving.com/Steven- Herbert","indices":[119,141]}],"user_mentions":[]},"favorited":false,"retweeted":false," possibly_sensitive":false,"filter_level":"medium","lang":"en"} 15
  • 20. Step 5 Stop or undeploy the stream after collecting some data. stream undeploy --name epilepsytweets Step 6 Refine the Data using Hive Create the tables based on the streamed data collected in Hive. 16
  • 21. We can see the tweets in hadoop interface has been brought into structured format. A report can be build on top of the same. 17
  • 22. Use Cases Collect and represent the information on epilepsy types, symptoms, medicines and pros and cons of the same. Collect and represent the information on neurosurgeons, success scenarios handled, publications. Scenario: Collect doctors data from different hospitals and research centres The HIVE ETL script below in Hadoop will load the list of doctors data into warehouse. create table tbl_doctor ( id string, name string, age int, hospitalname string, expertise string, publications_link string, profile_info string, country string, city string) insert overwrite table tbl_doctor SELECT regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) doctor_id, regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) fullname, regexp_extract(col_value, '^(?:([^,]*),?){10}', 1) age, regexp_extract(col_value, '^(?:([^,]*),?){3}', 1) organisation, regexp_extract(col_value, '^(?:([^,]*),?){11}', 1) specialisation, regexp_extract(col_value, '^(?:([^,]*),?){8}', 1) articles_cited, regexp_extract(col_value, '^(?:([^,]*),?){13}', 1) wiki_profile, regexp_extract(col_value, '^(?:([^,]*),?){4}', 1) Country, regexp_extract(col_value, '^(?:([^,]*),?){5}', 1) City from temp_doctor; LOAD DATA INPATH '/user/hue/Doctors_List.csv' OVERWRITE INTO TABLE tbl_doctors We can customise our script based on the information received from hospitals and research centres. Columns position can be toggled For example if the specialisation field from list of of doctors of Hinduja hospital is at position 11 the we go by the below script. If the specialisation field from list of of doctors of Fortis hospital is at position 14 the we modify the below script for statement "regexp_extract(col_value, '^(?:([^,]*),?){14}', 1) specialisation". 18
  • 23. Scenario: Build a catalog of epilepsy types and epilepsy medicines. HCATALOG provides easy interface to upload the files in different formats and set up the data. 19
  • 24. Scenario Collect patients data related to his presurgical evaluation, medical history, physical examination and lab tests. The other tables are represented in the below. We can have customised ETL jobs based on the hospitals data. We can automate this process once we have the list of files. However it will be essential to encrypt and store the data or mask the data rather than revealing individual name. This will be subject to the healthcare laws of different nations. This scenario can be complimented by writing PIG scripts to compare data on epileptic patients across different states or countries. 20
  • 25. Scenario Information can be shared easily on emails about the events to increase the awareness. Design the job in Oozie Editor/Dashboard 21
  • 26. Conclusion and Recommendations The aim of this blend case is to increase networking amongst hospitals, doctors, people, childrens thus improving the healthcare systems. We can have proper data warehouse Kimball model as well as federated data warehouse in Hadoop. BigData is feasible for structured as well as unstructured data. Data across different testing methods, research is already available we can carry out data mining and able to predict on epileptic data. This will also aid to recognise the difference between the normal and abnormal flow on epileptic sufferers. Cognitive features on neural networking can be aimed to read the machine language of test carried out on epilepsy patients. Test data and their scenarios can be known upfront based on the parameters. Algorithms can be developed to make the system precision and agnostic. We can aim to build a language interpreter app which can share the epilepsy data primarily into different languages to the target audience across different countries. This will help in bridging the language barrier on communication between different languages spoken over the world. Document stores for CT scans, MRI, EEG recordings can be explored in MongoDB to optimize audio, video data. Interfacing with SAP HANA, SAP Business Objects, Microstrategy, Jasper. Qlikview and other reporting tools can be carried so that we can have the graphs and data representing a normal behaviour and deviated behaviour on seizures. 22
  • 27. List of Abbreviations AWS - Amazon Web Services EMR - Elastic Map Reduce HDP - Hortonworks Data Platform EDW - Enterprise wide Datawarehouse HDFS - Hadoop Distributed File System IAAS - Infrastructure As A Service List of Figures Page 1: Hadoop Architecture Page 2: Architecture Design of the System Page 3: Epilepsy Global Data Centres Leveraging Cloud Computing Features Page 4: Data Storage Structure and Query Processing Flow in HDFS and HBASE Page 4: HDFS Storage Structure Page 8: HBASE Storage Structure 23
  • 28. Literature References [1] http://www.epilepsyfoundation.org [2] Moving To The Cloud. Developing Apps in the New World of Cloud Computing. Dinkar Sitaram. Geetha Manjunath. [3] http://bigdatauniversity.com [4] http://www.mongodb.com/learn/big-data [5] http://ocw.mit.edu/courses/brain-and-cognitive-sciences/ [6] http://aws.amazon.com/ [7] Artificial Intelligence and Soft Computing: Behavioral and Cognitive Modeling of the Human Brain, Volume 1 By Amit Konar [8] Computational Intelligence: Principles, Techniques and Applications By Amit Konar [9] http://hortonworks.com/ [10] http://hadoop.apache.org/ [11] http://projects.spring.io/spring-xd/ [12] http://guidance.nice.org.uk/ [13] https://www.hemr.org/wiki/Category:Epilepsy_syndromes [14] Dr. Vrajesh Udani. http://www.hindujahospital.com/communityportal/doctors/doctor- details.aspx?did=140&name=dr-vrajesh-udani&cid=36&cname= [15] https://twitter.com/epilepsysociety [16] https://www.hemr.org/wiki/Category:Epilepsy_syndromes [17] Jayapandian CP, Chen CH, Bozorgi A, Lhatoo SD, Zhang GQ, Sahoo SS. Electrophysiological Signal Analysis and Visualization using Cloudwave for Epilepsy Clinical Research. The 14th World Congress on Medical and Health Informatics (MedInfo), 2013. http://www.ncbi.nlm.nih.gov/pubmed/23920671 [18] Hadoop Architecture http://www.intel.co.uk/content/www/xa/en/big-data/big-data- analytics-turning-big-data-into-intelligence.html 24