SlideShare una empresa de Scribd logo
1 de 29
Handling Large Data on a
Single System
Problems while handling large data:
A large volume of data poses new challenges, such as overloaded memory and
algorithms that never stop running. It forces you to adapt and expand your repertoire of
techniques. But even when you can perform your analysis, you should take care of
issues such as I/O (input/output) and CPU starvation, because these can cause speed
issues.
General Techniques for handling large data:
Never-ending algorithms, out-of-memory errors, and speed
issues are the most common challenges you face when working with large
data. In this section, we’ll investigate solutions to overcome or alleviate these
problems.
Choosing the right algorithm:
Choosing the right algorithm can solve more problems than adding
more or better hardware. An algorithm that’s well suited for handling large data doesn’t
need to load the entire data set into memory to make predictions. Ideally, the algorithm
also supports parallelized calculations.
Some of the three algorithms are,
Online Algorithms,
Block Matrices,
MapReduce.
Choosing the right data structure:
Algorithms can make or break your program, but the way you store
your data is of equal importance. Data structures have different storage requirements,
but also influence the performance of CRUD (create, read, update, and delete) and other
operations on the data set.
Sparse Data:
Tree:
Selecting the right tools:
With the right class of algorithms and data structures in place, it’s time to choose the
right tool for the job. The right tool can be a Python library or at least a tool that’s
controlled from Python. The number of helpful tools available is enormous, so we’ll
look at only a handful of them.
General programming tips for dealing with large data sets:
The tricks that work in a general programming context still apply for
data science. Several might be worded slightly differently, but the principles are
essentially the same for all programmers. This section recapitulates those tricks that are
important in a data science context.
Don’t reinvent the wheel
Get the most out of your hardware
Reduce your computing needs
Case study 1: Predicting malicious URLs:
The internet is probably one of the greatest inventions of modern
times. It has boosted humanity’s development, but not everyone uses this great
invention with honorable intentions. Many companies (Google, for one) try to protect
us from fraud by detecting malicious websites for us. Doing so is no easy task, because
the internet has billions of web pages to scan. In this case study we’ll show how to work
with a data set that no longer fits in memory.
Step 1: Defining the research goal
Step 2: Acquiring the URL data
Step 4: Data exploration
Step 5: Model building
Case study 2: Building a recommender system inside a
database:
In reality most of the data you work with is stored in a relational database, but most
databases aren’t suitable for data mining. But as shown in this example, it’s possible to
adapt our techniques so you can do a large part of the analysis inside the database itself,
thereby profiting from the database’s query optimizer, which will optimize the code for
you. In this example we’ll go into how to use the hash table data structure and how to
use Python to control other tools.
Tools and techniques needed
Step 1: Research question
Step 3: Data preparation
Step 5: Model building
Step 6: Presentation and automation
First steps in big data
Distributing data storage and processing with frameworks:
New big data technologies such as Hadoop and Spark make it much
easier to work with and control a cluster of computers. Hadoop can scale up to
thousands of computers, creating a cluster with petabytes of storage. This enables
businesses to grasp the value of the massive amount of data available.
Hadoop: a framework for storing and processing large data sets
Apache Hadoop is a framework that simplifies working with a cluster
of computers. It aims to be all of the following things and more:
Reliable,
Fault Tolerant,
Scalable,
Portable.
An example for MapReduce flow for counting the color in input text:
Spark: replacing MapReduce for better performance
Data scientists often do interactive analysis and rely
on algorithms that are inherently iterative; it can take awhile until an algorithm
converges to a solution. As this is a weak point of the MapReduce framework, we’ll
introduce the Spark Framework to overcome it. Spark improves the performance on
such tasks by an order of magnitude.
Case study: Assessing risk when loaning money
Enriched with a basic understanding of Hadoop and Spark, we’re now
ready to get our hands dirty on big data. The goal of this case study is to have a first
experience with the technologies we introduced earlier in this chapter, and see that for a
large part you can (but don’t have to) work similarly as with other technologies.
Step 1: The research goal,
Step 2: Data retrieval,
Step 3: Data preparation
Steps 4 & 6: Exploration and report creation
Join the NoSQL movement
Introduction to NoSQL:
As you’ve read, the goal of NoSQL databases isn’t only to offer a
way to partition databases successfully over multiple nodes, but also to present
fundamentally different ways to model the data at hand to fit its structure to its
use case and not to how a relational database requires it to be modeled.
ACID: the core principal of relational Database,
CAP Theorm: the problem with DBs on many nodes,
The BASE principal of NoSQL Database,
NoSQL Database types,
ACID: the core principle of relational databases:
The main aspects of a traditional relational database can be
summarized by the concept ACID:
Atomicity ,
Consistency,
Isolation,
Durability.
CAP Theorem: the problem with DBs on many nodes
Once a database gets spread out over different servers, it’s difficult to
follow the ACID principle because of the consistency ACID promises; the CAP
Theorem points out why this becomes problematic. The CAP Theorem states that a
database can be any two of the following things but never all three:
Partition tolerant
Available,
Consistent
CAP Theorem: the problem with DBs on many nodes
The BASE principles of NoSQL databases
RDBMS follows the ACID principles; NoSQL databases that don’t
follow ACID, such as the document stores and key-value stores, follow BASE. BASE is
a set of much softer database promises:
Basically available,
Soft State,
Eventual Consistent,
NoSQL database types
As you saw earlier, there are four big NoSQL types: key-value store,
document store, column-oriented database, and graph database. Each type solves a
problem that can’t be solved with relational databases. Actual implementations are often
combinations of these. OrientDB, for example, is a multi-model database, combining
NoSQL types. OrientDB is a graph database where each node is a document.
Normalization,
Many to many relationship
Case Study: What disease is that?
Step-1: Setting the research goal,
Step-2 & 3: Data Retrieval and Preparation,
Step-4: Data Exploration,
Step-3 revisited: Data Preparation for disease
profiling,
Step-4 revisited: Data Exploration for disease
profiling,
Step-6: Presentation and Automation
Step 1: Setting the research goal
Steps 2 and 3: Data retrieval and preparation
Data retrieval and data preparation are two distinct steps in the data science
process, and even though this remains true for the case study, we’ll explore both in the same section.
This way you can avoid setting up local intermedia storage and immediately do data preparation while
the data is being retrieved.
Step-3: Data Preparation
Step 4: Data exploration
It’s not lupus. It’s never lupus!
Step 3 revisited: Data preparation for disease profiling
Step 4 revisited: Data exploration for disease profiling
searchBody={
"fields":["name"],
"query":{
"filtered" : {
"filter": {
'term': {'name':'diabetes'}
}
}
},
"aggregations" : {
"DiseaseKeywords" : {
"significant_terms" : { "field" : "fulltext", "size" : 30 }
},
"DiseaseBigrams": {
"significant_terms" : { "field" : "fulltext.shingles",
"size" : 30 }
}
}
}
client.search(index=indexName,doc_type=docType,
body=searchBody, from_ = 0, size=3)
Thank u

Más contenido relacionado

La actualidad más candente

Transaction management DBMS
Transaction  management DBMSTransaction  management DBMS
Transaction management DBMSMegha Patel
 
Hashing Technique In Data Structures
Hashing Technique In Data StructuresHashing Technique In Data Structures
Hashing Technique In Data StructuresSHAKOOR AB
 
Syntax directed translation
Syntax directed translationSyntax directed translation
Syntax directed translationAkshaya Arunan
 
Data structure lecture 1
Data structure lecture 1Data structure lecture 1
Data structure lecture 1Kumar
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithmGangadhar S
 
Relational algebra ppt
Relational algebra pptRelational algebra ppt
Relational algebra pptGirdharRatne
 
Sorting Algorithms
Sorting AlgorithmsSorting Algorithms
Sorting AlgorithmsPranay Neema
 
Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)swapnac12
 
Introduction to Data Abstraction
Introduction to Data AbstractionIntroduction to Data Abstraction
Introduction to Data AbstractionDennis Gajo
 
CORBA Basic and Deployment of CORBA
CORBA Basic and Deployment of CORBACORBA Basic and Deployment of CORBA
CORBA Basic and Deployment of CORBAPriyanka Patil
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEZalpa Rathod
 

La actualidad más candente (20)

Transaction management DBMS
Transaction  management DBMSTransaction  management DBMS
Transaction management DBMS
 
Predictive parser
Predictive parserPredictive parser
Predictive parser
 
Hashing Technique In Data Structures
Hashing Technique In Data StructuresHashing Technique In Data Structures
Hashing Technique In Data Structures
 
Syntax directed translation
Syntax directed translationSyntax directed translation
Syntax directed translation
 
Query processing
Query processingQuery processing
Query processing
 
Data structure lecture 1
Data structure lecture 1Data structure lecture 1
Data structure lecture 1
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Relational algebra ppt
Relational algebra pptRelational algebra ppt
Relational algebra ppt
 
Xml processors
Xml processorsXml processors
Xml processors
 
Sorting Algorithms
Sorting AlgorithmsSorting Algorithms
Sorting Algorithms
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Tree traversal techniques
Tree traversal techniquesTree traversal techniques
Tree traversal techniques
 
AVL Tree
AVL TreeAVL Tree
AVL Tree
 
Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)
 
Introduction to Data Abstraction
Introduction to Data AbstractionIntroduction to Data Abstraction
Introduction to Data Abstraction
 
Chapter18
Chapter18Chapter18
Chapter18
 
Finite Automata
Finite AutomataFinite Automata
Finite Automata
 
CORBA Basic and Deployment of CORBA
CORBA Basic and Deployment of CORBACORBA Basic and Deployment of CORBA
CORBA Basic and Deployment of CORBA
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSE
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 

Similar a data science chapter-4,5,6

One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolutionmark madsen
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBhavya Gulati
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.pptRutujaPatil247341
 
My Article on MySQL Magazine
My Article on MySQL MagazineMy Article on MySQL Magazine
My Article on MySQL MagazineJonathan Levin
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architectureRahul Chaturvedi
 
Database Management System ( Dbms )
Database Management System ( Dbms )Database Management System ( Dbms )
Database Management System ( Dbms )Kimberly Brooks
 
SQL vs NoSQL: Big Data Adoption & Success in the Enterprise
SQL vs NoSQL: Big Data Adoption & Success in the EnterpriseSQL vs NoSQL: Big Data Adoption & Success in the Enterprise
SQL vs NoSQL: Big Data Adoption & Success in the EnterpriseAnita Luthra
 
Big Data
Big DataBig Data
Big DataNGDATA
 
The Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystemsThe Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystemstaimur hafeez
 

Similar a data science chapter-4,5,6 (20)

Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolution
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
disertation
disertationdisertation
disertation
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt
 
My Article on MySQL Magazine
My Article on MySQL MagazineMy Article on MySQL Magazine
My Article on MySQL Magazine
 
Hadoop(Term Paper)
Hadoop(Term Paper)Hadoop(Term Paper)
Hadoop(Term Paper)
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architecture
 
Database Management System ( Dbms )
Database Management System ( Dbms )Database Management System ( Dbms )
Database Management System ( Dbms )
 
SQL vs NoSQL: Big Data Adoption & Success in the Enterprise
SQL vs NoSQL: Big Data Adoption & Success in the EnterpriseSQL vs NoSQL: Big Data Adoption & Success in the Enterprise
SQL vs NoSQL: Big Data Adoption & Success in the Enterprise
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
 
Report 1.0.docx
Report 1.0.docxReport 1.0.docx
Report 1.0.docx
 
ch02models.pptx
ch02models.pptxch02models.pptx
ch02models.pptx
 
ch02models.pptx
ch02models.pptxch02models.pptx
ch02models.pptx
 
Big Data
Big DataBig Data
Big Data
 
The Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystemsThe Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystems
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

Último

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 

Último (20)

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 

data science chapter-4,5,6

  • 1. Handling Large Data on a Single System
  • 2. Problems while handling large data: A large volume of data poses new challenges, such as overloaded memory and algorithms that never stop running. It forces you to adapt and expand your repertoire of techniques. But even when you can perform your analysis, you should take care of issues such as I/O (input/output) and CPU starvation, because these can cause speed issues.
  • 3. General Techniques for handling large data: Never-ending algorithms, out-of-memory errors, and speed issues are the most common challenges you face when working with large data. In this section, we’ll investigate solutions to overcome or alleviate these problems.
  • 4. Choosing the right algorithm: Choosing the right algorithm can solve more problems than adding more or better hardware. An algorithm that’s well suited for handling large data doesn’t need to load the entire data set into memory to make predictions. Ideally, the algorithm also supports parallelized calculations. Some of the three algorithms are, Online Algorithms, Block Matrices, MapReduce.
  • 5. Choosing the right data structure: Algorithms can make or break your program, but the way you store your data is of equal importance. Data structures have different storage requirements, but also influence the performance of CRUD (create, read, update, and delete) and other operations on the data set.
  • 7. Selecting the right tools: With the right class of algorithms and data structures in place, it’s time to choose the right tool for the job. The right tool can be a Python library or at least a tool that’s controlled from Python. The number of helpful tools available is enormous, so we’ll look at only a handful of them.
  • 8. General programming tips for dealing with large data sets: The tricks that work in a general programming context still apply for data science. Several might be worded slightly differently, but the principles are essentially the same for all programmers. This section recapitulates those tricks that are important in a data science context. Don’t reinvent the wheel Get the most out of your hardware Reduce your computing needs
  • 9. Case study 1: Predicting malicious URLs: The internet is probably one of the greatest inventions of modern times. It has boosted humanity’s development, but not everyone uses this great invention with honorable intentions. Many companies (Google, for one) try to protect us from fraud by detecting malicious websites for us. Doing so is no easy task, because the internet has billions of web pages to scan. In this case study we’ll show how to work with a data set that no longer fits in memory. Step 1: Defining the research goal Step 2: Acquiring the URL data Step 4: Data exploration Step 5: Model building
  • 10. Case study 2: Building a recommender system inside a database: In reality most of the data you work with is stored in a relational database, but most databases aren’t suitable for data mining. But as shown in this example, it’s possible to adapt our techniques so you can do a large part of the analysis inside the database itself, thereby profiting from the database’s query optimizer, which will optimize the code for you. In this example we’ll go into how to use the hash table data structure and how to use Python to control other tools. Tools and techniques needed Step 1: Research question Step 3: Data preparation Step 5: Model building Step 6: Presentation and automation
  • 11. First steps in big data
  • 12. Distributing data storage and processing with frameworks: New big data technologies such as Hadoop and Spark make it much easier to work with and control a cluster of computers. Hadoop can scale up to thousands of computers, creating a cluster with petabytes of storage. This enables businesses to grasp the value of the massive amount of data available. Hadoop: a framework for storing and processing large data sets Apache Hadoop is a framework that simplifies working with a cluster of computers. It aims to be all of the following things and more: Reliable, Fault Tolerant, Scalable, Portable.
  • 13. An example for MapReduce flow for counting the color in input text:
  • 14. Spark: replacing MapReduce for better performance Data scientists often do interactive analysis and rely on algorithms that are inherently iterative; it can take awhile until an algorithm converges to a solution. As this is a weak point of the MapReduce framework, we’ll introduce the Spark Framework to overcome it. Spark improves the performance on such tasks by an order of magnitude.
  • 15. Case study: Assessing risk when loaning money Enriched with a basic understanding of Hadoop and Spark, we’re now ready to get our hands dirty on big data. The goal of this case study is to have a first experience with the technologies we introduced earlier in this chapter, and see that for a large part you can (but don’t have to) work similarly as with other technologies. Step 1: The research goal, Step 2: Data retrieval, Step 3: Data preparation Steps 4 & 6: Exploration and report creation
  • 16. Join the NoSQL movement
  • 17. Introduction to NoSQL: As you’ve read, the goal of NoSQL databases isn’t only to offer a way to partition databases successfully over multiple nodes, but also to present fundamentally different ways to model the data at hand to fit its structure to its use case and not to how a relational database requires it to be modeled. ACID: the core principal of relational Database, CAP Theorm: the problem with DBs on many nodes, The BASE principal of NoSQL Database, NoSQL Database types,
  • 18. ACID: the core principle of relational databases: The main aspects of a traditional relational database can be summarized by the concept ACID: Atomicity , Consistency, Isolation, Durability.
  • 19. CAP Theorem: the problem with DBs on many nodes Once a database gets spread out over different servers, it’s difficult to follow the ACID principle because of the consistency ACID promises; the CAP Theorem points out why this becomes problematic. The CAP Theorem states that a database can be any two of the following things but never all three: Partition tolerant Available, Consistent
  • 20. CAP Theorem: the problem with DBs on many nodes
  • 21. The BASE principles of NoSQL databases RDBMS follows the ACID principles; NoSQL databases that don’t follow ACID, such as the document stores and key-value stores, follow BASE. BASE is a set of much softer database promises: Basically available, Soft State, Eventual Consistent,
  • 22.
  • 23. NoSQL database types As you saw earlier, there are four big NoSQL types: key-value store, document store, column-oriented database, and graph database. Each type solves a problem that can’t be solved with relational databases. Actual implementations are often combinations of these. OrientDB, for example, is a multi-model database, combining NoSQL types. OrientDB is a graph database where each node is a document. Normalization, Many to many relationship
  • 24. Case Study: What disease is that? Step-1: Setting the research goal, Step-2 & 3: Data Retrieval and Preparation, Step-4: Data Exploration, Step-3 revisited: Data Preparation for disease profiling, Step-4 revisited: Data Exploration for disease profiling, Step-6: Presentation and Automation
  • 25. Step 1: Setting the research goal Steps 2 and 3: Data retrieval and preparation Data retrieval and data preparation are two distinct steps in the data science process, and even though this remains true for the case study, we’ll explore both in the same section. This way you can avoid setting up local intermedia storage and immediately do data preparation while the data is being retrieved.
  • 27. Step 4: Data exploration It’s not lupus. It’s never lupus! Step 3 revisited: Data preparation for disease profiling
  • 28. Step 4 revisited: Data exploration for disease profiling searchBody={ "fields":["name"], "query":{ "filtered" : { "filter": { 'term': {'name':'diabetes'} } } }, "aggregations" : { "DiseaseKeywords" : { "significant_terms" : { "field" : "fulltext", "size" : 30 } }, "DiseaseBigrams": { "significant_terms" : { "field" : "fulltext.shingles", "size" : 30 } } } } client.search(index=indexName,doc_type=docType, body=searchBody, from_ = 0, size=3)