SlideShare una empresa de Scribd logo
1 de 80
Big Data World
Hossein Zahed
www.hzahed.com
www.linkedin.com/in/hosseinzahed
1
Table of Contents
• Definitions
• Big Data 3V's
• Internet Stats
• Applications & Examples
• Data Science Areas
• Identities and Skills
• Data Work Flow
• Challenges
• Data Generation
• Data Structure
• Cloud Service Providers
2
• Hadoop Ecosystem
• Data Visualization
• Data Analytics Methods
• Data Trends
• Programming Languages
• NoSQL Databases
• Interesting Facts
• Interesting Insights
• Data Sources
• Keywords & Glossary
• References
Big Data - Definitions
1. The first documented use of the term “big data” appeared in a 1997 paper
by scientists at NASA, describing the problem they had with visualization
(i.e. computer graphics) which “provides an interesting challenge for
computer systems: data sets are generally quite large, taxing the capacities
of main memory, local disk, and even remote disk. We call this the problem
of big data. When data sets do not fit in main memory (in core), or when
they do not fit even on local disk, the most common solution is to acquire
more resources.” (NASA)
2. Data of a very large size, typically to the extent that its manipulation and
management present significant logistical challenges. (Oxford English
Dictionary)
3. Big data usually includes data sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage, and process
data within a tolerable elapsed time. (Wikipedia)
3
Big Data – Every 60 Seconds on the Internet
4
Big Data – Basic 3V’s
Big Data: Extremely large data
sets that may be analyzed
computationally to reveal
patterns, trends, and
associations, especially relating to
human behavior and interactions.
(Google)
5
Velocity
Variety
Volume
Big Data – Basic 3V’s
1. Volume: Huge amount of data (Terabytes of Records, Transactions,
Tables, Files)
2. Velocity: High rate of data and information flowing into and out of
our systems (Batch, Real-time, Streams, Near-time)
3. Variety: Complexity, thousands or more features per data item
(Structured, Unstructured, Semi-Structured)
6
Big Data – MoreV’s
• Veracity: Accuracy and uncertainty of data
• Validity: Data quality, clean/unclean data
• Variability: Constantly changing/dynamic data
• Value: The potential business value/ROI of data
• Venue: Distributed, heterogeneous data from multiple platforms
• Vocabulary: Schema, data models, semantics, ontologies,
taxonomies, context based
• Vagueness: Confusion over the meaning of data
• Visibility: Open/Secure data
• Visualization: Presentation of data in a readable and accessible way
7
Big Data – Moore’s Law
Physical capacity and performance of computers double about every two years!
8
Big Data – Gartner’s EmergingTechnology (2015)
9
Big Data – Internet Stats
• The data volumes are exploding, more data has been created in the past two
years than in the entire previous history of the human race.
• Data is growing faster than ever before and by the year 2020, about 1.7
megabytes of new information will be created every second for every human
being on the planet.
• By then, our accumulated digital universe of data will grow from 4.4
zettabytes (1021) today to around 44 zettabytes, or 44 trillion gigabytes.
• Every second we create new data. For example, we perform 40,000 search
queries every second (on Google alone), which makes it 3.5 billion searches
per day and 1.2 trillion searches per year.
• In Aug 2015, over 1 billion people used Facebook FB +2.39% in a single day.
10
Big Data – Internet Stats – Continued
• Facebook users send on average 31.25 million messages and view 2.77
million videos every minute.
• We are seeing a massive growth in video and photo data, where every
minute up to 300 hours of video are uploaded to YouTube alone.
• In 2015, a staggering 1 trillion photos will be taken and billions of them will be
shared online. By 2017, nearly 80% of photos will be taken on smart phones.
• This year, over 1.4 billion smart phones will be shipped – all packed with
sensors capable of collecting all kinds of data, not to mention the data the
users create themselves.
• By 2020, we will have over 6.1 billion smartphone users globally (overtaking
basic fixed phone subscriptions).
11
Internet Stats - Continued
• Within five years there will be over 50 billion smart connected devices in the world,
all developed to collect, analyze and share data.
• By 2020, at least a third of all data will pass through the cloud (a network of servers
connected over the Internet).
• Distributed computing (performing computing tasks using a network of computers in
the cloud) is very real. Google GOOGL +0.63% uses it every day to involve about
1,000 computers in answering a single search query, which takes no more than 0.2
seconds to complete.
• The Hadoop (open source software for distributed computing) market is forecast to
grow at a compound annual growth rate 58% surpassing $1 billion by 2020.
• Estimates suggest that by better integrating big data, healthcare could save as much
as $300 billion a year — that’s equal to reducing costs by $1000 a year for every
man, woman, and child.
12
Internet Stats - Continued
• Estimates suggest that by better integrating big data, healthcare could save as much
as $300 billion a year — that’s equal to reducing costs by $1000 a year for every
man, woman, and child.
• The White House has already invested more than $200 million in big data projects.
• For a typical Fortune 1000 company, just a 10% increase in data accessibility will
result in more than $65 million additional net income.
• Retailers who leverage the full power of big data could increase their operating
margins by as much as 60%.
• 73% of organizations have already invested or plan to invest in big data by 2016
• Favorite fact: At the moment less than 0.5% of all data is ever analyzed and used,
just imagine the potential here.
• More stats: http://www.internetlivestats.com
13
Big Data – Consumer Applications
• Google Search!
• IPhone Siri
• Microsoft Cortana
• Amazon Suggestions
• Spotify Suggestions
• Yelp Recommendations
• Netflix Recommendations
• Google Now!
14
Big Data – Business Applications
• Google Ads Searches: Showing relevant ads to users
• Predictive Marketing: consumer behavior, users demographic info
• Banking: Fraud detection, risk reporting, customer data analysis
• Financial: Stocks prediction, Forex
• Fraud Detection: spam filtering, online payments
• Health: self-aware medics, sports analysis, genomics, health records
• Smart Cities: IoT, transportation, traffic, governance, energy, economy
• Social Media: friends, topics, videos recommendations
• Education: LMS tracks & logs, time spent on subjects
15
Big Data – ResearchApplications
• Google Trends: Flu, Zika & Ebola virus, racial justice, supporting refugees
and migrant crisis
• National Institute of Health: Brain Innovative Neurotechnologies to create a
full map of brain functionalities
• NASA: Kepler space telescope searching for exoplanets/planets out side of
our solar system
• Facebook Graphs: Revealing relationships, six-degrees of separation,
psychological and personality data
• Google Books: Ngram Viewer, History of words, their usage, different
meanings
16
Big Data – Example 1 – UPS Post
• Insight: Optimize the routing again, predict the
maintenance requirements of vehicles.
• System: ORION database: engine performance,
speed, number of stops, mileage, miles per gallon,
GPS, driver behavior, safety habits, emissions,
fuel consumption, deliveries, customers,
addresses, routes. 250 million+ data points.
• Analysis: Advanced mathematical models that
provide additional optimization and navigational
capabilities to make drivers more efficient.
• Result: Saved over 39 million gallons of fuel,
avoided 364 million miles, reduced engine idle
time by 10 million minutes.
17
Big Data – Example 2 –Walmart
• Insight: Customers stock up on certain products in
the days leading up to predicted hurricanes.
• System: RetailLinksystem records sale, triggers
reordering, scheduling, and delivery. Back-office
scanners track shipments. Partners use RFID
technology to track and coordinate inventories. Data
includes daily sales, shipments, returns, purchase
orders, invoices.
• Analysis: Mines data to get its product mix right
under all sorts of varying environmental conditions.
• Result: Revenues greater thananyfirm in the US.
RFID boosted sales 20%. Gillette increased sales
19%.
18
Big Data – Example 3 – Fraud at eBay
• Insight: Fraud spikes mid-week, enabling
fraudsters to receive goods by the weekend. Basic
fraud pattern= long-distance, high-dollar,
expedited shipping.
• System: Names, email, addresses, device
fingerprinting, IP address, geolocation lookups,
time zones, countries in Oracle database of 1.3
billion entries.
• Analysis: Run transactions against 600 rules, 20-
plus machine learning algorithms. Regularly tweak
the fraud rules.
• Result: In 2014, prevented $55-million worth of
fraudulent transactions.
19
Big Data – Example 4 – Kaiser Permanente
• Insight: Kaiser Permanente:
HealthConnectexchanges data across all facilities,
promotes electronic records. Improved outcomes in
cardiovascular disease and saved $1 billion from
reduced office visits and lab tests.
• System: Pharmaceutical companies have
aggregated years of research and development data
into medical databases, payorsand providers have
digitized patient records, public stakeholders have
opened data from clinical trials. 4 billion petabytes.
• Analysis: Determine whether standard protocol for a
disease produces optimal results.
• Result: $300 billion to $450 billion in reduced health-
care spending.
20
Big Data vs Small Data
21
Aspect Small Data Big Data
Goals Have specific goal May have a goal
Location On a single computer On the cloud (multiple servers)
Structure Highly structured Semi-structured/unstructured
File Types SQL, Excel Documents, multimedia, graphs, tables
Data Preparation Prepared by one user
Prepared, analyzed, used by different group
of users
Longevity Short time period Continues for a long time
Measurements Single unit (cm) Multiple units (cm, inch,…)
Reproducibility Usually reproducible Rarely reproducibility
Lost Costs Limited Huge amount
Introspection Clear meaning Complex meaning, meaningless
Analysis Can be analyzed at once Needs an analysis procedure
Big Data – Data ScienceVenn Diagram
22
Big Data – Professional Identities
23
Data Developer Developer Engineer
Data Researcher Researcher Scientist Statistician
Data Creative Jack of All Trades Artist Hacker
Data Businessperson Leader Businessperson Entrepreneur
Big Data – Five Skill Groups
24
Business ML / Big Data Math / OR Programming Statistics
Product
Development
Unstructured Data Optimization
System
Administration
Visualization
Business Structured Data Math
Back-End
Programming
Temporal
Statistics
Machine Learning Graphic Models
Frond-End
Programming
Surveys and
Marketing
Big and
Distributed Data
Bayesian / Monte
Carlo Statistics
Spatial Statistics
Algorithms Science
Simulation Data Manipulation
Classical
Statistics
Big Data – Crossed Identities and Skills
25
Big Data – Scientific Data
• Genetic Data (1V): High Volume of data in a structured way
• Earthquake Prediction (1V): High Velocity of data, almost real-time
• Facial Recognition (1V): High Variety of data
• Jet Engine Sensors (2Vs): High Volume + High Velocity (20TB/hour
data)
• Surveillance Video (2Vs): High Velocity + High Variety of data
streaming
• Google Books (2Vs): High Volume + High Variety of data (30 Million
books)
26
Big Data – Data ScienceWork Flow
27
Start
Big Data – Common Challenges
• Anonymity: danger of de-anonymizing public data, social network graphs,
medical data,…
• Confidentiality: trying to protect data and access levels, storing unimportant
data and it’s responsibility
• Data Quality: Nearly 95% of spreadsheets have errors
• Incomplete or corrupted data
• Duplicate records
• Typographical errors
• Data without context/missing context
• Incomplete transformations
• Data conversion errors
28
Big Data – Security Challenges
• Secure computations in distributed programming frameworks
• Security best practices for non-relation data stores
• Secure data storage and transaction logs
• End-point input validation/filtering
• Real-time security/compliance monitoring
• Scalable and composable privacy-preserving data mining and analytics
• Cryptographically enforced access control and secure communication
• Granular access control
• Granular audits
• Data provenance 29
Big Data – Human Generated Data
• Intentional Data: Chats, photos, videos, comments, likes, web
searches, emails, cell phone call, text messages, online purchases,…
• Meta Data: Data about data, second order data
• Photo metadata taken by cameras
• Cell phones time and location
• Emails To, From, CC, BCC
• Social networks connectivity's
• Twitter collects 150 pieces of metadata for each tweet
30
Big Data – IPhone 4s Photo EXIF Metadata
31
ExifToolVersion Number : 8.68
File Name : IMG_1031.JPG
Directory : . File Size : 3.1 MB File Modification
Date/Time : 2011:10:05 01:43:44-07:00 File
Permissions : rw-r--r-- FileType : JPEG MIME Type :
image/jpeg Exif Byte Order : Big-endian (Motorola,
MM) Make : Apple Camera Model Name : iPhone
4S Orientation : Rotate 180 X Resolution : 72Y
Resolution : 72 ResolutionUnit : inches Software :
5.0 Modify Date : 2011:08:24 13:13:33YCb Cr
Positioning : Centered ExposureTime : 1/286 F
Number : 2.4 Exposure Program : Program AE ISO
: 64 ExifVersion : 0221 Date/TimeOriginal :
2011:08:24 13:13:33 Create Date : 2011:08:24
13:13:33 ComponentsConfiguration :Y,Cb, Cr, -
Shutter SpeedValue : 1/286 ApertureValue : 2.4
BrightnessValue : 6.992671928 Metering Mode :
Multi-segment Flash : Auto, Did not fire Focal
Length : 4.3 mm SubjectArea : 1631 1223 881 881
FlashpixVersion : 0100 Color Space : sRGB Exif
ImageWidth : 3264 Exif Image Height : 2448
Sensing Method : One-chip color area Exposure
Mode : AutoWhite Balance : Auto Focal Length In
35mm Format : 35 mm SceneCaptureType :
Standard
Sharpness : NormalGPS Latitude Ref : North GPS
Longitude Ref : West GPSAltitude Ref : Above Sea
Level GPSTime Stamp : 21:08:30 GPS Img
Direction Ref :True NorthGPS Img Direction :
346.4727273 Compression : JPEG (old-style)
ThumbnailOffset : 908Thumbnail Length : 12311
ImageWidth : 3264 Image Height : 2448 Encoding
Process : Baseline DCT, Huffman coding Bits Per
Sample : 8 Color Components : 3YCb Cr Sub
Sampling :YCbCr4:2:0 (2 2) Aperture : 2.4 GPS
Altitude : 1222 m Above Sea LevelGPS Latitude :
37 deg 44' 10.80" N GPS Longitude : 119 deg 35'
58.80" W GPS Position : 37 deg 44' 10.80" N, 119
deg 35' 58.80"W Image Size : 3264x2448 Scale
FactorTo 35 mm Equivalent: 8.2 Shutter Speed :
1/286Thumbnail Image : (Binary data 12311 bytes,
use -b option to extract)CircleOf Confusion : 0.004
mm FieldOfView : 54.4 deg Focal Length : 4.3 mm
(35 mm equivalent: 35.0 mm) Hyperfocal Distance :
2.08 m LightValue : 11.3
Big Data – Computer Generated Data
• Sources: Cell phones connecting to towers, Satellite radio, GPS
connecting, Wi-Fi connections, Web Crawlers,…
• Internet of Things (IoT): Information collected an transmitted via IoT
devices, Production Lines, Smart Meters, Environmental Monitoring,
Industrial Applications, Infrastructure Management, Energy
Management, Medical and Healthcare Systems, Smart Buildings,…
• Machine to Machine: Server to Server connections, Web Services,
Cloud Computations, Real-Time Analytics, Network Monitoring,
Routing and Switching,…
32
Big Data – Structured vs. Unstructured Data
33
Big Data – Structured vs. Unstructured Data
34
Features Structured Data Unstructured Data
Representation Discrete rows and columns
Less defined boundaries and easily
addressable
Storage
Rational Databases or
Spreadsheets
Unmanaged file structured
Metadata Syntax Semantics
Integration
Tools
ETL or ELT
Batch processing or manual data
entry that involves codes
Standard SQL, ADO.NET, ODBC,...
OpenXML, JSON, SMTP, SMS,
CSV,...
Databases MSSQL, Oracle, Excel,… Hadoop, HDInsight, MongoDB,…
Content Typically Text
Text, Images, Audio, Video,
Documents
Big Data – Cloud Computing Services
35
SaaS / DaaS
PaaS
IaaS
Big Data – Cloud Computing Services Continued
• IaaS: Infrastructure as a Service
• Servers, Virtual Machines, Storage, Load Balancers, Firewalls, Network
• PaaS: Platform as a Service
• Web Servers, Databases, Development Tools, Execution Runtime
• SaaS: Software as a Service
• CRM, ERP, Email, Virtual Desktop, Communications, Games
• DaaS: Data as a Service (Free or Commercial)
• Stocks, Forex, Google Map, Reddit, Twitter Demographic Data
36
Big Data – Cloud Service Providers
• Google Big Data Solutions
• Amazon Public Elastic Cloud
• Microsoft Azure
• OpenStack by Rackspace and NASA
• IBM Big Data Solutions
• Cloudera
• Oracle Cloud Platform
• Hortonworks
• SAP Big Data
37
Big Data – Cloud Providers Comparison
38
Big Data – Hadoop
• Apache Hadoop (pronunciation: /həˈduːp/) is an open-source software
framework for distributed storage and distributed processing of very large
data sets on computer clusters built from commodity hardware. All the
modules in Hadoop are designed with a fundamental assumption that
hardware failures are common and should be automatically handled by the
framework. (Wikipedia)
• History: Doug Cutting, Mike Cafarella and team took the solution provided
by Google and started an Open Source Project called HADOOP in 2005 and
Doug named it after his son's toy elephant. Now Apache Hadoop is a
registered trademark of the Apache Software Foundation.
• Hadoop is a Free and Open Source Project
39
Big Data – Hadoop EcosystemArchitecture
40
Big Data – Hadoop Components
• HDFS: The Hadoop distributed File System, used to store files across
many computers
• MapReduce:
• Map splits a task into pieces
• Reduce combines the output
• Has been replaced by YARN (Known as MapReduce 2)
• YARN: Can do Batch Processing like MapReduce and also Stream
Processing and Graph Processing unlike MapReduce
41
Big Data – Hadoop Components Continued
• Pig: Writes MapReduce programs, uses the Pig Latin programming
language
• Hive: Summarizes queries, Analyzes data, uses the HiveQL
programming language
• HBase: A NoSQL, not relational, not only SQL database
• Storm: Processing and Streaming data
• Spark: In Memory Processing (HDD to RAM)
• Giraph: Graph Processing for Social Networks data
42
Big Data – Landscape
43
http://www.hzahed.com/post/big-data-landscape
Big Data – Microsoft HDInsight
44
Big Data – Google Big Data Cloud
45
Big Data – Amazon AWS Big Data
46
Big Data –Who Uses Hadoop
• Google
• Yahoo!
• LinkedIn
• Facebook
• Quantcast
• Amazon
• IBM
47
• ISI
• Spotify
• Twitter
• Adobe
• Ebay
• Alibaba
• Many others
Big Data – ETL Definition
• ETL: Stands for Extract, Transform, Load
• Extract: The process of pulling data from storage such as a database
• Transform: The process of putting data into a common format
• Load: The process of loading data into software for analysis
48
Extract Transform Load
Big Data – ETL in Hadoop
• ETL in Hadoop works differently from common databases
• Data starts and ends in Hadoop
• Hadoop can handle different formats
• It doesn’t require as much inspection
• No need to be aware of or worry about ETL processes in Hadoop
• Make it a point to inspect data
49
Big Data – Monitoring & Anomaly
• Monitoring
• Detects specific events
• Needs specific criterion in advance
• Triggers automatic response
• Anomaly
• Notifies of “unusual activity”
• Based on flexible criterial
• Doesn’t trigger a response
• Instead, invites inspection
50
Big Data –Visualization – Human vs. Computers
• Computers spot certain patterns
• Computers excel at predictive models
• Computers excel at data mining
• Humans perceive and interpret better
• Humans vision still plays and important role
• Humans identify visual patterns
• Humans identify anomalies
• Humans seeing patterns across groups
• Humans interpret content of images better
• Humans identify Gestalt Test better
51
Big Data –Visualization – GestaltTest
52
Big Data –Visualization – Best Practices
• Prettier graphs are not always better
• Never use a false third dimension
• Animated and interactive graphs can be distracting
• The goal of data visualization is insight
• Use proper chart formats for visualization
• Choosing the right color scheme (Qualitative, Sequential, Diverging)
• Make sure chart alone can tell your story
53
Big Data – Microsoft Excel Role
• Excel is the most common data tool
• Millions of people use it and know how to deal with it
• Professional data miners use it
• Excel can do real data science on its own
• ODBC interfaces can connect Excel directly to Hadoop
• Excel is great for sharing data results
• Excel includes interactive PivotTables, Sortable Worksheets, Graphics
and Charts
54
Big Data – Data Analytics (DA) Methods
• Machine Learning (ML)
• Pattern Recognition (PR)
• Data Mining (DM)
• Natural Language Processing (NLP)
• Information Retrieval (IR)
• Text Mining (TM)
• Predictive Analytics
• Business Intelligence (BI)
• Prescriptive Analytics 55
Big Data – Machine Learning (ML)
• Definition: Machine Learning (LM) is a subfield of computer science
(more particularly soft computing) that evolved from the study of
pattern recognition and computational learning theory in artificial
intelligence. In 1959, Arthur Samuel defined machine learning as a
"Field of study that gives computers the ability to learn without being
explicitly programmed". (Wikipedia)
• Examples: Recommendations, Classifications, Line Regression,
Clustering, Neural Networks
56
Big Data – Pattern Recognition (PR)
• Definition: Pattern Recognition (PR) is a branch of machine learning
that focuses on the recognition of patterns and regularities in data,
although it is in some cases considered to be nearly synonymous with
machine learning. Pattern recognition systems are in many cases
trained from labeled "training" data (supervised learning), but when no
labeled data are available other algorithms can be used to discover
previously unknown patterns (unsupervised learning). (Wikipedia)
• Examples: Face detection, fingerprint verification, screening for
tumors and cancers, shape recognition, navigation systems
57
Big Data – Data Mining (DM)
• Definition: Data Mining (DM) is an interdisciplinary subfield of
computer science. It is the computational process of discovering
patterns in large data sets involving methods at the intersection of
artificial intelligence, machine learning, statistics, and database
systems. The overall goal of the data mining process is to extract
information from a data set and transform it into an understandable
structure for further use. (Wikipedia)
• Examples: Anomaly Detection, Association Rule Learning, Clustering,
Classification, Regression, Summarization
58
Big Data – Natural Language Processing (NLP)
• Definition: Natural Language Processing (NLP) is a field of computer
science, artificial intelligence, and computational linguistics concerned
with the interactions between computers and human (natural)
languages. As such, NLP is related to the area of human–computer
interaction. (Wikipedia)
• Examples: Natural language understanding, enabling computers to
derive meaning from human or natural language input; and others
involve natural language generation. (SIRI, Cortana)
59
Big Data – Information Retrieval (IR)
• Definition: Information Retrieval (IR) is the activity of obtaining
information resources relevant to an information need from a collection
of information resources. Searches can be based on or on full-text (or
other content-based) indexing. (Wikipedia)
• Examples: Automated information retrieval systems are used to
reduce what has been called "information overload". Many universities
and public libraries use IR systems to provide access to books,
journals and other documents. Web search engines (Google & Bing)
are the most visible IR applications.
60
Big Data –Text Mining (TM)
• Definition: Text Mining (TM) also referred to as text data mining, roughly
equivalent to text analytics, refers to the process of deriving high-quality
information from text. High-quality information is typically derived through the
devising of patterns and trends through means such as statistical pattern
learning. Text mining usually involves the process of structuring the input text
(usually parsing, along with the addition of some derived linguistic features
and the removal of others, and subsequent insertion into a database),
deriving patterns within the structured data, and finally evaluation and
interpretation of the output. (Wikipedia)
• Examples: Enterprise Business Intelligence/Data Mining, Competitive
Intelligence, National Security/Intelligence, Publishing, Social Media
Monitoring, Search/Information Access, Natural Language/Semantic Toolkit
or Service, Sentiment Analysis Tools, Listening Platforms
61
Big Data – Predictive Analytics
• Definition: Predictive Analytics encompasses a variety of statistical
techniques from predictive modeling, machine learning, and data mining that
analyze current and historical facts to make predictions about future or
otherwise unknown events. In business, predictive models exploit patterns
found in historical and transactional data to identify risks and opportunities.
Models capture relationships among many factors to allow assessment of
risk or potential associated with a particular set of conditions, guiding
decision making for candidate transactions. (Wikipedia)
• Examples: Actuarial Science, Marketing, Financial Services, Insurance,
Telecommunications, Retail, Travel, Healthcare, Child Protection,
Pharmaceuticals, Capacity Planning
62
Big Data – Business Intelligence (BI)
• Definition: Business Intelligence (BI) can be described as "a set of
techniques and tools for the acquisition and transformation of raw data into
meaningful and useful information for business analysis purposes". The term
"data surfacing" is also more often associated with BI functionality. BI
technologies are capable of handling large amounts of unstructured data to
help identify, develop and otherwise create new strategic business
opportunities. The goal of BI is to allow for the easy interpretation of these
large volumes of data. Identifying new opportunities and implementing an
effective strategy based on insights can provide businesses with a
competitive market advantage and long-term stability. (Wikipedia)
• Examples: Measurement, Analytics, Enterprise Reporting, Collaboration
Platform, Knowledge management
63
Big Data – Prescriptive Analytics
• Definition: Prescriptive analytics is the third and final phase of
business analytics (BA) which includes descriptive, predictive and
prescriptive analytics. Predictive analytics answers the question what
will happen. This is when historical performance data is combined with
rules, algorithms, and occasionally external data to determine the
probable future outcome of an event or the likelihood of a situation
occurring. The final phase is prescriptive analytics, which goes beyond
predicting future outcomes by also suggesting actions to benefit from
the predictions and showing the implications of each decision option.
(Wikipedia)
64
Big Data – Prescriptive Analytics Continued
65
Big Data – Prescriptive Analytics Continued
66
Big Data – InterestTrends
67
Big Data – InterestTrends
68
Big Data – Programming Languages
69
6.3%
8.1%
8.5%
8.8%
12.4%
30.6%
35.0%
36.4%
49.0%
MATLAB
SPSS
PIG / HIVEQL
UNIX SHELL
JAVA
SQL
PYTHON
SAS
R
Big Data – NoSQL Databases
70
Database Type Vendors
Wide Column Store
Hadoop HBase, Cassandra, Hortonworks, Cloudera,
Amazon SimpleDB, IBM Informix
Document Store
Elastic, MongoDB, Azure DocumentDB, Terrastore,
JSON ODM
Key Value / Tuple Store
Azmazon DynamoDB, Azure Table Storage,
Oracle NoSQL Database, Genomu
Graph Databases Neo4J, Infinite Graph, Sparksee, InfoGrid, GraphBase
Multimodel Databases ArangoDB, OrientDB, RockallDB, FoundationDB
Object Databases
Versant, db4o, Objectivity, Startcounter, Perst, HSS
Database, Magma, EyeDB, NDatabase, ObjectDB
Big Data – NoSQL Databases Continued
71
Database Type Vendors
Grid & Cloud Database
Solutions
Crate Data, Oracle Coherence,
GigaSpaces, Infinispan
XML Databases
EMC Documentum xDB, eXist, Senda,
BaseX, QizX, Berkeley DB XML
Multidimensional
Databases
Globals, SciDB, MiniM DB, DaggerDB
Multivalue Databases U2, OpenInsight, Reality, OpenQM, ESENT
Event Sourcing Event Store, ES4J
Time Series /
Streaming Databases
Axibase, Influxdata, kdb+
Other NoSQL
Databases
IBM Lutos, eXteremeDB, Yserial, BayesDB,
GPUdb, CodernityDB
Big Data – 10 Interesting Facts
1. Every 2 days we create as much information as we did from the beginning
of time until 2003.
2. Over 90% of all the data in the world was created in the past 2 years.
3. It is expected that by 2020 the amount of digital information in existence will
have grown from 3.2 zettabytes today to 40 zettabytes.
4. The total amount of data being captured and stored by industry doubles
every 1.2 years.
5. Every minute we send 204 million emails, generate 1.8 million Facebook
likes, send 278 thousand Tweets, and upload 200 thousand photos to
Facebook.
72
Big Data – 10 Interesting Facts
6. Google alone processes on average over 40 thousand search queries per
second, making it over 3.5 billion in a single day.
7. Around 100 hours of video are uploaded to YouTube every minute and it
would take you around 15 years to watch every video uploaded by users in
one day.
8. Facebook users share 30 billion pieces of content between them every day.
9. AT&T is thought to hold the world’s largest volume of data in one unique
database – its phone records database is 312 terabytes in size, and
contains almost 2 trillion rows.
10.The amount of data transferred over mobile networks increased by 81% to
1.5 Exabyte’s (1.5 billion gigabytes) per month between 2012 and 2014.
Video accounts for 53% of that total.
73
Big Data – 10 Interesting Insights
1. “The world is one big data problem.” – Andrew McAfee
2. “In God we trust. All others must bring data.” – W. Edwards Deming
3. “Torture the data, and it will confess to anything.” – Ronald Coase
4. “Information is the oil of the 21st century, and analytics is the
combustion engine.” - Peter Sondergaard
5. “It’s easy to lie with statistics. It’s hard to tell the truth without
statistics.” – Andrejs Dunkels
74
Big Data – 10 Interesting Insights
6. “The goal is to turn data into information, and information into
insight.” – Carly Fiorina
7. “The most valuable commodity I know of is information.” – Gordon
Gekko
8. “Data really powers everything that we do.” – Jeff Weiner
9. “Numbers have an important story to tell. They rely on you to give
them a voice.” – Stephen Few
10.“Data beats emotions.” – Sean Rad
75
Big Data – Free Data Sources
• Google Trends: www.google.com/trends/explore
• Google Finance: www.google.com/finance
• Google Freebase: developers.google.com/freebase
• Wikipedia Content: en.wikipedia.org/wiki/Wikipedia:Database_download
• U.S. Government Open Data: www.data.gov
• Quandl: www.quandl.com
• World Health Organization: www.who.int/gho/database/en
• Amazon Public Datasets: aws.amazon.com/datasets
• Facebook Graph: developers.facebook.com/docs/graph-api
• UNICEF: www.unicef.org/statistics/ 76
Big Data – KeyTerms & Glossary
• Algorithm
• Analytics Platform
• Apache Hive
• Behavioral Analytics
• Big Data Analytics
• Business Intelligence
• Cascading
• Cloud Computing
• Concurrency /
Concurrent computing
• Cluster Analysis
• Comparative Analysis
77
• Internet of Things (IOT)
• Machine Learning
• Metadata
• Natural Language
Processing
• Pattern Recognition
• Petabyte
• Predictive Analytics
• Prescriptive Analytics
• Semi-structured Data
• Sentiment Analysis
• Terabyte
• Connection Analytics
• Correlation Analysis
• Data Analyst
• Data Cleansing
• Data Mining
• Data Model / Data
Modeling
• Data Warehouse
• Descriptive Analytics
• ETL
• Hadoop
• Exabyte
http://bigdata.teradata.com/US/Big-Data-Quick-Start/Glossary
References
• http://www.smartinsights.com/internet-marketing-statistics/happens-
online-60-seconds/
• https://www.mapr.com/blog/top-10-big-data-challenges-%E2%80%93-
serious-look-10-big-data-v%E2%80%99s
• https://en.wikipedia.org/wiki/Moore%27s_law
• http://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-
mind-boggling-facts-everyone-must-read/#7f2504de6c1d
• http://www.gartner.com/smarterwithgartner/whats-new-in-gartners-
hype-cycle-for-emerging-technologies-2015/
• http://google.org/special-programs/
78
References - Continued
• https://datafloq.com/read/ups-spends-1-billion-big-data-annually/273
• http://joelcadwell.blogspot.de/2016/01/a-data-science-solution-to-
question.html
• http://bigdata-madesimple.com/how-i-chose-the-right-programming-
language-for-data-science
• http://nosql-database.org
• http://www-01.ibm.com/software/data/bigdata/
• http://www.sequentia.in/why-big-data-matters
• https://www.linkedin.com/pulse/20140502105616-8781298-25-
insightful-and-thought-provoking-quotes-about-big-data
79
References - Continued
• http://www.mckinsey.com/insights/health_systems_and_services/the_big-
data_revolution_in_us_health_care
• http://www.datanami.com/2015/12/21/tis-the-season-to-hunt-fraudsters-with-
big-data
• https://datafloq.com/read/ups-spends-1-billion-big-data-annually/273
• http://2012books.lardbucket.org/books/getting-the-most-out-of-information-
systems-v1.3/s15-07-data-asset-in-action-technolog.html
• http://www.gartner.com
• http://bigdata.teradata.com/US/Big-Data-Quick-Start/Glossary
• https://www.isaca.org/Groups/Professional-English/big-data
• Analyzing the Analyzers Book – Harris, Murphy, Vaisman
80

Más contenido relacionado

La actualidad más candente

Big Data
Big DataBig Data
Big DataNGDATA
 
Big data 2017 final
Big data 2017   finalBig data 2017   final
Big data 2017 finalAmjid Ali
 
Big Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation SlidesBig Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation SlidesSlideTeam
 
Big Data Information Architecture PowerPoint Presentation Slide
Big Data Information Architecture PowerPoint Presentation SlideBig Data Information Architecture PowerPoint Presentation Slide
Big Data Information Architecture PowerPoint Presentation SlideSlideTeam
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewSivashankar Ganapathy
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 
The Pros and Cons of Big Data in an ePatient World
The Pros and Cons of Big Data in an ePatient WorldThe Pros and Cons of Big Data in an ePatient World
The Pros and Cons of Big Data in an ePatient WorldPYA, P.C.
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataHari Priya
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
Team 2 Big Data Presentation
Team 2 Big Data PresentationTeam 2 Big Data Presentation
Team 2 Big Data PresentationMatthew Urdan
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Geoffrey Fox
 
Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)Yaman Hajja, Ph.D.
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBernard Marr
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)Shahbaz Anjam
 

La actualidad más candente (20)

Big Data
Big DataBig Data
Big Data
 
Big data 2017 final
Big data 2017   finalBig data 2017   final
Big data 2017 final
 
Big Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation SlidesBig Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation Slides
 
Big Data Information Architecture PowerPoint Presentation Slide
Big Data Information Architecture PowerPoint Presentation SlideBig Data Information Architecture PowerPoint Presentation Slide
Big Data Information Architecture PowerPoint Presentation Slide
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
The Pros and Cons of Big Data in an ePatient World
The Pros and Cons of Big Data in an ePatient WorldThe Pros and Cons of Big Data in an ePatient World
The Pros and Cons of Big Data in an ePatient World
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
NewMR 2016 presents: 9 Big Applications of Big Data
NewMR 2016 presents: 9 Big Applications of Big DataNewMR 2016 presents: 9 Big Applications of Big Data
NewMR 2016 presents: 9 Big Applications of Big Data
 
The promise and challenge of Big Data
The promise and challenge of Big DataThe promise and challenge of Big Data
The promise and challenge of Big Data
 
Team 2 Big Data Presentation
Team 2 Big Data PresentationTeam 2 Big Data Presentation
Team 2 Big Data Presentation
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data
Big dataBig data
Big data
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
 
Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big data
Big dataBig data
Big data
 

Similar a Big Data World

Bigdata and Hadoop with applications
Bigdata and Hadoop with applicationsBigdata and Hadoop with applications
Bigdata and Hadoop with applicationsPadma Metta
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and InternetSanoj Kumar
 
Unit 1 (DSBDA) PD.pptx
Unit 1 (DSBDA)  PD.pptxUnit 1 (DSBDA)  PD.pptx
Unit 1 (DSBDA) PD.pptxSamiksha880257
 
Smart Data Module 1 introduction to big and smart data
Smart Data Module 1 introduction to big and smart dataSmart Data Module 1 introduction to big and smart data
Smart Data Module 1 introduction to big and smart datacaniceconsulting
 
Computational intelligence for big data analytics bda 2013
Computational intelligence for big data analytics   bda 2013Computational intelligence for big data analytics   bda 2013
Computational intelligence for big data analytics bda 2013oj08
 
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven RamageGeospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven RamageSteven Ramage
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalIIIT Allahabad
 
Interesting ways Big Data is used today
Interesting ways Big Data is used todayInteresting ways Big Data is used today
Interesting ways Big Data is used todayDaniel Sârbe
 
Big data and development
Big data and developmentBig data and development
Big data and developmentSimone Sala
 

Similar a Big Data World (20)

Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
 
Bigdata and Hadoop with applications
Bigdata and Hadoop with applicationsBigdata and Hadoop with applications
Bigdata and Hadoop with applications
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and Internet
 
Understanding big data
Understanding big dataUnderstanding big data
Understanding big data
 
Unit 1 (DSBDA) PD.pptx
Unit 1 (DSBDA)  PD.pptxUnit 1 (DSBDA)  PD.pptx
Unit 1 (DSBDA) PD.pptx
 
Ictam big data
Ictam big dataIctam big data
Ictam big data
 
Big data
Big dataBig data
Big data
 
Smart Data Module 1 introduction to big and smart data
Smart Data Module 1 introduction to big and smart dataSmart Data Module 1 introduction to big and smart data
Smart Data Module 1 introduction to big and smart data
 
Big Data
Big DataBig Data
Big Data
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Big data
Big dataBig data
Big data
 
Computational intelligence for big data analytics bda 2013
Computational intelligence for big data analytics   bda 2013Computational intelligence for big data analytics   bda 2013
Computational intelligence for big data analytics bda 2013
 
Data Mining With Big Data
Data Mining With Big DataData Mining With Big Data
Data Mining With Big Data
 
Big data Mining
Big data MiningBig data Mining
Big data Mining
 
Big Data and You
Big Data and YouBig Data and You
Big Data and You
 
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven RamageGeospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
 
Bigdata " new level"
Bigdata " new level"Bigdata " new level"
Bigdata " new level"
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar Semwal
 
Interesting ways Big Data is used today
Interesting ways Big Data is used todayInteresting ways Big Data is used today
Interesting ways Big Data is used today
 
Big data and development
Big data and developmentBig data and development
Big data and development
 

Más de Hossein Zahed

Machine Learning with ML.NET
Machine Learning with ML.NETMachine Learning with ML.NET
Machine Learning with ML.NETHossein Zahed
 
8 Database Paradigms
8 Database Paradigms8 Database Paradigms
8 Database ParadigmsHossein Zahed
 
مبانی رایانش ابری
مبانی رایانش ابریمبانی رایانش ابری
مبانی رایانش ابریHossein Zahed
 
HTTPS نحوه کارکرد پروتکل
HTTPS نحوه کارکرد پروتکلHTTPS نحوه کارکرد پروتکل
HTTPS نحوه کارکرد پروتکلHossein Zahed
 
مبانی چابکی و اسکرام
مبانی چابکی و اسکراممبانی چابکی و اسکرام
مبانی چابکی و اسکرامHossein Zahed
 
آموزش سی شارپ - بخش 1
آموزش سی شارپ - بخش 1آموزش سی شارپ - بخش 1
آموزش سی شارپ - بخش 1Hossein Zahed
 
فرآیند توسعه نرم افزار
فرآیند توسعه نرم افزارفرآیند توسعه نرم افزار
فرآیند توسعه نرم افزارHossein Zahed
 
مبانی اینترنت
مبانی اینترنتمبانی اینترنت
مبانی اینترنتHossein Zahed
 
تخته سیاه آنلاین
تخته سیاه آنلاینتخته سیاه آنلاین
تخته سیاه آنلاینHossein Zahed
 
مفاهیم اساسی برنامه نویسی کامپیوتر
مفاهیم اساسی برنامه نویسی کامپیوترمفاهیم اساسی برنامه نویسی کامپیوتر
مفاهیم اساسی برنامه نویسی کامپیوترHossein Zahed
 
ASP.NET MVC 5 - EF 6 - VS2015
ASP.NET MVC 5 - EF 6 - VS2015ASP.NET MVC 5 - EF 6 - VS2015
ASP.NET MVC 5 - EF 6 - VS2015Hossein Zahed
 
CSharp Language Overview Part 1
CSharp Language Overview Part 1CSharp Language Overview Part 1
CSharp Language Overview Part 1Hossein Zahed
 
Network Essentials v2.0
Network Essentials v2.0Network Essentials v2.0
Network Essentials v2.0Hossein Zahed
 
Microsoft SQL Server 2008
Microsoft SQL Server 2008Microsoft SQL Server 2008
Microsoft SQL Server 2008Hossein Zahed
 
.Net Framework Basics
.Net Framework Basics.Net Framework Basics
.Net Framework BasicsHossein Zahed
 

Más de Hossein Zahed (19)

Machine Learning with ML.NET
Machine Learning with ML.NETMachine Learning with ML.NET
Machine Learning with ML.NET
 
8 Database Paradigms
8 Database Paradigms8 Database Paradigms
8 Database Paradigms
 
مبانی رایانش ابری
مبانی رایانش ابریمبانی رایانش ابری
مبانی رایانش ابری
 
HTTPS نحوه کارکرد پروتکل
HTTPS نحوه کارکرد پروتکلHTTPS نحوه کارکرد پروتکل
HTTPS نحوه کارکرد پروتکل
 
مبانی چابکی و اسکرام
مبانی چابکی و اسکراممبانی چابکی و اسکرام
مبانی چابکی و اسکرام
 
آموزش سی شارپ - بخش 1
آموزش سی شارپ - بخش 1آموزش سی شارپ - بخش 1
آموزش سی شارپ - بخش 1
 
فرآیند توسعه نرم افزار
فرآیند توسعه نرم افزارفرآیند توسعه نرم افزار
فرآیند توسعه نرم افزار
 
مبانی اینترنت
مبانی اینترنتمبانی اینترنت
مبانی اینترنت
 
تخته سیاه آنلاین
تخته سیاه آنلاینتخته سیاه آنلاین
تخته سیاه آنلاین
 
مفاهیم اساسی برنامه نویسی کامپیوتر
مفاهیم اساسی برنامه نویسی کامپیوترمفاهیم اساسی برنامه نویسی کامپیوتر
مفاهیم اساسی برنامه نویسی کامپیوتر
 
ASP.NET MVC 5 - EF 6 - VS2015
ASP.NET MVC 5 - EF 6 - VS2015ASP.NET MVC 5 - EF 6 - VS2015
ASP.NET MVC 5 - EF 6 - VS2015
 
SEO Fundamentals
SEO FundamentalsSEO Fundamentals
SEO Fundamentals
 
CSharp Language Overview Part 1
CSharp Language Overview Part 1CSharp Language Overview Part 1
CSharp Language Overview Part 1
 
CSS Basics
CSS BasicsCSS Basics
CSS Basics
 
HTML & XHTML Basics
HTML & XHTML BasicsHTML & XHTML Basics
HTML & XHTML Basics
 
Network Essentials v2.0
Network Essentials v2.0Network Essentials v2.0
Network Essentials v2.0
 
Microsoft SQL Server 2008
Microsoft SQL Server 2008Microsoft SQL Server 2008
Microsoft SQL Server 2008
 
.Net Framework Basics
.Net Framework Basics.Net Framework Basics
.Net Framework Basics
 
Network Essentials
Network EssentialsNetwork Essentials
Network Essentials
 

Último

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjurptikerjasaptiker
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........EfruzAsilolu
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制vexqp
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schscnajjemba
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样wsppdmt
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdftheeltifs
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 

Último (20)

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 

Big Data World

  • 1. Big Data World Hossein Zahed www.hzahed.com www.linkedin.com/in/hosseinzahed 1
  • 2. Table of Contents • Definitions • Big Data 3V's • Internet Stats • Applications & Examples • Data Science Areas • Identities and Skills • Data Work Flow • Challenges • Data Generation • Data Structure • Cloud Service Providers 2 • Hadoop Ecosystem • Data Visualization • Data Analytics Methods • Data Trends • Programming Languages • NoSQL Databases • Interesting Facts • Interesting Insights • Data Sources • Keywords & Glossary • References
  • 3. Big Data - Definitions 1. The first documented use of the term “big data” appeared in a 1997 paper by scientists at NASA, describing the problem they had with visualization (i.e. computer graphics) which “provides an interesting challenge for computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.” (NASA) 2. Data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges. (Oxford English Dictionary) 3. Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. (Wikipedia) 3
  • 4. Big Data – Every 60 Seconds on the Internet 4
  • 5. Big Data – Basic 3V’s Big Data: Extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions. (Google) 5 Velocity Variety Volume
  • 6. Big Data – Basic 3V’s 1. Volume: Huge amount of data (Terabytes of Records, Transactions, Tables, Files) 2. Velocity: High rate of data and information flowing into and out of our systems (Batch, Real-time, Streams, Near-time) 3. Variety: Complexity, thousands or more features per data item (Structured, Unstructured, Semi-Structured) 6
  • 7. Big Data – MoreV’s • Veracity: Accuracy and uncertainty of data • Validity: Data quality, clean/unclean data • Variability: Constantly changing/dynamic data • Value: The potential business value/ROI of data • Venue: Distributed, heterogeneous data from multiple platforms • Vocabulary: Schema, data models, semantics, ontologies, taxonomies, context based • Vagueness: Confusion over the meaning of data • Visibility: Open/Secure data • Visualization: Presentation of data in a readable and accessible way 7
  • 8. Big Data – Moore’s Law Physical capacity and performance of computers double about every two years! 8
  • 9. Big Data – Gartner’s EmergingTechnology (2015) 9
  • 10. Big Data – Internet Stats • The data volumes are exploding, more data has been created in the past two years than in the entire previous history of the human race. • Data is growing faster than ever before and by the year 2020, about 1.7 megabytes of new information will be created every second for every human being on the planet. • By then, our accumulated digital universe of data will grow from 4.4 zettabytes (1021) today to around 44 zettabytes, or 44 trillion gigabytes. • Every second we create new data. For example, we perform 40,000 search queries every second (on Google alone), which makes it 3.5 billion searches per day and 1.2 trillion searches per year. • In Aug 2015, over 1 billion people used Facebook FB +2.39% in a single day. 10
  • 11. Big Data – Internet Stats – Continued • Facebook users send on average 31.25 million messages and view 2.77 million videos every minute. • We are seeing a massive growth in video and photo data, where every minute up to 300 hours of video are uploaded to YouTube alone. • In 2015, a staggering 1 trillion photos will be taken and billions of them will be shared online. By 2017, nearly 80% of photos will be taken on smart phones. • This year, over 1.4 billion smart phones will be shipped – all packed with sensors capable of collecting all kinds of data, not to mention the data the users create themselves. • By 2020, we will have over 6.1 billion smartphone users globally (overtaking basic fixed phone subscriptions). 11
  • 12. Internet Stats - Continued • Within five years there will be over 50 billion smart connected devices in the world, all developed to collect, analyze and share data. • By 2020, at least a third of all data will pass through the cloud (a network of servers connected over the Internet). • Distributed computing (performing computing tasks using a network of computers in the cloud) is very real. Google GOOGL +0.63% uses it every day to involve about 1,000 computers in answering a single search query, which takes no more than 0.2 seconds to complete. • The Hadoop (open source software for distributed computing) market is forecast to grow at a compound annual growth rate 58% surpassing $1 billion by 2020. • Estimates suggest that by better integrating big data, healthcare could save as much as $300 billion a year — that’s equal to reducing costs by $1000 a year for every man, woman, and child. 12
  • 13. Internet Stats - Continued • Estimates suggest that by better integrating big data, healthcare could save as much as $300 billion a year — that’s equal to reducing costs by $1000 a year for every man, woman, and child. • The White House has already invested more than $200 million in big data projects. • For a typical Fortune 1000 company, just a 10% increase in data accessibility will result in more than $65 million additional net income. • Retailers who leverage the full power of big data could increase their operating margins by as much as 60%. • 73% of organizations have already invested or plan to invest in big data by 2016 • Favorite fact: At the moment less than 0.5% of all data is ever analyzed and used, just imagine the potential here. • More stats: http://www.internetlivestats.com 13
  • 14. Big Data – Consumer Applications • Google Search! • IPhone Siri • Microsoft Cortana • Amazon Suggestions • Spotify Suggestions • Yelp Recommendations • Netflix Recommendations • Google Now! 14
  • 15. Big Data – Business Applications • Google Ads Searches: Showing relevant ads to users • Predictive Marketing: consumer behavior, users demographic info • Banking: Fraud detection, risk reporting, customer data analysis • Financial: Stocks prediction, Forex • Fraud Detection: spam filtering, online payments • Health: self-aware medics, sports analysis, genomics, health records • Smart Cities: IoT, transportation, traffic, governance, energy, economy • Social Media: friends, topics, videos recommendations • Education: LMS tracks & logs, time spent on subjects 15
  • 16. Big Data – ResearchApplications • Google Trends: Flu, Zika & Ebola virus, racial justice, supporting refugees and migrant crisis • National Institute of Health: Brain Innovative Neurotechnologies to create a full map of brain functionalities • NASA: Kepler space telescope searching for exoplanets/planets out side of our solar system • Facebook Graphs: Revealing relationships, six-degrees of separation, psychological and personality data • Google Books: Ngram Viewer, History of words, their usage, different meanings 16
  • 17. Big Data – Example 1 – UPS Post • Insight: Optimize the routing again, predict the maintenance requirements of vehicles. • System: ORION database: engine performance, speed, number of stops, mileage, miles per gallon, GPS, driver behavior, safety habits, emissions, fuel consumption, deliveries, customers, addresses, routes. 250 million+ data points. • Analysis: Advanced mathematical models that provide additional optimization and navigational capabilities to make drivers more efficient. • Result: Saved over 39 million gallons of fuel, avoided 364 million miles, reduced engine idle time by 10 million minutes. 17
  • 18. Big Data – Example 2 –Walmart • Insight: Customers stock up on certain products in the days leading up to predicted hurricanes. • System: RetailLinksystem records sale, triggers reordering, scheduling, and delivery. Back-office scanners track shipments. Partners use RFID technology to track and coordinate inventories. Data includes daily sales, shipments, returns, purchase orders, invoices. • Analysis: Mines data to get its product mix right under all sorts of varying environmental conditions. • Result: Revenues greater thananyfirm in the US. RFID boosted sales 20%. Gillette increased sales 19%. 18
  • 19. Big Data – Example 3 – Fraud at eBay • Insight: Fraud spikes mid-week, enabling fraudsters to receive goods by the weekend. Basic fraud pattern= long-distance, high-dollar, expedited shipping. • System: Names, email, addresses, device fingerprinting, IP address, geolocation lookups, time zones, countries in Oracle database of 1.3 billion entries. • Analysis: Run transactions against 600 rules, 20- plus machine learning algorithms. Regularly tweak the fraud rules. • Result: In 2014, prevented $55-million worth of fraudulent transactions. 19
  • 20. Big Data – Example 4 – Kaiser Permanente • Insight: Kaiser Permanente: HealthConnectexchanges data across all facilities, promotes electronic records. Improved outcomes in cardiovascular disease and saved $1 billion from reduced office visits and lab tests. • System: Pharmaceutical companies have aggregated years of research and development data into medical databases, payorsand providers have digitized patient records, public stakeholders have opened data from clinical trials. 4 billion petabytes. • Analysis: Determine whether standard protocol for a disease produces optimal results. • Result: $300 billion to $450 billion in reduced health- care spending. 20
  • 21. Big Data vs Small Data 21 Aspect Small Data Big Data Goals Have specific goal May have a goal Location On a single computer On the cloud (multiple servers) Structure Highly structured Semi-structured/unstructured File Types SQL, Excel Documents, multimedia, graphs, tables Data Preparation Prepared by one user Prepared, analyzed, used by different group of users Longevity Short time period Continues for a long time Measurements Single unit (cm) Multiple units (cm, inch,…) Reproducibility Usually reproducible Rarely reproducibility Lost Costs Limited Huge amount Introspection Clear meaning Complex meaning, meaningless Analysis Can be analyzed at once Needs an analysis procedure
  • 22. Big Data – Data ScienceVenn Diagram 22
  • 23. Big Data – Professional Identities 23 Data Developer Developer Engineer Data Researcher Researcher Scientist Statistician Data Creative Jack of All Trades Artist Hacker Data Businessperson Leader Businessperson Entrepreneur
  • 24. Big Data – Five Skill Groups 24 Business ML / Big Data Math / OR Programming Statistics Product Development Unstructured Data Optimization System Administration Visualization Business Structured Data Math Back-End Programming Temporal Statistics Machine Learning Graphic Models Frond-End Programming Surveys and Marketing Big and Distributed Data Bayesian / Monte Carlo Statistics Spatial Statistics Algorithms Science Simulation Data Manipulation Classical Statistics
  • 25. Big Data – Crossed Identities and Skills 25
  • 26. Big Data – Scientific Data • Genetic Data (1V): High Volume of data in a structured way • Earthquake Prediction (1V): High Velocity of data, almost real-time • Facial Recognition (1V): High Variety of data • Jet Engine Sensors (2Vs): High Volume + High Velocity (20TB/hour data) • Surveillance Video (2Vs): High Velocity + High Variety of data streaming • Google Books (2Vs): High Volume + High Variety of data (30 Million books) 26
  • 27. Big Data – Data ScienceWork Flow 27 Start
  • 28. Big Data – Common Challenges • Anonymity: danger of de-anonymizing public data, social network graphs, medical data,… • Confidentiality: trying to protect data and access levels, storing unimportant data and it’s responsibility • Data Quality: Nearly 95% of spreadsheets have errors • Incomplete or corrupted data • Duplicate records • Typographical errors • Data without context/missing context • Incomplete transformations • Data conversion errors 28
  • 29. Big Data – Security Challenges • Secure computations in distributed programming frameworks • Security best practices for non-relation data stores • Secure data storage and transaction logs • End-point input validation/filtering • Real-time security/compliance monitoring • Scalable and composable privacy-preserving data mining and analytics • Cryptographically enforced access control and secure communication • Granular access control • Granular audits • Data provenance 29
  • 30. Big Data – Human Generated Data • Intentional Data: Chats, photos, videos, comments, likes, web searches, emails, cell phone call, text messages, online purchases,… • Meta Data: Data about data, second order data • Photo metadata taken by cameras • Cell phones time and location • Emails To, From, CC, BCC • Social networks connectivity's • Twitter collects 150 pieces of metadata for each tweet 30
  • 31. Big Data – IPhone 4s Photo EXIF Metadata 31 ExifToolVersion Number : 8.68 File Name : IMG_1031.JPG Directory : . File Size : 3.1 MB File Modification Date/Time : 2011:10:05 01:43:44-07:00 File Permissions : rw-r--r-- FileType : JPEG MIME Type : image/jpeg Exif Byte Order : Big-endian (Motorola, MM) Make : Apple Camera Model Name : iPhone 4S Orientation : Rotate 180 X Resolution : 72Y Resolution : 72 ResolutionUnit : inches Software : 5.0 Modify Date : 2011:08:24 13:13:33YCb Cr Positioning : Centered ExposureTime : 1/286 F Number : 2.4 Exposure Program : Program AE ISO : 64 ExifVersion : 0221 Date/TimeOriginal : 2011:08:24 13:13:33 Create Date : 2011:08:24 13:13:33 ComponentsConfiguration :Y,Cb, Cr, - Shutter SpeedValue : 1/286 ApertureValue : 2.4 BrightnessValue : 6.992671928 Metering Mode : Multi-segment Flash : Auto, Did not fire Focal Length : 4.3 mm SubjectArea : 1631 1223 881 881 FlashpixVersion : 0100 Color Space : sRGB Exif ImageWidth : 3264 Exif Image Height : 2448 Sensing Method : One-chip color area Exposure Mode : AutoWhite Balance : Auto Focal Length In 35mm Format : 35 mm SceneCaptureType : Standard Sharpness : NormalGPS Latitude Ref : North GPS Longitude Ref : West GPSAltitude Ref : Above Sea Level GPSTime Stamp : 21:08:30 GPS Img Direction Ref :True NorthGPS Img Direction : 346.4727273 Compression : JPEG (old-style) ThumbnailOffset : 908Thumbnail Length : 12311 ImageWidth : 3264 Image Height : 2448 Encoding Process : Baseline DCT, Huffman coding Bits Per Sample : 8 Color Components : 3YCb Cr Sub Sampling :YCbCr4:2:0 (2 2) Aperture : 2.4 GPS Altitude : 1222 m Above Sea LevelGPS Latitude : 37 deg 44' 10.80" N GPS Longitude : 119 deg 35' 58.80" W GPS Position : 37 deg 44' 10.80" N, 119 deg 35' 58.80"W Image Size : 3264x2448 Scale FactorTo 35 mm Equivalent: 8.2 Shutter Speed : 1/286Thumbnail Image : (Binary data 12311 bytes, use -b option to extract)CircleOf Confusion : 0.004 mm FieldOfView : 54.4 deg Focal Length : 4.3 mm (35 mm equivalent: 35.0 mm) Hyperfocal Distance : 2.08 m LightValue : 11.3
  • 32. Big Data – Computer Generated Data • Sources: Cell phones connecting to towers, Satellite radio, GPS connecting, Wi-Fi connections, Web Crawlers,… • Internet of Things (IoT): Information collected an transmitted via IoT devices, Production Lines, Smart Meters, Environmental Monitoring, Industrial Applications, Infrastructure Management, Energy Management, Medical and Healthcare Systems, Smart Buildings,… • Machine to Machine: Server to Server connections, Web Services, Cloud Computations, Real-Time Analytics, Network Monitoring, Routing and Switching,… 32
  • 33. Big Data – Structured vs. Unstructured Data 33
  • 34. Big Data – Structured vs. Unstructured Data 34 Features Structured Data Unstructured Data Representation Discrete rows and columns Less defined boundaries and easily addressable Storage Rational Databases or Spreadsheets Unmanaged file structured Metadata Syntax Semantics Integration Tools ETL or ELT Batch processing or manual data entry that involves codes Standard SQL, ADO.NET, ODBC,... OpenXML, JSON, SMTP, SMS, CSV,... Databases MSSQL, Oracle, Excel,… Hadoop, HDInsight, MongoDB,… Content Typically Text Text, Images, Audio, Video, Documents
  • 35. Big Data – Cloud Computing Services 35 SaaS / DaaS PaaS IaaS
  • 36. Big Data – Cloud Computing Services Continued • IaaS: Infrastructure as a Service • Servers, Virtual Machines, Storage, Load Balancers, Firewalls, Network • PaaS: Platform as a Service • Web Servers, Databases, Development Tools, Execution Runtime • SaaS: Software as a Service • CRM, ERP, Email, Virtual Desktop, Communications, Games • DaaS: Data as a Service (Free or Commercial) • Stocks, Forex, Google Map, Reddit, Twitter Demographic Data 36
  • 37. Big Data – Cloud Service Providers • Google Big Data Solutions • Amazon Public Elastic Cloud • Microsoft Azure • OpenStack by Rackspace and NASA • IBM Big Data Solutions • Cloudera • Oracle Cloud Platform • Hortonworks • SAP Big Data 37
  • 38. Big Data – Cloud Providers Comparison 38
  • 39. Big Data – Hadoop • Apache Hadoop (pronunciation: /həˈduːp/) is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. (Wikipedia) • History: Doug Cutting, Mike Cafarella and team took the solution provided by Google and started an Open Source Project called HADOOP in 2005 and Doug named it after his son's toy elephant. Now Apache Hadoop is a registered trademark of the Apache Software Foundation. • Hadoop is a Free and Open Source Project 39
  • 40. Big Data – Hadoop EcosystemArchitecture 40
  • 41. Big Data – Hadoop Components • HDFS: The Hadoop distributed File System, used to store files across many computers • MapReduce: • Map splits a task into pieces • Reduce combines the output • Has been replaced by YARN (Known as MapReduce 2) • YARN: Can do Batch Processing like MapReduce and also Stream Processing and Graph Processing unlike MapReduce 41
  • 42. Big Data – Hadoop Components Continued • Pig: Writes MapReduce programs, uses the Pig Latin programming language • Hive: Summarizes queries, Analyzes data, uses the HiveQL programming language • HBase: A NoSQL, not relational, not only SQL database • Storm: Processing and Streaming data • Spark: In Memory Processing (HDD to RAM) • Giraph: Graph Processing for Social Networks data 42
  • 43. Big Data – Landscape 43 http://www.hzahed.com/post/big-data-landscape
  • 44. Big Data – Microsoft HDInsight 44
  • 45. Big Data – Google Big Data Cloud 45
  • 46. Big Data – Amazon AWS Big Data 46
  • 47. Big Data –Who Uses Hadoop • Google • Yahoo! • LinkedIn • Facebook • Quantcast • Amazon • IBM 47 • ISI • Spotify • Twitter • Adobe • Ebay • Alibaba • Many others
  • 48. Big Data – ETL Definition • ETL: Stands for Extract, Transform, Load • Extract: The process of pulling data from storage such as a database • Transform: The process of putting data into a common format • Load: The process of loading data into software for analysis 48 Extract Transform Load
  • 49. Big Data – ETL in Hadoop • ETL in Hadoop works differently from common databases • Data starts and ends in Hadoop • Hadoop can handle different formats • It doesn’t require as much inspection • No need to be aware of or worry about ETL processes in Hadoop • Make it a point to inspect data 49
  • 50. Big Data – Monitoring & Anomaly • Monitoring • Detects specific events • Needs specific criterion in advance • Triggers automatic response • Anomaly • Notifies of “unusual activity” • Based on flexible criterial • Doesn’t trigger a response • Instead, invites inspection 50
  • 51. Big Data –Visualization – Human vs. Computers • Computers spot certain patterns • Computers excel at predictive models • Computers excel at data mining • Humans perceive and interpret better • Humans vision still plays and important role • Humans identify visual patterns • Humans identify anomalies • Humans seeing patterns across groups • Humans interpret content of images better • Humans identify Gestalt Test better 51
  • 52. Big Data –Visualization – GestaltTest 52
  • 53. Big Data –Visualization – Best Practices • Prettier graphs are not always better • Never use a false third dimension • Animated and interactive graphs can be distracting • The goal of data visualization is insight • Use proper chart formats for visualization • Choosing the right color scheme (Qualitative, Sequential, Diverging) • Make sure chart alone can tell your story 53
  • 54. Big Data – Microsoft Excel Role • Excel is the most common data tool • Millions of people use it and know how to deal with it • Professional data miners use it • Excel can do real data science on its own • ODBC interfaces can connect Excel directly to Hadoop • Excel is great for sharing data results • Excel includes interactive PivotTables, Sortable Worksheets, Graphics and Charts 54
  • 55. Big Data – Data Analytics (DA) Methods • Machine Learning (ML) • Pattern Recognition (PR) • Data Mining (DM) • Natural Language Processing (NLP) • Information Retrieval (IR) • Text Mining (TM) • Predictive Analytics • Business Intelligence (BI) • Prescriptive Analytics 55
  • 56. Big Data – Machine Learning (ML) • Definition: Machine Learning (LM) is a subfield of computer science (more particularly soft computing) that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a "Field of study that gives computers the ability to learn without being explicitly programmed". (Wikipedia) • Examples: Recommendations, Classifications, Line Regression, Clustering, Neural Networks 56
  • 57. Big Data – Pattern Recognition (PR) • Definition: Pattern Recognition (PR) is a branch of machine learning that focuses on the recognition of patterns and regularities in data, although it is in some cases considered to be nearly synonymous with machine learning. Pattern recognition systems are in many cases trained from labeled "training" data (supervised learning), but when no labeled data are available other algorithms can be used to discover previously unknown patterns (unsupervised learning). (Wikipedia) • Examples: Face detection, fingerprint verification, screening for tumors and cancers, shape recognition, navigation systems 57
  • 58. Big Data – Data Mining (DM) • Definition: Data Mining (DM) is an interdisciplinary subfield of computer science. It is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. (Wikipedia) • Examples: Anomaly Detection, Association Rule Learning, Clustering, Classification, Regression, Summarization 58
  • 59. Big Data – Natural Language Processing (NLP) • Definition: Natural Language Processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. (Wikipedia) • Examples: Natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve natural language generation. (SIRI, Cortana) 59
  • 60. Big Data – Information Retrieval (IR) • Definition: Information Retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on or on full-text (or other content-based) indexing. (Wikipedia) • Examples: Automated information retrieval systems are used to reduce what has been called "information overload". Many universities and public libraries use IR systems to provide access to books, journals and other documents. Web search engines (Google & Bing) are the most visible IR applications. 60
  • 61. Big Data –Text Mining (TM) • Definition: Text Mining (TM) also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. (Wikipedia) • Examples: Enterprise Business Intelligence/Data Mining, Competitive Intelligence, National Security/Intelligence, Publishing, Social Media Monitoring, Search/Information Access, Natural Language/Semantic Toolkit or Service, Sentiment Analysis Tools, Listening Platforms 61
  • 62. Big Data – Predictive Analytics • Definition: Predictive Analytics encompasses a variety of statistical techniques from predictive modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future or otherwise unknown events. In business, predictive models exploit patterns found in historical and transactional data to identify risks and opportunities. Models capture relationships among many factors to allow assessment of risk or potential associated with a particular set of conditions, guiding decision making for candidate transactions. (Wikipedia) • Examples: Actuarial Science, Marketing, Financial Services, Insurance, Telecommunications, Retail, Travel, Healthcare, Child Protection, Pharmaceuticals, Capacity Planning 62
  • 63. Big Data – Business Intelligence (BI) • Definition: Business Intelligence (BI) can be described as "a set of techniques and tools for the acquisition and transformation of raw data into meaningful and useful information for business analysis purposes". The term "data surfacing" is also more often associated with BI functionality. BI technologies are capable of handling large amounts of unstructured data to help identify, develop and otherwise create new strategic business opportunities. The goal of BI is to allow for the easy interpretation of these large volumes of data. Identifying new opportunities and implementing an effective strategy based on insights can provide businesses with a competitive market advantage and long-term stability. (Wikipedia) • Examples: Measurement, Analytics, Enterprise Reporting, Collaboration Platform, Knowledge management 63
  • 64. Big Data – Prescriptive Analytics • Definition: Prescriptive analytics is the third and final phase of business analytics (BA) which includes descriptive, predictive and prescriptive analytics. Predictive analytics answers the question what will happen. This is when historical performance data is combined with rules, algorithms, and occasionally external data to determine the probable future outcome of an event or the likelihood of a situation occurring. The final phase is prescriptive analytics, which goes beyond predicting future outcomes by also suggesting actions to benefit from the predictions and showing the implications of each decision option. (Wikipedia) 64
  • 65. Big Data – Prescriptive Analytics Continued 65
  • 66. Big Data – Prescriptive Analytics Continued 66
  • 67. Big Data – InterestTrends 67
  • 68. Big Data – InterestTrends 68
  • 69. Big Data – Programming Languages 69 6.3% 8.1% 8.5% 8.8% 12.4% 30.6% 35.0% 36.4% 49.0% MATLAB SPSS PIG / HIVEQL UNIX SHELL JAVA SQL PYTHON SAS R
  • 70. Big Data – NoSQL Databases 70 Database Type Vendors Wide Column Store Hadoop HBase, Cassandra, Hortonworks, Cloudera, Amazon SimpleDB, IBM Informix Document Store Elastic, MongoDB, Azure DocumentDB, Terrastore, JSON ODM Key Value / Tuple Store Azmazon DynamoDB, Azure Table Storage, Oracle NoSQL Database, Genomu Graph Databases Neo4J, Infinite Graph, Sparksee, InfoGrid, GraphBase Multimodel Databases ArangoDB, OrientDB, RockallDB, FoundationDB Object Databases Versant, db4o, Objectivity, Startcounter, Perst, HSS Database, Magma, EyeDB, NDatabase, ObjectDB
  • 71. Big Data – NoSQL Databases Continued 71 Database Type Vendors Grid & Cloud Database Solutions Crate Data, Oracle Coherence, GigaSpaces, Infinispan XML Databases EMC Documentum xDB, eXist, Senda, BaseX, QizX, Berkeley DB XML Multidimensional Databases Globals, SciDB, MiniM DB, DaggerDB Multivalue Databases U2, OpenInsight, Reality, OpenQM, ESENT Event Sourcing Event Store, ES4J Time Series / Streaming Databases Axibase, Influxdata, kdb+ Other NoSQL Databases IBM Lutos, eXteremeDB, Yserial, BayesDB, GPUdb, CodernityDB
  • 72. Big Data – 10 Interesting Facts 1. Every 2 days we create as much information as we did from the beginning of time until 2003. 2. Over 90% of all the data in the world was created in the past 2 years. 3. It is expected that by 2020 the amount of digital information in existence will have grown from 3.2 zettabytes today to 40 zettabytes. 4. The total amount of data being captured and stored by industry doubles every 1.2 years. 5. Every minute we send 204 million emails, generate 1.8 million Facebook likes, send 278 thousand Tweets, and upload 200 thousand photos to Facebook. 72
  • 73. Big Data – 10 Interesting Facts 6. Google alone processes on average over 40 thousand search queries per second, making it over 3.5 billion in a single day. 7. Around 100 hours of video are uploaded to YouTube every minute and it would take you around 15 years to watch every video uploaded by users in one day. 8. Facebook users share 30 billion pieces of content between them every day. 9. AT&T is thought to hold the world’s largest volume of data in one unique database – its phone records database is 312 terabytes in size, and contains almost 2 trillion rows. 10.The amount of data transferred over mobile networks increased by 81% to 1.5 Exabyte’s (1.5 billion gigabytes) per month between 2012 and 2014. Video accounts for 53% of that total. 73
  • 74. Big Data – 10 Interesting Insights 1. “The world is one big data problem.” – Andrew McAfee 2. “In God we trust. All others must bring data.” – W. Edwards Deming 3. “Torture the data, and it will confess to anything.” – Ronald Coase 4. “Information is the oil of the 21st century, and analytics is the combustion engine.” - Peter Sondergaard 5. “It’s easy to lie with statistics. It’s hard to tell the truth without statistics.” – Andrejs Dunkels 74
  • 75. Big Data – 10 Interesting Insights 6. “The goal is to turn data into information, and information into insight.” – Carly Fiorina 7. “The most valuable commodity I know of is information.” – Gordon Gekko 8. “Data really powers everything that we do.” – Jeff Weiner 9. “Numbers have an important story to tell. They rely on you to give them a voice.” – Stephen Few 10.“Data beats emotions.” – Sean Rad 75
  • 76. Big Data – Free Data Sources • Google Trends: www.google.com/trends/explore • Google Finance: www.google.com/finance • Google Freebase: developers.google.com/freebase • Wikipedia Content: en.wikipedia.org/wiki/Wikipedia:Database_download • U.S. Government Open Data: www.data.gov • Quandl: www.quandl.com • World Health Organization: www.who.int/gho/database/en • Amazon Public Datasets: aws.amazon.com/datasets • Facebook Graph: developers.facebook.com/docs/graph-api • UNICEF: www.unicef.org/statistics/ 76
  • 77. Big Data – KeyTerms & Glossary • Algorithm • Analytics Platform • Apache Hive • Behavioral Analytics • Big Data Analytics • Business Intelligence • Cascading • Cloud Computing • Concurrency / Concurrent computing • Cluster Analysis • Comparative Analysis 77 • Internet of Things (IOT) • Machine Learning • Metadata • Natural Language Processing • Pattern Recognition • Petabyte • Predictive Analytics • Prescriptive Analytics • Semi-structured Data • Sentiment Analysis • Terabyte • Connection Analytics • Correlation Analysis • Data Analyst • Data Cleansing • Data Mining • Data Model / Data Modeling • Data Warehouse • Descriptive Analytics • ETL • Hadoop • Exabyte http://bigdata.teradata.com/US/Big-Data-Quick-Start/Glossary
  • 78. References • http://www.smartinsights.com/internet-marketing-statistics/happens- online-60-seconds/ • https://www.mapr.com/blog/top-10-big-data-challenges-%E2%80%93- serious-look-10-big-data-v%E2%80%99s • https://en.wikipedia.org/wiki/Moore%27s_law • http://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20- mind-boggling-facts-everyone-must-read/#7f2504de6c1d • http://www.gartner.com/smarterwithgartner/whats-new-in-gartners- hype-cycle-for-emerging-technologies-2015/ • http://google.org/special-programs/ 78
  • 79. References - Continued • https://datafloq.com/read/ups-spends-1-billion-big-data-annually/273 • http://joelcadwell.blogspot.de/2016/01/a-data-science-solution-to- question.html • http://bigdata-madesimple.com/how-i-chose-the-right-programming- language-for-data-science • http://nosql-database.org • http://www-01.ibm.com/software/data/bigdata/ • http://www.sequentia.in/why-big-data-matters • https://www.linkedin.com/pulse/20140502105616-8781298-25- insightful-and-thought-provoking-quotes-about-big-data 79
  • 80. References - Continued • http://www.mckinsey.com/insights/health_systems_and_services/the_big- data_revolution_in_us_health_care • http://www.datanami.com/2015/12/21/tis-the-season-to-hunt-fraudsters-with- big-data • https://datafloq.com/read/ups-spends-1-billion-big-data-annually/273 • http://2012books.lardbucket.org/books/getting-the-most-out-of-information- systems-v1.3/s15-07-data-asset-in-action-technolog.html • http://www.gartner.com • http://bigdata.teradata.com/US/Big-Data-Quick-Start/Glossary • https://www.isaca.org/Groups/Professional-English/big-data • Analyzing the Analyzers Book – Harris, Murphy, Vaisman 80