2. Bigdata
Big Data is similar to small data but bigger in size but having data bigger it requires different
approaches, Techniques, tools and architecture an aim to solve new problems or old problems
in a better way Big Data generates value from the storage and processing of very large
quantities of digital information that cannot be analyzed with traditional computing
techniques.
The challenges include
• capture
• curation
• storage
• search
• sharing
• transfer
• analysis
• visualization
4. Volume
Big Data indicates huge volumes’ of data that is being generated on a daily basis from various
sources like social media platforms, business processes, machines, networks, human
interactions, etc. Such a large amount of data are stored in data warehouses.
Data volume is increasing exponentially
◦ 44x increase from 2009 2020
◦ From 0.8 zettabytes to 35zb
•A typical PC might have had 10 gigabytes of storage in 2000
•Today, Facebook ingests 500 terabytes of new data every day.
•Boeing 737 will generate 240 terabytes of flight data during a single
flight across the US
5. Earthscope
The Earth scope is the world's largest science
project. Designed to track North America's
geological evolution, this observatory records
data over 3.8 million square miles, amassing
67 terabytes of data. It analyzes seismic slips
in the San Andreas fault, sure, but also the
plume of magma underneath
6. Velocity
The term velocity refers to the speed of generation of data. How fast data is generated and
process to meet the demands.
Big Data velocity deals with the speed at which data flows in from sources like business
processes, applications, networks, social media, sensors and mobile devices etc. The flow of
data is massive and continuous.
•Data is begin generated fast and need to be processed fast
•Online Data Analytics
•Late decisions missing opportunities
Examples
E-Promotions: Based on your current location, your purchase history, what you like send
promotions right now for store next to you
Healthcare monitoring: sensors monitoring your activities and body any abnormal
measurements require immediate reaction
7. Variety
Variety of Big Data refers to structured, unstructured, and semi structured data that is gathered from
multiple sources. While in the past, data could only be collected from spreadsheets and databases,
today data comes in an array of forms such as emails, PDFs, photos, videos, audios, SM posts, and so
much more.
•Relational Data (Tables/Transaction/Legacy Data)
•Text Data (Web)
•Semi-structured Data (XML)
•Graph Data
◦ Social Network, Semantic Web (RDF), …
• Streaming Data
◦ You can only scan the data once
•Big Public Data (online, weather, finance, etc)
8. Veracity
Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being
stored, and mined meaningful to the problem being analyzed. Veracity is often defined as the
quality or trustworthiness of the data you collect Inderal feel veracity in data analysis is the
biggest challenge when compares to things like volume and velocity.
Considering how accurate the data you collect and
analyze is important. In this sense, when it comes to
big data, quality is always preferred over quantity.
To focus on quality, it is important to set metrics
around what type of data you may collect and from what sources.
9. The ModelHas Changed…
The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
10. The Types of bigdata
•Structured
Most traditional data
sources
•Semi-structured
Many sources of big data
•Unstructured
Video data, audio data
11. Types
Structured: structured data, we mean data that can be processed, stored, and retrieved in a
fixed format. For instance, the employee table in a company database will be structured as the
employee details, their job positions, their salaries, etc., will be present in an organized manner.
Unstructured: Unstructured data refers to the data that lacks any specific form or structure
whatsoever. This makes it very difficult and time-consuming to process and analyze unstructured
data. Email is an example of unstructured data.
Semi-structured: Semi-structured data pertains to the data containing both the formats
mentioned above, that is, structured and unstructured data. To be precise, it refers to the data
that although has not been classified under a particular repository (database), yet contains vital
information or tags that segregate individual elements within the data.
12. Storingthe bigdata
Analyzing your data characteristics
Selecting data sources for analysis
Eliminating redundant data
Establishing the role of NoSQL
Overview of Big Data stores
Data models: key value, graph, document, column-family
Hadoop Distributed File System
HBase
Hive
13. Processingbig data
Integrating disparate data stores
• Mapping data to the programming framework
• Connecting and extracting data from storage
• Transforming data for processing
• Subdividing data in preparation for Hadoop
Map Reduce
Employing Hadoop Map Reduce
• Creating the components of Hadoop Map Reduce jobs
• Distributing data processing across server farms
• Executing Hadoop Map Reduce jobs
• Monitoring the progress of job flows
17. Benefits
•Real-time big data isn’t just a process for storing petabytes or exabytes of data in a data
warehouse, It’s about the ability to make better decisions and take meaningful actions at the
right time.
•Fast forward to the present and technologies like Hadoop give you the scale and flexibility to
store data before you know how you are going to process it.
•Technologies such as Map Reduce, Hive and Impala enable you to run queries without changing
the data structures underneath.
•organizations are using big data to target customer-centric outcomes, tap into internal data and
build a better information ecosystem
19. Introduction
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to
automatically learn and improve from experience without being explicitly programmed. Machine
learning focuses on the development of computer programs that can access data and use it learn for
themselves.
The basic premise of machine learning is
to build algorithms that can receive input
data and use statistical analysis to predict
an output while updating outputs as
new data becomes available
21. How machinelearningworks
Machine learning algorithms are often categorized as
Basically supervised learning is a learning in which we teach or train the machine using data which
is well labeled that means some data is already tagged with the correct answer. After that, the
machine is provided with a new set of examples(data) so that supervised learning algorithm
analyses the training data(set of training examples) and produces a correct outcome from labeled
data
Example suppose you are given a basket filled with different kinds of fruits. Now the first step is
to train the machine with all different fruits one by one like this:
22. Example
If shape of object is rounded and depression at top having color Red then it will be labelled as –
Apple.
If shape of object is long curving cylinder having color Green-Yellow then it will be labelled as –
Banana.
23. Example
Now suppose after training the data, you have given
a new separate fruit say Banana from basket and asked to identify it.
Since the machine has already learned the things from previous data and this time have to use it
wisely. It will first classify the fruit with its shape and color and would confirm the fruit name as
BANANA and put it in Banana category
Supervised learning classified into two categories of algorithms:
Classification: A classification problem is when the output variable is a category, such as “Red” or
“blue” or “disease” and “no disease”.
Regression: A regression problem is when the output variable is a real value, such as “dollars” or
“weight”.
24. Unsupervisedlearning
Unsupervised learning is the training of machine using information that is neither classified nor
labeled and allowing the algorithm to act on that information without guidance. Unlike
supervised learning, no training will be given to the machine. Therefore machine is restricted to
find the hidden structure in unlabeled data by our-self.
Example
suppose it is given an image having both dogs and cats which have not seen ever.
25. Example
Thus the machine has no idea about the features of dogs and cat so we can’t categorize it in
dogs and cats. But it can categorize them according to their similarities, patterns, and
differences we can easily categorize the above picture into two parts. First may contain all pics
having dogs in it and second part may contain all pics having cats in it. Here you didn’t learn
anything before, means no training data or examples
Unsupervised learning classified into two categories of algorithms:
Clustering: A clustering problem is where you want to discover the inherent groupings in the
data, such as grouping customers by purchasing behavior.
Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.
26. Reinforcementlearning
Reinforcement learning.
This area of deep learning involves models iterating over many attempts to complete a process.
Steps that produce favorable outcomes are rewarded and steps that produce undesired
outcomes are penalized until the algorithm learns the optimal process.
27. Examplesofmachinelearning
Facebook news feeds
Machine learning is being used in a wide range of applications today. One of the most well-known
examples is Facebook's News Feed. The News Feed uses machine learning to personalize each
member's feed. If a member frequently stops scrolling to read or like a particular friend's posts, the
News Feed will start to show more of that friend's activity earlier in the feed. Behind the scenes,
the software is simply using statistical analysis and predictive analytics to identify patterns in the
user's data and use those patterns to populate the News Feed. Should the member no longer stop
to read, like or comment on the friend's posts, that new data will be included in the data set and
the News Feed will adjust accordingly.
28. Examplesof ML
.Self-driving cars
Machine learning also plays an important role in self-driving cars. Deep learning neural networks
are used to identify objects and determine optimal actions for safely steering a vehicle down the
road
29. Typesofmachinelearningalgorithms
. Here are a few of the most commonly used models:
Decision trees. These models use observations about certain actions and identify an optimal
path for arriving at a desired outcome.
K-means clustering. This model groups a specified number of data points into a specific number
of groupings based on like characteristics.
Neural networks. These deep learning models utilize large amounts of training data to identify
correlations between many variables to learn to process incoming data in the future.
30. future of machinelearning
Current machine learning (ML) algorithms identify statistical regularities in complex data sets and
are regularly used across a range of application domains, but they lack the robustness and
generalizability associated with human learning. If ML techniques could enable computers to
learn from fewer examples, transfer knowledge between tasks, and adapt to changing contexts
and environments, the results would have very broad scientific and societal impacts.
31. Howarebigdataandmachinelearningrelated?
Machine learning(ML) is based on algorithms that can learn from data without relying on rules-
based programming. Big data is the type of data that may be supplied into the analytical system
so that a ML model could ‘learn’ (or in other words, improve the accuracy of its predictions).
A quick example: preventive machinery maintenance. We use big data from sensors
(temperature, humidity, pressure and vibration readings for each machinery part that come
every second) to train, test and retrain a ML model. The role of the model is to identify hidden
patterns that lead to machinery failure and check newly incoming data against the identified
patterns. As a final step – the analytical system may trigger alerts to the maintenance team if
the model identifies a match with a pre failure condition pattern.
A good way to think about the relationship between Big Data and Machine Learning is that the
data is the raw material that feeds the machine learning process. The tangible benefit to a
business is derived from the predictive model(s) that comes out at the end of the process, not
the data used to construct it.