TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Introduction to Data Science
1. An Introduction to Data Science
Anoop V.S
Ph.D Research Scholar
Data Engineering Lab
Indian Institute of Information Technology and Management - Kerala (IIITM-K)
Thiruvananthapuram, India
anoop.res15@iiitmk.ac.in
March 10, 2017
Anoop V.S Introduction to Data Science March 10, 2017 1 / 48
3. Why you should attend this talk ?
Companies have recognized the immense business value which can be
delivered using data. This has caused a huge demand of skilled
professional in data related jobs around the world.
Job profiles such as Data Scientist, Data Analyst, Big Data Engineer,
Statistician are being largely hunted by companies. Not only they are
being handsomely paid, but a career in analytics has much more to
promise.
After the U.S., India has the largest demand of analytics / big data /
data science professionals. Amidst such demand, people find
themselves confused to select an appropriate job profile for the best
future.
Anoop V.S Introduction to Data Science March 10, 2017 3 / 48
4. How much a Data Science Professional can earn ?
Anoop V.S Introduction to Data Science March 10, 2017 4 / 48
5. Which cities are offering high salary ?
Anoop V.S Introduction to Data Science March 10, 2017 5 / 48
6. Data Scientist - the SEXIEST JOB OF 21st CENTURY !
Requires a mixture of multidisciplinary skills ranging from an
intersection of mathematics, statistics, computer science,
communication and business.
Finding a Data Scientist is hard !
Finding people who understand who a Data Scientist is, is equally
hard !!
The trend is expected to accelerate in the coming years as data from
mobile sensors, sophisticated instruments, the web, and more, grows
It is predicted that in 2020, the world will generate 50 times the
amount of data than in 2011
Anoop V.S Introduction to Data Science March 10, 2017 6 / 48
7. What skills are needed ?
Anoop V.S Introduction to Data Science March 10, 2017 7 / 48
8. So, what really is Data Science ?
Asking questions (formulating hypothesis), answers to which solve
known problems or unearth unknown solutions that in turn drive
business value
Defining the data needed or working with an existing data set and
employing tools (computer science based) to collect, store and explore
such data generally in huge volume & variety
Identifying the type of analysis to be done to get to the answers and
performing such analysis by implementing various algorithms/tools,
often in a distributed and parallel architecture
Communicating the insights gathered from the analysis in the form of
simple stories/visualizations/dashboards that a non-data scientist can
understand and build conversation out of it
Building a higher level abstraction that does steps 2-3-4 in an
autonomous way, analyzing & taking actions on new data as they are
fed to the system
Anoop V.S Introduction to Data Science March 10, 2017 8 / 48
9. Summing up in an image
Anoop V.S Introduction to Data Science March 10, 2017 9 / 48
10. Leading by an example
Two of the most famous companies in the world use analytics and Big
Data to shape their product, services and delivery - Amazon and Facebook.
Amazon uses analytics to curate products on their customers
homepages based on their previous purchases and browsing habits.
Facebook uses analytics to fill your news feed with updates from
people you interact with the most; content from sites you frequent
and products you have checked out on other sites.
Anoop V.S Introduction to Data Science March 10, 2017 10 / 48
11. Type of analytics
Descriptive Analytics, which use data aggregation and data mining
to provide insight into the past and answer: ”What has happened?”
Predictive Analytics, which use statistical models and forecasts
techniques to understand the future and answer: ”What could
happen?”
Prescriptive Analytics, which use optimization and simulation
algorithms to advice on possible outcomes and answer: ”What
should we do?”
Anoop V.S Introduction to Data Science March 10, 2017 11 / 48
12. Descriptive Analytics: Insight into the past
Descriptive analysis or statistics does exactly what the name implies
they Describe, or summarize raw data and make it something that is
interpretable by humans
They are analytics that describe the past. The past refers to any
point of time that an event has occurred, whether it is one minute
ago, or one year ago
Descriptive analytics are useful because they allow us to learn from
past behaviors, and understand how they might influence future
outcomes.
Common examples of descriptive analytics are reports that provide
historical insights regarding the companys production, financials,
operations, sales, finance, inventory and customers
Anoop V.S Introduction to Data Science March 10, 2017 12 / 48
13. Predictive Analytics: Understanding the future
Predictive analytics has its roots in the ability to ”Predict” what
might happen
Predictive analytics provides companies with actionable insights based
on data.
It is important to remember that no statistical algorithm can predict
the future with 100% certainty. Companies use these statistics to
forecast what might happen in the future. This is because the
foundation of predictive analytics is based on probabilities
Predictive analytics can be used throughout the organization, from
forecasting customer behavior and purchasing patterns to identifying
trends in sales activities
Anoop V.S Introduction to Data Science March 10, 2017 13 / 48
14. Prescriptive Analytics: Advise on possible outcomes
The relatively new field of prescriptive analytics allows users to
prescribe a number of different possible actions to and guide them
towards a solution
At their best, prescriptive analytics predicts not only what will
happen, but also why it will happen providing recommendations
regarding actions that will take advantage of the predictions.
Prescriptive analytics use a combination of techniques and tools such
as business rules, algorithms, machine learning and computational
modelling procedures. These techniques are applied against input
from many different data sets including historical and transactional
data, real-time data feeds, and big data
Anoop V.S Introduction to Data Science March 10, 2017 14 / 48
15. Now into some basics - What is Data / Information /
Knowledge ?
Data is unprocessed facts and figures without any added
interpretation or analysis. ”The price of crude oil is $80 per barrel.”
Information is data that has been interpreted so that it has meaning
for the user. ”The price of crude oil has risen from $70 to $80 per
barrel” gives meaning to the data and so is said to be information to
someone who tracks oil prices.
Knowledge is a combination of information, experience and insight
that may benefit the individual or the organisation. ”When crude oil
prices go up by $10 per barrel, it’s likely that petrol prices will rise by
Rs. 20 per litre” is knowledge.
Anoop V.S Introduction to Data Science March 10, 2017 15 / 48
16. Relationship of Data, Information and Intelligence
Anoop V.S Introduction to Data Science March 10, 2017 16 / 48
17. Categories of Data - A quick view
Structured Data concerns all data which can be stored in database
SQL in table with rows and columns. They have relationnal key and
can be easily mapped into pre-designed fields. Today, those data are
the most processed in development and the simpliest way to manage
informations.
Semistructured Data doesnt reside in a relational database but that
does have some organizational properties that make it easier to
analyze. With some process you can store them in relation database.
Unstructured Data represent around 80% of data. It often include
text and multimedia content. Examples include e-mail messages,
word processing documents, videos, photos, audio files, presentations,
webpages and many other kinds of business documents.
Unstructured data is everywhere. In fact, most individuals and
organizations conduct their lives around unstructured data
Anoop V.S Introduction to Data Science March 10, 2017 17 / 48
18. Big Data - in recent News
Anoop V.S Introduction to Data Science March 10, 2017 18 / 48
19. Big Data - in recent News
Anoop V.S Introduction to Data Science March 10, 2017 19 / 48
20. Big Data - in recent News
Anoop V.S Introduction to Data Science March 10, 2017 20 / 48
21. Big Data - in recent News
Anoop V.S Introduction to Data Science March 10, 2017 21 / 48
22. Do you know ”90% of the worlds data was generated in
the last few years.” !!!
Big data means really a big data, it is a collection of large datasets
that cannot be processed using traditional computing techniques
Big data is not merely a data, rather it has become a complete
subject, which involves various tools, techniques and frameworks.
What comes under Big Data ?
Black Box Data
Social Media Data
Stock Exchange Data
Power Grid Data
Transport Data
Search Engine Data etc.
Anoop V.S Introduction to Data Science March 10, 2017 22 / 48
23. 3Vs of Big Data
Volume Organizations collect data from a variety of sources,
including business transactions, social media and information from
sensor or machine-to-machine data. In the past, storing it wouldve
been a problem but new technologies (such as Hadoop) have eased
the burden.
Velocity Data streams in at an unprecedented speed and must be
dealt with in a timely manner. RFID tags, sensors and smart metering
are driving the need to deal with torrents of data in near-real time.
Variety Data comes in all types of formats from structured, numeric
data in traditional databases to unstructured text documents, email,
video, audio, stock ticker data and financial transactions.
Anoop V.S Introduction to Data Science March 10, 2017 23 / 48
24. Who uses Big Data ?
Banking - its important to understand customers and boost their
satisfaction, its equally important to minimize risk and fraud while
maintaining regulatory compliance. Big data brings big insights, but
it also requires financial institutions to stay one step ahead of the
game with advanced analytics
Education - Educators armed with data-driven insight can make a
significant impact on school systems, students and curriculums. By
analyzing big data, they can identify at-risk students, make sure
students are making adequate progress, and can implement a better
system for evaluation and support
Government - When government agencies are able to harness and
apply analytics to their big data, they gain significant ground when it
comes to managing utilities, running agencies, dealing with traffic
congestion or preventing crime.
Anoop V.S Introduction to Data Science March 10, 2017 24 / 48
25. Who uses Big Data ?
Health care - Patient records. Treatment plans. Prescription
information. When it comes to health care, everything needs to be
done quickly, accurately and, in some cases, with enough
transparency to satisfy stringent industry regulations. When big data
is managed effectively, health care providers can uncover hidden
insights that improve patient care.
Manufacturing - More and more manufacturers are working in an
analytics-based culture, which means they can solve problems faster
and make more agile business decisions.
Retail - Retailers need to know the best way to market to customers,
the most effective way to handle transactions, and the most strategic
way to bring back lapsed business
Anoop V.S Introduction to Data Science March 10, 2017 25 / 48
26. Operational Vs. Analytical Big Data
Operational Big Data provide operational features to run real-time,
interactive workloads that ingest and store data.
MongoDB is a top technology for operational Big Data applications
with over 10 million downloads of its open source software.
Analytical Big Data Analytical Big Data technologies, on the other
hand, are useful for retrospective, sophisticated analytics of your data.
Hadoop is the most popular example of an Analytical Big Data
technology.
But picking an operational vs analytical Big Data solution isnt the
right way to think about the challenge. They are complementary
technologies and you likely need both to develop a complete Big Data
solution.
Anoop V.S Introduction to Data Science March 10, 2017 26 / 48
27. Traditional Vs. Google’s solution
In Traditional approach will have a computer to store and process
big data. Here data will be stored in an RDBMS like Oracle
Database, MS SQL Server or DB2 and sophisticated softwares can be
written to interact with the database, process the required data and
present it to the users for analysis purpose.
Limitations will have a computer to store and process big data. Here
data will be stored in an RDBMS like Oracle Database, MS SQL
Server or DB2 and sophisticated softwares can be written to interact
with the database, process the required data and present it to the
users for analysis purpose.
Anoop V.S Introduction to Data Science March 10, 2017 27 / 48
28. Google’s solution
Google solved this problem using an algorithm called MapReduce.
This algorithm divides the task into small parts and assigns those
parts to many computers connected over the network, and collects
the results to form the final result dataset.
Doug Cutting, Mike Cafarella and team took the solution provided by
Google and started an Open Source Project called HADOOP in 2005.
Hadoop runs applications using the MapReduce algorithm, where the
data is processed in parallel on different CPU nodes. In short, Hadoop
framework is capable enough to develop applications capable of
running on clusters of computers and they could perform complete
statistical analysis for a huge amounts of data.
Anoop V.S Introduction to Data Science March 10, 2017 28 / 48
29. How MapReduce works ?
Anoop V.S Introduction to Data Science March 10, 2017 29 / 48
30. Machine Learning - Learning from DATA !
Machine learning is a method of data analysis that automates
analytical model building. Using algorithms that iteratively learn from
data, machine learning allows computers to find hidden insights
without being explicitly programmed where to look.
The iterative aspect of machine learning is important because as
models are exposed to new data, they are able to independently adapt.
They learn from previous computations to produce reliable, repeatable
decisions and results
While many machine learning algorithms have been around for a long
time, the ability to automatically apply complex mathematical
calculations to big data over and over, faster and faster is a recent
development.
Anoop V.S Introduction to Data Science March 10, 2017 30 / 48
31. Here are a few widely publicized examples of machine
learning applications you may be familiar with
The heavily hyped, self-driving Google car? The essence of machine
learning.
Online recommendation offers such as those from Amazon and
Netflix? Machine learning applications for everyday life.
Knowing what customers are saying about you on Twitter? Machine
learning combined with linguistic rule creation.
Fraud detection? One of the more obvious, important uses in our
world today.
Anoop V.S Introduction to Data Science March 10, 2017 31 / 48
32. How to learn from DATA ?
1 Supervised Learning
1 we have training data with correct answers
2 use training data to prepare the algorithm
3 then apply it to a data without correct answer
2 Unsupervised Learning
1 no training data
2 throw data into the algorithm
3 hope it makes some kind of sense out of the data
Anoop V.S Introduction to Data Science March 10, 2017 32 / 48
33. Some types of learning algorithms
Prediction Predicting a variable from data
Classification Assigning records to predefined groups
Clustering Splitting records into groups based on similarity
Association Learning Seeing what often appears together with what
Issues with learning - Data is usually noisy in some way, Inductive bias -
the shape of the algorithm we choose may not fit the data at all, may
induce induce under-fitting or over-fitting.
Anoop V.S Introduction to Data Science March 10, 2017 33 / 48
34. Testing our model and treating missing values
When using for real problems, testing the model is crucial.
Testing means splitting your dataset - training data (used as input to
algorithm) and test data (used for evaluation only)
Need to compute some measure of performance - precision / recall,
root mean square error
Usually there are missing values in the dataset and this cause problems for
many Machine Learning algorithms. These can be solved by,
Remove all records with NULL values
Use a default value
Estimate a replacement value etc.
Anoop V.S Introduction to Data Science March 10, 2017 34 / 48
35. Top 10 Machine Learning Algorithms
Machine Learning algorithms are expected to replace 25% of the jobs
across the world in the next 10 years !!!
Nave Bayes Classifier Algorithm
K Means Clustering Algorithm
Support Vector Machine Algorithm
Apriori Algorithm
Linear Regression
Logistic Regression
Artificial Neural Networks
Random Forests
Decision Trees
Nearest Neighbours
Anoop V.S Introduction to Data Science March 10, 2017 35 / 48
36. Nave Bayes Classifier Algorithm
When to use Nave Bayes Classifier Algorithm ?
If you have a moderate or large training data set.
If the instances have several attributes.
Given the classification parameter, attributes which describe the
instances should be conditionally independent.
Applications of Nave Bayes Classifier Algorithm
Sentiment Analysis - It is used at Facebook to analyse status updates
expressing positive or negative emotions.
Document Categorization - Google uses document classification to
index documents and find relevancy scores i.e. the PageRank
Google Mail uses Nave Bayes algorithm to classify your emails as
Spam or Not Spam
Anoop V.S Introduction to Data Science March 10, 2017 36 / 48
37. K Means Clustering Algorithm
K-means is a popularly used unsupervised machine learning algorithm
for cluster analysis
The algorithm operates on a given data set through pre-defined
number of clusters, k.
The output of K Means algorithm is k clusters with input data
partitioned among the clusters.
Applications of K Means Clustering Algorithm
K Means Clustering algorithm is used by most of the search engines
like Yahoo, Google to cluster web pages by similarity and identify the
relevance rate of search results
This helps search engines reduce the computational time for the users.
Anoop V.S Introduction to Data Science March 10, 2017 37 / 48
38. Support Vector Machine Learning Algorithm
Support Vector Machine is a supervised machine learning algorithm
for classification or regression problems
Dataset teaches SVM about the classes so that SVM can classify any
new data
It works by classifying the data into different classes by finding a line
(hyperplane) which separates the training data set into classes
SVM offers best classification performance (accuracy) on the training
data.
Applications of Support Vector Machine Learning Algorithm
SVM is commonly used for stock market forecasting by various
financial institutions.
It can be used to compare the relative performance of the stocks
when compared to performance of other stocks in the same sector
The relative comparison of stocks helps manage investment making
decisions based on the classifications made by the SVM learning
algorithm.
Anoop V.S Introduction to Data Science March 10, 2017 38 / 48
39. Apriori Machine Learning Algorithm
Apriori algorithm is an unsupervised machine learning algorithm that
generates association rules from a given data set
Association rule implies that if an item A occurs, then item B also
occurs with a certain probability
Most of the association rules generated are in the IF THEN format.
For example, IF people buy an iPad THEN they also buy an iPad
Case to protect it
It is easy to implement and can be parallelized easily.
Applications of Apriori Machine Learning Algorithm
Detecting Adverse Drug Reactions
Market Basket Analysis
Auto-Complete Applications
Anoop V.S Introduction to Data Science March 10, 2017 39 / 48
40. Linear Regression Machine Learning Algorithm
Linear Regression algorithm shows the relationship between 2
variables and how the change in one variable impacts the other
The algorithm shows the impact on the dependent variable on
changing the independent variable
It is one of the most interpretable machine learning algorithms,
making it easy to explain to others.
It is the mostly widely used machine learning technique that runs fast.
Applications of Linear Regression Machine Learning Algorithm
Estimating Sales - Linear Regression finds great use in business, for
sales forecasting based on the trends
Risk Assessment - Linear Regression helps assess risk involved in
insurance or financial domain. A health insurance company can do a
linear regression analysis on the number of claims per customer
against age
Anoop V.S Introduction to Data Science March 10, 2017 40 / 48
41. Decision Tree Machine Learning Algorithm
A decision tree is a graphical representation that makes use of
branching methodology to exemplify all possible outcomes of a
decision, based on certain conditions
In a decision tree, the internal node represents a test on the attribute,
each branch of the tree represents the outcome of the test and the
leaf node represents a particular class label
The classification rules are represented through the path from root to
the leaf node.
Applications of Decision Tree Machine Learning Algorithm
Decision trees are among the popular machine learning algorithms
that find great use in finance for option pricing.
Decision tree algorithms are used by banks to classify loan applicants
by their probability of defaulting payments.
Anoop V.S Introduction to Data Science March 10, 2017 41 / 48
42. The Best Machine Learning Libraries in Python
Python is one of the best languages you can use to learn (and implement)
machine learning techniques for a few reasons:
It’s simple - Python is now becoming the language of choice among
new programmers thanks to its simple syntax and huge community
It’s powerful - Just because something is simple doesn’t mean it
isn’t capable. Python is also one of the most popular languages
among data scientists and web programmers. Its community has
created libraries to do just about anything you want, including
machine learning
Lots of ML libraries There are tons of machine learning libraries
already written for Python. You can choose one of the hundreds of
libraries based on your use-case, skill, and need for customization.
Anoop V.S Introduction to Data Science March 10, 2017 42 / 48
43. The Best Machine Learning Libraries in Python - contd..
Tensorflow - a high-level neural network library that helps you
program your network architectures while avoiding the low-level details
scikit-learn - The scikit-learn library is definitely one of, if not the
most, popular ML libraries out there among all languages. It has a
huge number of features for data mining and data analysis, making it
a top choice for researches and developers alike.
Theano - is a machine learning library that allows you to define,
optimize, and evaluate mathematical expressions involving
multi-dimensional arrays, which can be a point of frustration for some
developers in other libraries
Anoop V.S Introduction to Data Science March 10, 2017 43 / 48
44. The Best Machine Learning Libraries in Python - contd..
Pylearn2 - Most of Pylearn2’s functionality is actually built on top of
Theano, so it has a pretty solid base.
Pyevolve - Pyevolve provides a great framework to build and execute
genetic algorithms and neural networks.
Pattern - This is more of a ’full suite’ library as it provides not only
some ML algorithms but also tools to help you collect and analyze
data. The data mining portion helps you collect data from web
services like Google, Twitter, and Wikipedia. The nice thing about
including these tools is how easy it makes it to both collect and train
on data in the same program.
Anoop V.S Introduction to Data Science March 10, 2017 44 / 48
45. Machine Learning & Big Data Analytics - The perfect
marriage
TWO Orthogonal Aspects !!
Big Data - Handling massive data volumes !
Analytics / Machine Learning - Learning insights from data !
Can be combined so that it gives accurate, effective analysis !!!
Anoop V.S Introduction to Data Science March 10, 2017 45 / 48
46. Books I recommend for Machine Learning
Anoop V.S Introduction to Data Science March 10, 2017 46 / 48
47. Books I recommend for Big Data, Machine Learning
Anoop V.S Introduction to Data Science March 10, 2017 47 / 48
48. Thank you for not yawning !
Questions ?
Anoop V.S Introduction to Data Science March 10, 2017 48 / 48