Teaches what is Data science? Who is Data Scientist? Qualifications required to become a Data Scientist. Responsibilities of Data Scientist. Advantages of Data Science, Roles in Data Science project, Python libraries for Data Science Big Data vs Data Science
Identifying Appropriate Test Statistics Involving Population Mean
DataScience.pptx
1.
2. »Significance
»Advantages
»Process of Data Science
»Roles in a Data Science Project
»Stages of Data Science Project
»Responsibilities of Data Scientist
»Qualifications of Data Scientist
»Data Science vs Big Data
»Python Libraries for Data Science
3. M Vishnuvardhan
Data Science
» Data science, also known as data-driven science, is an interdisciplinary
field about scientific methods, processes, and systems to extract
knowledge or insights from data in various forms, either structured or
unstructured
» A data scientist manages big data. They take a large amount of data
points and use their skills in programming, math, and statistics to
organize, and clean them.
4. M Vishnuvardhan
Data Science and its Importance
The concept of data science is to help unify statistics, machine
learning, data analysis, and other related methods. That way people
will better understand and analyze information with data. Data
Science tends to be used to describe predictive modeling, business
intelligence, business analytics, or other uses of data.
Data Science is about uncovering hidden information that may be
able to help companies make smarter choices for their business.
Eg: Spotify recommends new music, Netflix recommends new
movies, spam filter in Gmail, recommendation engine in Amazon,
5. M Vishnuvardhan
Data Science Advantages
» Monetizing the data
» Mitigating company risk
» Better understanding the customers
» Unique insights for businesses
» Business Expansion
» Improve forecasting
» Objective decisions for businesses
6. M Vishnuvardhan
Process of Data Science
» Frame the problem
» Collection of data to solve the problem
» Process the data
» Explore the data
» Perform in-depth analysis
» Communication of result analysis
7. M Vishnuvardhan
Data Science Project - Roles
» Project sponsor
» Client
» Data scientist
» Data architect
» Operations
9. M Vishnuvardhan
Stages of Data Science Project
» Define Goal
» Data Collection and Management
» Modelling
» Model evaluation
» Presentation and documentation
» Model deployment and maintenance
10. M Vishnuvardhan
Stages of Data Science Project
» Define Goal
» Data Collection and Management
» Modelling
» Model evaluation
» Presentation & documentation
» Model deployment & maintenance
» Why do the sponsors want the project?
» What do they lack, and what do they need?
» What are they doing to solve the problem, and
why isn’t that good enough?
» What resources will you need ie., staff,
domain experts
» How do the project sponsors plan to deploy
your results?
» What are the constraints that have to be met
for successful deployment?
11. M Vishnuvardhan
Stages of Data Science Project
» Define Goal
» Data Collection & Management
» Modelling
» Model evaluation
» Presentation & documentation
» Model deployment & maintenance
This step includes identifying the data you need,
exploring it, and conditioning it to be suitable for
analysis
» What data is available to me?
» Will it help me solve the problem?
» Is it enough?
» Is the data quality good enough?
12. M Vishnuvardhan
Stages of Data Science Project
» Define Goal
» Data Collection & Management
» Modelling
» Model evaluation
» Presentation & documentation
» Model deployment & maintenance
Statistics and machine learning is used during
modelling stage The most common data science
modelling tasks are these:
» Classification
» Scoring
» Ranking
» Clustering
» Finding relations
» Characterization
13. M Vishnuvardhan
Stages of Data Science Project
» Define Goal
» Data Collection & Management
» Modelling
» Model evaluation
» Presentation & documentation
» Model deployment & maintenance
Once you have a model, you need to determine if
it meets your goals:
» Is it accurate enough for your needs? Does it
generalize well?
» Does it perform better than “guess”? Better
than whatever estimate you currently use?
» Do the results of the model make sense in the
context of the problem domain?
14. M Vishnuvardhan
Stages of Data Science Project
» Define Goal
» Data Collection & Management
» Modelling
» Model evaluation
» Presentation & documentation
» Model deployment & maintenance
Once you have a model that meets your success
criteria, you must present your results to your
project sponsor and other stakeholders. You must
also document the model
15. M Vishnuvardhan
Stages of Data Science Project
» Define Goal
» Data Collection & Management
» Modelling
» Model evaluation
» Presentation & documentation
» Model deployment &
maintenance
model is put into operation. In many organizations
this means the data scientist no longer has
primary responsibility for the day-to-day
operation of the model. But you still should
ensure that the model will run smoothly and
won’t make disastrous unsupervised decisions.
You also want to make sure that the model can be
updated as its environment changes.
16. M Vishnuvardhan
Responsibilities of a Data Scientist
» Recommend the most cost-effective changes that should be made to
existing strategies and procedures.
» Communicate findings and predictions to IT and management
departments through effective reports and visualizations of data.
» Come up with new algorithms to figure out problems and create new
tools to automate work.
» Device data-driven solutions to the challenges that are most pressing.
» Examine and explore data from several different angles to find hidden
opportunities, weaknesses, and trends.
17. M Vishnuvardhan
Responsibilities of a Data Scientist
» Prune and clean data to get rid of the irrelevant information.
» Employ sophisticated analytics programs, statistical methods, and
machine learning to get data ready for use in a prescriptive and
predictive modelling.
» Extract data from several external and internal sources.
» Conduct undirected research and create open-ended questions
18. M Vishnuvardhan
Qualifications of Data Scientists- Technical
» Cloud tools such as Amazon S3.
» Big data platforms such as Hive & Pig, and Hadoop. Python, Perl, Java,
C/C++
» SQL databases, as well as database querying languages.
» SAS and R languages. Unstructured data techniques.
» Data visualization and reporting techniques. Data munging and cleaning.
» Data mining
» Software engineering skills
» Machine learning techniques and tools. Statistics
» Mathematics
19. M Vishnuvardhan
Qualifications of Data Scientists- Business
» Industry knowledge: It’s important to understand how your chosen
industry works and how the data is utilized, collected, and analyzed.
» Intellectual curiosity: Data Scientists have to explore new territories and
find unusual and creative ways to solve problems.
» Effective communication: Data Scientists have to explain their
discoveries and techniques to non-technical and technical audiences in a
way that they can understand.
» Analytic problem-solving: Data Scientists approach high-level challenges
with clear eyes on what is important. They employ the right methods
and approaches to create the best use of human resources and time
20. M Vishnuvardhan
Data Science and Big Data
Big data refers to the large group of heterogeneous data that comes from
various sources. This data encompasses all different types of data;
unstructured, semi-structured, and structured information that can be
found easily throughout the internet. Big data includes:
» Structured data: transaction data, OLTP, RDBMS, and other structured
formats.
» Semi-Structured: text files, system log files, XML files, etc.
» Unstructured data: web pages, sensor data, mobile data, online data
sources, digital audio, and video feeds, digital images, tweets, blogs,
emails, social networks, and other sources.
21. M Vishnuvardhan
Data Science and Big Data
Big Data Data Science
M
e
a
n
i
n
g
• Large volumes of data that can’t be
handled using a normal database
program.
• Characterized by velocity, volume,
and variety.
• Data focused scientific activity.
• Similar in nature to data mining.
• Harnesses the potential of big data to
support business decisions.
• Includes approaches to process big
data.
22. M Vishnuvardhan
Data Science and Big Data
Big Data Data Science
C
o
n
c
e
p
t
• Includes all formats and types of
data.
• Diverse data types are generated
from several different sources.
• Helps organizations make decisions.
• Provides techniques to help extract
insights and information to create large
datasets.
23. M Vishnuvardhan
Data Science and Big Data
Big Data Data Science
B
a
s
i
s
o
f
F
o
r
m
a
t
i
o
n
• Data is generated from system logs.
• Data is created in organizations –
emails, spreadsheets, DB, transactions,
and so on.
• Online discussion forums.
• Video and audio streams that include
live feeds.
• Electronic devices – RFID, sensors, and
so on.
• Internet traffic and users.
• Working apps are made by
programming developed models.
• It captures complex patterns from
big data and developed models.
• It is related to data analysis,
preparation, and filtering.
• Applies scientific methods to find
the knowledge in big data.
24. M Vishnuvardhan
Data Science and Big Data
Big Data Data Science
A
p
p
l
i
c
a
t
i
o
n
A
r
e
a
s
• Security and law enforcement.
• Research and development.
• Commerce.
• Sports and health.
• Performance optimization.
• Optimizing business processes.
• Telecommunications.
• Financial services.
• Web development.
• Fraud and risk detection.
• Image and speech recognition.
• Search recommenders.
• Digital advertisements.
• Internet search.
• Other miscellaneous areas and
utilities
25. M Vishnuvardhan
Data Science and Big Data
Big Data Data Science
A
p
p
r
o
a
c
h
• To understand the market and to
gain new customers.
• To find sustainability.
• To establish realistic ROI and
metrics.
• To leverage datasets for the
advantage of the business.
• To gain competitiveness.
• To develop business agility.
• Data Visualization and prediction.
• Data destroy, preserve, publishing,
processing, preparation, or acquisition.
• Programming skills, like NoSQL, SQL,
and Hadoop platforms.
• State-of-the-art algorithms and
techniques for data mining.
• Involves the extensive use of statistics,
mathematics, and other tools.
26. M Vishnuvardhan
Python Libraries used for Data Science
» NumPy: is a Python module for numerical computation that can process
massive amounts of data and perform array computations. NumPy
integrates seamlessly with other libraries commonly used in data
science, such as pandas and Matplotlib.
» Matplotlib: Matplotlib is a visualization-building plotting package that is
used to plot graphs and charts. It is frequently utilized for data analysis
due to the charts and histograms that it generates.
» Seaborn: A Matplotlib-based package is used to make visualizations that
are more enticing and instructive. These include themes, color palettes,
and custom fonts.
27. M Vishnuvardhan
Python Libraries used for Data Science
» Scikit-learn: is a machine learning package for Python that offers
practical tools for data analysis and mining
» TensorFlow: An open-source software framework created by Google
called TensorFlow enables dataflow and differentiable programming for
a variety of purposes, including machine learning. TensorFlow used to
run ML algorithms on Smartphones, the internet, and the cloud.
» Keras: Keras is a Python-based high-level neural network API that can
operate on top of TensorFlow, CNTK, or Theano. It was created with the
goal of allowing for quick experimentation. It supports neural network
layers, activation functions, loss functions, and optimizers that are
typical in neural networks.
28. M Vishnuvardhan
Python Libraries used for Data Science
» PyTorch : is an open-source machine learning library used for tasks like
computer vision and natural language processing. It was created by
Facebook's AI research team and is extensively used in both business and
academia. PyTorch supports distributed computation, enabling quick and
effective model training on huge datasets.
» Pandas: Pandas is a popular data science library. It provides a range of
functions for data manipulation, data analysis, and data visualization,
» Statsmodels: provides a range of statistical models as well as tools for
data scientists. The models include linear and logistic regression or
generalized linear models. It also easily integrates seamlessly with
Pandas, to analyze and visualize data stored in data frames.
29. M Vishnuvardhan
Python Libraries used for Data Science
» NLTK or Natural Language Toolkit: It is used for natural language
processing. Some data scientists deal with the analysis of natural
language data. It provides a range of functions for text processing. It
also offers functions for sentiment analysis, which is the process of
determining the sentiment or opinion expressed in a piece of text.