introduction to data science

ICT 3202 - INTRODUCTION
TO DATA SCIENCE
BY
ENGR. JOHNSON C. UBAH
B.ENG, M.ENG, HCNA, ASM

Course description
This course is an introduction to data science. The major goals
of this course are to learn how to use tools for acquiring,
cleaning, analyzing, exploring, and visualizing data; making
data-driven inferences and decisions; and effectively
communicating results. For practical purposes one may work
with Python, Octave/Matlab, ...

Fields to be covered
 Data mining
 Statistics
 Machine learning
 Information visualization
 Network analysis
 Natural language processing
 Algorithms
 Software engineering
 Databases
 Distributed systems
 Big data

Introduction
Data science is an inter-disciplinary field that uses scientific
methods, processes, algorithms and systems to
extract knowledge and insights from many structural
and unstructured data
Data science is related to data mining and big data.

Introduction to data science
Data science is a "concept to unify statistics, data analysis, machine
learning and their related methods" in order to "understand and
analyze actual phenomena" with data.[3] It employs techniques and
theories drawn from many fields within the context
of mathematics, statistics, computer science, and information
science.

Big data
Big Data refers to a huge volume of data that can be structured,
semi-structured and unstructured. It comprises of 5 Vs i.e.
Volume: It refers to an amount of data or size of data that can be in
quintillion when comes to big data.
Variety: It refers to different types of data like social media, web
server logs etc.

Big Data
Velocity: It refers to how fast data is growing, data is exponentially
growing and at a very fast rate.
Veracity: It refers to an uncertainty of data like social media means if the
data can be trusted or not.
Value: It refers to the data which we are storing and processing is worth
and how we are getting benefit from this huge amount of data.

Structured data
Data that is the easiest to search and organize, because it is usually
contained in rows and columns and its elements can be mapped into
fixed pre-defined fields, is known as structured data.
Often structured data is managed using Structured Query Language
(SQL)—a programming software language developed by IBM in the
1970s for relational databases.

Structured data
Examples of structured data include financial data such as accounting
transactions, address details, demographic information, star ratings
by customers, machines logs, location data from smart phones and
smart devices, etc.
Today, most estimate structured data accounts for less than 20
percent of all data.

Unstructured data
A much bigger percentage of all the data is our world is unstructured data.
Unstructured data is data that cannot be contained in a row-column
database and doesn’t have an associated data model.
Think of the text of an email message. The lack of structure made
unstructured data more difficult to search, manage and analyse.

Unstructured data
Other examples of unstructured data include photos, video and audio
files, text files, social media content, satellite imagery, presentations,
PDFs, open-ended survey responses, websites and call center
transcripts/recordings.
Instead of spreadsheets or relational databases, unstructured data is
usually stored in data lakes, NoSQL databases, applications and data
warehouses.

Semi-structured data
Beyond structured and unstructured data, there is a third category, which
basically is a mix between both of them.
The type of data defined as semi-structured data has some defining or
consistent characteristics but doesn’t conform to a structure as rigid as is
expected with a relational database.
Therefore, there are some organizational properties such as semantic tags
or metadata to make it easier to organize, but there’s still fluidity in the data.

Email messages are a good example.
While the actual content is unstructured, it does contain structured data such as
name and email address of sender and recipient, time sent, etc.
Another example is a digital photograph.
The image itself is unstructured, but if the photo was taken on a smart phone,
for example, it would be date and time stamped, geo tagged, and would have a
device ID
Semi-structured data

Big data can be analyzed for insights that
lead to better decisions and strategic
business moves.

How much data does it take to
be called Big Data?
Usually, data which is equal to or greater than 1 Tb known as Big
Data. Analysts predict that by 2020, there will be 5,200 Gbs of data
on every person in the world.
Example: On average, people spend about 50 million tweets per day,
Walmart processes 1 million customer transaction per hour.

Why is Big Data Important?
The importance of Big Data does not mean how much data we have
but what would you get out of that data. We can analyze data to
reduce cost and time, smart decision making etc.
Challenges:
Storing such a huge amount of data efficiently.
How do we process and extract valuable information from this huge
amount of data within a given timeframe?
Solution: Hadoop and Spark framework

Data Mining
Data Mining also known as Knowledge Discovery of Data
refers to extracting knowledge from a large amount of data
i.e. Big Data. It is mainly used in statistics, machine
learning and artificial intelligence. It is the step of the
“Knowledge discovery in databases”.

Data Mining basics
The components of data mining mainly consist of 5 levels, those are:
–
1. Extract, transform and load data into warehouse
2. Store and manage
3. Provide data access (Communication)
4. Analyze (Process)
5. User Interface (Present data to user)

Need for Data Mining
Analyze relationship and patterns in stored transaction data to get
information which will help for better business decisions.
Data mining helps in Credit ratings, targeted marketing, Fraud
detection like which types of transactions are like to be a fraud by
checking the past transactions of a user, checking customer relationship
like which customers are loyal and which will leave for other company.

We can do 4 relationships using data mining:
1. Classes: It is used to locate the target
2. Clusters: It will group the data items to logical relation
3. Association: Relationship between data
4. Sequential Pattern: To anticipate behavioral patterns and trends.

Challenges in Data Mining
1. Mining different types of Knowledge in databases
2. Handling noise and incomplete data
3. Efficiency and scaling of data mining algorithms
4. Handling relational and complex types of data
5. Protection of data security, integrity, and privacy

Head To Head Comparison
Between Big Data vs Data Mining
Big Data and Data Mining are two different concepts, Big data is a
term which refers to a large amount of data whereas data
mining refers to deep drive into the data to extract the key
knowledge/Pattern/Information from a small or large amount of
data.

The main concept in Data Mining is to dig deep into analyzing the
patterns and relationships of data that can be used further in
Artificial Intelligence, Predictive Analysis etc. But the main concept
in Big Data is the source, variety, volume of data and how to store
and process this amount of data.

Analyzing of Big data to give a business solution or to make a
business definition plays a crucial role to determine growth.
Data Mining does not depend on Big Data as it can be done on the
small or large amount of data but big data surely depends on Data
Mining because if we are not able to find the value/importance of a
large amount of data then that data is of no use.

Features Data mining Big Data
Focus It mainly focuses on
lots of details of a data
It mainly focuses on
lots of relationship
between data
View It is a close-up view of
data
It is Big picture of data
Data It expresses what
about data
It expresses why of the
data
Volume It can be used for small
data or big data
It refers to a large
amount of data set

Features Data Mining Bid Data
Definition It is a technique for
analyzing data
It is a concept than a
precise term
Data types Structured data, relational
and dimensional database
Structured, semi-structured
and unstructured data (in
NoSQL)
Analysis Mainly statistical analysis,
focus on prediction and
discovery of business
factors on small scale
Mainly data analysis, focus
on prediction and discovery
of business factors on large
scale.
Result Mainly for strategic
decision making
Dashboards and predictive
measures.

Big data only refers to only a large amount of data and all the big data
solutions depends on the availability of data. It can be considered as the
combination of Business Intelligence and Data Mining.
Data mining uses different kinds of tools and software on Big data to
return specific results. It is mainly “looking for a needle in a haystack”
In short, big data is the asset and data mining is the manager of that is
used to provide beneficial results.

introduction to data science

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (18)

Similar a introduction to data science

Similar a introduction to data science (20)

Más de Johnson Ubah

Más de Johnson Ubah (7)

Último

Último (20)

introduction to data science