This document discusses various data skills needed for the digital era, including data science, business intelligence, big data, and data engineering. It provides overviews of these fields and lists important programming languages, tools, and skills for each, such as Python, R, SQL, Tableau, and Hadoop for data science; SQL, data warehousing, Tableau for business intelligence; Java, Python, Scala, Hadoop for big data; and Linux, NoSQL, Python, data ingestion tools for data engineering. It also recommends courses from universities like Michigan and Berkeley for gaining skills in these areas.
4. Data Science
Math & Statistics
Computer Science
Subject Matter Expertise
Mohtat@ut.ac.ir 4
Data Science is an
interdisciplinary field about
processes and systems to
extract knowledge or
insights from data, which is
a continuation of some of
the data analysis fields such
as statistics, data mining,
and predictive analytics,
similar to Knowledge
Discovery in
Databases (KDD).
7. Critical Skills for Data Scientists
Python
R
SQL
Data Mining Tools
Knime , RapidMiner,
IBM SPSS Modeler
Excel
BI Tools
Tableau, Power BI, Qlik
Mohtat@ut.ac.ir 9
8. Top Python Libraries in Data Science
TensorFlow
“TensorFlow is an open source
software library for numerical
computation using data flow graphs.
PyTorch
“PyTorch is a Python package that
provides Deep neural networks built
on a tape-based autograd system
Numpy
“NumPy is the fundamental
package needed for scientific
computing with Python.
Scikit-Learn
“scikit-learn is a Python module for
machine learning built on NumPy,
SciPy and matplotlib.
Keras
“Keras is a high-level neural networks
API, written in Python and capable of
running on top of TensorFlow, CNTK,
or Theano.
Scipy
“SciPy is open-source software for
mathematics, science, and engineering.
Pandas
“pandas is a Python package providing
fast, flexible, and expressive data
structures designed to make working
with "relational" or "labeled" data both
easy and intuitive
Matplotlib
“Matplotlib is a Python 2D plotting
library which produces publication-
quality figures in a variety of
hardcopy formats and interactive
environments across platforms.
Scrapy
“Scrapy is a fast high-level web crawling
and web scraping framework, used to
crawl websites and extract structured
data from their pages.
Mohtat@ut.ac.ir 10
9. Top Skills every Data Scientist needs to Master
TensorFlow Keras Hadoop Spark Hive Java Matlab
Mohtat@ut.ac.ir 11
10. Most Essential Skills for Data Scientists
Complex Problem Solving
Team Working
Emotional Intelligence
Creativity
Critical Thinking
Negotiation
Mohtat@ut.ac.ir 12
11. Applied Data Science with Python
Michigan University(Coursera)
Basic Data Visualization Machine Learning Text Mining SNA
Applied Text Mining in Python
Introduction to Data Science in Python
Applied Plotting, Charting & Data
Representation in Python
Applied Machine Learning in Python Applied Social Network Analysis in
Python
Mohtat@ut.ac.ir 13LOGO HERE
15. Business Intelligence
encompasses a wide variety of
tools, applications and
methodologies that enable
organizations to collect data
from internal systems and
external sources; prepare it for
analysis; develop and run
queries against that data; and
create reports, dashboards and
data visualizations to make the
analytical results available to
corporate decision-makers, as
well as operational workers.
BI
Mohtat@ut.ac.ir 17
Business Skills
Link to Business Strategy
Define Priorities
Define BI Vision
Lead Organization / BPR
Analytics Skills
Data Mining
Social BI
IT Skills
Infrastructure
Build Technology
Data Integration & Quality
24. Big Data
Volume
Terabyte
Distribute
Big Table
Velocity
Real-time
Stream Processing
Variety
Structured
Unstructured
Text, Image, Video
Mohtat@ut.ac.ir 27
Big data is a term used to
refer to data sets that are
too large or complex for
traditional data-processing
application software to
adequately deal with.
It’s what organizations do
with the data that matters.
Big data can be analyzed
for insights that lead to
better decisions and
strategic business moves.
26. 3 Types of Big Data Jobs
1 2
3
Big Data Developer
Big Data Administration
Big Data Analytics
Mohtat@ut.ac.ir 29
27. Top Big Data Programming Languages
Not only Hadoop, many other big data analysis tools like Storm,
Spark, and Kafka are written in Java and run on the JVM
Java
Python is a simple, open-source, general-purpose language.
Hence, it is easy to learn Python for anyone.. With its rich set
of utilities and libraries and easy-to-use features, it works
wonder for big data processing and analysis.
Python
Scala is a rival of Java and Python in the world of Data Science
and becoming more and more popular due to extensive use of
Apache Spark in Big data Hadoop industry.
Scala
Mohtat@ut.ac.ir 30
29. Big Data Companies & Vendors
Cloudera, Inc. is a US-based
software company that
provides a software platform
for data engineering, data
warehousing, machine
learning and analytics that
runs in the cloud or on
premises
Cloudera
MapR is a business software
company headquartered in
Santa Clara, California. MapR
provides access to a variety of
data sources from a single
computer cluster, including big
data workloads
MapR
Hortonworks is a data software
company based in Santa Clara,
California that develops,
supports, and provides expertise
on a set of open-source software
designed to manage data and
processing for things such as IOT,
single view of X, and advanced
analytics and machine learning
Hortonworks
32. Big Data Specialization
Michigan University(Coursera)
Introduction to Big Data
Big Data Modeling and
Management Systems
Big Data Integration and Processing
Machine Learning With Big Data
Graph Analytics for Big Data
Mohtat@ut.ac.ir 36LOGO HERE
36. Data Scientist VS Data Engineer
Mohtat@ut.ac.ir 40
Dolor sit ametis
Data Engineering
Data Scientist
Data Pipelines
Visualization & Storytelling
Programming
Modeling & Advance Analytics
Math & Statistics
System Implementation
37. Data Engineering
Data engineers develop, maintain,
test and evaluate data solutions
within organizations. ... A data
engineer builds large-scale data
processing systems, is an expert in
data warehousing solutions and
should be able to work with the
latest (NoSQL) database
technologies.
Clean and wrangle data
into a usable state
Mohtat@ut.ac.ir 41
38. How To Become A Data Engineer
Linux
NoSQL & SQL
Python / Java / Scala
Agile Development
Data Ingestion
Processing Frameworks
Mohtat@ut.ac.ir 42
39. Best Data Processing Frameworks
MapReduce is a programming model
and an associated implementation for
processing and generating big data
sets with a parallel, distributed
algorithm on a cluster
Apache Spark is an open-
source distributed
general-purpose cluster-
computing framework.
Apache Storm is a free
and open source
distributed realtime
computation system.
The core of Apache Flink
is a distributed streaming
dataflow engine written in
Java and Scala
43