3. Knowledge Check ...
For now, ask yourself,
1. Why do we need data ?
2. Why do we need to store it efficiently? How to store it efficiently?
3. How and where to persist data? Hint: Maybe “Excel” 🤦♂️
4. What insights do we get after analyzing it?
4. Data Structure(s)
a data organization, management, and storage format that enables efficient
access and modification. 🤔 (boring “Wikipedia” definition)
a way of organizing the data so that it can be used efficiently. 😀
5. Data Type(s)
An attribute of data to indicate what type of data we are
storing or manipulating.
it tells us what kind of data we are dealing with.😀
6. Preparing the data … “correctly ✅” ...
This is where understanding the different
types of data and data structures comes in
handy.
There isn’t one great way of storing.
Every organization store data differently.
7. Initially, develop a general
idea of how all data is
being
● Generated
● Collected
● Stored
Then only,
● We can find data that is
“relevant”,
● Process it, and
● Analyze to gain insights.
8. Why does data matter? 🤔
Data is the most valuable commodity in the world. Data has value or can have value.
We want to store data in such a way that it will be easier to manipulate and gain
insights.
9. Once upon a time … 👴
Data was structured and stored across multiple
tables and managed by RDBMS.
The computational power to process the data, at
that time, was low.
Social networks, smart phones and IOT devices,
video streaming platforms … these “data
sources” were still in their early days.
10. Some years later … ⏩
As we become a more digital
society, the amount of data being
created and collected is growing
and accelerating significantly.
Analysis of this ever-growing data
becomes a challenge with
traditional analytical tools.
“DATA IS
EVERYWHERE”
AND IT IS UNSTRUCTURED MOSTLY
11. 90%
of the data in the world today has been created in the last two years. 😲
12. Why AWS ?
Amazon Web Services (AWS)
provides a broad platform of
managed services to help you
build, secure, and seamlessly
scale end-to-end big data
applications quickly and with
ease.
We require innovation to bridge the
gap between data being generated
and data that can be analyzed
effectively.💡
13. But wait, What is Big Data ?
Data, so large and complex, that exceeds the processing capacity of
conventional database systems.
3 V’s of Big Data:
● Volume: refers to size of data we are dealing with.
● Variety: refers to fact that data is coming from various sources and in different
formats.
● Velocity: refers to the speed at which data is being generated.
There can be more V’s.
So, any data that crashes Excel is “Big Data”.😬
14. Data from where ? 🤔
Ask yourself,
● Where does data come from ?
● How is such huge data being
generated ?
● Is the data even relevant or from
valid sources ?
● Who is storing the data ?
15. Data sources …
IOT devices, sensors, CCTV
Social Networks and Search Engines
Stock Exchange Data
Online Shopping, Retail Data
Log files
ERPs, CRMs systems
Healthcare Industry, Insurance
Airlines Data
Financial Data
Geographical Data
SO MUCH MORE!!!
16. Structured , Unstructured and Semi Structured
Data
Structured data has a defined schema. This type of data is well organized.
e.g. Relational Data.
Unstructured Data has no defined schema or structural properties. It makes
up the majority of data collected. e.g. Audio/Video, Images, Binary data.
Semi Structured Data is somewhere in the middle. This data is too
unstructured for relational data but has some organizational structure. e.g.
XML data.
19. Data LifeCycle
Stages:
1. Data Ingestion
2. Data Staging
3. Data Cleansing
4. Data Analytics and
Visualization
5. Data Archive
20. Data Ingestion: The movement of data from an external source to
another location for analysis.
Data Staging: It involves performing housekeeping tasks prior to
making data available to users.
Data Cleansing: Before data is analyzed, data cleansing detects,
corrects, and removes inaccurate data or corrupted records or
files.
Data Analytics and Visualization: The real value of data can be
extracted in this stage. Decision-makers use analytics
and visualization tools to predict customer needs, improve
operations, transform broken processes, and innovate to
compete.
Data Archiving: The AWS Cloud facilitates data archiving,
enabling IT departments to invest more time in other stages of
the data lifecycle.
38. Data Integrity 🤔
it just means the accuracy,
completeness, and quality of data as it’s
maintained over time and across
formats.
39. Database Consistency
The database must remain in a
consistent state after any
transaction.
A consistent transaction will not
violate integrity constraints placed
on the data by the database rules.
40. ETL(Extract-Transform-Load)
a way to integrate data into a single location. 😀
ETL is a recurring activity (daily, weekly, monthly) of a Data Warehouse system
and needs to be agile, automated, and well documented.
42. ETL ... similar to ELT (Extract Load Transform)
ELT inverts the last two stages of the ETL process, meaning that after being extracted
from databases, data is loaded straight into a central repository where all
transformations occur.
43.
44.
45.
46. Analyzing Data … (becoming “Sherlock” 🤔🤔♂️)
Understanding the real value contained
within the data and with those insights we
can make business decisions.
extracting information from data to
support decision making.
47. Visualizing Data ... 📈
presentation of data in a pictorial or graphical
format.
The way human brain processes information,
using charts or graphs to visualize large
amounts of complex data is easier than poring
over spreadsheets or reports.
48. Common Data Visualization Ways
Source: https://morphocode.com/location-time-urban-data-visualization/
50. “In God we trust; All others must bring data.”
- W.EDWARD DEMING
51. AWS provides a host of
services to address an
organization’s data
lifecycle and analytics
requirements.😌
Notas del editor
Array, Linked list, stack, queues are some of basic data structure. But when it comes to Big Data, we have other. Discussed Later. This is just Definition.
The types of Data in the Big Data world: Structured, Unstructured and Semi Structured data. Discussed Later.
Ask students, How to store it efficiently? Different Data structure and type need to be prepared differently. We cannot just adjust schema free data into Relational database. We must “prepare” the data correctly.
We must develop a general idea of how all data is generated, collected and stored so that we can find data that is relevant, process it, and analyze to extract the hidden insights.
We process the data to discover meaningful patterns in our data and with the information we make decisions to make our businesses more profitable and secure.
In the Traditional Architecture, data was mostly collected in Structured or tabular format and handled via RDBMS (Relational Database Management Systems). But now, data is generated in unimaginable way and it is not structured as well.
The amount of data that one has to process has boomed to unimaginable levels in the past decade. It’s important that organizations find ways to manage and analyze it so that they can act on the data and make important business decisions.
Estimated by IBM in “2012”. Its 2021, Think how much data must have been generated in these years. Link to the post: https://www.facebook.com/IBM/posts/90-of-the-data-in-the-world-today-has-been-created-in-the-last-two-years/293229680748471/
Analyzing large data sets requires significant compute capacity that can vary in size based on the amount of input data and the type of analysis. AWS provides the infrastructure and tools to tackle such large datasets with pay-as-you-go cloud computing model.
Depending upon the type of data or how it is structured(the data structure), we have various kinds of databases.