The expression “garbage, garbage out” emphasizes the need for thorough testing in any Big Data and analytics implementation. Big Data testing means ensuring the correctness and completeness of voluminous, often heterogeneous, data as it moves across different stages—ingestion, storage, analytics, and visualization—producing actionable insights. What should be our testing focus? Which of the 4 V’s—variety, volume, velocity, and veracity—are most important at which stage? For example, in the ingestion stage, testing needs to focus on variety of data rather than volume. As the data moves on to the storage stage, testing needs to focus on veracity rather than velocity. Jaya Bhallamudi presents a unique approach for analyzing a typical Big Data implementation architecture to identify various testing interfaces and highlight the specific V’s as the focus of testing. The focus is based on the context of the data flow (type of source from which data originates and the type of target to which the data is destined to move) and the context of the data (source data format, target data format, the business, filter, and transformation rules applied on the data), and then mapping them to different testing strategies. Take back the testing strategies and a test automation approach that are in perfect alignment with the 4 V’s of Big Data testing.
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity
1.
T13
Big
Data
10/6/16
13:30
The
Four
V's
of
Big
Data
Testing:
Variety,Volume,
Velocity,
and
Veracity
Presented
by:
Jaya
Bhagavathi
Bhallamudi
Tata
Consultancy
Services
Brought
to
you
by:
350
Corporate
Way,
Suite
400,
Orange
Park,
FL
32073
888-‐-‐-‐268-‐-‐-‐8770
·∙·∙
904-‐-‐-‐278-‐-‐-‐0524
-‐
info@techwell.com
-‐
http://www.starwest.techwell.com/
2.
Jaya
Bhagavathi
Bhallamudi
A
senior
consultant
in
the
assurance
services
unit
of
Tata
Consultancy
Services,
Jaya
Bhagavathi
Bhallamudi
heads
the
Big
Data
and
Analytics
Assurance
Center
of
Excellence,
which
focuses
on
R&D,
test
process
definitions,
test
automation
solution
development,
and
competency
development
on
Big
Data
technologies.
Jaya
has
been
in
the
test
automation,
testing
services,
and
solutions
innovation
space
for
fifteen
of
her
seventeen
years
in
IT.
She
enjoys
building
test
automation
frameworks
and
accelerators
for
various
testing
services.
Contact
Jaya
at
jayabugs@gmail.com
or
on
LinkedIn.
4. 2
With you today…
• Jaya is a Senior Consultant in TCS and currently heading the
Big Data and Analytics Assurance Center of Excellence, which
focuses on the R&D, Test Process definitions, Test Automation
solution development and Competency development
• Jaya has 18+ years of experience in IT industry with 15+ years
in Test Automation and Testing Services & Solutions
Innovation
• Jaya holds Masters degree in Computer Application from
Osmania University, Hyderabad, India
Jayabhagavathi Bhallamudi, Head – Big Data Testing COE, Assurance Services, TCS
TCS Confidential Information – Not to be shared
5. 3
Today we will cover…
TCS Confidential
1
Tester’s Dilemma2
Framework to tackle the problem
Need for Big Data Assurance
3
7. 5
Big Data Analytics
TCS Confidential Information – Not to be shared
Non-traditional internal data &
uncontrolled external data
Complex non-traditional
analytical models
INPUT OUTPUT
8. 6
Garbage in equals Garbage out
TCS Confidential Information – Not to be shared
IN OUT
Increased Risk
=
9. 7
How this impacts your business
TCS Confidential Information – Not to be shared
Bad Data
Wrong Insights
Business / Brand Image Losses
Incorrect Processing
10. 8
Appropriate Big Data Assurance ensures
TCS Confidential Information – Not to be shared
Good Data
Relevant Actionable Insights
Business Growth
Reliable Processing
12. 10
Scope in terms of data flow
Ingestion
Integration
Migration
Homogenization
Standardization
Storage
Analytics
Apps
Insights
Transformed
Data
Raw Data
TCS Confidential Information – Not to be shared
13. 11
VERACITY
Focus in terms of V’s
VALUE
TCS Confidential Information – Not to be shared
VELOCITY
VOLUME
VARIETY
VARIABILITY
BIG
DATA
TBs
RDBMS, txt,
xml, json,
bson, orc, rc…
Inconsistency
Reliability
Relevancy
Performance
14. 12TCS Confidential Information – Not to be shared
Ingestion
Integration
Migration
Homogenization
Standardization
Storage
Analytics
Apps
Insights
When to focus which ‘V’?
Or .. Should we focus on all V’s all the time?
17. 15
Hadoop
Non-Hadoop
Databases Files Near real-time data streams
HDFS ( Raw data )
HIVE / HBASE ( Standardized data )
HIVE / HBASE
( Data for creating
analytical models )
HIVE / HBASE
( Data for applying
analytical models )
Step 1: Understand the architecture
DWHs
Apps
Analytics
Analytics
TCS Confidential Information – Not to be shared
19. 17
Hadoop
Non-Hadoop
Databases Files Near real-time data streams
HDFS ( Raw data )
HIVE / HBASE ( Standardized data )
HIVE / HBASE
( Data for creating
analytical models )
HIVE / HBASE
( Data for applying
analytical models )
Step 2: Identify testing interfaces
DWHs
Apps
Analytics
Analytics
TCS Confidential Information – Not to be shared
a b c
d
f h
e
g i
j
k
m
l
n
20. 18
Identify testing type relevant to the interface
3
TCS Confidential Information – Not to be shared
21. 19
Databases
HDFS ( Raw data )
Data ingestion testing
Data migration testing
Data integration testingTestingtypes@
Step 3: Identify testing type
a
a
TCS Confidential Information – Not to be shared
22. 20
Files
HDFS ( Raw data )
Data ingestion testing
Data migration testing
Data integration testingTestingtypes@
Step 3: Identify testing type
b
b
TCS Confidential Information – Not to be shared
23. 21
Near real-
time data
streams
HDFS ( Raw data )
Data ingestion testing
Data integration testing
Testingtypes@
Step 3: Identify testing type
c
c
TCS Confidential Information – Not to be shared
24. 22
HDFS
(Raw data)
HIVE / HBASE
(Standardized data)
Data homogenization
testing
Testingtypes@
Step 3: Identify testing type
d
d
TCS Confidential Information – Not to be shared
25. 23
HIVE / HBASE
(Standardized data)
Data standardized testing
Testingtypes@
Step 3: Identify testing type
e
TCS Confidential Information – Not to be shared
e
26. 24
HIVE / HBASE
(Standardized data)
Data migration testing
Testingtypes@
Step 3: Identify testing type
f
TCS Confidential Information – Not to be shared
HIVE / HBASE
(Data for creating
analytical models)
Data integration testing
f
27. 25
HIVE / HBASE
(Data for creating
analytical models)
Analytical model validation
Testingtypes@
Step 3: Identify testing type
g
TCS Confidential Information – Not to be shared
g
28. 26
HIVE / HBASE
(Standardized data)
Data migration testing
Testingtypes@
Step 3: Identify testing type
h
TCS Confidential Information – Not to be shared
HIVE / HBASE
(Data for applying
analytical models)
Data integration testing
h
29. 27
HIVE / HBASE
(Data for applying
analytical models)
Analytical model
effectiveness testing
Testingtypes@
Step 3: Identify testing type
i
TCS Confidential Information – Not to be shared
i
30. 28
HadoopHIVE / HBASE
(Data for applying
analytical models)
Data provision
testing
Testingtypes@
Step 3: Identify testing type
TCS Confidential Information – Not to be shared
Analyticsj
j
k
l
Apps
Analyticsk
l
31. 29
Hadoop
HIVE / HBASE
(Data for applying
analytical models)
Data migration
testing
Testingtypes@
Step 3: Identify testing type
TCS Confidential Information – Not to be shared
k
DWHsk
Data ingestion
testing
Data
integration
33. 31
Identify the V to be prioritized for the testing type
4
TCS Confidential Information – Not to be shared
34. 32
Step 4: Prioritize V’s 4
Data Ingestion Testing
VARIETY
VELOCITY
High priority for file-based data ingestions
High priority for real time data ingestions
TCS Confidential Information – Not to be shared
35. 33
Step 4: Prioritize V’s 4
Data Migration Testing
VOLUME High priority for historical data migrations
TCS Confidential Information – Not to be shared
36. 34
Step 4: Prioritize V’s 4
Data Integration Testing
VARIABILITY Inconsistency / non-compliance checks
TCS Confidential Information – Not to be shared
High priority for data acquired from multiple sources to a single target
High priority for data acquired from external sources like social media
37. 35
Step 4: Prioritize V’s 4
Data Homogenization Testing
VARIETY
High priority for unstructured or semi-structured to
structured data format conversions
TCS Confidential Information – Not to be shared
38. 36
Step 4: Prioritize V’s 4
Data Standardization Testing
VOLUME
High priority for any pre-existing data to be checked for
conformance to data standards & industry compliances
TCS Confidential Information – Not to be shared
39. 37
Step 4: Prioritize V’s 4
Analytical Model Validation
VOLUME To identify data patterns which were not considered in
development of model; Entire historical data to be
considered for testing
TCS Confidential Information – Not to be shared
Analytical models based on historical data
40. 38
Step 4: Prioritize V’s 4
Analytical Model Validation
VERACITY
VALUE
High priority to identify the data patterns
that are not relevant for the business
High priority to identify the data patterns
that do not bring any value to the business
TCS Confidential Information – Not to be shared
Analytical models not based on historical data
41. 39
Step 4: Prioritize V’s 4
Analytical Model Effectiveness Testing
VOLUME High priority to identify wrong predictions, unidentified data
patterns
TCS Confidential Information – Not to be shared
If the actual data, on which the model needs to be run, is available
42. 40
Thank you!
For more information, please write to me at Global.Assurance@tcs.com
Visit TCS at booth # 1