Graphics cards (GPU) open up new ways of processing and analytics over big data, showing millisecond selections over billions of lines, as well as telling stories about data. #QikkDB
How to present data to be understood by everyone? Data analysis is for scientists, but data storytelling is for everyone. For managers, product owners, sales teams, the general public. #TellStory
Learn about high performance computing with GPU and how to present data with a rich Covid-19 data story example on the upcoming webinar.
Fast data in times of crisis with GPU accelerated database QikkDB | Business Breakfast | 23.4.2020
1.
2. Adastra Group
Our Solution Portfolio
Webinar: Fast data in times of crisis with the help of GPU2
One Focus: Data & Digitalization
Advanced Analytics
(Big) Data
Engineering
Data Governance Cloud
Services
Machine Learning
& AI
Digital
Transformation
3. ADASTRA Group
Adastra introduction
3 Adastra Group
International consulting company
that creates functional solutions
in various sectors, facilitating
the transition to the digital era.
Cutting-edge software for data
quality management, Master Data
Management, and data governance.
Solutions to complex business
problems in risk management, sales,
and process optimization.
Specialist in mobile app
development.
Full-service creative agency based
on a strong technological
background.
Recruitment for banks, financial institutions,
telecoms and insurance companies, and many
others, including Adastra.
Artificial intelligence, machine
learning and optimization services.
Big data monetization solutions.
Webinar: Fast data in times of crisis with the help of GPU
4. Adastra Group
Technical & other details
Webinar: Fast data in times of crisis with the help of GPU4
The panel
Matej Misik
QikkDB & TellStory product owner
Ask questions &
answer polls
Get beta access
to the tools we
show
Leave us with
feedback
Tomas Synek
Moderator
Martin Zahumensky
TellStory power user
5. Data bases & GPU intro with QikkDB [45mins]
Intro into the deep-tech DB space
What are GPUs and how they accelerate HPC
Data story telling with TellStory [45mins]
Traditional BI vs. data story telling
Explaining Covid19 by creating a data story
Agenda for today
Let’s
GO
7. Some of our
challenges
Real-time visitors reporting
over stream of data
30k per second
~ 2.6 billion per day
e.g. monitoring crowd
during an event, targeted
marketing
8. Some of our
challenges
Data science on large
datasets
Testing hypotheses and
ad-hoc querying when
indexing is not predictable
Profiling new datasets
Large flows of commuters
above 500 SIM-cards
9. We were looking for
solutions
Tested different technologies Elastic,
ClickHouse... not working for us very well
for various reasons
Came across GPU accelerated
computing
so?
Why not?
Elastic – slow on one node,
slow data ingest
Actian Vector – faster, but still
not performing well on one
node
Clickhouse – much faster, no
geo-spatial capabilities, only
for linux
MS-SQL – even when tuned
not fast enough
MapD (Omnisci) – considered
but far too expensive
10. Types of databases
By type of use:
• Transactional
• Batch
• Real-time
• Analytical
• Streaming
By using resources:
• In-memory
• Disk databases
• Hardware accelerated (FPGA, GPU,
Quantum)
Relational
Columnar Time-seriesGraph
DocumentKey-value
By stored data:
...
11. The technological edge – Why GPU?
GPUs for HPC (high performance computing)
~10x higher performance in
single hardware unit
Great effectiveness (cheaper
computations)
Power growing exponentially vs
linear CPU
Image
processing
Tsunami
simulation
DNA
analyses
Generic
commodi
ty HW
Available in
Cloud
AWS, Azure
12. Lot of processors for
parallel computing
Intel® Xeon® Platinum 8253
has 16 cores
NVDIA Tesla V100
has 5120 cores and is data
center focused
Rediscovery of Columnar Data
Storage
Utilizing GPUs computation power requires different approach to storing data.
The most suitable database architecture that works well with parallel processing
is columnar storage. In contrast to conventional relational databases which store
data in row-based format, columnar databases store data in separate columns.
In context of parallel processing, GPUs love long vectors of the same data type
FIgure 1: GPUs have thousands of arithmetic logic units (ALUs) in one piece of hardware.
CPU GPU
GPUs help to accelerate
compute-intensive use-cases
“1 GPU node replaces up to 54 CPU nodes” (NVIDIA)
New cards to be announced 2020 with approx. 8000 cores & 40% faster
13. Inserting a GPU into the
machine is not enough
Need to parallelize programs = hard
CUDA programming model since 2007 by Nvidia
Algorithms must be Embarrassingly parallel
14. Multi-GPU
How the computation is spread onto cores
GPU CUDA core A B C
Logical conditions
Records meeting the
condition
Result after
reconstruction
A>= B A < 5 Final AND mask
1st
1 5 Apple 0 1 0 - Orange
2 4 Grapes 0 1 0 - Lemon
3 3 Orange 1 1 1 Orange -
2nd
4 2 Lemon 1 1 1 Lemon -
5 1 Banana 1 0 0 - -
nth ...
Transfer data CPU RAM to GPU GPU memory – no transfers GPU to CPU RAM
SELECT C FROM FRUIT_TABLE WHERE A >= B AND A < 5
Parallel execution
1
2
n
1
n
16. Crucial requirements for the
database system
Fast insert Fast processing
Scalability & high
availability
Limit pre-aggregations
Standardized access and
common syntax
17. Deep-tech based on real
science
Google Protocol Buffers
Processing data on GPU is written in CUDA 10 (direct commands to HW
on single core level)
Database core is written in low level language C++ 17 (memory
management, control of instructions…)
Libraries for specific modules
(networking, building, parsing…)
Created in cooperation with Slovak
Technical University top talents
18. What is qikkDB for?
Filtering and aggregations over single flat huge table
Spatio-temporal data processing
Complex polygon operations (contains, intersect,
union)
Numeric and datetime data
Incremental data which are growing over time
Network utilization & analysis, Risk scoring, Dynamic pricing,
Real-time Analytics, Hypothesis verification, Profiling of big
data, Machine learning, etc.
Logs
Polygons
IoT
GPSNetwork
Events
Auto
motive
Maps
19. So how fast is it?
1.2B data rows in
7 columns
Average execution
time was obtained
based on 200 query
runs
Biggest datasets
tested at 400GB,
limited by Memory,
can be cached from
disk for bigger
datasets,
benchmarks to
come soon
20. Execution
Times Results
1. QikkDB
2. GiraffeDB
leading GPU database
3. CatDB
leading columnar database
4. RacoonDB
tuned leading relational database
CPU machine(c5d.9xlarge)36 CPU cores
We use codenames for well known
databases because for legal
reasons we can’t tell you who
these slow guys are.
GPU machine(p3.8xlarge)4x Tesla V100
Compared to Other DBs (results in ms)
Query qikkDB @
p3.8xl
qikkDB @
g4dn.12xl
GiraffeDB
@ p3.8xl
CatDB @
c5d.9xl
RacoonDB
@ c5d.9xl
Elastic
(tuned)
Spark 21x
m3xl
Spark
i3.8xl
#1 22 37 25 435 22 810 2362 22000
#2 37 82 235 1061 964 1818 3559 25000
#3 228 925 231 1630 3491 n/a 4019 27000
#4 283 1105 417 2174 3996 n/a 20412 65000
Avg 143 537 227 1325 2118 n/a 7588 34750
10
to 100x
quicker
21. The blazing speed
Same HW, 1.2bn data points, 2 databases
www.tellstory.cloud
Both running
on AWS
g4dn.12xlarge
48vCPU 192GB
RAM, 4x Tesla
T4 GPU
Deployed beta
platform with
data
exploration
front-end
23. Persisted data
on disk
(compressed)
Pre-loaded
data on RAM
Relevant
columns go
to the GPU
Data on GPU RAM
(decompressed)
Result set
PCI-E
Filters &
aggregations
CUDA kernels
When inserting new data a column is automatically created ~
“schema less”, good for IoT and similar
Whats going on in the background?
Data storage & flow
24. How can it scale?
Multi-GPU (vertical) scalability single-node (up
to 8 GPUs)
• Accelerating computations
• Enabling multiple session
Multi-level caching
• GPU RAM cache
• CPU RAM cache
On roadmap
• Multi-nodes (horizontal) scalability
• High-availability
• Data lazy loading
Not limited to data size ~ Best performance when
data fit GPU mem, but can load from disk on demand
25. Why not just index?
Traditional databases use indexing for faster processing
resulting in slow insert
qikkDB does not need indexing
(but they are available anyway)
Data are just appended
GPU takes care of fast processing
26. Integration with your
environment on
standards you know
Kafka connector
ODBC/JDBC
Adapters
C#, Java, Python
Streaming data
Visualizationtools (PowerBI…)
Customapplications,data analysis…
Speed up your
BI tools,
applications or
use TellStory for
fast analysis
TellStory
Exploration & analysis FE
Data story tellingwith real-timedata exploration
27. GPU AWS
12USD/hour
GPU HW
~50k USD
Expensive
hardware?
QikkDB can handle the queries in a fraction
of the time of traditional databases, so you
can do more with your hardware
allocation in the same time.
It also means that to do the same amount
of work you need a lot less hardware and
therefore saving on costs.
“1 GPU node replaces up to 54 CPU nodes” (NVIDIA)
28. v
In short: Interactive analytics
on massive data sets
GPU acceleration
§ Billions of data points in milliseconds
Great for spatio-temporal data
§ Finding & understanding links between data
points in space & time
Standard SQL syntax
§ Easy to start using & integrate into the data
science environment
Efficiency & speed
§ GPUs becoming commodity HW and thanks to
their efficiency cost per 10k queries on par
with CPU approaches
GPU
Columnar
DB
Real-time
queries in
millisecs
API, ODBC,
JDBC,
connects
to
everything
SQL
standard
Spatio-
temporal
data
processing
Cloud or
on-prem
29. Data bases & GPU intro with QikkDB
Intro into the deep-tech DB space
What are GPUs and how they accelerate HPC
Data story telling with TellStory
Traditional BI vs. data story telling
Explaining Covid19 by creating a data story
Live stories and fast data
TellStory Roadmap
Q&A
Part 2!
Let’s
GO
30. Martin is ex-Instarea CEO now
working in Ataccama as Head
of Product Strategy
Martin created
https://qikk.ly/c
ovid19 story and
will lead you
through how he
did it
31. Interpreted data, easy to understand, with new facts
brought to reader
And once they have the story they can start to sell it to
other parties
Animated
video
playing
Story telling
32. A story is about being visual
Cool
Visualization
Plugins &
animations
Newspaper
like reading
&
interactive
Interesting
facts
34. When you want to have the story live,
you must have the data live, and when
you work with billions big data sets you
need
Fast Database
Animated
video
playing
LIVE story
Live
35. More features to come in Phase 2, Let AI create your Story is in progress
TellStory Roadmap
Beta release JUNE
Find interesting facts
Minute by Minute
updates
(be notified when something
interesting happens)
Animated
visualizations
(timeline charts, maps)
Share as Video
(Instagram upload, Youtube
livestream)
Google sheets
integration
Auto update data
(scheduled refresh)
Embed sections
(embedding only parts of
story will be possible)
36. Value
proposition
for Adastra
services
with these
tools
Quick pilots for hands on
experience
§ GPU data acceleration: 2 month pilot to
deliver real-time processing of vast
streaming data (e.g. 5G, smart meters,
transactions)
§ Data story telling: 1 month pilot to
provide customers with live &
interactive intelligence and insights
§ Data story telling: 1 month pilot to give
management the minute by minute data
they need
38. Useful links
More info
§ https://qikk.ly – product web with basic
information
§ https://qikk.ly/downloads/qikkDB_white_pa
per.pdf – White paper
§ https://docs.qikk.ly/ – Documentation &
Installation instructions
§ https://support.qikk.ly/ – Issues & Features
reporting portal
§ https://tellstory.cloud – Front-end for data
visualization, SQL console on AWS
§ https://tellstory.ai – Find out more about
TellStory