4. Running a platform requires people with the
right skills
1. There is complexity to
manage
2. Agile working
environment
3. Data Rock Stars are rare
4. Data modelling and SQL
invaluable
5. What technology enabled us to launch a
data platform?
A step change in the evolution of cloud computing.
1. Decoupling storage from compute
Pay for compute only when needed (scale up, down, out)
Pay for storage separately (very cheap)
2. Low barrier to entry
Extremely easy to set up
Very low price (no CAPEX)
3. Same data used by everybody – no impact, each to their own
compute
10. Price of Entry
KETL can now offer cloud services with little risk
No software licensing costs
No hardware costs
Time to deployment rapid
We can use existing team skills to carry out Extract Load
Transform
We can iterate quickly and let designs evolve
Time to value for clients massively reduced
12. KETL
30 Queen Charlotte St
Bristol
BS1 4JH
+44 (0)117 251 0064
www.ketl.co.uk
info@ketl.co.uk
@KETL_BI
For more information on
what we do please contact
Helen Woodcock.
We host regular workshops
Notas del editor
This talk is about how KETL were able to launch a data platform
There are three key ingredients – People, Technology, and Pricing
People
This talk is about KETLs experience with developing a cloud platform.
Three key factors in it’s development are people, technology, and price.
I’d like to share my experiences to date in order to give you some context.
We are all products of our experiences. Good or Bad.
It’s true to say that we can only improve going forward.
As our Marketing team would say “I am a seasoned professional”.
These grey hairs come from experience – I wish I had cloud when I started.
When I started work we had to physically build the servers, compile operating systems, fine tune software and run business processes as best we could.
The environment was fragile because budgetary and technical constraints meant that true resilience was expensive and difficult to obtain.
The financial and emotional investment in systems stymied free form creative development.
Changes to any code often came with downtime, arduous deployments, and on occasion hardware upgrades too.
In the old world, as sharp and keen as I think I was, I was still perceived as a bottleneck by the business.
I was also the custodian and gatekeeper to the data.
We all talk cloud and new world but for many people and businesses that I meet they are still carry lots of technical debt, frustrations, and fears – and I completely understand where they are coming from.
CALL CENTRE
I had to replace three spreadsheets and an access database for a call-centre
If I looked at the data it was mostly complete and superficially fit for purpose
When I sat with the team taking calls I began to understand the stress of using Excel and Access whilst on the phone to the client
When you start to realise that technology should enable and is a service and the power of the right data at the right time you can better understand my personal goal as being an enabler
That is what we want for our platform too
I’m not a data rock star yet
The data platform is a service that has to integrate with different organisations and data sources
Interactions with the outside world 70% of the time require dealing with people
Good Service requires good people – people are key
For me good data people display some key personal qualities:
Accuracy (don’t send me a CV with missing full stops and typos)
Consistency and reliability
Evidence of working with data problems (the number of rows do not always determine the complexity of the problem)
The “old skills” are still valuable today
Environments are build from scripts – we can deploy identical client server environments using a script called with a different variable, vanilla configuration of cloud servers and data services
Coding of key business functions into reusable working patterns – systematic thought processes (we still have to deliver a chain of events even if they now last seconds rather than hours)
SQL skills – it matters little which database you learnt SQL on, the fundamentals of SQL should be able to give an employee an understanding of the data warehousing concepts
I look for experience of Kimball or Data Vault schema design (even if not directly stated)
It still requires humans to interpret business processes
Can’t build a data service without a data team
Why is this piece of data here?
Why is there missing data?
What does this piece of data relate to in the business?
What does people friendly mean?
how do we get to the business of what a customer wants
Part of our platform has to deliver a semantic layer to the client – this is where we describe and codify data values in business terms
The outcome we are trying to achieve is consistency in business reporting and a centralisation of business logic
We are only able to code this layer if we understand and interact with the business users and match code to meaning
An Example: What is a customer?
Is it a credit card?
An email list member?
A loyalty card?
A gift recipient?
Do they expire?
How are customers counted?
How are they uniquely identifiable?
Other typical scenarios are summarising business activity markers into sales stages or grouping products into categories
IN MY EXPERIENCE:
There is complexity still to manage
Client technical debt
Access to information and third party systems
Multiple data sources, multiple vendors
Provision of data ingestion access points and EC2 servers
Reporting Requirements
We use agile methodologies:
de-risk the complexity
make tasks manageable
Time to value for our clients has improved dramatically with snowflake
We recently implemented a proof of concept (end-to-end) from extract to load in four days
Previously it would have taken about four months (elapsed time) – hardware, VPCs, software, licences, schema pre-design, etc
Proof of concept uses production data and replicates production functionality, the only limitation was the scope of required outputs
We used to have a long design stage prior to data load, now we spend the time exploring the real data and adapting the design as we go
Typically we can add fields and new sources within a sprint (two weeks)
New Skills can be taught and learned (many skills transferable)
So many tutorials online, so many great e-learning courses
We are running a zero to snowflake courseLike any of us coders that feel we can do pretty much anything – there is no substitute for hand on experience
Necessity is the mother of invention
Our platform was designed to make the steps we do for all clients repeatable and automated
Data Rock stars are rare
Teams with a combination of skills, differences of opinion, different backgrounds and domain experience provide the best results
No single point of failure, or anything to esoteric
Domain experience helps speed up insight
Data modelling and SQL are key to the product
Kimball Star Schemas / Data Vault
Understanding join logic
SQL load scripts
Many of you will have these attributes, so starting a journey in Snowflake will not be as hard as you may think
TECHNOLOGY is now meeting our expectation
Legacy data pipelines with long batch windows on maxed out limited hardware – fragile and high maintenance, changes difficult
Initial cloud offerings reduced capital expenditure on equipment but still required lots of system administrators
Recent improvements in shared services and containerisation (Docker)
Key elements of the Snowflake technology de-risked the investment
Benefits of the technology easily replicated and shared for different client types
enables fail-fast (development and querying)
lots of different teams can work on the same production data set
the same data can be split between different server groups (no impact across teams)
ability to process data loading/unloading without impacting running queries
zero copy clones and separate compute
time travel
we are running a hands on zero to snowflake session November 6th Details at end
There are other MPP
2016 productive use
TECHNOLOGY:
This architecture diagram illustrates the core concepts
Every user has access to the same data (subject to permissions)
Data is stored once
Teams can use “clones” of production data to carry out development on
Different teams can use their own virtual warehouses (compute resources)
Loading warehouse is generally about parallel file ingestion (particularly for legacy)
CSV is the quickest
File sizes should be about the same 10MB-100MB compressed maximum
NUMBER OF FILES key
4 cores / 8 threads -> so eight files in parallel, one file per thread
Adhoc analytics warehouse
Scaled for query time responsiveness
Multiple users, more clusters – resolves concurrency, prevents queued queries
Development can scale up and down the warehouse to test different functions
Proof of concept single small server adequate for view development
We are running a hands on Zero to Snowflake session November 6th
Bring a laptop
TECHNOLOGY:
Each virtual warehouse can have from 1 server per cluster (X-Small) to 128 servers per cluster (4X-Large)
Each virtual warehouse cluster can be scaled out identically (up to 10 clusters)
Automated Cluster Scale Out
Single command line scale-up/down
Result Cache persisted for 24 hours – reset each time the results are accessed, for up to 31 days
Pay for only the compute used
As a company we do not have to predict demand but are able to respond to it
We are able to set limits and alerts around usage such that we can be pro-active with running costs
TECHNOLOGY:
The ETL processes of old where data took ages to load in series and had to be manipulated outside of the database are over
Load (and reload) all the data, transform in situ – the Extract Load Transform process
Transformation in situ does not require tooling, just a good understanding of SQL
Snowflake allows the parallel loading of data files (streams and other feeds)
The more nodes you have in the virtual server, the more threads you have to load files
Query Result Cache – the cache is part of the snowflake service and returns previously calculated
Disk slow but cheap, SSD cache proportional to virtual warehouse size nodes – disappears on ramp down
Tuning – ramp up/ramp down
Ramp-up for parallel ingestion
Ramp out for concurrency – many BI report users
Snowflake is an enabler
We can ingest data very easily
Bash
Python
Connectors
SQL Scripts and S3
We can rapidly prototype and deploy data models
Develop views in design stage
Share data with customers
As mentioned earlier, we are able to implement Proof of Concept warehouses (end-to-end)
From four months to four days
Data tables can be refined through continual iterations
The scalability and speed allow experimentation and measurement before the outcome has to be fixed
The investigation of data becomes a doing thing rather than a thinking thing – find issues quicker
Flexibility in modelling
We can use our “old skills” in designing the data model and generating the semantic layer for the client
Notes:
AWS Lambda limit 15 minutes
EC2 Orchestrated Scripts – CRON / Python / Bash
Apache Airflow
Come and meet our Rock Star
You learn from doing
Other Notes:
Forecasting using auto.arima - possible ARIMA models are searched through to find the best fit. ARIMA is Auto-Regressive Integrated Moving Average