Sponsored by Data Transformed, the KNIME Meetup was a big success. Please find the slides for Dan's, Tom's, Anand's and Chhitesh's presentations.
Agenda:
Registration & Networking
Keynote – Dan Cox, CEO of Data Transformed
KNIME & Harvest Analytics – Tom Park
Office of State Revenue Case Study – Anand Antony
Using Spark with KNIME – Chhitesh Shrestha
Networking & Drinks
4. Energise Organisational
Advantage through
Awareness and Insight
Registration & Networking
Keynote – Dan Cox, CEO of Data Transformed
KNIME & Harvest Analytics – Tom Park
Office of State Revenue Case Study – Anand Antony
Using Spark with KNIME – Chhitesh Shrestha
Networking & Drinks
5. Journey to Best in Class Analytics
We Help our Clients along this Path
Time
Value
Proactive
Discover and
Predict Performers
Reactive
Monitor and Alert FollowersStatic
Report and Drill-down
Laggards
Dynamic
Analytics-enabled
business processes
Innovators
7. BUDGET PLANNING Budgeting
Forecasting
Planning
Demand Planning
Workforce Management
Accounting
Financing
Cashflow
Sales Forecasting
Modelling
Campaign Forecasting
DATA PREPARATION
Data Governance
Data Quality
Master Data Management
Data Warehousing
Data Science
ETL Applications
Data Analytics
SQL Language
Python Language
Scripting
Database Management
Application Development
Database Development
Textual ETL
Text Analytics
Hadoop Ecosphere
Analytical Databases
Relational Databases
Microsoft Analysis Server
OLAP
OLTP
Multi-Dimensional Databases
Data Vault Architectures
Star-Schema Architectures
Data Marting
Data Transformed Skill Sets
VISUALISATION
30%
BUDGET
PLANNING
20%
DATA
PREPARATION
50%
VISUALISATION
Dashboarding
Reporting
Charting
Location Analytics
Statistical Analytics
Data Analytics
Business Analysis
Story Telling
Symmantic Layer
Presentation Layer
Collabration
11. • Challenges
• Constrains data to app
• Can’t manage new data
• Costly to Scale
Business Value
Clickstream
Geolocation
Web Data
Internet of Things
Docs, emails
Server logs
2012
2.8 Zettabytes
2020
44 Zettabytes
LAGGARDS
INDUSTRY
LEADERS
1
2
New Data
ERP CRM SCM
New
Traditional
Traditional systems under pressure
12 Zettabytes
12. Volume Exponential Growth
Variety New Data Types
Velocity Time To Value
The Digital Floodgates have opened…
and will never be turned off again
13. Big Data equals Big Opportunity
Data Source & Type Untouched
Value New Possibilities
88OF BIG DATA
15TRILLION
$
Universal Access Time To Value
OF COMPANIES
%
%
1
23. Acquire, Grow & Retain Customers
Who are your best customers
and how can you keep them
satisfied?
Where can you find more
customers like them?
Big data holds the insights into
who your customers are and
what motivates them.
24. Optimise Operations & Reduce Fraud
Are your operational processes
and systems as efficient as
they could be?
Could you reduce waste and
fraud if you had real-time
visibility into your business?
Adopting a big data and analytics
strategy can help you plan,
manage and maximise
operations, supply chains and the
use of infrastructure assets.
25. Transform Financial Processes
Do you have real-time access
to reliable information about all
aspects of your business?
Do you have the visibility,
insight and control over
financial performance to better
measure, monitor and shape
business outcomes?
Analysing all of your data,
including big data, can drive
enterprise agility and provide
insights to help you make better
decisions
26. Manage Risk
How can you mitigate the
financial and operational risks
that could devastate your
organisation?
How can you manage
regulatory change and reduce
the risk of non-compliance?
Proactively identifying,
understanding and managing
financial and operational risk can
enable more risk-aware,
confident decision making
27. Create New Business Models
Are your competitors making
bigger strides in changing your
industry or creating new markets
than you?
Does your organisation’s culture
support innovative thinking and
exploration?
Explore strategic options for
business growth, using new
perspectives gained from exploiting
big data and analytics
28. Improve IT Economics
Is your existing IT infrastructure
able to provide the insights that
decision makers need?
Are you doing enough to protect
your data centre and data from
potential criminal activity or
fraud?
Lead the creation of new value
and agility for your business by
optimising big data and analytics
for faster insight at a lower cost
29. Analytics Trends
1. Data Governance
2. Social Intelligence
3. Analytics Organisation-Wide
4. Community Collaboration
5. Integration of Everything
6. Cloud Analytics
7. Conversational Data
8. Journalism Data
9. Mature Mobility
10.Smart Analytics
30. Areas BIG DATA is Helping
1. Operations & Optimising
2. Product Development
3. Customer Experience
4. Understanding and Targeting Customers
31. Performance Examples
Actian is Helping These Companies Achieve Leadership
Digital Marketing: Hyper-segmentation every hour
Banking: Enterprise Risk every 2 minutes
Retail: Enterprise Market Basket Analysis every minute
Defense: Network intrusion models every second
Fraud: Adjustments every nano-second
Amazon Redshift – Actian Matrix Cloud-based, Petabyte
Scale Data Warehouse
32. The Value of Business Intelligence
Organisations
competing with Analytics
Substantially OUTPERFORM
their peers by
220%
47. KNIME @ OSR
Anand Antony
Senior Data Analyst
Operations Analytics and Intelligence
Office of State Revenue
anandjantony@gmail.com
Ph. 0414491765
48. OSR: Who are we?
As NSW’s principal revenue agency, OSR
administers state taxation and revenue for, and
on behalf of, the people of NSW
◦ Payroll tax
◦ Land tax
◦ Duties
◦ Grants such as First Home Benefits
49. Data Analytics Team: Who are we?
Operations Analytics & Intelligence is the
analytics wing of the Operations Division in OSR
◦ Three teams – Business Intelligence, Data Analytics and
Data Team
Data Analytics team consists of 10 analysts
Supports tax auditors by detecting possible non-
compliant clients
◦ Via matching data from various sources and analysing
them
◦ 60+ data sources
50. Data Analytics Scenario - Past
Data matching, preparation and analysis
◦ SPSS Clementine, SAS Enterprise Guide
Data mining
◦ Salford Systems
Reporting/Dashboards
◦ Excel
Fuzzy data matching
◦ SSA Name (Informatica)
51. Data Analytics Scenario - Current
Data matching, preparation and analysis
◦ KNIME (around 70% transitioned from
Clementine/SAS)
Data mining
◦ Salford Systems
◦ Will be evaluating KNIME
Reporting/Dashboards
◦ Excel
Fuzzy data matching
◦ SSA Name (Informatica)
53. Why KNIME?
Enrich with coding via coding snippets
◦ Mostly Java snippet at the moment
Start with canvas programming
Fast and easy learning curve for data
scientists
Can tackle almost any analytic task
54. KNIME - Having the best of both worlds!
◦ Canvas programming Coding
55. What do we use KNIME for?
Pretty much for everything! (except
reporting and datamining)
◦ Data reading (text files, databases, non-
standard formats)
◦ Data merging (potentially fuzzy matching too
in future)
◦ Data manipulation
◦ Creating new variables
◦ Data Output
◦ Modelling (possibly in future)
56. Key nodes/functionalities
◦ Sorter, Column Reorder, Column Filter, Column
Rename
◦ Concatenate, Joiner, Reference Row Filter (anti-
join)
◦ Missing value
◦ Math Formula, String Manipulation, Rule
Engine, Java Snippet
◦ GroupBy (aggregate, dedupe)
◦ Value Counter, Pivoting
◦ Looping
◦ Regular expressions/wildcards in various nodes
58. Case study 1
Officers fill in a questionnaire on the
entity audited – one excel spreadsheet
for one entity
Collate all the spreadsheets stored in a
location
Massage the data to produce an analysis
dataset with one row per entity
Key KNIME nodes/functionalities used
◦ List files
◦ Table Row to Variable Loop Start, Loop End
◦ Java Snippet
66. Case study 2 – Use of Flow variables
Technique
◦ Input metadata rules into a file
◦ Read and convert into flow variables
Example
◦ Reorder variables in a dataset as per the
order in the data dictionary
◦ We use “Flow variables” tab in Column
Reorder tab to achieve this
67. Use of flow variables
Use this tab
Do not use this “manual” tab
68. KNIME wishlist!
Offset function in some nodes
eg. Rule Engine, Math formula
Offset function gives the value of a variable in a
previous row.
Eg. In SPSS Clementine @OFFSET(var,1) gives the
value in the previous row.
Note:- Within Java Snippet this is readily achieved
since a variable retains its value until it is
over-written. Therefore we can conveniently first utilise
the value populated from the previous row inside a formula.
Then we can update the value from the current row so as to
be used in the next row.
72. Apache Spark on KNIME
Unleash the power of Big Data on Hadoop
73. The Big Data Problem: Data Volume
1. Storage are getting cheaper
2. Data sources are increasing
3. Thus, data is growing faster
YARN
But, Still processing them is a problem. Why ?
74. The Big Data Problem: Processing
Now, as the memory is cheaper.
75. Why Apache Spark ?
Apache Spark is an open source parallel
processing framework that enables users to
run large scale data analytics across clustered
computers.
• Speed
• Flexible with programming platform
• Generality
• Run Everywhere
The cloud is everywhere, and we will continue to see adoption at extreme volumes. And big data is driving a lot of clouds growth: Revenues for the top 50 public cloud providers shot up 47% in Q4 of 2013 to $6.2B according to Technology Business Research. Amazon Redshift and Google Big Query are growing dramatically. Database players like Teradata are also jumping in the game.
Snowflake
It has been suggested that 80% of an analyst’s time is spent on data prep, while only 20% is spent looking for insights. Enter the personal data cleansing tools focused on the analyst. Tools like Trifacta, Alteryx, Paxata and Informatica Rev are making data preparation easier to use with less technology and infrastructure required to support it.
KNIME
Some may think that the jury is still deliberating, but NoSQL is making a mark in the industry. NoSQL was founded to provide scale, flexibility, and the ability to leverage large sets of data faster. Companies like MarkLogic, Casandra, Couchbase, and MongoDB are bringing new innovation to the SQL database market and are doing quite well with large production implementations in surprising places.
Whether you are of the belief that Hadoop will take over current database architecture, or there will be a mix of Hadoop and other styles of databases, one thing is clear, Hadoop is now a part of the big data architecture in many companies. The legacy data storage vendors have incorporated Hadoop into their architecture in one way or another. Some classical database providers have embraced the market leading Hadoop players like Teradata, SAP, and HP. Others, like IBM, have built their own flavor of Hadoop. Spark and Impala continue to mature, putting more pressure on the traditional stack. In any case, Hadoop looks like it is here to stay and is synonymous with big data architectures.
The concept of a big data lake, a large body of data that exists in a natural or unrefined state, is in early stages. This idea answers some fundamental questions around how to effectively store, manage and use the massive amounts of incoming data. The cutting edge companies Google and Facebook have developed useful ways to leverage the data lake, but should be considered early adopters. As it is, the data lake is still in a nascent concept, and we should expect to see advances in managing and securing the big data lake this year. And as Gartner points out, the data lake requires a new kind of management to be effective.
When new ways of doing things come about, it creates a new ecosystem around it. The same holds true for big data. We have new ways to store data, clean data, add content to data, bring in social media, analyze machine data, do deep analysis on data and, of course, visualize data. Over the next year we will see some surprising changes in the current ecosystem. Specifically, we will see MPP (Massively Parallel Processing) databases play a different and less prominent role.
Actian Matrix (or more well known as Amazon Redshift)
Your Ford Fusion sends 250GB of data back to Ford, who in turn lets you know that something is wrong with your car. Sounds like fantasy, but hardware and semi-conductor companies are betting on it. Companies like Ford, GE, and Rolls Royce jet engines are just a few examples of companies investing in IoT. In 2015, we will see a greater use from manufacturers. Some technology companies like Cisco will create solutions around the concept to help manage the massive amounts of data.
The cloud is everywhere, and we will continue to see adoption at extreme volumes. And big data is driving a lot of clouds growth: Revenues for the top 50 public cloud providers shot up 47% in Q4 of 2013 to $6.2B according to Technology Business Research. Amazon Redshift and Google Big Query are growing dramatically. Database players like Teradata are also jumping in the game.
It has been suggested that 80% of an analyst’s time is spent on data prep, while only 20% is spent looking for insights. Enter the personal data cleansing tools focused on the analyst. Tools like Trifacta, Alteryx, Paxata and Informatica Rev are making data preparation easier to use with less technology and infrastructure required to support it.
Some may think that the jury is still deliberating, but NoSQL is making a mark in the industry. NoSQL was founded to provide scale, flexibility, and the ability to leverage large sets of data faster. Companies like MarkLogic, Casandra, Couchbase, and MongoDB are bringing new innovation to the SQL database market and are doing quite well with large production implementations in surprising places.
Whether you are of the belief that Hadoop will take over current database architecture, or there will be a mix of Hadoop and other styles of databases, one thing is clear, Hadoop is now a part of the big data architecture in many companies. The legacy data storage vendors have incorporated Hadoop into their architecture in one way or another. Some classical database providers have embraced the market leading Hadoop players like Teradata, SAP, and HP. Others, like IBM, have built their own flavor of Hadoop. Spark and Impala continue to mature, putting more pressure on the traditional stack. In any case, Hadoop looks like it is here to stay and is synonymous with big data architectures.
The concept of a big data lake, a large body of data that exists in a natural or unrefined state, is in early stages. This idea answers some fundamental questions around how to effectively store, manage and use the massive amounts of incoming data. The cutting edge companies Google and Facebook have developed useful ways to leverage the data lake, but should be considered early adopters. As it is, the data lake is still in a nascent concept, and we should expect to see advances in managing and securing the big data lake this year. And as Gartner points out, the data lake requires a new kind of management to be effective.
When new ways of doing things come about, it creates a new ecosystem around it. The same holds true for big data. We have new ways to store data, clean data, add content to data, bring in social media, analyze machine data, do deep analysis on data and, of course, visualize data. Over the next year we will see some surprising changes in the current ecosystem. Specifically, we will see MPP (Massively Parallel Processing) databases play a different and less prominent role.
Your Ford Fusion sends 250GB of data back to Ford, who in turn lets you know that something is wrong with your car. Sounds like fantasy, but hardware and semi-conductor companies are betting on it. Companies like Ford, GE, and Rolls Royce jet engines are just a few examples of companies investing in IoT. In 2015, we will see a greater use from manufacturers. Some technology companies like Cisco will create solutions around the concept to help manage the massive amounts of data.
The cloud is everywhere, and we will continue to see adoption at extreme volumes. And big data is driving a lot of clouds growth: Revenues for the top 50 public cloud providers shot up 47% in Q4 of 2013 to $6.2B according to Technology Business Research. Amazon Redshift and Google Big Query are growing dramatically. Database players like Teradata are also jumping in the game.
It has been suggested that 80% of an analyst’s time is spent on data prep, while only 20% is spent looking for insights. Enter the personal data cleansing tools focused on the analyst. Tools like Trifacta, Alteryx, Paxata and Informatica Rev are making data preparation easier to use with less technology and infrastructure required to support it.
Some may think that the jury is still deliberating, but NoSQL is making a mark in the industry. NoSQL was founded to provide scale, flexibility, and the ability to leverage large sets of data faster. Companies like MarkLogic, Casandra, Couchbase, and MongoDB are bringing new innovation to the SQL database market and are doing quite well with large production implementations in surprising places.
Whether you are of the belief that Hadoop will take over current database architecture, or there will be a mix of Hadoop and other styles of databases, one thing is clear, Hadoop is now a part of the big data architecture in many companies. The legacy data storage vendors have incorporated Hadoop into their architecture in one way or another. Some classical database providers have embraced the market leading Hadoop players like Teradata, SAP, and HP. Others, like IBM, have built their own flavor of Hadoop. Spark and Impala continue to mature, putting more pressure on the traditional stack. In any case, Hadoop looks like it is here to stay and is synonymous with big data architectures.
The concept of a big data lake, a large body of data that exists in a natural or unrefined state, is in early stages. This idea answers some fundamental questions around how to effectively store, manage and use the massive amounts of incoming data. The cutting edge companies Google and Facebook have developed useful ways to leverage the data lake, but should be considered early adopters. As it is, the data lake is still in a nascent concept, and we should expect to see advances in managing and securing the big data lake this year. And as Gartner points out, the data lake requires a new kind of management to be effective.
When new ways of doing things come about, it creates a new ecosystem around it. The same holds true for big data. We have new ways to store data, clean data, add content to data, bring in social media, analyze machine data, do deep analysis on data and, of course, visualize data. Over the next year we will see some surprising changes in the current ecosystem. Specifically, we will see MPP (Massively Parallel Processing) databases play a different and less prominent role.
Your Ford Fusion sends 250GB of data back to Ford, who in turn lets you know that something is wrong with your car. Sounds like fantasy, but hardware and semi-conductor companies are betting on it. Companies like Ford, GE, and Rolls Royce jet engines are just a few examples of companies investing in IoT. In 2015, we will see a greater use from manufacturers. Some technology companies like Cisco will create solutions around the concept to help manage the massive amounts of data.
Acquire, grow and retain customers:
Who are your best customers and how can you keep them satisfied? Where can you find more customers like them?
Big data holds the insights into who your customers are and what motivates them. Analysing big data can help you discover ways to improve customer interactions, add value and build relationships that last.
Optimise operations and reduce fraud:
Are your operational processes and systems as efficient as they could be? Could you reduce waste and fraud if you had real-time visibility into your business? Adopting a big data and analytics strategy can help you plan, manage and maximise operations, supply chains and the use of infrastructure assets. Gain the insights you need to reduce costs, increase efficiencies and productivity, and limit threats.
Transform financial processes
Do you have real-time access to reliable information about all aspects of your business? Do you have the visibility, insight and control over financial performance to better measure, monitor and shape business outcomes? Analysing all of your data, including big data, can drive enterprise agility and provide insights to help you make better decisions
Manage risk
How can you mitigate the financial and operational risks that could devastate your organisation? How can you manage regulatory change and reduce the risk of non-compliance? Proactively identifying, understanding and managing financial and operational risk can enable more risk-aware, confident decision making.
Create new business models
Are your competitors making bigger strides in changing your industry or creating new markets than you? Does your organisation’s culture support innovative thinking and exploration? Explore strategic options for business growth, using new perspectives gained from exploiting big data and analytics.
Improve IT economics
Is your existing IT infrastructure able to provide the insights that decision makers need? Are you doing enough to protect your data centre and data from potential criminal activity or fraud? Lead the creation of new value and agility for your business by optimising big data and analytics for faster insight at a lower cost.
Just as the business intelligence landscape has transformed to self-service data, so too must governance transform. Simple approaches like locking down all enterprise data won’t work any longer—nor will the approach of doing away with any process at all. Organizations will begin to investigate what governance means in a world of self-service analytics.
In 2014 we saw organizations begin to analyze social data in earnest. In 2015, the leading edge will start to take advantage of their capabilities. Tracking conversations at scale via social will let companies find out when a topic is starting to trend and what their customers are talking about. Social analytics will open the door to responsive product optimization.
Today’s data analyst may be an operations manager, a supply chain executive or even a salesperson. New, easier to use technologies that provide browserbased analytics let people answer ad-hoc business questions. Companies that recognize this as a strategic advantage will begin to support everyday analysts with data, tools and training to help them do what they’re doing.
The consumerization of IT is no longer theoretical, it’s a fact. People use products that they enjoy using, and analytics software is no different. Companies whose products inspire and empower are seeing their communities flourish. And prospective customers will also look to the health of product communities as important proof points in crowded marketplaces.
The last 10 years have seen a massive amount of innovation across the data space, resulting in mixed environments for everything from data storage to analytics to business applications. We won’t see a return to the age of monolithic systems. However, organizations are losing patience with multiple logins and clunky processes to move and manage data. Rapid integration leveraging simple interfaces is going to become the standard.
In 2015, we’ll start to see the first major use of cloud analytics—for onpremise data. Til now, cloud analytics have been primarily used for data in cloud apps. In 2015 companies will begin to choose the cloud when it makes sense for their business case, not only because the data is there.
We are starting to see an age when data is interactive enough that it can become the backbone of a conversation. Now that people have speed-ofthought analytical tools, they can quickly analyze data, mash it up with other data and redesign it to create a new perspective. And as a result of these data conversations, organizations will get more insight from their data.
The arrival on the scene of vox and continued ascendance of sites like fivethirtyeight.com will force more newsrooms to integrate data analytics into their online presence. This trend will have a spillover effect from the public sphere to organizations, encouraging companies that are lagging in analytics to get with the times.
Workers are spending less time at their desks. But that doesn’t mean they should be less informed by data; in fact they have a greater need for data than ever before. Mobile solutions for many analytics emerged years ago and are finally reaching a level of maturity that means that mobile workers really can do light analysis from the road. And the emphasis on mobile has forced vendors to offer more natural and intuitive interfaces across the board.
Advances in graphical, intuitive modeling will mean that business users can begin to use predictive analytics without the need for extensive expert consultation or scripting. As self-service analytics becomes more mainstream, tasks such as forecasting and prediction, will become more common– and a lot less painful.
Since graduating with a Master of Statistics , Analytics has been a core theme in my 20+ year career. Using data to solve problems is a passion that drives me to seekout and apply technology innovatively. In the new digital world, I aim to be a champion and an evangelist to the principle of "Evidence based Decision Making".
Currently Director Risk Analytics Deloitte Australia
A Data Analyst with 15 years of experience (Taxation - 10 years, Data driven marketing - 5 years)Experience across a spectrum of data analysis tasks (exploratory analysis, developing risk/predictive variables, predictive modelling, reporting)Well developed programming skills in a range of data analysis softwares such as Knime, SAS, SPSS Clementine (IBM Modeller)
He’s a highly regarded Data Analyst at OSR.
A Data Analyst with 15 years of experience (Taxation - 10 years, Data driven marketing - 5 years)Experience across a spectrum of data analysis tasks (exploratory analysis, developing risk/predictive variables, predictive modelling, reporting)Well developed programming skills in a range of data analysis softwares such as Knime, SAS, SPSS Clementine (IBM Modeller)
He’s a highly regarded Data Analyst at OSR.
In Slide Show mode, click the arrow to enter the PowerPoint Getting Started Center.