3. I was recently asked to build an analyti- 3. Provide a way to automate the An incidental but equally useful
cal platform for a project. But what is an running of the statistical data models, consequence of using a column-store
analytical platform? The client, a retailer, once developed, so that they can be database such as SAP Sybase IQ is that
described it as a database where it could run without engaging the statistical there is no advantage in creating a star
store data and as a front end where it development resources. schema as a data model. Instead, hold-
could do statistical work. This work Of course, time was of the essence ing all the data in one large wide table is
would range from simple means and and costs had to be as low as possible – -
standard deviations through to more but we’ve come to expect that with ing each column with a key means that
complex predictive analytics that could the underlying storage of data is a star
be used, for example, to analyze past Step 1: The database schema. Creating a star schema in a
performance of a customer to assess the Our chosen solution for the database column-store database rather than a
likelihood that the customer will exhibit a was an SAP® Sybase® IQ database, a large single table would mean incurring
future behavior. Or it might involve using technology our client was already famil- unnecessary additional join and process-
models to classify customers into groups iar with. SAP Sybase IQ is a column-store ing overhead.
and ultimately to bring the two processes database. This means that instead of As a result of choosing SAP
together into an area known as decision storing all the data in its rows, as many Sybase IQ’s column-store database
models. The customer had also come up other databases do, the data is organized we are able to have a data model that
with an innovative way to resource the on disk by the columns. For example if a consists of a number of simple single
table data sets (one table for each
work placements to master’s degree have the text of each country (for exam-
students studying statistics at the local ple, “United Kingdom”) stored many that is quick to load and to query.
university and arranged for them to work times. In a column-store database the It should be noted that this type of
with the customer insight team to text is stored only once and given a
describe and develop the advanced unique ID. This is repeated for each online transaction processing (OLTP)
models. All the customer needed was a column and therefore the “row” of data applications because of the cost of doing
platform to work with. consists of a list of IDs linked to the data small inserts and updates. However, this
From a systems architecture and held for each column. is not relevant for this particular
development perspective, we could
describe the requirements in three rela- reporting and analytical databases. The solution can be deployed only on
tively simple statements: a Linux platform. We use Linux for three
1. Build a database with a very simple reasons. First, RStudio Server Edition is
data model that could be easily used. In our example, “United Kingdom” not yet available for Microsoft Windows.
loaded, that was capable of support- would occupy 14 bytes, while the ID Second, precompiled packages for all
ing high-performance queries, and might occupy only 1 byte – reducing the elements of the solution on Linux reduce
that did not consume a massive storage for that one value in that one
amount of disk space. It would also column by a ratio of 14:1 – and this environments are normally cheaper than
ideally be capable of being placed in Windows environments due to the cost
the cloud. the data. Furthermore, because there is of the operating system license. We
2. Create a Web-based interface that less data on the disk, the time taken to chose CentOS because it is a Red Hat
would allow users to securely log on, read the data from disk and to process derivative that is free.
to write statistical programs that One additional advantage of this solu-
could use the database as a source of which massively speeds up the queries tion for some organizations is the ability
data, and to output reports and graph- too. Finally, each column is already to deploy it in the cloud. Since the solu-
ics and well as to populate other indexed, which again helps the overall -
tables (for example, target lists) as a query speed. ered, and since all querying is done via a
result of statistical models. Web interface, it is possible to use any
SAP White Paper – Building an Analytical Platform 3
5. SAP Sybase IQ has the functionality ConCluSionS ABout the Author
David Walker has been involved with business
These C++ programs “talk” to a process Business intelligence requirements are
intelligence and data warehousing for over
known as Rserve, which in turn executes changing and business users are moving
the R program and returns the results more and more from historical reporting
to SAP Sybase IQ. This allows R func- into predictive analytics in an attempt to
tions to be embedded directly into SAP get both a better and deeper under- Data Management & Warehousing (http://
datamgmt.com) in 1995.
Sybase IQ SQL commands. While setting standing of their data.
this up requires a little more program- Traditionally, building an analytical David and his team have worked around
ming experience, it does mean that all platform has required an expensive infra- the world on projects designed to deliver
processing can be done within SAP structure and a considerable amount of
Sybase IQ. time for setup and deployment. converting data into information and by
Conversely, it is possible to run R from By combining the high performance,
exploit that information.
the command line and call the program low footprint of SAP Sybase IQ with the
that in turn uses the RJDBC connection open-source R and RStudio statistical David’s project work has given him experi-
to read and write data to the database. packages, it is possible to quickly deploy ence in a wide variety of industries including
Having a choice of methods is very an analytical platform in the cloud for -
facturing, transportation, and public sector
helpful as it means that it can be inte- which there are readily available skills.
as well as a broad and deep knowledge of
grated with the ETL environment in the This infrastructure can be used both business intelligence and data warehousing
most appropriate way. If the ETL tool for rapid prototyping on analytical technologies.
models and for running completed
function (UDF) route is the most attrac- models on new data sets to deliver
tive. However, if the ETL tool supports greater insight into the data.
host callouts (as ours does) then running
R programs from a command line callout
is quicker than developing the UDF.
SAP White Paper – Building an Analytical Platform 5