Weather data meets ibm cloud. part 3 transformation and aggregation of weather data - ibm developer recipes

6/29/2020 Weather Data meets IBM Cloud. Part 3 - Transformation and Aggregation of Weather Data - IBM Developer Recipes
https://developer.ibm.com/recipes/tutorials/weather-data-meets-ibm-cloud-part-3-transformation-and-aggregation-of-weather-data/ 1/49
Overview
Skill Level: Any Skill Level
This recipe shows how to extract, transform and load weather data stored by IBM Cloud Object
Storage using IBM SQL Query, pandas, Apache Spark, IBM Watson Studio and Jupyter
notebooks for Python.
Ingredients
Software Requirements
IBM Cloud Account
IBM Watson Studio
To obtain an IBM Cloud Account and get access to the IBM Cloud and to IBM Watson Studio, please
follow the instructions outlined here:
Get Started with IBM Cloud
Sign up for IBM Watson Studio

Recipe Requirement
This recipe covers part 3 of the 'Weather Data meets IBM Cloud' series.
Get the setup ready for this recipe, by covering the following prerequisite recipe:
Weather Data meets IBM Cloud. Part 1 - Ingestion and Processing of Weather Data.
Weather data meets IBM Cloud. Part 2 - Storage and Query of Weather Data
Step-by-step
Introduction

The recipe titled ‘Weather Data meets IBM Cloud. Part 1 – Ingestion and Processing‘ showed
how to ingest and process weather data from the Weather Company into the IBM Cloud
Platform using IBM Functions and IBM Event Stream. A second recipe titled ‘Weather data
meets IBM Cloud. Part 2 – Storage and Query of Weather Data‘ followed up by showing how
to store and query weather data using IBM Cloud Object Storage and IBM SQL Query.
In this recipe we shall continue up with a more comprehensive demonstration of how to
transform and aggregate weather data in preparing it for analysis an visualization. In doing so
we will implement a number of Extract, Transform and Load (ETL) jobs for weather
observation and weather lightning data using a variety of technologies each with their pro and
cons, namely IBM SQL Query, the pandas toolkit of Python as well as Apache Spark. We will
start out aggregating weather observation data computing daily minimum, average and
maximum temperatures using IBM SQL Query and store the resulting information in IBM
Cloud Object Storage as a CSV file. In a second step we will then look into lightning data and
transform it into a form that can be used for analytics by using pandas – a toolkit frequently
used for data wrangling purposes when developing data science and machine learning
solutions. In a third step we will briefly demonstrate how Spark SQL can be used to aggregate
weather observation data as well and then finally store the results in a PostgreSQL database
on IBM Cloud rather than a CSV file. For this step we will use the IBM Analytics Engine as
well as IBM Cloud Databases for PostgreSQL. Key components in all tasks will be IBM
Watson Studio and its support for Jupyter notebooks with Python.
In the recipe you will go through the following exercises:
Section 3: Get started by provisioning the required services – e.g. IBM Watson Studio,
IBM Cloud Databases for PostgreSQL and IBM Analytics Engine running Apache Spark.
Section 4 & 5: Define transformations by aggregating weather data using IBM SQL Query
as a standalone tool interactively, then continue running the same query using Jupyter
notebooks and the IBM SQL Query API.
Section 6: Define transformations by transforming lightning data using Jupyter notebooks
and pandas.
Section 7: Define transformations by aggregating weather observation data using Jupyter
notebooks and Apache Spark SQL to finally store the data in a PostgreSQL database.

Before the hands-on exercises we will briefly introduce the overall architecture and introduce
the main new components: IBM Watson Studio, IBM Cloud Databases for PostgreSQL and

IBM Analytics Engine running Apache Spark
Architectural Overview
Before you start it is important to provide you with some details on the architecture of the
solution. The architecture diagram is shown below and is a typical IoT architecture with
services for event data ingestion, event handling (message bus), landing zone (object
storage) and event data processing to transform, aggregate, load and visualize event data in
a form so that it becomes business relevant. In other words, it is a data refine pipeline turning
“raw oil” into “business relevant information” covering data at rest.

In the first part of this series, weather data was retrieved from the Weather Company Data
service by IBM Cloud Functions (based on Apache OpenWhisk) and delegated onwards
to IBM Event Streams (based on Apache Kafka). In the second part of the series IBM Cloud
Object Storage Bridge was then configured to write data from IBM Event Streams to IBM
Cloud Object Storage. Once that the weather data had been stored in IBM Cloud Object
Storage, it became possible to query the data using IBM SQL Query. As a prerequisite for
writing and reading data from IBM Cloud Object Storage, proper access permissions had
however to be created using IBM Identity and Access Management (behind the scene).
In this part we shall continue transforming and aggregating the weather data in IBM Cloud
Object Storage using a variety of technologies and then briefly discuss the pro and cons of
each. None of the transformations performed will be complex. The main purpose of these
transformations will be to prepare the data for analytics so that it can be used for defining
dashboards and we have deliberately chosen to focus on the principles of using the
technologies rather than dwelling into complex transformations requiring lengthy
computations. For all tasks we will use IBM Watson Studio to develop Jupyter
notebooks for Python.
During the sections of this recipe we will work with two kinds of data: the weather observation
data that was ingested and stored in the IBM Cloud Object Storage using IBM Functions and
IBM Event Streams, as well as lightning data World Wide for a given point in time. The latter
will be available as a JSON file that has been extracted from the Weather Company API and
made available to you. You will work with a pre-extracted file that just needs to be uploaded to
IBM Watson Studio rather than real-time data for a very simple reason: extracting real-time
lightning data from the Weather Company API requires a special API Key with limited access
and charges involved. So in this recipe we shall just pretend that the data has already been
ingested and then work with an uploaded file.
In this recipe 3 different kind of technologies will be used for defining the transformations:
IBM SQL Query: For the weather observations data we will initially aggregate the data to
show minimum, average and maximum temperatures per location on a daily basis making
a deliberate distinction between night and day temperatures. For this we will use IBM
SQL Query and will continue the work in the second recipe by defining a SQL query that
loads weather observation data from a COS bucket, aggregates it and then stores the
results in a dedicated CSV file.
Pandas: In a second step we will look into the lightning data. Unfortunately, this data
comes in a form which can neither be processed by IBM SQL Query, nor be used in for BI

using IBM Watson Studio. The data must therefore be transformed into an acceptable
form and for this purpose we will the pandas library for Python which is frequently used
for data wrangling purposes as long as the amount of data can adequately be processed
in memory of a single machine.
Apache Spark: For large data sets we will need to adopt other technologies than pandas.
In a third step we will there show how to use Apache Spark and Spark SQL to aggregate
the weather observation data, but rather than loading the result into a CSV file as was the
case for IBM SQL Query, we will load the aggregated data into a PostgreSQL database
using the SQLAlchemy toolkit for Python. For this purpose you will need two additional
services in IBM Cloud: the IBM Analytics Engine offering Apache Spark (as well as
Hadoop which will not be needed since we are using IBM Cloud Object Storage as data
lake) and IBM Cloud Databases for PostgreSQL.

In context of transformations of data the terms Extract-Transform-Load (ETL) and Extract-
Load-Transform (ELT) are usually mentioned as two different kind of approaches.
ETL is the traditional approach adopted for Data Warehouses and is normally a well-
defined workflow implemented using specific ETL tools. ETL first extracts data from
various data sources. Then data is cleansed, enriched, transformed to finally be stored in
a data warehouse. The Data Warehouse is typically well defined according to a schema
and adding new information to the DW is typically a rigorous process requiring changes to
ETL jobs, the Data Warehouse schema and the required reposts and dashboard.
ELT (Extract, Load, Transform) ion the other hand is of never origin and is a variant of
ETL where the extracted data is first loaded into a target system. Transformations are
then performed after the data is loaded. ELT typically works well when the target system
is powerful enough to handle transformations. Distributed repositories such as Hadoop or
IBM Cloud Object Storage combined with e.g. Apache Spark are often used in ELT
pipelines because they are highly efficient in performing transformations on large set of
data. ELT is less rigorous than traditional approaches using ETL and DWs, and it does
not (necessarily) presume that the data is stored according to a schema.

If we look at the architecture above ‘in the large’ then it follows the ELT paradigm: data is
extracted from The Weather Company, loaded into IBM Cloud Object Storage to eventually
be transformed by e.g. IBM SQL Query. However, there is also an instance of a classical ETL
approach in the architecture insofar that Apache Spark is used to extract data from IBM Cloud

Object Storage, transform it by performing appropriate aggregations to finally load it into the
Data Warehouse on PostgreSQL for further analysis.
An overview of IBM Functions, IBM Event Streams, IBM Cloud Object Storage, IBM SQL
Query and IBM Identity and Access Management has already been given in the previous 2
parts of this series. We shall therefore provide a short overview of the 3 new components in
this recipe: IBM Watson Studio, IBM Cloud Databases for PostgreSQL and IBM Analytics
Engine.

IBM Watson Studio
IBM Watson Studio provides users with environment and tools to solve business problems
by collaboratively working with data. Users can choose the tools needed to analyse and
visualize data, to cleanse and shape data, to ingest streaming data, or to create, train, and
deploy machine learning models.
The main functionality offers relates to components for:
Create Projects to organize the resources (such as data connections, data assets,
collaborators, notebooks) to achieve an analytics goal.
Access data from Connections to your cloud or on-premises data sources.
Upload files to the project’s object storage.

Create and maintain Data Catalogs to discover, index, and share data.
Refine data by cleansing and shaping the data to prepare it for analysis.
Perform Data Science tasks by creating Jupyter notebooks for Python or Scala to run
code that processes data and then view the results inline.
Ingest and Analyze Streams data with the Streams Designer tool.
Create, test and deploy Machine Learning and Deep Learning models.
Classify images by training deep learning models to recognize image content.
Create and share Dashboards of data visualizations without coding.
Reuse content such as data sets, notebooks and tútorials available on the Watson
Community.

IBM Watson Studio is technically based on a variety of Open Source and IBM technology. In
this recipe we shall define several Jupyter notebook for Python that will extract, transform and
then finally load the result of the transformation into either a file in IBM Cloud Object Storage
or a database table. The below picture provides an overview of some of the components of
IBM Watson Studio and the IBM Watson Data Platform. The ones with solid yellow lines are
used in this recipe, the ones with dotted lines in previous or upcoming recipes:

During the hands-on in this recipe you will work with a number of Jupyter notebooks as
shown below – either using IBM SQL Query, pandas or Spark SQL. The notebooks will take
data assets such as the weather data bucket of IBM Cloud Object Storage (connection
‘COSWeatherData’) holding weather observation or lightning data (‘Lightning_global_1min’)
as input, and produce several CSV files as the result:
‘weather-observations.csv’ containing detailed weather data extracted from the JSON.
objects in IBM Cloud Object Storage.
‘weather-observation-temperatures.csv’ containing aggregated min, max and average
temperatures.
‘lightning-data.csv’ containing lightning data in tabular form.

IBM Analytics Engine and Apache Spark

IBM Analytics Engine has been designed for the development and deployment of analytics
applications using open source Apache Spark and Apache Hadoop. Using the service, it is
possible to customize the cluster using own analytics libraries and open source packages.
The service also integrates with IBM Watson Studio or third-party applications to submit jobs
to the cluster. In this recipe we shall take a one-eyed view of the service just focusing on the
ability to run a Spark job from within a Jupyter notebook being managed by IBM Watson
Studio.
Apache Spark is a distributed computing platform designed to be a fast and general-purpose
engine for processing big data – be it at rest or in motion. Spark applications run as
independent sets of processes on a cluster, coordinated by a driver program. The root
object is SparkContext which can be used to connect to the cluster manager, which allocate
resources across the cluster. Spark comes with its own scheduler but does also work with
YARN and Mesos. Once connected to the cluster manager, Spark acquires executors on
worker nodes in the cluster, which are processes that run computations and store data for the
application.
When a Spark program is run, data is partitioned into different blocks across the cluster. The
driver program sends the code to be executed to each worker node in the cluster.
The executors on each node/block will perform the work using tasks. After the first action
completes the result is send back to the driver. Data locally to each node may be cached to
speed up re-use of intermediate results. The following figure
(from https://spark.apache.org/docs/latest/cluster-overview.html) outlines the cluster
architecture:
Apache Spark achieves high speed by using in-memory computations and by distributing the
computations across a cluster of nodes. Sparks primary abstraction is that of the so
called Resilient Distributed Datasets – or RDD for short. An RDD is a distributed collection

of elements stored across the nodes of the Spark cluster which provides the basis for
parallelized processing of data across the cluster. It offers two forms of
operations: transformations (e.g. map, filter) and actions. Transformations are lazily
evaluated and are operations that take one or more RDD’s as input and yields an RDD as
result. No operations are however performed immediately on the RDD by transformations
until an action is performed, rather Spark build up a Directed Acyclic Graph (DAG) that can
be optimized before actual execution. This mode of operation resemble the inner working of
lazy functional programming languages such as Haskell. Actions are the operations that
triggers execution of a sequence of transformations on an RDD and returns a value. This
mode contributes to the fault tolerance of a Spark cluster: should a node disappear, or
intermediate results be lost, Spark has the option to create a new node, load the data and
perform the operations of the DAG again. To improve performance when handling large
datasets and complex transformations it is furthermore possible to cache intermediate results
– either in memory or on disk or by using a combination of both.
The Spark Core is the foundation of the overall Spark system. It provides distributed task
dispatching, scheduling, and basic I/O functionalities, exposed through an application
programming interface centered on the RDD abstraction. On top of the core, Spark provides 4
libraries: Spark SQL for defining transformations using SQL like queries, Spark Streaming for
handling real time processing of data in motion, MLLib for machine learning and GraphX for
graph processing. RDD’s, and the operations on RDD’s in form of transformation and actions,
remain however the underlying kernel abstraction of Spark. Spark provides APIs for Scala,
Python, Java and R. Its native language is Scala, but in this recipe we will use Python
and Spark SQL which provides a much higher level of abstraction than RDDs. When running
SQL from within another programming language such as Python the results will be returned
as a Dataset/DataFrame. Datasets provides the benefits of RDDs (strong typing, ability to use
powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A
DataFrame is a Dataset organized into named columns and is conceptually equivalent to a
table in a relational database and unlike Datasets it is supported by Python.

IBM Cloud Database Services
IBM Cloud™ Databases for PostgreSQL offers an object-relational SQL database,
complemented by powerful enhancements like indexable JSON, publish/subscribe functions
and drivers. The service is fully managed with IBM Cloud taking care of high availability,
backups, logging, monitoring, scaling, hardware setup and software patching. IBM provides
ready-to-use integrations with IBM Identity and Access Management to extend access control

and auditing. IBM Cloud Databases for PostgreSQL deployment are built as highly available
clustered configurations leveraging multiple availability zones in a given region and have
been designed with scalability and economics in mind: Databases for PostgreSQL are billed
hourly and helps enable you to separately scale disk and RAM for each of your deployments.
High availability of IBM Cloud Databases is provided by running the databases on several
availability zones, with one of the zones taking the role as the leader and the other database
the role of follower. Resiliency is achieved by backing up data to a cross regional instance of
IBM Cloud Object Storage.
Getting Started
We will assume that you have already gained access to IBM Cloud and eventually also IBM
Watson Studio (see the “Prerequisites” section at the beginning of the recipe for the links
needed for registering).
In this section of the recipe you will get started by doing the following:
Create an instance of IBM Databases for PostgreSQL.
Create an instance of the IBM Analytics service (running Apache Spark).
Create an instance of IBM Watson Studio.
Create an IBM Watson Studio project.
Upload a dataset for lightning data from GitHub.

Provision IBM Cloud PostgreSQL Database Service
You will need to create an instance of the PostgreSQL database and configure credentials for
the service:
Select the IBM Cloud Catalog.
Enter ‘post’ as search string.
Select the ‘Databases for PostgreSQL’ service.
On the ‘Databases for PostgreSQL’ page:
Provide a name for the service (optionally using ‘Weather Data Demo’ as prefix).
As resource group choose the one that you created in the previous recipe (e.g.
‘weather-data-demo’).
Add a tag (e.g. ‘weather-data-demo’) to the service.
Scroll down and select for pricing the Standard plan.
Click Create to create the service.
Wait until the service has been created.
Select Service Credentials in the toolbar to the left.
Select the New Credential button.

Enter a name for the credentials in the Add new credential dialog.
Click Add to create the
Select the View credentials action.
Copy and paste the JSON object that defines the new credential to a local file for later
use.

Provision IBM Cloud Analytics Service
You will need to create an instance of the IBM Cloud Analytics service:
Enter ‘spark’ as search string.
Select the ‘Analytics Engine’ service.
On the ‘Analytics Engine’ page:
Scroll down and select for pricing the Lite plan.

Click Configure to configure the service.
On the Configure service instance page:
Keep the default for the Hardware configuration.
Keep the default of 1 for Number of compute nodes.
Keep the default of ‘Spark and Hive’ for the Software package.
Click Create.
Close the dialog that informs you that the service creation may take some time.

You do not need to wait for the service to be fully provisioned but can continue with the next
steps right away. The service will not be needed before you start in section 7 of this recipe
using Apache Spark.

Create IBM Watson Studio Service
To create the Watson Studio service, do the following:
Enter ‘studio’ as search string.

Select the ‘Watson Studio’ service.
On the Watson Studio page:
Scroll down and select for pricing the Lite plan.
Click Create to create the service.
Wait until the service has been created.
On the dashboard of the service, select Get Started.
This will take you to the Home Page showing you commands for creating a project and
searching a Catalog.

If you already have an account for IBM Watson Studio you can of course decide to use that
service. You will then need to log into the service and optionally change the region to the
region used for the other services of this recipe (or preferably a region that is close to the
location of those services).


Create IBM Watson Studio Project
Next you will need to create an IBM Watson Studio project. Goto the dashbaord of IBM
Watson Studio and do the following:
Click Create a project.
On the next page titled ‘Create a Project’, select the Standard Project template.
Click Create Project.
In the New Project dialog,
Enter a name to the project such as “Weather Data Demo”.
Provide a short description of the project.
Select the Object Storage Instance where you want project artefacts to be stored.
Create.
Wait until the project has been created.

Watson Studio will now provide with an overview of the project as well as tabs for viewing
assets, environments, bookmarks, collaborators and settings that allow you to add services,
access tokens and integrations to e.g. GitHub.

Upload Lightning Data Set to IBM Watson Studio Project
To upload the lightning data set and the notebook do the following:
Download the ‘Lightning_global_1min_201812041030.json‘ data set from the GitHub
project weather-data-demo.
Select the Find and add data command in the main toolbar of Watson Studio (located in
the upper right corner).
Drag the file named ‘Lightning_global_1min_201812041030.json‘ from your local file
system to the pane with the title ‘drop files here’.
Select the command Add to project.

If you go back to the IBM Cloud dashboard and select Resource List from the Navigation
Menu in the top left corner, you should now be able to view the resources tagged with
‘weather-data-demo’. The list should (eventually) show you the following services:
The Analytics Engine will appear in the list as well as soon as it has been completely
provisioned.
Transformations using IBM SQL Query
In the previous recipe we have used IBM SQL Query on a very simple example to query all
the weather observations stored in a large number of JSON files. In this recipe we shall
continue with a more refined and realistic example by aggregating the weather data based on
observation location, date and day time (night/day) and then compute statistical features such
as min, max and average temperatures.
The SQL Query that we will use is defined as follows:

WITH weatherdata as (
    SELECT class, day_ind, CAST(from_unixtime(valid_time_gmt) AS DATE) as
obs_date, obs_name, pressure, temp, wdir, wspd
    FROM cos://eu-de/weather-data-demo/* STORED AS JSON o
    WHERE o.class = "observation"
   )
SELECT
   weatherdata.obs_name,
   weatherdata.obs_date,
   weatherdata.day_ind,
   ROUND(MIN(weatherdata.temp),1) AS mintemp,
   ROUND(AVG(weatherdata.temp),1) AS avgtemp,
   ROUND(MAX(weatherdata.temp),1) AS maxtemp
FROM weatherdata
GROUP BY weatherdata.obs_name, weatherdata.obs_date, weatherdata.day_ind
ORDER BY weatherdata.obs_name, weatherdata.obs_date, weatherdata.day_ind
INTO cos://eu-geo/weatherdatademo-donotdelete-pr-2qk5tcfdz1qs9t/result/
STORED AS CSV
The query will, when executed, return results as shown in the screenshot below:
The SELECT statement in the WITH clause is similar to the query used in the previous recipe
and will return a consolidated table of weather observations from the bucket named ‘weather-

data-demo’. The next SELECT statement groups the information according to observation
location, date and time of day and then computes the average, maximum and minimum
temperatures for the group. The last part of the query outputs the results to the bucket that
contains the IBM Watson Studio project artefacts. This way we can easily import the result
into the project once that it has been generated. For more details regarding the supported
SQL syntax please see the SQL Reference available on the online documentation for the IBM
SQL Query service.
There is a crucial point worthwhile mentioning which has to do with the URL for the IBM
Cloud Object Storage endpoint. The source bucket that contains the weather data (‘weather-
data-demo’) and the target bucket that holds the project artefacts (‘weatherdatademo-
doniotdelete…’) have different endpoint locations although both buckets share the same IBM
Cloud Object Storage instance. The FROM clause of the weather data demo bucket uses the
endpoint location ‘eu-de’ whereas INTO clause for the bucket holding the project artefacts
uses the location ‘eu-geo’:
The reason for this is simple. The weather data demo bucket was created as a regional
bucket located in the IBM Cloud availability zones in Frankfurt whereas the bucket containing
the project artefacts has been defined as a cross region bucket replicated across the IBM
Cloud data centres in Europe. IBM Watson Studio uses cross region buckets to store project
artefacts to achieve a higher level of availability. Usually this is transparent to the user, but the
difference becomes important when defining queries or data source connections inside IBM
Watson Studio that accesses the bucket: if you use the wrong location, e.g. ‘eu-geo’ for the
weather-data-demo bucket rather than ‘eu-de’, the query or the connection will fail and you
will get an error.

To run the query, do the following:
Open the dashboard for your IBM Cloud Object Storage service.
Select Buckets in the toolbar to the left and note (copy) name and location of the two
relevant buckets.
Open the dashboard of your IBM SQL Query service.
Copy the query from the recipe into the editor of the service.
Modify the FROM clause and the INTO clause so that the names and locations match
your buckets.
For the first test run, prefix the INTO clause with ‘–‘ to comment it out. This way you avoid
cluttering up your project workspace with erroneous results. They will be written to the
default location instead (which is us-south).
Select the Run command to run the query.
Check that query has returned meaningful results.
If this is the case, delete the comment marker in the INTO clause and rerun the query to
send the results to your project bucket.

Once that the query has run successfully time has come to import the results into the project.
To achieve this, do the following:
Open the dashboard for IBM Watson Studio.
Open the weather data demo project that you created.

Select Assets.
Select the Find and add data button in the top right corner (labelled with 0 and 1).
Select the Files tab.
You should now see a number of generated files that have been produced by the SQL
Query jobs.
Select the last file.
Select the menu “…” as shown on the screenshot.
Select the Add data asset menu item in the resulting pull-down menu.
Select Apply in the resulting dialog named ‘Apply changes’.

This will add the file. In a last step you can then change the name of the file to something
more meaningful than the cryptical generated file name:
Select the name of the generated CSV file in the list of data assets. This will open up the
file and render its content.
In the upper left part of the screen click the pencil to the right of the generated file name.
Select the part of the name before ‘.csv’ and delete the text.
Change the name of the data asset to ‘weather-observation-temperatures.csv’
Select the Apply command.
In the upper left part of the screen select the link named Weather Data Demo to take you
back to your list of project assets.

You have now transformed and aggregated the weather data to finally add the generated
result file as a data asset to your project. This file can conveniently be used e.g. as input to a
dashboard, which we will look into in the next recipe.


As a last step of this recipe we will also need to generate a CSV file containing the detailed
weather observations. Basically, this means re-running the query that was used in the
previous recipe but with the addition that the data is this time to be stored in the bucket that
contains the project artefacts. This file will then be used in the next recipe for generating
statistical diagrams using matplotlib. This time you will have use the query from Part 2:
The SQL Query that we will use is defined as follows:
SELECT
   class,
   clds,
   day_ind,
   expire_time_gmt,
   valid_time_gmt,
   CAST(from_unixtime(valid_time_gmt) AS DATE) as date,
   obs_name,
   pressure,
   temp,
   uv_desc,
   wdir,
   wdir_cardinal,
   wspd,
   wx_phrase
FROM cos://eu-de/weather-data-demo/* STORED AS JSON o WHERE o.class =
"observation"
ORDER BY valid_time_gmt DESC
INTO cos://eu-geo/weatherdatademo-donotdelete-pr-
2qk5tcfdz1qs9t/result/weather-observations.csv STORED AS CSV

You will need to go through the following steps – very similar to what have already been doing
so far to generate ‘weather-observation-temperatures-csv’:
Copy, paste and run the above query using IBM SQL Query but for the URL in the FROM
and INTO clauses use your URLs for the buckets containing the weather data and the
bucket containing your project artefacts – respectively.
Add the generated CSV file as a data asset to your project in a same way as you added
‘weather-observation-temperatures.csv’.
Rename the added file to ‘weather-observations.csv’.
Open the data asset and view the results.

Transformations using Jupyter Notebooks and IBM SQL
Query API
In this section of the recipe we shall demonstrate how to run the SQL Query using API calls.
For this purpose we will reuse and configure an existing notebook named ‘IBM Cloud SQL
Query‘ that is available on the IBM Watson Studio Community. IBM Watson Studio comes

with a fairly comprehensive set of assets that are shared with the community of the Watson
Studio users. These assets amount to articles, data sets, notebooks, papers and tutorials.
To import the required notebook into your project, do the following:
Select the Navigation Menu of IBM Watson Studio in the upper left corner of the
dashboard.
Select the menu item named Community from the pull-down menu. This will take you to
the Community page:
Enter the search string ‘SQL Query’. This will provide you with icons for a number of
various artefacts.
Select All Filters to filter the result set further.
Select Notebook to just show the Jupyter notebooks.
Select the notebook named ‘IBM Cloud SQL Query’ (sse screenshot).
In the notebook, click the Copy command in the upper right part of the toolbar to copy the
notebook into your project.
In the ‘New Notebook’ dialog,
Select your weather data demo project as target project.
Select the Free runtime system for Python.

Click Create Notebook.

You have now imported the notebook into your project. However, the notebook will need to be
modified as follows in order to get it to work with weather data:
Provide the credentials for the COS bucket containing the weather data – including API
Key and Cloud Resource Name.
Insert project token for the COS project (target) bucket, which is needed to store the
resulting artefact of the query in the project using API calls.
Provide a definition for the SQL query to be submitted.
Delete superfluous cells in the notebook.

The notebook starts in section 1 by installing and importing libraries needed to configure and
run the SQL query. Make sure that the libraries get installed by doing the following:
Select the first code cell in section 1 that installs the ‘pyarrow’ and ‘sqlparse’ libraries.
Click the Run Cell command in the toolbar and wait for the import to complete (it is
complete when the cell marker changes from * to a number).
Run the two remaining cells in the notebook.

Section 2 of the notebook titled ‘Configure SQL Query’ sets credentials and other access
information that are needed for obtaining access to IBM Cloud Object Storage and IBM SQL
Query. Follow the instructions in the notebook with a couple of modifications:
For the API Key:
Invoke the command Manage > Access (IAM).
Select IBM Cloud API Keys in the toolbar to the left.
Invoke the Create an IBM Cloud API Key command.
In the ‘Create API Key’ dialog enter a name for the key, e.g. ‘Weather Data Demo
Key’ and click Create.
In the ‘API Key successfully created’ dialog, copy and download the key. Then close
the dialog window.
Paste the key into the notebook.
For the IBM SQL Query Cloud Resource Name (CRN):
Go to the IBM Cloud dashboard and select the resource lists.

Filter the resources using the tag ‘weather-data-demo’.
Select your instance of IBM SQL Query and copy the CRN from the dashboard of the
service.
Paste the CRN into the notebook so that it is assigned to the variable ‘instancecrn’.
For the SQL COS Endpoint:
Open the dashboard for your IBM Cloud Object Storage service and select
the Buckets tab.
For the bucket that contains your weather data, invoke the Bucket Actions > SQL
URL command in the pulldown menu to the right of the bucket.
Copy the value and assign it to a variable named ‘sql_cos_endpoint’ in the notebook
as shown above in the screenshot.
Run the cell.

The ‘Configure SQL Query’ section continues with cells that sets the project id and project
token and from this basis computes the name of the bucket holding project artefacts as well
as the target URL for the target project. This information is needed in order to get access to
the project bucket when the results of the query will be saved:

To modify the notebook, you will need to obtain and insert the project token into the notebook:
Open a new window for the project, e.g. by pressing CTRL-Left Click on the name
(Weather Data Demo) of the project in the upper left part of the dashboard.
Select the Settings tab in for the project in the project dashboard.
Scroll down to the section named Access Tokens.

Invoke the command New token.
In the dialog named ‘New token’:
Enter a name for the token, e.g. ‘WeatherDataDemoWriter’.
Select Editor as role.
Click Create.
The project token will now be listed in the section titled Access Token.
You can view the token by invoking the Action pull-down menu to the left of the access
token and then select the menu item named View.
Go back to your notebook.
Select the cell that contains the heading ‘inserting the project token
Invoke More > Insert project token which will created a hidden cell akin to the selected
cell in the screenshot above.
Check that the inserted project access token matches the one in step (7). If not, copy the
one from step (7) and replace it with the access token in the notebook.
Run the 3 cells in the screenshot above in the order they are listed. When asked about
input (e.g. for the ‘apikey’) just leave the text field blank and press Return.


You have now created a project token and the necessary statements in the notebook that will
provide the SQL Query API with sufficient credentials to create assets in the project for
storing the results of the query.
The notebook continues in section 3 with Python statements that defines and formats the
SQL query to be submitted:
Section 3 builds the query string whereas section 4 contains the final statements for
executing the SQL query and returning the results in form of a pandas data frame. In this
recipe we shall use the following simple query definition first just to get a confirmation that
everything works:
sql='SELECT class, day_ind, valid_time_gmt, obs_name, pressure, temp,
uv_desc, wdir, wdir_cardinal, wspd, wx_phrase ' +
' FROM ' + sql_cos_endpoint +
'/* STORED AS JSON o WHERE o.class = "observation"'
Do the following to change the query and then in it run the query:

Replace the SQL query in section 3 of the notebook with the one above.
Run the cell.
Press Return if asked to enter the query.
Select the first cell in section 4.
Invoke Run and wait for the execution to complete (this may take some time)
Invoke Run in the next cell to see the first 10 records.

The results returned should look something akin to:
We will only use the first part of the notebook so you can delete all cells following the first two
cells in section 4 and downward until section 8 of the notebook. The easiest way to do this is
top use keyboard shortcuts. You can get information on the available shortcuts from the Help
menu by invoking Help > Keyboard Shortcuts. To delete the superfluous content:

Set the editor from insert mode (where cells are marked green in the left margin) into
command mode (where cells are marked blue in the left margin) by pressing Esc.
Scroll down and select cell 12 (this is the cell after the cell with the orange text ”Orders by
Employee’).
Then cut it by clicking X.
Continue until all superfluous cells are deleted and you have reached the section titled
‘Next Steps’.
Invoke File > Save AS to save the file.
Provide a new name for the notebook, e.g. ‘Weather Data ETL – IBM Cloud SQL Query’.

You can now change the query and try the one that was defined in section 4 that aggregates
weather observation temperatures by location, day and time of day. Replace the code cell in
section 3 with the following to change the definition of the query:
sql = "WITH wd AS (SELECT day_ind, CAST(from_unixtime(valid_time_gmt) AS
DATE) as obs_date, obs_name, temp n" +
"FROM " + sql_cos_endpoint + "/* STORED AS JSON o WHERE o.class =
'observation') n " +
"SELECT wd.obs_name, wd.obs_date, wd.day_ind, ROUND(MIN(wd.temp),1) AS
mintemp, ROUND(AVG(wd.temp),1) AS avgtemp, ROUND(MAX(wd.temp),1) AS
maxtemp n " +
"FROM wd n " +
"GROUP BY wd.obs_name, wd.obs_date, wd.day_ind n " +
"ORDER BY wd.obs_name, wd.obs_date, wd.day_ind n " +
"INTO {}result/ STORED AS CSV".format(targeturl)
print('nYour SQL statement is:n')
print(sql)

Then run all cells in section 3 and 4 to get the aggregated figures:

Rename the notebook to ‘Weather Data ETL – IBM Cloud SQL Query V2’ and save it.
A minimal version of the notebook titled ‘ETL Weather Data Using IBM Cloud SQL Query 2‘
that does not format the query or ask the user for inputs has been uploaded to GitHub. To use
this notebook, you will need to provide your project credentials and bucket name in cell 3 and
the project token in cell 4.
Transformations using Jupyter Notebooks and Pandas
In this section you will work with a Jupyter notebook that extracts lightning data from the
Weather Company, transforms it from a JSON multi-line representation to a tabular
representation as a pandas Data Frame, to finally load it into a CSV file in a format that can
be used for analytics. Rather than fetching the data from the Weather Company API directly
in real-time (which would require a special API Key), you will work the file that was uploaded
to IBM Watson Studio in section 2.

The data in the input file (named ‘Lightning_global_1min_201812041030.json’) comes in a
multi-line JSON representation. If you select the file in IBM Watson Studio it will be opened,
and you can investigate the structure in more detail:
The measures such as latitude, longitude and intensity are represented as single properties
with an associated list that contains the values for all observations. Each single observation is
delimited by a comma and a newline. In principle it would be possible to turn such a structure
into a collection of observations in table format with one row for each observation and one
column for each property using an SQL query on the form:
SELECT explode(lightning.intensity) intensity
FROM cos://eu-geo/weatherdatademo-donotdelete-pr-
2qk5tcfdz1qs9t/Lightning_global_1min_201812041030.json STORED AS JSON
A more comprehensive example of a query using the explode function, which has been used
a basis for define the query above, can be found in the sample queries of IBM SQL Query
(see ‘Samples > CSV and JSON Parsing Queries > Advanced JSON flattening‘).
However, there is one caveat here: IBM SQL Query does not accept multi-line JSON
structures of the form that the file is in for the time being. There are several solutions to the
problem. On could for example try to use the command line utility jq with the -c option, e.g. “$
jq -c “.[]” or a conversion using functions of the JSON package as described in ‘Converting
JSON into newline delimited JSON in Python‘. A third option would be to use pandas which in

general comes in very handy for transforming data sets that are not very large, which is the
actually the case for the 1.3 MB file that we are dealing with just now.
We shall therefore convert the file using pandas and a Jupyter notebook for Python that you
can download from GitHub. You will need to go through the following steps:
Add the notebook on GitHub to the project.
Insert pandas commands for reading the data set into your notebook.
Run the cells in the notebook in the given order and observe the effects.

To create and download the notebook from GitHub do the following in IBM Watson Studio:
Select the command Add to project from the toolbar.
Select Notebook in the ‘Choose asset type’ dialog.
Select From URL in the ‘New Notebook’ dialog.
Copy the URL ‘https://github.com/EinarKarlsen/weather-data-
demo/blob/master/Weather%20Lightning%20Data%20ETL%20-%20Pandas.ipynb‘ to the
URL Field. This should set the name of the notebook as well (‘Weather Lightning Data
ETL – Pandas’).
In the Select runtime field, select the runtime ‘Default Python 3.5 Free’ which provides a
runtime free of cost.

Once the notebook has been imported you will initially need to insert commands for reading
the lightning data set in your project into the first cell of the notebook which is empty.

Fortunately, this is supported by commands of IBM Watson Studio:
Select the first cell in the notebook.
Select the Find and add data command in the toolbar
Select Insert to code > Insert pandas DataFrame for the dataset containing the
lightning data. This wil insert all the code needed to read the data.
Uncomment the last two lines in the first cell that reads the data frame and assigns it to a
variable.
In the next cell, assign the values for the variable ‘cos’, ‘wsbucket’ and ‘df1’ to the relevant
values used in cell 1.
Invoke File > Save to save the changes.
Run the first 4 cells to extract the information and print out some information about the
structure of the data set.

Having imported the data, the notebook outlines the structure of the dataset:

This data will need to be transformed into a format where the attributes of interest are
represented at columns and where each observation is represented as a row. This is simply
done in pandas by constructing a new data-frame:
Run the cells 5 and 6 to transform the data and print out the first 5 observations in the new
data frame.

Having transformed the data, we can save it as a CSV file in IBM Cloud Object Storage for
later use. The data frame will be written to CSV first using the ‘to_csv’ command. In a next
step, this file is then uploaded to the project:
To generate the CSV file and import it into the project do the following:
Run the cells 7-10 to create and upload the CSV file.
Select the Assets tab for the project in IBM Watson Studio.
Select the Find and add data button in the top right corner (labelled with 0 and 1).
Select the Files tab.
Enter ‘lightning’ in the Find in storage text field.
Select the file named ‘lightning-data.csv’.
Select the menu “…” as shown on the screenshot.
Select the Add data asset menu item in the resulting pull-down menu.

Select Apply in the resulting dialog named ‘Apply changes’.
For additional information on how to use notebooks to work with COS see for example the
notebook ‘Working With IBM Cloud Object Storage In Python‘ and the online documentation
‘Using Python functions to work with Cloud Object Storage‘. The uploading and downloading
of files is touched upon in the blog ‘How to Upload/Download Files to/from Notebook in my
Local machine‘ by Charles Gomes.
Transformation using Jupyter Notebooks and Apache Spark
In this section of the recipe we shall again deal with weather observation data rather than
lightning data. Suppose that we would like to aggregate the data just like in section 4 of the
recipe, but rather than storing the data in a CSV file (or Parquet file) we would like to store it
to a to a PostgreSQL database (or some other kind of database not currently supported as
target database for IBM SQL Query). A simple solution would be to do it in two steps: let IBM
SQL Query aggregate and store the data in an intermediate CSV file and then, in a second
step, let another program in the ETL pipeline read the CSV file and store the information to
PostgreSQL. The second step can be achieved with a few (around 10) lines of code using
pandas.
However, in situations where the transformations to be performed are very complex calling
e.g. for a solution expressed in a procedural or functional language rather than SQL, other
technologies like Apache Spark come in very handy – either using Apache Spark RDDs or
Data Frames. We shall therefore briefly show how to use Apache Spark to implement the ETL
process but for simplicity reasons we will in this example keep the transformation so simple

that we can reuse the SQL query of section 4 but change the target from a CSV file to a
PostgreSQL database.
In this section of the recipe you will need to create a new Jupyter notebook using a Spark for
Python runtime system. The code for extracting the data using Apache Spark can be
generated by IBM Watson Studio, provided that there is a data connection and data asset
available for the data in the project. The connection to IBM Cloud Object Storage and a data
asset in the weather data bucket must therefore be established first. Once this is done it is
quite easy to create a Spark SQL data frame, submit the query and get hold on the resulting
aggregated data. The loading of the data into the PostgreSQL database will finally be done
using the SQLAlchemy toolkit.
You will consequently need to go through the following steps:
Create a connection to the COS weather data bucket.
Add a (random) data set in that bucket to your project as an data asset.
Download the notebook for aggregating weather observation data from GitHub.
Add Apache Spark python code to the notebook to read your weather observations
in your weather data bucket.
Run the commands for loading and transforming the weather data.
Add URL endpoint (including credentials) for your instance of the PostgreSQL database.
Run the commands that saves the aggregated data frame to a database table in
PostgreSQL.

To create the data connection and connected data set do the following:
Select Connection in the ‘Choose asset type’ dialog.
Select Cloud Object Storage in the New Connection dialog.

In the New Connection Dialog for IBM Cloud Object Storage.
Enter a name for the connection.
For the Login URL paste in the endpoint for your weather data bucket (you can get
this information in the endpoint section of the dashboard for your IBM Cloud Object
Storage service but be aware that you select the correct resiliency and location
settings).
For the API Key paste in the API key that you created for the bucket in the previous
recipe (you can find this information in the credentials section of your IBM Cloud
Object Storage service but be aware that you use the credentials that you created in
this recipe. In our case we named the credentials ‘WeatherDataWriter’).
Click Create to create the new connection.
In IBM Watson Studio you should now be able to see the new connection in the list of
assets.

Continue by adding a connected data set to the project:
Select Connected data in the ‘Choose asset type’ dialog.
Click Select Source to select the source.

Expand the selection until you have selected one of the weather observation files.
Click Select to select the object file.
In the Add Asset from Connection dialog, enter a name for the asset (e.g.
‘COSWeatherDataObservationFile’).
Invoke the Select command.
In IBM Watson Studio you should now be able to see the new connected data in the list of
assets.


To create and download the notebook from GitHub do the following:
Select Notebook in the ‘Choose asset type’ dialog.
Select From URL in the New Notebook dialog.
Copy the URL ‘https://github.com/EinarKarlsen/weather-data-
demo/blob/master/Weather%20Temperature%20ETL%20-%20Spark%20SQL.ipynb‘ to
the URL Field.
This should set the name of the notebook as well (‘Weather Temperature ETL – Spark
SQL’). If not, enter a name for the notebook.
In the Select runtime field, select the runtime ‘Python 3.5 with Spark’.

Once the notebook has been imported you will initially need to insert commands for reading
the lightning data set into the first cell of the notebook which is empty.

Fortunately, this is supported by commands of IBM Watson Studio:
Select the first cell in the notebook.
Select the Find and add data command in the toolbar
Select Insert to code > Insert SparkDSession Setup for the dataset containing the
weather observation. This will insert all the code needed to read the single file.
Change the COS URL in the last line of cell 1 so that the URL designates all files in the
weather data bucket. This basically means that you will need to replace the designator for
a single object with ‘/*’ (see screenshot above).
In the next cell, assign the values for the variable ‘cos’, ‘wsbucket’ and ‘df1’ to the relevant
values used in cell 1.
Run the first 3 cells to import the data and print out the schema.

The schema shows the various properties found in the read files. Notice that there may be
corrupted records as well, e.g. files that can’t be interpreted as JSON files.
Next the data will need to be aggregated and grouped by location, date and time of day to
compute minimum, average and maximum temperatures:

Notice that the query needs to compute the date from a timestamp (as in section 4) but that
the expression using Spark SQL is slightly different from the expression used by IBM SQL
Query.
Having transformed the data, we can store the data in a database rather than on the level of
an object storage. In this recipe we shall use PostgreSQL as the data warehouse. For more
information on how to manage PostgreSQL on IBM Cloud we refer to the online
documentation titled ‘About IBM Cloud Databases for PostgreSQL‘.
There are several libraries that can provide access to a PostgreSQL instance in Python. What
is needed in this context is a library that can conveniently save a pandas data frame to a
database table. How to do this is for example outlined in the YouTube video titled ‘Pandas
Tutorial 14: Read Write Data From Database (read_sql, to_sql)‘ and on the online
documentation for Pandas SQL. Both use the SQLAlchemy library which will also be used in
this recipe. There is also an existing notebook on the IBM Watson community titled ‘Access
PostgreSQL with Python‘ that shows how to interact with PostgreSQL but this notebook is
using psycop.

The notebook contains cells for setting the PostgreSQL credentials and endpoint followed by
a cell that creates the engine. Saving to a database is done with the data frame operation
named ‘to_sql’ which takes the database table name and engine as parameters. Additional
parameters then specify that the existing table content (if any) shall be replaced, that the data
frame index shall be ignored when saving and that the data shall be sent to PostgreSQL with
a given chunk size. For testing purposes, we then follow up reading the content of the
database table and printing the first 5 records.
To get the notebook to run on your data do the following:
In IBM Cloud open up the dashboard for your PostgreSQL service.
Scroll down and copy the ‘Public postgres endpoint’ to your notebook and assign it to the
variable ‘postgreSQLEndPoint’. This URL contains placeholders for the user name and
the password for the service.
Comment out the ‘?sslmode=verify-full’ parameter of the URL at the end.
Select the Service credentials tab.
Open the credentials that you created in section 3 (or alternatively create new ones).
Copy the variables ‘postgres.authentication.username’ and
‘postgres.authentication.password’ and insert them in the endpoint URL in place of the
placeholders.
Run the cells 7-13 that loads trhe data into PostgreSQL.

Conclusion
In this recipe we have shown various ways of transforming weather data using IBM SQL
Query, pandas and Apache Spark SQL (running under the hood of IBM Analytics Engine).
IBM SQL Query has the advantage that it is a light weight, easy to use and cost-effective
solution. It does not require a runtime cluster to be configured which may consume resources
even if being idle, but adopt a paradigm that is more akin to that of Functions as a Service.
Queries that are conveniently expressed in SQL are good use cases for IBM SQL Query
provided that the input and output source formats are supported. Moreover, it is old news that
some computational problems are very hard to express using SQL, whereas they can be very
easily formulated using either procedural or functional languages. Pandas can in such cases
conveniently be used as a light weight and cost-effective alternative to IBM SQL Query, since

it does only require a runtime system for Python. The combination of Python and pandas is a
proved solution for data wrangling purposes but with the limitation that it does not address big
data use cases. So the fallback position, either in case of complex transformation, special
input and output sources or big data would be to use Apache Spark – be Apache Spark
RDDs, Data Frames and/or Spark SQL. The disadvantage here is that it requires a cluster of
compute nodes to be available, i.e. unlike IBM SQL Query, you will need to provision a cluster
first and then run the Spark jobs once this has been done. There is a free plan available for
IBM Analytics Engine that can be used for a limited time for test purposes but in general one
must expect – in context of Cloud offerings – that a cost will occur for the compute and
storage resources set aside for the cluster – whether it is being used or not.
The lessons learned from this recipe also has impact on the architecture of the solution.
Actually it would be relevant to go back to IBM Functions and then change the definition of
sequences dealing with multi-line JSON data structures from The Weather Company by
including actions to transform the ingested data before it is send to IBM Event Streams for
storage in IBM Cloud Object Storage. If this is carefully done, the data would be stored in IBM
Cloud Object Storage in the right format, and IBM SQL Query could directly be used to work
on the data and there would be no need to transform it using e.g. notebooks and pandas.
That of course would mean that the approach would be changed from an ELT approach to an
approach looking more like an ETL.
It is also worth mentioning that it is possible to get IBM SQL Query to work with IBM
Functions so that queries are executed in context of actions. For more information on that
please see ‘SQL Query Package‘ project on GitHub. This implies another relevant change of
the architecture: we could trigger aggregation jobs currently implemented with IBM SQL
Query using IBM Functions.
Acknowledgement
The Jupyter notebook named ‘IBM Cloud SQL Query’ represents work by Torsten
Steinbach from the IBM Cloud Data Platform team. I would also like to thank IBM Cloud
support (Witchaya Keosung and Daniel Pittner) for pointing me to the workarounds for
transforming lightning data from a multi line to a single line JSON file using e.g. jq.

Weather data meets ibm cloud. part 3 transformation and aggregation of weather data - ibm developer recipes

Recomendados

Recomendados

Más contenido relacionado

Similar a Weather data meets ibm cloud. part 3 transformation and aggregation of weather data - ibm developer recipes

Similar a Weather data meets ibm cloud. part 3 transformation and aggregation of weather data - ibm developer recipes (20)

Último

Último (20)

Weather data meets ibm cloud. part 3 transformation and aggregation of weather data - ibm developer recipes