Big Insights v4.1

Contents
BigInsights 4.1.0 1
BigInsights 3
Accessibility 3
Using keyboard shortcuts and accelerators 4
Terms and Conditions 7
Notices 8
Product overview 13
BigInsights information roadmap 14
Introduction to BigInsights 18
IBM Open Platform with Apache Hadoop and the BigInsights value-add services 20
IBM BigInsights Quick Start Edition for Non-Production Environments 21
Features and architecture 22
File systems 24
Hadoop Distributed File System 25
MapReduce frameworks 26
Hadoop MapReduce 27
Yarn 28
Open source technologies 29
Ambari 32
Flume 34
Hadoop 36
HBase 38
Hive 40
Kafka 43
Oozie 44
Pig 45
Slider 47
Solr 49
Spark 50
Sqoop 53
ZooKeeper 54
Other Apache projects 55
Text Analytics 56
IBM Big SQL 57
Integration with other IBM products 60
Suggested services layout for IBM Open Platform with Apache Hadoop and
BigInsights value-added services 62
Scenarios for working with big data 63
Predictive modeling 64
Consumer sentiment insight 66
Research and business development 67
Where BigInsights fits in an enterprise data architecture 69
Release notes 70
Release notes - IBM Open Platform with Apache Hadoop, and BigInsights value-
add services, Version 4.1 71
What's new for Version 4.1 75

Installing 80
Installing IBM Open Platform with Apache Hadoop 81
Important installation information 83
Preparing to install IBM Open Platform with Apache Hadoop 86
Reference Architecture 88
BigInsights value-added services 89
Meeting minimum system requirements 91
Collect host names and other information for your installation 92
Preparing your environment 93
Users and groups for IBM Open Platform with Apache Hadoop 102
Default Ports created by a typical installation 104
Setting up port forwarding from private to edge nodes 106
Configuring your browser 108
Configuring authentication 109
Configuring LDAP authentication on RHEL 6 110
Obtaining software for the IBM Open Platform with Apache Hadoop 113
Downloading the IBM repository definition for the IBM Open Platform with
Apache Hadoop 115
Creating a mirror repository for the IBM Open Platform with Apache Hadoop
software 116
Creating a repository for Spectrum Scale 118
Installing IBM Open Platform with Apache Hadoop on SUSE Linux Enterprise
Server (SLES) by using tar files 120
Running the installation package 122
Validating your installation 131
Upgrading the Java (JDK) version 132
Installing and configuring HttpFS on IBM Open Platform with Apache Hadoop 134
Installing and configuring WebHDFS with Knox on IBM Open Platform with Apache
Hadoop 137
Installing additional services in your IBM Open Platform 139
Cleaning up nodes before reinstalling software 140
HostCleanup.ini file 144
HostCleanup_Custom_Actions.ini file 146
Advanced installation planning 147
Serviceability tools 148
Directories created when installing IBM Open Platform with Apache Hadoop150
Planning for high availability 152
Configuring Slider HBase 153
Setting up a dual network for IBM Open Platform with Apache Hadoop 157
Enabling CMX compression to compress the output of the intermediate jobs
generated by Pig 159
Configuring YARN container execution 160
Installing the IBM BigInsights value-added services on IBM Open Platform with Apache
Hadoop 162
Preparing to install the BigInsights value-add services 164
Users, groups, and ports for BigInsights value-add services 169

Default Ports created by a typical BigInsights value-add services installation 171
Obtaining the BigInsights value-add services 172
Additional related software 175
Obtaining the BigInsights Quick Start Edition for non-production use 177
Installing the BigInsights value-add packages 178
Installing BigInsights Home 185
Installing the BigSheets service 187
Installing the BigInsights - Big SQL service 190
Preinstallation checker utility for Big SQL 194
Clean-up utility for Big SQL 198
Migrating from Big SQL 1.0 200
Installing the BigInsights - Data Server Manager service 203
Installing the Text Analytics service 207
Installing the Big R service 211
Installing Big R on a workstation or notebook 217
Enabling Knox for value-add services 218
Removing BigInsights value-add services 222
Installing the Enterprise Manager module for IBM Open Platform with Apache Hadoop
226
Acquiring Spectrum Scale and Platform Symphony 227
Spectrum Scale FPO (GPFS) 228
Installing Spectrum Scale FPO (GPFS) 229
Performance analysis for GPFS 230
Platform Symphony 232
Installing and configuring Platform Symphony 234
Installing the IBM Platform Symphony service stack on the Ambari server
235
Adding Platform Symphony as an Ambari service 237
Post-installation checks 239
Running a service check on the Platform Symphony cluster 240
Verifying component installation and configuration 241
Cleaning up an incomplete Symphony deployment from Ambari 242
Post-installation configuration 244
Mapping Hadoop YARN queues to Symphony YARN queues 245
Granting MySQL access privileges 248
Updating the Ambari GUI manually after integrating Symphony 249
Changing port numbers 251
Configuring master failover in an Ambari environment 252
Configuring automatic failure recovery for the Symphony YARN Resource
Manager in an Ambari environment 253
Opening the Platform Management Console from the Ambari console 254
Controlling hosts with Platform Symphony 255
Integrating open source MapReduce, YARN, or Spark in a Symphony cluster
256
Undoing integration of open source MapReduce, YARN, or Spark in a
Symphony cluster 257
Upgrading IBM Platform Symphony to IBM Open Platform 4.1 258

Getting started with Spark on EGO 261
Submitting a sample Spark job 263
Spark on EGO FAQs 265
Installing the free IBM Open Platform with Apache Hadoop and BigInsights Quick Start
Edition, non-production software 266
IBM Open Platform with Apache Hadoop and IBM BigInsights Quick Start Edition
for non-production environments, v4.1: Docker image README 267
Optional: Get nodes on a cloud environment 276
IBM BigInsights Quick Start Edition for Non-Production Environment, v4.1: VMware
image README 279
Tutorials 284
Tutorial: Analyzing big data with BigSheets 285
Lesson 1: Creating master workbooks from social media data 287
Lesson 2: Tailoring your data by creating child workbooks 290
Lesson 3: Combining the data from two workbooks 293
Lesson 4: Creating columns by grouping data 297
Lesson 5: Viewing data in BigSheets diagrams 299
Lesson 6: Visualizing and refining the results in charts 300
Lesson 7: Exporting data from your workbooks 306
Summary of analyzing data with BigSheets tutorial 308
Tutorial: Developing Big SQL queries to analyze big data 309
Setting up the Big SQL tutorial environment 311
Creating a directory in the distributed file system to hold your samples 312
Getting the sample data 313
Accessing the Big SQL sample data installed in the Big SQL service 314
Downloading sample data from a developerWorks source 316
Creating tables and loading sample data 317
Module 1: Creating and running SQL script files 320
Lesson 1.1: Creating an SQL script file 322
Lesson 1.2: Creating and running a simple query to begin data analysis 323
Lesson 1.3: Creating a view that represents the inventory shipped by branch
327
Lesson 1.4: Analyzing products and market trends with Big SQL Joins and
Predicates 329
Lesson 1.5: Creating advanced Big SQL queries that include common table
expressions, aggregate functions, and ranking 334
Lesson 1.6: Advanced: Porting an existing Hive UDF to Big SQL 337
Lesson 1.7: Advanced: Creating and running a simple Big SQL query from a
JDBC client application 339
Module 2: Analyzing big data by using Big SQL and BigSheets 343
Lesson 2.1: Preparing queries to export to BigSheets that examine the results
of sales by year 345
Lesson 2.2: Exporting Big SQL data about total sales by year to BigSheets 347
Lesson 2.3: Creating tables for BigSheets from other tables 349
Lesson 2.4: Exporting BigSheets data about IBM Watson blogs to Big SQL
tables 351
Task 2.4.1: Creating and modifying a BigSheets workbook from JSON
Array formatted data 352

Task 2.4.2: Exporting the BigSheets blog data workbook to a TSV file 354
Task 2.4.3: Creating a Big SQL script that creates Big SQL tables from the
exported TSV file 355
Task 2.4.4: Exporting the BigSheets workbook as a JSON Array for use
with a SerDe application in Big SQL 357
Task 2.4.5: Creating a Big SQL table that uses the SerDe application to
process the Watson blog data 359
Module 3: Analyzing Big SQL data in a client spreadsheet program 361
Lesson 3.1: Installing the IBM Data Server Driver Package for the client ODBC
drivers 362
Lesson 3.2: Importing Big SQL data to a client spreadsheet program 365
Module 4: Using Federation in Big SQL 367
Lesson 4.1: Setting up the client data source 368
Lesson 4.2: Configuring Big SQL as the Federated Server 370
Lesson 4.3: Integrating information from two companies 373
Module 5: Working with HBase tables 375
Lesson 5.1: Creating and populating HBase tables 376
Lesson 5.2: Creating and using HBase views 378
Summary of developing Big SQL queries to analyze big data 379
Tutorial: Analyzing big data with Big R 380
Lesson 1: Uploading the airline data set to the BigInsights server with Big R 382
Lesson 2: Exploring the structure of the data set with Big R 384
Lesson 3: Analyzing data with Big R 385
Lesson 4: Visualizing big data with Big R 387
Lesson 5: Extending R packages to work with big data 389
Lesson 6: Building scalable machine learning models with Big R 392
Summary of analyzing data with Big R tutorial 395
Tutorial: Analyzing text with BigInsights Text Analytics 396
Lesson 1: Setting up your project 398
Lesson 2: Selecting input documents and identifying examples 400
Lesson 3: Creating and testing extractors 403
Lesson 4: Writing and testing extractors for candidates 406
Lesson 5: Creating and testing final extractors 411
Lesson 6: Finalizing and saving the extractor 413
Summary of creating your first Text Analytics extractor 414
Importing and exporting data 415
Identifying your data resources 417
Recommended tools for importing data 420
Importing data at rest 421
Importing data by using Hadoop shell commands 423
Importing data in motion 425
Importing data by using Flume 427
Importing data from a data warehouse 428
Big SQL LOAD 430
How to diagnose and correct LOAD HADOOP problems 460
How to monitor the progress of LOAD HADOOP 463
Importing and exporting DB2 data by using Sqoop 464

Importing IMS data by using Sqoop 466
Integrating the Teradata Connector for Hadoop 468
Installing the Teradata Connector for Hadoop 469
Importing data with the Teradata Connector for Hadoop 471
Exporting data with the Teradata Connector for Hadoop 475
Running tdimport or tdexport from an Oozie application 479
Corresponding Sqoop and Teradata options 482
Setting up and administering security 484
Securing IBM Open Platform with Apache Hadoop 485
Setting up HTTPS with a self-signed certificate for the Ambari web interface 487
Setting up HTTPS with an authority certificate for the Ambari web interface 489
Setting up two-way SSL between the Ambari server and Ambari agents 491
Manually configuring SSL support for HBase REST gateway with Knox 492
Manually configuring SSL support for HBase, MapReduce, YARN, and HDFS web
interfaces 495
Manually configuring SSL support for HiveServer2 499
Manually configuring SSL support for Oozie 501
Apache Knox gateway overview 503
Hadoop service access in Knox 506
Knox Gateway directories 509
Knox Gateway samples 511
Changing the Knox Gateway port or path 513
Managing the master secret 514
Redeploying cluster topologies 515
Manually starting and stopping Apache Knox 518
Adding a new service to the Knox Gateway 519
Cluster topology definition in Apache Knox Gateway 520
Knox topology configuration to connect to Hadoop cluster services 522
Setting up Hadoop service URLs 523
Example: service definitions 524
Service connectivity validation 526
Configuring authentication on Knox and Ambari 528
Setting up LDAP authentication in Knox 529
Example: Active Directory configuration 531
Example: OpenLDAP configuration 532
Setting up LDAP or Active Directory authentication in Ambari 533
Knox Gateway Identity Assertion 537
Defining an identify-assertion provider 539
Adding a user mapping rule to an identity-assertion provider: 540
Concat Identity Assertion 541
User Mapping Example 542
Configuring group mapping 543
Knox Gateway security 545
Implementing web application security 546
Configuring Knox with a secured Hadoop cluster 548
Configuring wire encryption (SSL) 550
Using CA-signed certificates for production 551

Kerberos in IBM Open Platform with Apache Hadoop 552
Overview of Kerberos in IBM Open Platform with Apache Hadoop 553
Setting up Kerberos for IBM Open Platform with Apache Hadoop 556
Setting up a KDC manually 561
Manually generating keytabs for Kerberos authentication 563
Properties in the Kerberos descriptors 570
Enabling SPNEGO authentication for IBM Open Platform with Apache Hadoop
572
User and group management 573
Changing the administrator account password 575
Creating a local user 576
Changing the password of a local user 577
Deleting a local user 578
Creating a local group 579
Managing local group membership 580
Deleting a local group 581
Enabling transparent data encryption 582
Securing the BigInsights value-added services 585
Restarting Knox to access value added components 586
Setting up Kerberos for the BigInsights - Big R service 587
Setting up Kerberos for the BigInsights - Big SQL service 588
Setting up Kerberos for the BigInsights - Text Analytics service 590
Setting up Kerberos for the BigInsights - BigSheets service 591
Understanding and configuring BigSheets access of Big SQL table data 593
Enabling SSL encryption for the Big R service 595
Managing Access Control Lists (ACL) and Authorizations 597
ACL Management for Hive 598
Storage-Based Authorization 600
SQL-Standard Based Authorization 603
ACL Management for HDFS 607
ACL Management for HBase 609
ACL Management for YARN 612
Administering Ambari and components 613
High availability 614
Setting up NameNode high availability 615
Pointing to a new NameNode location in the Hive metastore 618
Setting up Oozie high availability 621
Setting up Resource Manager high availability 624
Enabling work-preserving ResourceManager restart 625
Enabling Hive metastore high availability 627
Enabling HiveServer2 high availability 629
High availability in Big SQL 632
Enabling Big SQL high availability 635
Disabling Big SQL high availability 638
Example of configuring clients to work with Big SQL high availability 639
Configuring high availability for HBase 641
Managing Flume 642

Flume configuration scenario 644
Decommissioning slave nodes 646
Ambari alerts and monitoring services 649
Working with alerts in the Ambari web interface 652
Configuring notifications 654
Creating or editing alert groups 657
Creating alert notification to track status changes in high availability failover 659
Pre-defined alerts in Ambari 663
HDFS service alerts 664
NameNode High Availability service alerts 667
YARN service alerts 669
MapReduce2 service alerts 671
HBase service alerts 672
Hive service alerts 674
Oozie servie alerts 675
ZooKeeper service alerts 676
Ambari alerts 677
Ambari metrics 678
Working with widgets 682
Working with service specific widgets 685
How to switch the Ambari metrics system to a distributed mode 689
Ambari views 691
Creating a Capacity-Scheduler view instance 692
Creating a Files view instance 694
Creating a Hive view instance 697
Additional Hive view configurations and setup 701
Creating a Pig view instance 703
Creating a Slider view instance 707
Developing applications to access and manage data 709
Developing Big SQL applications in your Hadoop environment 710
What's New with the Big SQL server 712
Big SQL configuration and log management 714
JSqsh client 715
Big SQL connections 716
Security in Big SQL 717
HBase tables 718
Data types that are supported by Big SQL 719
Big SQL Catalog schema 721
What you need to know before writing Big SQL applications 722
File formats supported by Big SQL 723
User-defined external scalar functions in Big SQL 728
Transactional behavior of Hadoop tables 731
Transactional behavior of CREATE TABLE ... AS in Big SQL 732
Transactional behavior of INSERT in Big SQL 733
Transactional behavior of LOAD HADOOP USING 734
Understanding data types 735
Data types migrated from Hive applications 736

Data types that are supported by Big SQL 737
How Hive handles NULL values on a partitioning column of type String 743
How to work with HBase tables 745
Security considerations when Big SQL accesses HBase objects 755
HDFS caching 757
Memory calculator worksheet 764
Developing routines 767
Routines 769
Overview of routines 770
Benefits of using routines 772
Types of routines 774
Built-in and user-defined routines 777
Built-in 778
User-defined 779
Comparison of user-defined and built-in routines 781
Choosing to use built-in or user-defined routines 783
Functional types of routines 784
Procedures 786
Functions 788
Scalar functions 790
Row functions 792
Table functions 793
Methods 794
Comparison of routine functional types 795
Choosing a routine functional type 798
Implementations of routines 800
Built-in routines 802
Sourced routines 803
SQL routines 804
External routines 805
Supported APIs and programming languages 807
Comparison of APIs and programming languages 808
Comparison of routine implementations 812
Choosing a routine implementation 815
Usage of routines 817
Administering databases with built-in routines 818
Extension of SQL function support with user-defined functions 820
Auditing using SQL table functions 821
Tools for developing routines 824
IBM Data Studio routine development support 825
SQL statements that can be executed in routines and triggers 826
SQL access levels 833
Determining what SQL statements can be executed in routines 835
Portability of routines 837
Interoperability of routines 838
Performance of routines 839
Security of routines 847

Securing routines 849
Authorizations and binding of routines that contain SQL 851
Data conflicts when procedures read from or write to tables 855
Debugging compiled SQL PL objects overview 857
External routines 858
External routine features 860
External function and method features 862
Scalar user-defined functions 864
External scalar function and method invocation 866
External table functions 867
External table function processing 868
Generic table functions 870
Using generic table functions 871
Java table function execution model 873
Scratchpads for external functions and external methods 875
Scratchpads for 32-bit and 64-bit operating systems 879
SQL in external routines 881
Parameter styles for external routines 884
Parameter handling 890
Supported routine programming languages 892
Comparison of APIs and programming languages 894
Performance considerations for developing routines 894
Security considerations for routines 897
Routine code page considerations 900
Application and routine support 901
32-bit and 64-bit support for external routines 903
Performance of 32-bit routines in 64-bit environments 904
XML data type support 905
Restrictions on external routines 907
Creating external routines 910
Writing routines 913
Debugging routines 915
Library and class management considerations 917
Deployment of routine library or class files 919
Security of external routine library or class files 921
Resolution of external routine library or class files 922
Modifications to external routine library or class files 923
Backup and restore of external routine library and class files 924
Performance and library management 925
C and C++ routines 926
Supported software (C) 928
Supported software (C++) 929
Tools for developing C and C++ routines 930
Designing C and C++ routines 931
Include file required for C and C++ routine development 933
Parameters in C and C++ routines 935
Parameter styles supported 937

Parameter null indicators 938
Parameter style SQL C and C++ procedures 939
Parameter style SQL C and C++ functions 943
Passing parameters by value and by reference 946
Parameters not required for result sets 947
The dbinfo structure routine parameter 948
Scratchpad as function parameter 952
Program type MAIN support for procedures 954
SQL data type representation 956
SQL data type handling 960
How to pass arguments to C routines 969
Graphic host variables 981
C++ type decoration 982
Returning result sets from procedures 984
Creating C and C++ routines 986
Building C and C++ routine code 989
Building C and C++ routine code using the sample bldrtn script 990
Building routines in C or C++ using the sample build script (UNIX)
992
Building C/C++ routines on Windows 992
Building C and C++ routine code from the command line 992
Compile and link options for C and C++ routines 994
AIX C routine compile and link options 995
AIX C++ routine compile and link options 995
HP-UX C routine compile and link options 995
HP-UX C++ routine compile and link options 995
Linux C routine compile and link options 995
Linux C++ routine compile and link options 995
Solaris C routine compile and link options 995
Solaris C++ routine compile and link options 995
Windows C and C++ routine compile and link options 995
Rebuilding routine shared libraries 995
Updating the database manager configuration parameters 996
COBOL procedures 997
Supported software 1000
Supported SQL data types in COBOL embedded SQL applications 1001
Building COBOL routines 1001
Compile and link options for COBOL routines 1002
AIX IBM COBOL routine compile and link options 1003
AIX Micro Focus COBOL routine compile and link options 1003
HP-UX Micro Focus COBOL routine compile and link options 1003
Solaris Micro Focus COBOL routine compile and link options 1003
Linux Micro Focus COBOL routine compile and link options 1003
Windows IBM COBOL routine compile and link options 1003
Windows Micro Focus COBOL routine compile and link options 1003
Building IBM COBOL routines on AIX 1003
Building UNIX Micro Focus COBOL routines 1003

Building IBM COBOL routines on Windows 1003
Building Micro Focus COBOL routines on Windows 1003
Java routines 1003
Supported software 1005
JDBC and SQLJ API support 1006
Specifying JDK for Java routine development (Linux and UNIX) 1007
Specification of a driver for Java routines 1009
Tools for developing Java routines 1010
Designing Java routines 1011
SQL data type representation 1013
Connection contexts in SQLJ routines 1015
Parameters in Java routines 1016
Parameter style JAVA procedures 1018
Parameter style JAVA functions 1020
Parameter style HIVE functions 1021
Supported SQL data types in HIVE routines 1024
Parameter style DB2GENERAL routines 1025
DB2GENERAL UDFs 1026
Supported SQL data types in DB2GENERAL routines 1029
Java classes for DB2GENERAL routines 1031
DB2GENERAL Java class: COM.ibm.db2.app.StoredProc
1032
DB2GENERAL Java class: COM.ibm.db2.app.UDF 1034
DB2GENERAL Java class: COM.ibm.db2.app.Lob 1037
DB2GENERAL Java class: COM.ibm.db2.app.Blob 1038
DB2GENERAL Java class: COM.ibm.db2.app.Clob 1039
Passing parameters of data type ARRAY to Java routines 1040
Returning result sets from Java (JDBC) procedures 1042
Returning result sets from Java (SQLJ) procedures 1043
Retrieving procedure result sets in Java (JDBC) applications and
procedures 1044
Retrieving procedure result sets in Java (SQLJ) applications and
procedures 1046
Restrictions on Java routines 1048
Java table function execution model 1050
Creating Java routines 1050
Creating Java routines from the command line 1052
Building Java routine code 1055
Building JDBC routines 1056
Building SQLJ routines 1058
Compile and link options for Java (SQLJ) routines 1059
SQLJ routine options for Linux and UNIX 1060
Deploying Java routines 1061
JAR file administration 1063
Updating Java routines 1065
Examples of Java (JDBC) routines 1067
Example: Array data type in Java (JDBC) procedure 1068

Example: XML and XQuery support in Java (JDBC) procedure 1069
Invoking routines 1074
Authorizations and binding of routines that contain SQL 1077
Routine names and paths 1077
Nested routine invocations 1079
Invoking 32-bit routines on a 64-bit database server 1080
References to procedures 1081
Calling procedures 1082
Calling procedures from applications or external routines 1084
Calling procedures from triggers or SQL routines 1086
Calling stored procedures from the CLP 1089
Calling stored procedures from CLI applications 1092
Calling stored procedures with array parameters from CLI
applications 1092
Procedure result sets 1092
Result sets from SQL data changes 1094
Result sets from SQL data changes using cursors 1097
References to functions 1098
Function selection 1100
Distinct types as UDF or method parameters 1102
LOB values as UDF parameters 1103
Invoking scalar functions or methods 1104
Invoking user-defined table functions 1106
Analyzing big data by using BigInsights value-added services 1108
Analyzing data with BigSheets 1109
Overview of BigSheets 1110
Workbooks and sheets 1112
Sheet types 1114
Group sheet 1116
Building sets of data 1120
Creating master workbooks from catalog tables 1121
Creating workbooks from existing workbooks 1122
Changing a column data type in a master workbook 1123
Data types 1124
Changing the data source for master workbooks 1125
Copying workbooks to a new cluster 1126
Exporting workbook metadata 1127
Importing workbook metadata 1128
Discovering data 1129
Adding columns in sheets 1130
Modifying columns in sheets 1131
Adding sheets to workbooks 1132
Viewing related sheets 1133
Viewing related workbooks 1134
Changing the data reader for workbooks 1135
Data readers 1136
Running workbooks 1140

Visualizing your result data in charts and maps 1141
Chart and map types 1142
Understanding null behavior 1145
Deleting workbooks 1146
Restoring deleted workbooks 1147
Purging deleted workbooks 1148
Formulas 1149
Functions 1150
Conditional functions 1152
DateTime functions 1157
Pattern syntax for custom DateTime formats 1165
Entity functions 1167
HTML and XML functions 1171
Math functions 1177
Geospatial functions 1182
Statistical functions 1190
Selection functions 1193
Text functions 1195
Text comparison functions 1214
URL functions 1219
Formula examples 1225
Sharing data 1227
Exporting data from a workbook 1228
Sharing workbooks with other users 1229
Creating and deleting catalog tables 1230
Extending BigSheets 1232
Administering BigSheets by using REST APIs 1233
Creating BigSheets plug-ins 1243
Building customized functions 1246
Building customized readers 1251
Building customized charts 1257
Uploading custom plug-ins to BigSheets 1261
Analyzing big data with Text Analytics 1262
Developing extractors in the web tool 1263
Information Extraction Web Tool 1264
Designing text extraction projects 1265
Design your project 1267
What are the provided extractors? 1269
Linguistic support 1270
Case study: Extracting insights from financial documents 1271
Create the extractor 1272
Refine the results 1275
Manage projects and extractors 1276
Use the workspace 1277
Manage projects 1278
Adding and removing sample documents 1279
Document size limitations 1283

Manage extractors 1285
Create, edit, and combine extractors 1287
Define dictionaries 1289
Define a list 1290
Define a mapping table 1291
Define a literal 1292
Define regular expressions 1293
Define sequence patterns 1295
Add proximity rules 1297
Define unions of extractors 1299
Run an extractor and refine results 1301
Refine results 1303
Eliminate duplicate and overlapping results 1306
Refine results using filters 1308
Export refined extractor results 1310
Extract in languages other than English 1312
Extend the provided extractors 1314
Define new extractors based on linguistic patterns 1316
Custom extractors 1317
Exporting Extractors 1318
Exporting extractors to AQL 1319
Exporting extractors as map/reduce jobs 1321
Exporting extractors as BigSheets functions 1323
Developing Text Analytics extractors using Annotation Query Language (AQL)1326
Annotation Query Language (AQL) 1328
Extractors 1329
Modules 1331
Scenarios that illustrate modules 1334
Best practices for developing modules 1342
AQL files 1343
Views 1345
Dictionaries 1347
Tables 1348
Functions 1349
Pre-built extractor libraries 1350
Named entity extractors 1354
Financial extractors 1361
Generic extractors 1372
Other extractors 1376
Sentiment extractors 1379
Machine Data Analytics extractors 1382
Base modules 1485
Guidelines for writing AQL 1487
Using basic feature AQL statements 1491
Using candidate generation AQL statements 1493
Using filter and consolidate AQL statements 1496
Creating complex AQL statements 1499

Enhancing content of AQL views 1502
Using naming conventions 1505
Data collection formats 1507
UTF-8 encoded text files 1508
UTF-8 encoded CSV files 1510
UTF-8 encoded JSON files in Hadoop text input format 1512
Multilingual support for Text Analytics 1515
Text Analytics Optimizer 1517
Execution plan 1522
Operators 1525
Relational operators 1526
Span aggregation operators 1528
Span extraction operators 1529
Specialized operators 1531
Tokenization 1532
Running Text Analytics extractors 1535
Run extractors on distributed files from the web tool 1536
Running extractors with the Java Text Analytics APIs 1538
Reading document collections with the DocReader API 1550
Text Analytics URI formats 1553
Improving extractor performance 1556
Know your performance requirements 1557
Follow the guidelines for regular expressions 1558
Use the consolidate clause wisely 1563
Using external resources in Text Analytics 1566
Analyzing and manipulating big data with Big SQL 1568
Configuring security for Big SQL 1570
Enabling authentication for Big SQL 1572
Authorization of Big SQL objects 1574
Database authorization 1577
Enabling SSL (Secure Socket Layer) encryption 1581
Default privileges granted on the bigsql database 1582
Configuration parameters 1584
Big SQL architecture 1587
Managing the Big SQL server 1589
Configuring the IBM Big SQL server 1590
Big SQL Scheduler 1593
Big SQL Input/Output 1595
JDBC and ODBC drivers 1596
JDBC driver 1597
ODBC driver for Linux 1599
ODBC driver for Windows 1601
Connecting to the Big SQL server that is part of the Big SQL service 1603
Downloading and Installing IBM Data Studio 1604
Creating or changing a JDBC driver definition 1605
Connecting to a Big SQL server 1606
Analyzing data with Big SQL 1608

How to run Big SQL queries 1610
Running Big SQL queries with Big SQL monitoring and edit tool 1611
Java SQL Shell (JSqsh) 1612
Big SQL statistics 1616
Statistics gathered from expression columns in statistical views 1617
Extending Big SQL 1619
Working with Hive ACID tables in Big SQL 1622
LOAD performance guidelines 1625
Tuning HBase performance 1627
HBase basics 1628
General HBase tuning 1631
Major compaction and data locality 1634
Hints for designing HBase tables 1636
Mapping data types to fields in HBase 1639
Hints for designing indexes 1642
Properties that can optimize HBase table scans 1645
Hints for optimizing LOAD 1646
Monitoring Big SQL in the IBM Open Platform with Apache Hadoop environment
1649
Monitoring metrics with the Big SQL query interface 1650
Monitoring the cluster status of your Big SQL queries 1653
Analyzing data with Big R 1654
Overview of Big R 1655
Connecting to a data set with Big R 1656
Running Big R scripts 1657
Troubleshooting and support 1658
Resolving problems with BigInsights 1659
Logging 1660
Logs and their locations 1661
Problems and workarounds 1664
Installation 1665
Unable to open Ambari browser 1666
Installing IBM Open Platform with Apache Hadoop does not complete
successfully because of connection issues 1667
Starting the Ambari server on Linux Power operating systems fails due to
connection error when the number of cores in the machine is greater than
48 1668
Components and value-add services 1669
Cannot stop all BigInsights value-add services from web interface 1671
Oozie: Cannot start Oozie service - ERROR XSDB6 1672
Restart all Hive services does not start all services correctly 1673
Failed to get schema version when starting Hive Metastore Service 1674
Adding additional Kafka Brokers after the initial installation might result in
an error when starting the broker 1675
Running Kafka producer with localhost generates error 1676
Continual "Hive Metastore Process" alerts showing in the Ambari web
interface even when process is running on pLinux 1678

After Kerberos is enabled, Ambari Quick Links might no longer work 1679
Ambari is unable to create a Files view for a cluster after Kerberos is
enabled 1680
Spark Thrift Server (1.5.1) goes down when Kerberos is enabled on the
cluster 1681
Restarting the Solr service might fail 1682
Big SQL 1683
Hive and Big SQL catalogs inconsistent 1685
TPCIP ports in FIN_WAIT1 state 1687
JSqsh Big SQL connection profile points to the wrong Big SQL service port
1688
Command fails with authorization error after successfully connecting to a
Big SQL server 1689
Installing the Big SQL service failed because of a tty requirement 1690
Cannot decommission a dead Big SQL worker node 1691
How to delete a Big SQL worker node 1693
Big SQL monitoring utility fails to install 1695
Big SQL monitoring utility (DSM) is not configured properly because of a
packaging error 1696
Uninstalling the BigInsights - Big SQL service does not completely uninstall
components 1697
Cannot use Hive directly to work with Big SQL HBase tables 1698
Big SQL instance owner does not exist in the server that hosts the
NameNode service causing Hadoop operations to fail 1699
Big SQL authorization errors after switching to an HDFS standby
NameNode 1700
How to remove a faulty node from the Big SQL service 1701
Interrupted operations can cause metadata inconsistency 1703
The Big SQL scheduler can report an incorrect number of nodes in a
Spectrum Scale FPO (GPFS) environment, affecting the Big SQL plan
quality 1704
Text Analytics 1705
Stopping and starting the Hive service that is installed on the same node as
BigInsights - Text Analytics might result in a failure. 1706
BigSheets 1707
Workbook names must be unique 1708
Workbook run does not progress 1709
BigSheets reader cannot view data 1710
BigSheets service start fails in a Kerberos environment 1712
Area charts hide data sets 1713
BigSheets hangs when you remove columns 1714
BigInteger columns cannot be Y axis 1715
Data missing from processed results 1716
Unable to create a table from BigSheets 1718
BigInsights Home service 1719
After installing the BigInsights Home service, other services will not start
1720

Big R problems and workarounds 1721
Removing Big R is not always successful 1722
General troubleshooting techniques and resources 1723
Subscribing to IBM Support updates 1724
Searching knowledge bases 1726
Getting fixes from Fix Central 1727
Contacting IBM Support 1729
Exchanging information with IBM 1731
Reference 1733
IBM Big SQL reference 1734
IBM Big SQL Reference 1735
How to read the syntax diagrams 1736
Conventions used for the SQL topics 1739
Error conditions 1740
Highlighting conventions 1741
Conventions describing Unicode data 1742
Language elements 1743
Characters 1744
Tokens 1746
Identifiers 1748
Data types 1777
Data type list 1779
Numbers 1780
Character strings 1783
Graphic strings 1788
National character strings 1790
Binary strings 1791
Large objects (LOBs) 1792
Datetime values 1794
Boolean values 1798
Cursor values 1799
XML values 1800
Array values 1801
Row values 1804
Anchored types 1806
User-defined types 1807
Promotion of data types 1811
Casting between data types 1813
Assignments and comparisons 1822
Rules for result data types 1842
Rules for string conversions 1849
String comparisons in a Unicode database 1851
Resolving the anchor object for an anchored type 1853
Resolving the anchor object for an anchored row type 1855
Database partition-compatible data types 1857
Constants 1859
Special registers 1865

CURRENT CLIENT_ACCTNG 1868
CURRENT CLIENT_APPLNAME 1869
CURRENT CLIENT_USERID 1870
CURRENT CLIENT_WRKSTNNAME 1871
CURRENT DATE 1872
CURRENT DBPARTITIONNUM 1873
CURRENT DECFLOAT ROUNDING MODE 1874
CURRENT DEFAULT TRANSFORM GROUP 1875
CURRENT DEGREE 1876
CURRENT EXPLAIN MODE 1877
CURRENT EXPLAIN SNAPSHOT 1879
CURRENT FEDERATED ASYNCHRONY 1880
CURRENT IMPLICIT XMLPARSE OPTION 1881
CURRENT ISOLATION 1882
CURRENT LOCALE LC_MESSAGES 1883
CURRENT LOCALE LC_TIME 1884
CURRENT LOCK TIMEOUT 1885
CURRENT MAINTAINED TABLE TYPES FOR OPTIMIZATION 1886
CURRENT MDC ROLLOUT MODE 1887
CURRENT MEMBER 1888
CURRENT OPTIMIZATION PROFILE 1889
CURRENT PACKAGE PATH 1890
CURRENT PATH 1891
CURRENT QUERY OPTIMIZATION 1892
CURRENT REFRESH AGE 1893
CURRENT SCHEMA 1894
CURRENT SERVER 1895
CURRENT SQL_CCFLAGS 1896
CURRENT TEMPORAL BUSINESS_TIME 1897
CURRENT TEMPORAL SYSTEM_TIME 1899
CURRENT TIME 1901
CURRENT TIMESTAMP 1902
CURRENT TIMEZONE 1904
CURRENT USER 1905
SESSION_USER 1906
SYSTEM_USER 1907
USER 1908
Global variables 1909
Types of global variables 1910
Authorization required for global variables 1912
Resolution of global variable references 1913
Using global variables 1915
Functions 1917
Methods 1933
Conservative binding semantics 1941
Expressions 1944
Datetime operations and durations 1957

CASE expression 1963
CAST specification 1966
Field reference 1972
XMLCAST specification 1974
ARRAY element specification 1976
Array constructor 1977
Dereference operation 1979
Method invocation 1981
OLAP specification 1983
ROW CHANGE expression 1998
Sequence reference 2000
Subtype treatment 2004
Determining data types of untyped expressions 2005
Row expression 2012
Predicates 2014
Search conditions 2015
Basic predicate 2018
Quantified predicate 2021
ARRAY_EXISTS predicate 2024
BETWEEN predicate 2025
Cursor predicates 2026
EXISTS predicate 2028
IN predicate 2029
LIKE predicate 2031
NULL predicate 2037
REGEXP_LIKE predicate 2038
Trigger event predicates 2041
TYPE predicate 2042
VALIDATED predicate 2044
XMLEXISTS predicate 2047
Built-in global variables 2050
CATALOG_SYNC_MODE global variable 2052
CLIENT_HOST global variable 2053
CLIENT_IPADDR global variable 2054
CLIENT_ORIGUSERID global variable 2055
COMPATIBILITY_MODE global variable 2056
CLIENT_USRSECTOKEN global variable 2057
MON_INTERVAL_ID global variable 2058
NLS_STRING_UNITS global variable 2059
PACKAGE_NAME global variable 2060
PACKAGE_SCHEMA global variable 2061
PACKAGE_VERSION global variable 2062
ROUTINE_MODULE global variable 2063
ROUTINE_SCHEMA global variable 2064
ROUTINE_SPECIFIC_NAME global variable 2065
ROUTINE_TYPE global variable 2066
TRUSTED_CONTEXT global variable 2067

Built-in functions 2068
Aggregate functions 2082
ARRAY_AGG 2083
AVG 2088
CORRELATION 2090
COUNT 2092
COVARIANCE 2094
COVARIANCE_SAMP 2096
GROUPING 2098
LISTAGG 2100
MAX 2103
MEDIAN 2105
MIN 2107
PERCENTILE_CONT 2109
PERCENTILE_DISC 2111
Regression functions (REGR_AVGX, REGR_AVGY, REGR_COUNT, ...)
2113
STDDEV 2117
STDDEV_SAMP 2119
SUM 2121
VARIANCE 2123
VARIANCE_SAMP 2125
XMLAGG 2127
XMLGROUP 2129
Scalar functions 2133
ABS or ABSVAL 2134
ACOS 2135
ADD_DAYS 2136
ADD_HOURS 2138
ADD_MINUTES 2140
ADD_MONTHS 2142
ADD_SECONDS 2144
ADD_YEARS 2146
AGE 2148
ARRAY_DELETE 2150
ARRAY_FIRST 2152
ARRAY_LAST 2153
ARRAY_NEXT 2154
ARRAY_PRIOR 2156
ASCII 2158
ASIN 2159
ATAN 2160
ATAN2 2161
ATANH 2162
BIGINT 2163
BINARY 2165
BITAND, BITANDNOT, BITOR, BITXOR, and BITNOT 2167

BLOB 2170
CARDINALITY 2171
CEILING or CEIL 2172
CHAR 2173
CHARACTER_LENGTH 2180
CHR 2182
CLOB 2183
COALESCE 2184
COLLATION_KEY 2185
COLLATION_KEY_BIT 2187
COMPARE_DECFLOAT 2189
CONCAT 2191
COS 2193
COSH 2194
COT 2195
CURSOR_ROWCOUNT 2196
DATAPARTITIONNUM 2197
DATE 2199
DAY 2201
DAYNAME 2203

IBM BigInsights
IBM BigInsights 4.1 documentation
Welcome to IBM® BigInsights®, a collection of powerful value-add services that
can be installed on top of the IBM Open Platform with Apache Hadoop. IBM Open
Platform with Apache Hadoop is a platform for analyzing and visualizing Internet-
scale data volumes that is powered by Apache Hadoop, an open source distributed
computing platform. The value-add services include Big SQL, BigSheets, Big R,
and Text Analytics. This information was updated December 2015.
Getting started
Introduction
Hadoop-Dev
FAQs
What’s new?
Release notes
Installing IBM Open Platform with Apache Hadoop
Installing the value-added services
Detailed system requirements
Common tasks
Security
Analyzing data with IBM Big SQL
A typical Big SQL scenario
Analyzing data (Big R, BigSheets, Big SQL, Text Analytics)
CREATE TABLE (HADOOP) (Big SQL)
CREATE TABLE (HBASE) (Big SQL)
Troubleshooting and support
Troubleshooting BigInsights
1

BigInsights support portal
Query IBM Support knowledge base
BigInsights for Hadoop
More information
Tutorials
Best Practices
IBM Big Data education
Understanding BigInsights
IBM Redbooks
© Copyright IBM Corporation 2009, 2015
2

-
-
-
-
IBM BigInsights
Accessibility
Accessibility features help users with physical disabilities, such as restricted mobility
or limited vision, to use software products successfully.
The following list specifies the major accessibility features:
All IBM® Open Platform with Apache Hadoop functionality is available using the
keyboard for navigation instead of the mouse.
You can customize the size of the fonts on IBM Open Platform with Apache
Hadoop user interfaces with your web browser.
BigInsights documentation is provided in an accessible format.
Accessible documentation
Documentation for BigInsights products is provided in XHTML 1.0 format, which is
viewable in most Web browsers. XHTML allows you to view documentation
according to the display preferences set in your browser. It also allows you to use
screen readers and other assistive technologies.
Syntax diagrams are provided in dotted decimal format. This format is available
only if you are accessing the online documentation using a screen-reader.
Using keyboard shortcuts and accelerators
You can use keys or key combinations to perform operations that can also be
done by using a mouse.
Parent topic:BigInsights
3

-
-
-
-
-
IBM BigInsights
Using keyboard shortcuts and accelerators
You can use keys or key combinations to perform operations that can also be done
by using a mouse.
About this task
You can initiate menu actions from the keyboard. Some menu items have
accelerators, which allow you to invoke the menu option without expanding the
menu. For example, you can enter CTRL+F for find, when the focus is on the details
view.
The major accessibility features in the Knowledge Center enable users to use
assistive technologies, magnify what is displayed on the screen, and initiate menu
actions from the keyboard. In addition, all images are provided with alternative text
so that users with vision impairments can understand the contents of the images.
Procedure
You can initiate menu actions from the keyboard in the following ways:
Press F10 to activate the keyboard. Then press the arrow keys to access specific
options, or press the same letter as the one that is underlined in the name of the
menu option you want to select. For example, to select the Help Index, press F10
to activate the main menu; then use the arrow keys to select Help > Help Index.
Press and hold the Alt key. Then press the same letter as the one that is
underlined in the name of the main menu option that you want to select. For
example, to select the General Help menu option, press Alt+H; then press G.
Tip: On some UNIX operating systems, you might need to press Ctrl instead of Alt
.
To exit the main menu without selecting an option, press Esc.
Often the directions for how to access a window or wizard instruct you to right-click
an object and select an object from the pop-up menu.
To open the pop-up menu by using keyboard shortcuts, first select the object then
press Shift+F10.
To access a specific menu option on the pop-up menu, press the same letter as
the one that is underlined in the name of the menu option you want to select.
The following tables provide instructions for using keyboard shortcuts and
accelerators.Table 1. General keyboard shortcuts and accelerators
Action Shortcut
Access the menu bar Alt or F10
Exit the main menu without selecting an option Esc
Go to the next menu item arrow keys, or the underlined letter in the menu
option
Go to the next field in a window Tab
Go back to the previous field in a window Shift+Tab
Go from the browser address bar to the browser
content area
F6
4

-
-
-
Table 2. Keyboard shortcuts for table actions
Table 3. Tree navigation
Table 4. Editing actions
Knowledge Center navigation: The major accessibility features in the Knowledge
Center enable users to do the following:
Use assistive technologies, such as screen-reader software and digital speech
synthesizers, to hear what is displayed on the screen. In this Knowledge Center, all
information is provided in HTML format. Consult the product documentation of the
assistive technology for details on using assistive technologies with HTML-based
information.
Operate specific or equivalent features by using only the keyboard.
Magnify what is displayed on the screen.
The following table gives instructions for how to navigate the Knowledge Center by
using the keyboard.Table 5. Keyboard shortcuts in the Knowledge Center
Find Ctrl+F
Find Next Alt+N
Action Shortcut
Move to the cell above or below up or down arrows
Move to the cell to the left or right left or right arrows
Give the next component focus Tab
Give the previous component focus Shift+Tab
Action Shortcut
Navigate out forward Tab
Navigate out backward Shift+Tab
Expand entry Right
Collapse entry Left
Toggle expand/collapse for entry Enter
Move up/down one entry up or down arrows
Move to first entry Home
Move to last visible entry End
Action Shortcut
Copy Ctrl+C
Cut Ctrl+X
Paste Ctrl+V
Select All Ctrl+A
Undo Ctrl+Z
Action Shortcut
5

- Standard operating system keystrokes are used for standard operating system
operations.
Parent topic:Accessibility
Go to the next link, button, or topic
branch from inside a frame (page)
Tab
Expand or collapse a topic branch Right and Left arrow keys
Move to the next topic branch Down arrow or Tab
Move to the previous topic branch Up arrow or Shift+Tab
Scroll to the top Home
Scroll to the bottom End
Go back Alt+Left arrow
Go forward Alt+Right arrow
Next frame Ctrl+Tab
Previous frame Shift+Ctrl+Tab
Print the current page or active frame Ctrl+P
6

IBM BigInsights
Terms and Conditions
Permissions for the use of these publications is granted subject to the following
terms and conditions.
Personal use: You may reproduce these Publications for your personal, non
commercial use provided that all proprietary notices are preserved. You may not
distribute, display or make derivative work of these Publications, or any portion
thereof, without the express consent of IBM.
Commercial use: You may reproduce, distribute and display these Publications
solely within your enterprise provided that all proprietary notices are preserved. You
may not make derivative works of these Publications, or reproduce, distribute or
display these Publications or any portion thereof outside your enterprise, without the
express consent of IBM.
Except as expressly granted in this permission, no other permissions, licenses or
rights are granted, either express or implied, to the Publications or any information,
data, software or other intellectual property contained therein.
IBM reserves the right to withdraw the permissions granted herein whenever, in its
discretion, the use of the Publications is detrimental to its interest or, as determined
by IBM, the above instructions are not being properly followed.
You may not download, export or re-export this information except in full compliance
with all applicable laws and regulations, including all United States export laws and
regulations.
IBM MAKES NO GUARANTEE ABOUT THE CONTENT OF THESE
PUBLICATIONS. THE PUBLICATIONS ARE PROVIDED "AS-IS" AND WITHOUT
WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING
BUT NOT LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY, NON-
INFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.
7

IBM BigInsights
Notices
This information was developed for products and services offered in the US. This
material might be available from IBM in other languages. However, you may be
required to own a copy of the product or product version in that language in order to
access it.
IBM may not offer the products, services, or features discussed in this document in
other countries. Consult your local IBM representative for information on the
products and services currently available in your area. Any reference to an IBM
product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product,
program, or service that does not infringe any IBM intellectual property right may be
used instead. However, it is the user's responsibility to evaluate and verify the
operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter
described in this document. The furnishing of this document does not grant you any
license to these patents. You can send license inquiries, in writing, to:IBM Director
of Licensing
IBM Corporation
North Castle Drive, MD-NC119
Armonk, NY 10504-1785
US
For license inquiries regarding double-byte character set (DBCS) information,
contact the IBM Intellectual Property Department in your country or send inquiries, in
writing, to:Intellectual Property Licensing
Legal and Intellectual Property Law
IBM Japan Ltd.
19-21, Nihonbashi-Hakozakicho, Chuo-ku
Tokyo 103-8510, Japan
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS
PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of express or
implied warranties in certain transactions, therefore, this statement may not apply to
you.
This information could include technical inaccuracies or typographical errors.
Changes are periodically made to the information herein; these changes will be
incorporated in new editions of the publication. IBM may make improvements and/or
changes in the product(s) and/or the program(s) described in this publication at any
time without notice.
Any references in this information to non-IBM websites are provided for convenience
only and do not in any manner serve as an endorsement of those websites. The
materials at those websites are not part of the materials for this IBM product and use
8

of those websites is at your own risk.
IBM may use or distribute any of the information you provide in any way it believes
appropriate without incurring any obligation to you.
Licensees of this program who wish to have information about it for the purpose of
enabling: (i) the exchange of information between independently created programs
and other programs (including this one) and (ii) the mutual use of the information
which has been exchanged, should contact:IBM Director of Licensing
IBM Corporation
North Castle Drive, MD-NC119
Armonk, NY 10504-1785
US
Such information may be available, subject to appropriate terms and conditions,
including in some cases, payment of a fee.
The licensed program described in this document and all licensed material available
for it are provided by IBM under terms of the IBM Customer Agreement, IBM
International Program License Agreement or any equivalent agreement between us.
The performance data and client examples cited are presented for illustrative
purposes only. Actual performance results may vary depending on specific
configurations and operating conditions.
Information concerning non-IBM products was obtained from the suppliers of those
products, their published announcements or other publicly available sources. IBM
has not tested those products and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBMproducts. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those
products.
Statements regarding IBM's future direction or intent are subject to change or
withdrawal without notice, and represent goals and objectives only.
All IBM prices shown are IBM's suggested retail prices, are current and are subject
to change without notice. Dealer prices may vary.
This information is for planning purposes only. The information herein is subject to
change before the products described become available.
This information contains examples of data and reports used in daily business
operations. To illustrate them as completely as possible, the examples include the
names of individuals, companies, brands, and products. All of these names are
fictitious and any similarity to actual people or business enterprises is entirely
coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which
illustrate programming techniques on various operating platforms. You may copy,
modify, and distribute these sample programs in any form without payment to IBM,
for the purposes of developing, using, marketing or distributing application programs
conforming to the application programming interface for the operating platform for
which the sample programs are written. These examples have not been thoroughly
tested under all conditions. IBM, therefore, cannot guarantee or imply reliability,
serviceability, or function of these programs. The sample programs are provided
"AS IS", without warranty of any kind. IBM shall not be liable for any damages
arising out of your use of the sample programs.
9

-
-
-
-
-
-
Each copy or any portion of these sample programs or any derivative work must
include a copyright notice as follows:
© (your company name) (year).
Portions of this code are derived from IBM Corp. Sample Programs.
© Copyright IBM Corp. _enter the year or years_.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of
International Business Machines Corp., registered in many jurisdictions worldwide.
Other product and service names might be trademarks of IBM or other companies.
A current list of IBM trademarks is available on the web at "Copyright and trademark
information" at www.ibm.com/legal/copytrade.shtml.
The following terms are trademarks or registered trademarks of other companies
and have been used in at least one of the documents in the BigInsights
documentation library:
Microsoft, Windows, and the Windows logo are trademarks of Microsoft
Corporation in the United States, other countries, or both.
Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo,
Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or
registered trademarks of Intel Corporation or its subsidiaries in the United States
and other countries.
Java™ and all Java-based trademarks and logos are trademarks or registered
trademarks of Oracle and/or its affiliates.
UNIX is a registered trademark of The Open Group in the United States and other
countries.
Linux is a registered trademark of Linus Torvalds in the United States, other
countries, or both.
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered
trademarks or trademarks of Adobe Systems Incorporated in the United States,
and/or other countries.
Other company, product, or service names may be trademarks or service marks of
others.
Terms and conditions for product documentation
Permissions for the use of these publications are granted subject to the following
terms and conditions.
Applicability
These terms and conditions are in addition to any terms of use for the IBM website.
Personal use
You may reproduce these publications for your personal, noncommercial use
provided that all proprietary notices are preserved. You may not distribute, display or
make derivative work of these publications, or any portion thereof, without the
10

express consent of IBM.
Commercial use
You may reproduce, distribute and display these publications solely within your
enterprise provided that all proprietary notices are preserved. You may not make
derivative works of these publications, or reproduce, distribute or display these
publications or any portion thereof outside your enterprise, without the express
consent of IBM.
Rights
Except as expressly granted in this permission, no other permissions, licenses or
rights are granted, either express or implied, to the publications or any information,
data, software or other intellectual property contained therein.
IBM reserves the right to withdraw the permissions granted herein whenever, in its
discretion, the use of the publications is detrimental to its interest or, as determined
by IBM, the above instructions are not being properly followed.
You may not download, export or re-export this information except in full compliance
with all applicable laws and regulations, including all United States export laws and
regulations.
IBM MAKES NO GUARANTEE ABOUT THE CONTENT OF THESE
PUBLICATIONS. THE PUBLICATIONS ARE PROVIDED "AS-IS" AND WITHOUT
WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING
BUT NOT LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY, NON-
INFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.
IBM Online Privacy Statement
IBM Software products, including software as a service solutions, (“Software
Offerings”) may use cookies or other technologies to collect product usage
information, to help improve the end user experience, to tailor interactions with the
end user, or for other purposes. In many cases no personally identifiable information
is collected by the Software Offerings. Some of our Software Offerings can help
enable you to collect personally identifiable information. If this Software Offering
uses cookies to collect personally identifiable information, specific information about
this offering’s use of cookies is set forth below.
This Software Offering does not use cookies or other technologies to collect
personally identifiable information.
If the configurations deployed for this Software Offering provide you as customer the
ability to collect personally identifiable information from end users via cookies and
other technologies, you should seek your own legal advice about any laws
applicable to such data collection, including any requirements for notice and
consent.
For more information about the use of various technologies, including cookies, for
these purposes, see IBM’s Privacy Policy at http://www.ibm.com/privacy and
IBM’s Online Privacy Statement at http://www.ibm.com/privacy/details in the
section entitled “Cookies, Web Beacons and Other Technologies,” and the “IBM
Software Products and Software-as-a-Service Privacy Statement” at
11

http://www.ibm.com/software/info/product-privacy.
12

-
-
-
-
-
IBM BigInsights
Product overview
BigInsights® is a flexible software platform that provides capabilities to discover and
analyze business insights that are hidden in large volumes of structured and
unstructured data, giving value to previously dormant data.
BigInsights information roadmap
This document provides links to the information resources that are available for
IBM® Open Platform with Apache Hadoop and the BigInsights value-add services.
Introduction to BigInsights
BigInsights is a software platform for discovering, analyzing, and visualizing data
from disparate sources. You use this software to help process and analyze the
volume, variety, and velocity of data that continually enters your organization every
day. BigInsights is a collection of value-added services that can be installed on top
of the IBM Open Platform with Apache Hadoop, wihch is the open Hadoop
foundation.
Release notes
Release notes contain information about the installation and administration of IBM
BigInsights Enterprise Edition and its components.
Installing the free IBM Open Platform with Apache Hadoop and BigInsights Quick
Start Edition, non-production software
What's new for Version 4.1
13

-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
IBM BigInsights
BigInsights information roadmap
This document provides links to the information resources that are available for
IBM® Open Platform with Apache Hadoop and the BigInsights® value-add services.
Product overview
Evaluating
Planning
Installing
Administering
Getting started
Developing
Analyzing
Reference
Community resources
Product overview
BigInsights home page
This web page provides an overview of BigInsights and its related components.
Introduction
These topics introduce you to BigInsights, including its product modules and
components. Last updated: 2015-008
New features and capabilities
These topics include information about the features, capabilities, and updates
that are included in the most recent version of BigInsights. Last updated: 2015-
08
SQL-on-Hadoop without compromise
This white paper contains information on the updated Big SQL for BigInsights
V3.0 and the speed, portability, and robust functionality that this SQL on
Hadoop solution provides. Last updated: 2014-04
Evaluating
Analyzing social media and structured data with InfoSphere BigInsights
This article provides a quick start on BigSheets. You'll learn how to model big
data in BigSheets, manipulate this data using built-in macros and functions,
create charts to visualize your work, and export the results of your analysis in
one of several popular output formats. Last updated: 2012
Big Data Networked Storage Solution for Hadoop
This IBM® Redpaper™ provides a reference architecture, based on Apache
Hadoop, to help businesses gain control over their data, meet tight service level
agreements (SLAs) around their data applications, and turn data-driven insight
into effective action. Big Data Networked Storage Solution for Hadoop delivers
the capabilities for ingesting, storing, and managing large data sets with high
reliability. IBM BigInsights provides an innovative analytics platform that
processes and analyzes all types of data to turn large complex data into
14

-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
insight.Last updated: 2013-07
Quick Start Edition
IBM BigInsights Quick Start Edition is a free, downloadable non-production
version of BigInsights that enables new solutions that cost effectively turn large,
complex volumes of data into insight by combining Apache Hadoop, (including
the MapReduce framework and the Hadoop Distributed File Systems), with
unique, enterprise-ready technologies and capabilities from across IBM,
including Big SQL, text analytics and BigSheets. Last updated: 2015-08
Planning
System requirements
This document describes the system requirements for BigInsights. Last
updated: 2015-08
Performance and Capacity Implications for Big Data
The purpose of this IBM Redpaper™ publication is to consider the performance
and capacity implications of big data solutions, which must be taken into
account for them to be viable. This paper describes the benefits that big data
approaches can provide. We then cover performance and capacity
considerations for creating big data solutions. We conclude with what this
means for big data solutions, both now and in the future. Last updated: 2014-
02-07
Using the IBM Big Data and Analytics Platform to Gain Operational Efficiency
This IBM® Redbooks® Solution Guide describes how to use IBM Big Data and
Analytics Platform to provide a more comprehensive view of a customer’s
interaction with an organization’s products and services. It provides an example
about how to take multiple sources of information and analyze that data to gain
insight. This example uses disparate data sources, real-time analytics, an
appliance data warehouse, analytic modeling, and reporting tools to support the
business decision-making process. Last updated: 2014-04
Big SQL: Data warehouse-grade performance on Hadoop
This Impact 2014 presentation introduces Big SQL and places it in the SQL on
Hadoop context. Last updated: 2014-04
Taming Big Data with Big SQL
This Impact 2014 presentation introduces big data, BigInsights, and Big SQL. It
provides performance and security best practices for Big SQL and announces
security improvements over Hive 0.12. Last updated: 2014-04
Installing
Release notes
The release notes contain critical information to ensure the successful
installation and operation of BigInsights. Last updated: 2015-08
Administering
Setting up Security and Administering BigInsights
These topics describe how to complete general administration tasks, such as
configuring user security, administering individual components, and deploying
and running programs. Last updated: 2015-08
15

-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Getting started
BigInsights FAQs
The FAQs area in the Hadoop Dev community provides answers to questions
frequently asked about BigInsights, Big SQL, Hive, and HBase.
BigInsights tutorials
These topics include tutorials that you can use to quickly get started with
BigInsights. Last updated: 2015-08
IBM big data education
IBM offers classroom, on-site, and e-Learning training classes to help build and
enhance your skills BigInsights.
IBM BigInsights education
These e-Learning training classes include material on BigInsights Foundation,
Big SQL, BigInsights Analytics for Business Analysts, and BigInsights Analytics
for Programmers.
Big Data University
The Big Data University website contains courses, downloads, and educational
materials about Hadoop and other big data applications.
Understanding BigInsights
This developerWorks® article provides an introduction to BigInsights, including
architecture, capabilities, and scenarios for how you can use the product in your
organization. Last updated: 2011-10
Developing
Developing and administering applications
These topics describe how to develop and maintain BigInsights applications.
Last updated: 2015-08
Set up and use federation
This developerWorks article introduces Big SQL federation capabilities by using
many data sources, including IBM DB2 for Linux, UNIX, and Windows, IBM
PureData System for Analytics, IBM PureData System for Operational
Analytics, Teradata, and Oracle. Federation enables you to sent distributed
requests to multiple data sources within a single SQL statement. Last updated:
2014-07-08
Analyzing
Analyzing big data
These topics describe how to analyze data with IBM BigSheets, analyzing and
manipulating data with Jaql, and analyzing documents with text analytics. Last
updated: 2015-08
BigSheets
This website includes information about BigSheets, which is a browser-based
visualization tool that you can use to extend the scope of your business
intelligence data.
16

-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Troubleshooting
These topics describe how to troubleshoot issues with BigInsights components,
security, and text analytics Last updated: 2015-08
dW Answers for BigInsights
This Q&A area in the Hadoop Dev community provides a place to ask questions
and get answers from experts.
BigInsights product support
The IBM Support Portal is a unified, customizable view of all technical support
tools and information for all IBM systems, software, and services. Updated
continuously.
Reference
Reference
Use the reference information to read more about commands and functions,
supported languages, and console messages. Last updated: 2015-08
Community resources
Hadoop Dev
This dev-to-dev community site where you can find resources and tips from
experts, ask questions, and share with others. Updated continuously.
IBM Meetup Groups
This website, designed for developers, data scientists, and big data
enthusiasts, provides an opportunity to work hands-on with the solutions and
tools in the big data portfolio.
IBM developerWorks
The IBM developerWorks website contains developer resources, tutorials, and
articles about BigInsights.
The Big Data and Analytics Hub
This website provides links to big data communities, events, blogs, videos and
podcasts, and developer-centric material. Updated continuously.
Video Guide
This developerWorks article provides links to new and trending videos from the
IBM Big Data channel on YouTube. You can also go to the Videos area in
Hadoop Dev for a searchable and continuously updated list.
Parent topic:Product overview
17

-
-
-
-
-
IBM BigInsights
Introduction to BigInsights
BigInsights® is a software platform for discovering, analyzing, and visualizing data
from disparate sources. You use this software to help process and analyze the
volume, variety, and velocity of data that continually enters your organization every
day. BigInsights is a collection of value-added services that can be installed on top
of the IBM® Open Platform with Apache Hadoop, wihch is the open Hadoop
foundation.
.BigInsights helps your organization understand and analyze massive volumes of
unstructured information as easily as smaller volumes of information. The flexible
platform is built on an Apache Hadoop open source framework that runs in parallel
on commonly available, low-cost hardware. You can easily scale the platform to
analyze hundreds of terabytes, petabytes, or more of raw data that is derived from
various sources. As information grows, you add more hardware to support the influx
of data.
BigInsights helps application developers, data scientists, and administrators in your
organization quickly build and deploy custom analytics to capture insight from data.
This data is often integrated into existing databases, data warehouses, and
business intelligence infrastructure. By using BigInsights, users can extract new
insights from this data to enhance knowledge of your business. For more
information about the IBM Open Source Platform, see Installing IBM Open Source
Platorm for Apache Hadoop.
BigInsights incorporates tooling and value-add services for numerous users,
speeding time to value and simplifying development and maintenance:
Software developers can use the value-add services that are provided to develop
custom text analytic functions to analyze loosely structured or largely unstructured
text data.
Data scientists and business analysts can use the data analysis tools within the
value-add services to explore and work with unstructured data in a familiar
spreadsheet-like environment.
IBM Open Platform with Apache Hadoop and the BigInsights value-add services
The content of IBM Open Platform with Apache Hadoop and the BigInsights value-
add services includes the following:
IBM BigInsights Quick Start Edition for Non-Production Environments
Test drive the IBM Open Platform with Apache Hadoop and BigInsights value-add
modules, Version 4.1 by downloading the Quick Start Edition, which is a free, non-
production software.
BigInsights features and architecture
BigInsights provides distinct capabilities for discovering and analyzing business
insights that are hidden in large volumes of data. These technologies and features
combine to help your organization manage data from the moment that it enters
your enterprise.
18

-
-
-
BigInsights value-added services
In your multi-node cluster, it is suggested that you have at least one management
node in your non-high availability environment, if performance is not an issue. If
performance is a concern, consider configuring at least three management nodes.
If you use the BigInsights - Big SQL service, consider configuring four
management nodes. If you use a high availability environment, consider six
management nodes. Use the following list as a guide for the nodes in your cluster.
Scenarios for working with big data
BigInsights provides capabilities to derive business value from complex,
unstructured information. BigInsights supports various scenarios that can help
different organizations grow by finding value that is hidden in data and data
relationships.
Where BigInsights fits in an enterprise data architecture
Reusing business investments and incorporating existing assets is important when
expanding your enterprise data architecture. BigInsights supports data exchange
with a number of sources, relational data stores, and applications so that it can
integrate into your existing architecture.
Parent topic:Product overview
19

IBM BigInsights
IBM Open Platform with Apache Hadoop and the
BigInsights value-add services
The content of IBM® Open Platform with Apache Hadoop and the BigInsights®
value-add services includes the following:
Table 1. Supported features of the BigInsights editions
Parent topic:Introduction to BigInsights
Modules Supported features
IBM Open Platform with Apache Hadoop Cluster management, and services such as
Hive, HBase, Oozie, Flume, HDFS
IBM BigInsights Analyst Module Big SQL and BigSheets
IBM BigInsights Data Scientist Module The contents of the IBM BigInsights Analyst
Module, plus Text Analytics and Big R
IBM BigInsights Enterprise Management
Module
GPFS and Platform Symphony
IBM BigInsights for Apache Hadoop The contents of the IBM BigInsights Data
Scienties module, the BigInsights Analyst
module, and the BigInsights Enterprise
Management module. In addition, it contains a
license that provides limited-use licenses for
other software so that you can get even more
value out of Hadoop.
IBM BigInsights Quick Start Edition Big SQLIBM BigInsights Big RBigSheetsText
AnalyticsConnectorsIBM Hadoop core
20

-
-
-
-
IBM BigInsights
IBM BigInsights Quick Start Edition for Non-
Production Environments
Test drive the IBM® Open Platform with Apache Hadoop and BigInsights® value-
add modules, Version 4.1 by downloading the Quick Start Edition, which is a free,
non-production software.
Use the Quick Start Edition to begin exploring the features of IBM Open Platform
with Apache Hadoop and BigInsights value-add modules by using real data and
running real applications.
The Quick Start Edition comes loaded with most of the same features as the IBM
Open Platform with Apache Hadoop, and the related services bundled in the Data
Scientist and Business Analyst packages, without any need to upgrade or uninstall
your current products. The Quick Start Edition puts no data limit on the cluster and
there is no time limit on the license.
Download the software
You can download the native software. See Installing the value-add services
for information how downloading and installing. Or, you can download the
VM image, the latter of which comes preconfigured.
Complete the tutorials
After you download the software, use the BigInsights tutorials to begin working
with big data.
For more information, view the video tutorials on the BigInsights home page.
The following table highlights the supported and unsupported features of the Quick
Start Edition.
Table 1. Supported and unsupported features of the Quick Start Edition
Related concepts:
IBM BigInsights Quick Start Edition for Non-Production Environments: VM image
README
Supported features Unsupported features
Big SQLIBM BigInsights Big RBigSheetsText
AnalyticsWorkload optimizationQuery
SupportConnectorsManagement toolsIBM Open
Platform with Apache Hadoop
High availability (HA) capabilityGeneral Parallel
File System (GPFS™)Production support
21

IBM BigInsights
BigInsights features and architecture
BigInsights® provides distinct capabilities for discovering and analyzing business
insights that are hidden in large volumes of data. These technologies and features
combine to help your organization manage data from the moment that it enters your
enterprise.
By combining these technologies, BigInsights extends the Hadoop open source
framework with enterprise-grade security, governance, availability, integration into
existing data stores, tools that simplify developer productivity, and more.
Hadoop is a computing environment built on top of a distributed, clustered file
system that is designed specifically for large-scale data operations. Hadoop is
designed to scan through large data sets to produce its results through a highly
scalable, distributed batch processing system. Hadoop comprises two main
components: a file system, known as the Hadoop Distributed File System (HDFS),
and a programming paradigm, known as Hadoop MapReduce. To develop
applications for Hadoop and interact with HDFS, you use additional technologies
and programming languages such as Pig, Hive, Flume, and many others.
Apache Hadoop helps enterprises harness data that was previously difficult to
manage and analyze. BigInsights features Hadoop and its related technologies as a
core component.
22

-
-
-
-
-
-
File systems
The Hadoop Distributed File System (HDFS) comes with IBM Open Platform with
Apache Hadoop as your distributed file system.
MapReduce frameworks
The MapReduce framework is the core of Apache Hadoop. This programming
paradigm provides for massive scalability across hundreds or thousands of servers
in a Hadoop cluster.
Open source technologies
The following open source technologies are included with IBM Open Platform with
Apache Hadoop version 4.1.
Text Analytics
BigInsights includes Text Analytics, which extracts structured information from
unstructured and semistructured data.
IBM Big SQL
Big SQL is a massively parallel processing (MPP) SQL engine that deploys directly
on the physical Hadoop Distributed File System (HDFS) cluster.
Integration with other IBM products
BigInsights complements and extends existing business capabilities by integrating
with other IBM products. These integration points extend existing technologies to
encompass more comprehensive information types, enabling a complete view of
your business.
23

-
IBM BigInsights
File systems
The Hadoop Distributed File System (HDFS) comes with IBM® Open Platform with
Apache Hadoop as your distributed file system.
Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS) allows applications to run across
multiple servers. HDFS is highly fault tolerant, runs on low-cost hardware, and
provides high-throughput access to data.
Parent topic:BigInsights features and architecture
24

IBM BigInsights
Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS) allows applications to run across
multiple servers. HDFS is highly fault tolerant, runs on low-cost hardware, and
provides high-throughput access to data.
Data in a Hadoop cluster is broken into smaller pieces called blocks, and then
distributed throughout the cluster. Blocks, and copies of blocks, are stored on other
servers in the Hadoop cluster. That is, an individual file is stored as smaller blocks
that are replicated across multiple servers in the cluster.
Each HDFS cluster has a number of DataNodes, with one DataNode for each node
in the cluster. DataNodes manage the storage that is attached to the nodes on
which they run. When a file is split into blocks, the blocks are stored in a set of
DataNodes that are spread throughout the cluster. DataNodes are responsible for
serving read and write requests from the clients on the file system, and also handle
block creation, deletion, and replication.
An HDFS cluster supports NameNodes, an active NameNode and a standby
NameNode, which is a common setup for high availability. The NameNode regulates
access to files by clients, and tracks all data files in HDFS. The NameNode
determines the mapping of blocks to DataNodes, and handles operations such as
opening, closing, and renaming files and directories. All of the information for the
NameNode is stored in memory, which allows for quick response times when adding
storage or reading requests. The NameNode is the repository for all HDFS
metadata, and user data never flows through the NameNode.
A typical HDFS deployment has a dedicated computer that runs only the
NameNode, because the NameNode stores metadata in memory. If the computer
that runs the NameNode fails, then metadata for the entire cluster is lost, so this
computer is typically more robust than others in the cluster.
Parent topic:File systems
Related reference:
Read the HDFS Architecture Guide
Read the HDFS User Guide
25

-
-
IBM BigInsights
MapReduce frameworks
The MapReduce framework is the core of Apache Hadoop. This programming
paradigm provides for massive scalability across hundreds or thousands of servers
in a Hadoop cluster.
Hadoop MapReduce
In IBM® Open Platform with Apache Hadoop, the MapReduce framework,
MapReduce version 2, is run as a YARN workload framework. The benefits of this
new approach are that resource management is separated from workload
management, and MapReduce applications can coexist with other types of
workloads such as Spark or Slider.
Yarn
The current version of the product supports the new Apache Hadoop YARN
framework and integrates it with the rest of the IBM Open Platform with Apache
Hadoop components. Yarn decouples resource management from workload
management.
26

IBM BigInsights
Hadoop MapReduce
In IBM® Open Platform with Apache Hadoop, the MapReduce framework,
MapReduce version 2, is run as a YARN workload framework. The benefits of this
new approach are that resource management is separated from workload
management, and MapReduce applications can coexist with other types of
workloads such as Spark or Slider.
In this programming paradigm, applications are divided into self-contained units of
work. Each of these units of work can be run on any node in the cluster. In a
Hadoop cluster, a MapReduce program is known as a job. A job is run by being
broken down into pieces, known as tasks. These tasks are scheduled to run on the
nodes in the cluster where the data exists.
MapReduce version 2 jobs are executed by YARN in the Hadoop cluster. The YARN
ResourceManager spawns a MapReduce ApplicationMaster container, which
requests additional containers for mapper and reducer tasks. The ApplicationMaster
communicates with the NameNode to determine where all of the data required for
the job exists across the cluster. It attempts to schedule tasks on the cluster where
the data is stored, rather than sending data across the network to complete a task.
The YARN framework and the Hadoop Distributed File System (HDFS) typically
exist on the same set of nodes, which enables the ResourceManager program to
schedule tasks on nodes where the data is stored.
As the name MapReduce implies, the reduce task is always completed after the map
task. A MapReduce job splits the input data set into independent chunks that are
processed by map tasks, which run in parallel. These bits, known as tuples, are
key/value pairs. The reduce task takes the output from the map task as input, and
combines the tuples into a smaller set of tuples.
Each MapReduce ApplicationMaster monitors its spawned tasks. If a task fails to
complete, the ApplicationMaster will reschedule that task on another node in the
cluster.
This distribution of work enables map tasks and reduce tasks to run on smaller
subsets of larger data sets, which ultimately provides maximum scalability. The
MapReduce framework also maximizes parallelism by manipulating data stored
across multiple clusters. MapReduce applications do not have to be written in
Java™, though most MapReduce programs that run natively under Hadoop are
written in Java.
Parent topic:MapReduce frameworks
27

IBM BigInsights
Yarn
The current version of the product supports the new Apache Hadoop YARN
framework and integrates it with the rest of the IBM® Open Platform with Apache
Hadoop components. Yarn decouples resource management from workload
management.
The YARN framework uses a ResourceManager service, a NodeManagers service,
and an Application master service.
The Application master service is an import type of YARN services that is a per-
application service. It is responsible for negotiating with the Resource Manager to
apply resources for a particular application. It also monitors the status of the
application execution and provides tracking information. The bottleneck for the
central service, like Resource Manager, on a highly concurrent and utilized cluster is
resolved by transferring the scheduling responsibility from Resource Manager to
Application Manager.
The ResourceManager is in charge of scheduling resources for jobs. The basic
allocation unit is a container. Containers are workload agnostic, and they can
represent any type of computation, such as a map or reduce task in MapReduce.
The ResourceManager ensures that the cluster capacity is not exceeded by keeping
track of the scheduled containers and queueing requests when resources are busy.
NodeManagers spawn containers scheduled by the ResourceManager and monitor
that they do not go beyond the expected resource utilization. Containers that use
more memory or CPU than allocated are terminated.
For more information about the Yarn architecture, see
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
Parent topic:MapReduce frameworks
28

-
-
-
-
IBM BigInsights
Open source technologies
The following open source technologies are included with IBM® Open Platform with
Apache Hadoop version 4.1.
Table 1. Open source technology versions by IBM BigInsights value-add services
release
Ambari
Apache Ambari is an open framework for provisioning, managing, and monitoring
Apache Hadoop clusters. Ambari provides an intuitive and easy-to-use Hadoop
management web UI backed by its collection of tools and APIs that simplify the
operation of Hadoop clusters.
Flume
Apache Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of streaming event data. Flume
helps you aggregate data from many sources, manipulate the data, and then add
the data into your Hadoop environment.
Hadoop
Apache Hadoop contains open-source software for reliable, scalable, distributed
computing and storage. The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets across clusters of
computers using simple programming models. It is designed to scale up from
single servers to thousands of machines, each offering local computation and
storage.
HBase
Apache HBase is a column-oriented database management system that runs on
top of HDFS and is often used for sparse data sets. Unlike relational database
systems, HBase does not support a structured query language like SQL. HBase
Open source
technology
4.1.0.0 4.1.0.1 4.1.0.2
Ambari 2.1.0 2.1.0 2.1.0
Flume 1.5.2 1.5.2 1.5.2
Hadoop (HDFS, YARN
MapReduce)
2.7.1 2.7.1 2.7.1
HBase 1.1.1 1.1.1 1.1.1
Hive 1.2.1 1.2.1 1.2.1
Kafka 0.8.2.1 0.8.2.1 0.8.2.1
Knox 0.6.0 0.6.0 0.6.0
Oozie 4.2.0 4.2.0 4.2.0
Pig 0.15.0 0.15.0 0.15.0
Slider 0.80.0 0.80.0 0.80.0
Solr 5.1.0 5.1.0 5.1.0
Spark 1.4.1 1.4.1 1.5.1
Sqoop 1.4.6 1.4.6 1.4.6
Zookeeper 3.4.6 3.4.6 3.4.6
29

-
-
-
-
-
-
-
-
applications are written in Java™, much like a typical MapReduce application.
HBase allows many attributes to be grouped into column families so that the
elements of a column family are all stored together. This approach is different from
a row-oriented relational database, where all columns of a row are stored together.
Hive
Apache Hive is a data warehouse infrastructure that facilitates data extract-
transform-load (ETL) operations, in addition to analyzing large data sets that are
stored in the Hadoop Distributed File System (HDFS). IBM Open Platform with
Apache Hadoop includes a JDBC driver that is used for programming with Hive
and for connecting with Cognos Business Intelligence software.
Kafka
Apache Kafka is a distributed publish-subscribe messaging system rethought as a
distributed commit log. It is designed to be fast, scalable, durable, and fault-tolerant
providing a unified, high-throughput, low-latency platform for handling real-time
data feeds. Kafka is often used in place of traditional message brokers because of
its higher throughput, reliability and replication.
Oozie
Apache Oozie is a management application that simplifies workflow and
coordination between MapReduce jobs. Oozie provides users with the ability to
define actions and dependencies between actions. Oozie then schedules actions
to run when the required dependencies are met. Workflows can be scheduled to
start based on a given time or based on the arrival of specific data in the file
system.
Pig
Apache Pig is a platform for analyzing large data sets that consist of a high-level
language for expressing data analysis programs, coupled with infrastructure for
evaluating these programs. A key property of Pig programs is that their structure is
amenable to substantial parallelization, which in turns enables them to handle very
large data sets.
Slider
Apache Slider (incubating) is a YARN application to deploy existing distributed
applications on YARN, monitor them, and make them larger or smaller as desired,
even while they are running.
Solr
Solr is an enterprise search tool from the Apache Lucene project that offers
powerful search tools, including hit highlighting, as well as indexing capabilities,
reliability and scalability, a central configuration system, and failover and recovery.
Spark
Spark is a component of IBM Open Platform with Apache Hadoop that includes
Apache Spark. Apache Spark is a fast and general-purpose cluster computing
system. It provides high-level APIs in Java, Scala and Python, and an optimized
engine that supports general execution graphs. It also supports a rich set of higher-
level tools including Spark SQL for SQL and structured data processing, MLLib for
machine learning, GraphX for combined data-parallel and graph-parallel
computations, and Spark Streaming for streaming data processing.
Sqoop
Sqoop is a tool designed to easily import information from structured databases
(such as SQL) and related Hadoop systems (such as Hive and HBase) into your
30

-
-
Hadoop cluster. You can also use Sqoop to extract data from Hadoop and export it
to relational databases and enterprise data warehouses.
ZooKeeper
ZooKeeper is a centralized infrastructure and set of services that enable
synchronization across a cluster. ZooKeeper maintains common objects that are
needed in large cluster environments, such as configuration information,
distributed synchronization, and group services. Many other open source projects
that use Hadoop clusters require these cross-cluster services. Having these
services available in ZooKeeper ensures that each project can embed ZooKeeper
without having to build new synchronization services into each project.
Other Apache projects
The IBM Open Platform with Apache Hadoop is a pure open source offering with
the latest components in the Apache Hadoop and Spark ecosystems.
Related reference:
Apache Hadoop website
Related information:
Apache Solr
31

-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
IBM BigInsights
Ambari
Apache Ambari is an open framework for provisioning, managing, and monitoring
Apache Hadoop clusters. Ambari provides an intuitive and easy-to-use Hadoop
management web UI backed by its collection of tools and APIs that simplify the
operation of Hadoop clusters.
Core Ambari
The release of IBM® Open Platform with Apache Hadoop includes an updated
Apache Ambari 2.1.0 with more functionality and improvements.
Customizable Dashboards [AMBARI-9792] : Ability to customize the Metric widgets
displayed on HDFS, YARN and HBase on Service Summary pages. Includes
ability for Operators to create new widgets and share widgets in a Widget Library.
Guided Configs [AMBARI-9794] : Service Configs for HDFS, YARN, Hive and
HBase included new UI controls (such as slider-bars) and an improved
organization/layout.
Manual Kerberos [AMBARI-9783] : When enabling Kerberos, ability to perform
setup Kerberos manually.
New User Views : Hive, Pig, Files and Capacity Scheduler user views are included
by default with Ambari.
Rack Awareness [AMBARI-6646] : Ability to set Rack ID on hosts. Ambari will
generate a topology script automatically and set the configuration for HDFS.
Alerts Log Appender [AMBARI-10249] : Log alert state change events to ambari-
alerts.log.
JDK 1.8 [AMBARI-9784] : Added support for Oracle JDK 1.8.
RHEL/CentOS/Oracle Linux 7 [AMBARI-979] : Added support for
RHEL/CentOS/Oracle Linux 7.
Ambari Alerts (AMBARI-6354)
Ambari Metrics (AMBARI-5707)
Simplified Kerberos Setup (AMBARI-7204)
Hive Metastore HA (AMBARI-6684)
HiveServer2 HA (AMBARI-8906)
Oozie HA (AMBARI-6683)
Add HDFS-NFS gateway as a new component to HDFS in Ambari stack (AMBARI-
9224)
Extensibility
Blueprints: Host Discovery [AMBARI-10750] : Ability to automatically add hosts to
a blueprint-created cluster.
Views Framework: Auto-create [AMBARI-10424] : Ability to specify how to
automatically create a view instance.
Views Framework: Auto-configure [AMBARI-10306] : Ability to specify how to
automatically configuration a view instance based on the cluster being managed by
Ambari.
32

For more information about the updates, see https://issues.apache.org/jira/browse/
Parent topic:Open source technologies
33

-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
IBM BigInsights
Flume
Apache Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of streaming event data. Flume
helps you aggregate data from many sources, manipulate the data, and then add
the data into your Hadoop environment.
Use the following terms as a guide for working with Flume:
sources
Any data source that Flume supports,
channels
A repository where the data are staged.
sinks
The target of data where Flume sends the interceptors to.
IBM® Open Platform with Apache Hadoop and BigInsights include the following
changes on top of Flume 1.5.2:
FLUME-2095: JMS source with TIBCO
FLUME-924: Implement a JMS source for Flume NG
FLUME-997: Support secure transport mechanism
FLUME-1502: Support for running simple configurations embedded in host process
FLUME-1516: FileChannel Write Dual Checkpoints to avoid replays
FLUME-1632: Persist progress on each file in file spooling client/source
FLUME-1735: Add support for a plugins.d directory
FLUME-1894: Implement Thrift RPC
FLUME-1917: FileChannel group commit (coalesce fsync)
FLUME-2010: Support Avro records in Log4jAppender and the HDFS Sink
FLUME-2048: Avro container file deserializer
FLUME-2070: Add a Flume Morphline Solr Sink
FLUME-1227: Introduce some sort of SpillableChannel
FLUME-2056: Allow SpoolDir to pass just the filename that is the source of an
event
FLUME-2071: Flume Context doesn’t support float or double configuration values.
FLUME-2185: Upgrade morphlines to 0.7.0
FLUME-2188: flume-ng-log4jappender Support user supplied headers
FLUME-2225: Elasticsearch Sink for ES HTTP API
FLUME-2294: Add a sink for Kite Datasets
FLUME-2309: Spooling directory should not always consume the oldest file first.
For a complete list of the new features, improvements, and bug fixes available, refer
to the CHANGELOG.txt file located in your Flume installation directory.
For more information about Flume, see http://flume.apache.org/.
34

-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
IBM BigInsights
Hadoop
Apache Hadoop contains open-source software for reliable, scalable, distributed
computing and storage. The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets across clusters of computers
using simple programming models. It is designed to scale up from single servers to
thousands of machines, each offering local computation and storage.
The Apache Hadoop 2.7.1 release includes important new features and
improvements since Hadoop 2.2.0:
Support for Access Control Lists in HDFS
Native support for Rolling Upgrades in HDFS
Usage of protocol-buffers for HDFS FSImage for smooth operational upgrades
Complete HTTPS support in HDFS
Enhanced support for new applications on YARN with Application History Server
and Application Timeline Server
Support for strong SLAs in YARN CapacityScheduler via Preemption
Support for Heterogeneous Storage hierarchy in HDFS.
In-memory cache for HDFS data with centralized administration and management.
Simplified distribution of MapReduce binaries with HDFS in YARN Distributed
Cache.
The IBM-specific changes are:
Backport HDFS-8432:Introduce a minimum compatible layout version to allow
downgrade in more rolling upgrade use cases.
Backport HADOOP-9431: TestSecurityUtil#testLocalHostNameForNullOrWild on
systems where hostname contains capital letters
Fix to remove cross-site scripting / scripting injection in HDFS webapps
Backport HADOOP-11138: Stream yarn daemon and container logs through log4j
Fix hadoop2 scripts to enable log streaming
Backport HADOOP-10420: Add support to Swift-FS to support tempAuth
Backport MAPREDUCE-5621: mr-jobhistory-daemon.sh doesn't have to execute
mkdir and chown all the time
Make the location of container executor config file configurable
Add log streaming for container logs
Backport HADOOP-7436: Bundle Log4j socket appender Metrics plugin in Hadoop
Upgrade jsch to 0.1.50 because of JDK 1.7 incompatibilities
Backport MAPREDUCE-6191: TestJavaSerialization fails with getting incorrect MR
job result
Backport HADOOP-11418: Property "io.compression.codec.lzo.class" does not
work with other value besides default
Fix race condition in Configuration.write
Backport MAPREDUCE-6246: DBOutputFormat.java appending extra semicolon to
query which is incompatible with DB2
Backport HDFS-7282. Fix intermittent TestShortCircuitCache and
TestBlockReaderFactory failures resulting from TemporarySocketDirectory GC
Backport HDFS-7182. JMX metrics aren't accessible when NN is busy
36

-
-
-
Upgrade jetty version to 6.1.26-ibm
Backport HADOOP-10062. race condition in
MetricsSystemImpl#publishMetricsNow that causes incorrect results.
Backport HDFS-6874: Add GET_BLOCK_LOCATIONS operation to HttpFS
37

Big Insights v4.1

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (8)

Similar a Big Insights v4.1

Similar a Big Insights v4.1 (20)

Último

Último (20)

Big Insights v4.1