This document provides best practices for optimizing the performance of InfoSphere BigInsights and InfoSphere Streams when deployed in the cloud. It discusses optimizing disk performance by choosing cloud providers and instances with good disk I/O, partitioning and formatting disks correctly, and configuring HDFS to use multiple data directories. It also discusses optimizing Java performance by correctly configuring JVM memory and optimizing MapReduce performance by setting appropriate values for map and reduce tasks based on machine resources.
2. Please Note
IBM’s statements regarding its plans, directions, and intent are subject to
change or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general
product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a
commitment, promise, or legal obligation to deliver any material, code or
functionality. Information about potential future products may not be
incorporated into any contract. The development, release, and timing of any
future features or functionality described for our products remains at our sole
discretion.
Performance is based on measurements and projections using standard IBM
benchmarks in a controlled environment. The actual throughput or performance
that any user will experience will vary depending upon many factors, including
considerations such as the amount of multiprogramming in the user’s job
stream, the I/O configuration, the storage configuration, and the workload
processed. Therefore, no assurance can be given that an individual user will
achieve results similar to those stated here.
3. Agenda
Introduction
Optimizing for disk performance
Optimizing Java for computational performance
Optimizing MapReduce for computational performance
Optimizing with Adaptive MapReduce
Common considerations for
InfoSphere BigInsights and InfoSphere Streams
Questions and Answers
4. Prerequisites
To get the most out of this session, you should be familiar
with the basics of the following:
−
Hadoop and Streams
−
MapReduce
−
HDFS or GPFS
−
Linux shell
−
XML
5. My Team
IBM Information Management Cloud Computing Centre of
Competence
−
Information Management Demo Cloud
−
Deploy complete stacks of IBM software for demonstration
and evaluation purposes
imcloud@ca.ibm.com
Images and templates with IBM software for public clouds
IBM SmartCloud Enterprise
IBM SoftLayer
Amazon EC2
6. My Work
Development:
−
Ruby on Rails, Python, Bash/KSH shell scripting, Java
IBM SmartCloud Enterprise
−
−
Public cloud
InfoSphere BigInsights, InfoSphere Streams, DB2
RightScale and Amazon EC2
−
−
Public cloud
InfoSphere BigInsights, InfoSphere Streams, DB2
IBM PureApplication System
−
Private cloud appliance
−
DB2
7. Background
BigInsights recommendations are based on my experience
optimizing BigInsights Enterprise 2.1 performance on an
OpenStack private cloud
Streams recommendations are based on my experience
optimizing Streams 3.1 performance on IBM SmartCloud
Enterprise
Some recommendations are based on work with the IBM
Social Media Accelerator to process enormous amounts of
Twitter data using BigInsights and Streams
8. Hadoop Challenges in the Cloud
Hadoop does batch processing of data stored on disk.
The bottleneck is disk I/O.
Infrastructure-as-a-Service clouds have traditionally
focused on uses such as web servers that are optimized
for in-memory operation and have different constraints.
10. Disk Performance
Hadoop performance is I/O bound. It depends on disk
performance.
Hadoop is for batch processing of data stored on disks
Contrast with real-time and in-memory workloads (Streams,
Apache), which depend on memory and processor speed
Infrastructure-as-a-Service clouds (IaaS) were originally
optimized for in-memory workloads, not disk workloads
Cloud disk performance has traditionally been weak due to
virtualization abstraction and network separation between
computational units and storage
Different clouds have different solutions to this
11. Disk Performance – Choice of Cloud
Choice of cloud provider and instance type is crucial
Some cloud providers are worse for Hadoop than others
Favour local storage over network-attached storage (NAS)
−
For example, EBS on Amazon tends to be slower than local
storage
Options
−
SoftLayer and clouds of physical hardware
−
Storage-optimized instances on Amazon EC2
−
Other public and private clouds that keep storage as close to
computational nodes as possible
12. Disk performance – Concepts
Hadoop Distributed File System (HDFS) and General Parallel
File System (GPFS) are both abstractions
HDFS and GPFS run on top of disk filesystems
A disk is a device
A disk is divided into partitions
Partitions are formatted with filesystems
Formatted partitions can be mounted as a directory and used
to store anything
For Hadoop, we want Just-a-Bunch-Of-Disks (JBOD), not
RAID. HDFS has built-in redundancy.
Eschew Linux Logical Volume Manager (LVM).
13. Disk performance – Partitioning
We’ll use /dev/sdb as a sample disk name
Disks greater than 2TB in size require the use of a GUID
Partition Table (GPT) instead of Master Boot Record (MBR)
−
parted -s /dev/sdb mklabel gpt
For Hadoop storage, create a single partition per disk
Partition editor can be finicky about where that partition stops
and starts
−
−
end=$( parted /dev/sdb print free -m | grep sdb |
cut -d: -f2 )
parted -s /dev/sdb mkpart logical 1 $end
If you were working with disk /dev/sdb, you will now have a
partition called /dev/sdb1
14. Disk performance – Formatting
Many options: ext4, ext3, xfs
xfs is not included in base Red Hat Enterprise Linux (RHEL),
so assume ext4
−
mkfs -t ext4 -m 1 -O
dir_index,extent,sparse_super /dev/sdb1
“-m 1” reduces the number of filesystem blocks reserved for
root to 1%. Hadoop does not run as root.
“dir_index” makes listing files in a directory faster. Instead of
using a linked list, the filesystem will use a hashed B-tree.
“extent” makes the filesystem faster when working with large
files. HDFS divides data into blocks of 64MB or more, so
you’ll have many large files.
“sparse_super” saves space on large filesystems by keeping
fewer backups of superblocks. Big Data processing implies
large filesystems.
15. Disk performance – Mounting
Before you can access a partition, you have to mount it in an
empty directory
−
−
mkdir -p /disks/sdb1
mount -noatime -nodiratime /dev/sdb1 /disks/sdb1
“noatime” skips writing file access time to disk every time a
file is accessed
“nodiratime” does the same for directories
In order for the system to re-mount your partition after reboot,
you also have to add it to the /etc/fstab configuration file
−
echo "/dev/sdb1 /disks/sdb1 ext4
defaults,noatime,nodiratime 1 2" >> /etc/fstab
16. HDFS Data Storage on Multiple Partitions
Don’t forget that you can spread HDFS across multiple
partitions (and so disks) on a single system
In the cloud, the root partition / is usually very small. You
definitely don’t want to store Big Data on it.
Don’t use the root of a mounted filesystem (e.g. /disks/sdb1)
as the data path. Create a subdirectory (e.g.
/disks/sdb1/data)
−
mkdir -p /disks/sdb1/data
Otherwise, HDFS will get confused by things Linux puts in
the root (e.g. /disks/sdb1/lost+found)
17. HDFS Data Storage – Installation and Timing
You can set HDFS data storage path during installation or
after installation.
BigInsights has a fantastic installer for Hadoop – offers both
a web-based graphical installer, and a powerful silent install
for response file.
Web-based graphical installer will generate a silent install
response file for you for future automation.
BigInsights also comes with sample silent install response
files.
18. HDFS Data Storage – During installation
During installation, HDFS data storage path is controlled by
the values of <hdfs-data-directory /> and <data-directory />
For example:
−
<cluster-configuration>
<hadoop><datanode><data-directory>
−
/disks/sdb1/data,/disks/vdc1/data
</data-directory></datanode></hadoop>
<node-list><node><hdfs-data-directory>
−
−
/disks/sdb1/data,/disks/vdc1/data
</hdfs-data-directory></node></node-list>
</cluster-configuration>
19. HDFS Data Storage – During Installation (2)
Multiple paths are separated by commas
Any path with an omitted initial / is considered relative to the
installation’s <directory-prefix />
If <directory-prefix/> is “/mnt”, then the <hdfs-data-directory/>
“hadoop/data” would be interpreted as “/mnt/hadoop/data”
You can mix relative and absolute paths in the commaseparated list of directories
20. HDFS Data Storage – After Installation
You can change the path of HDFS data storage after
installation
Path is controlled by dfs.data.dir variable in hdfs-site.xml
In Hadoop 2.0, dfs.data.dir is renamed to
dfs.datanode.data.dir
Note: With BigInsights, never modify configuration files in
$BIGINSIGHTS_HOME/hadoop-conf/ directly
−
Modify $BIGINSIGHTS_HOME/hdm/hadoop-conf-staging/hdfssite.xml
−
Then run synconf.sh to apply the configuration setting across
the cluster
echo 'y' | syncconf.sh hadoop force
Note: Never reformat data nodes in BigInsights. Reformatting
will erase BigInsights libraries from HDFS.
21. HDFS Namenode Storage
The Namenode of a Hadoop cluster stores the locations of all
the files on the cluster
During installation, the path of this storage is determined by
the value of <name-directory />
After installation, the path of namenode storage is
determined by the value of dfs.name.dir variable in hdfssite.xml
You can separate multiple locations with commas
In Hadoop 2.0, dfs.name.dir is renamed to
dfs.namenode.name.dir
23. Java and Computational Performance
BigInsights and Hadoop are Java-based
Configuration the Java Virtual Machine (JVM) correctly is
crucial to processing of Big Data in Hadoop
Correct JVM configuration depends on both the machine as
well as the type of data
BigInsights has a configuration preprocessor that will easily
size the configuration to match the machine
24. Java and Computational Performance
Note: Never modify mapred-site.xml in
$BIGINSIGHTS_HOME/hadoop-conf/ directly
Modify mapred-site.xml in
$BIGINSIGHTS_HOME/hdm/hadoop-conf-staging/
Run syncconf.sh to process the calculations and apply the
new configuration to the cluster
25. Java and Computational Performance
A key property for performance is the amount of memory
allocated to each Java process or task
Keep in mind many tasks will be running at the same time,
and you’ll want them all to fit within available machine
memory with some margin
A good value for many use cases is 600m
−
<property>
−
<name>mapred.child.java.opts</name>
<value>-Xmx600m</value>
</property>
When working with the IBM Social Media Accelerator, you’ll
want much more memory per task. 4096m or more is
common, with implications for size of machine expected.
Note: Do not enable -Xshareclasses. This was a bad default
in older BigInsights releases.
26. Java and Computational Performance –
Streams
Streams and Streams Studio are Java applications
You can increase the amount of memory allocated to the
Streams Web Server (SWS) as follows, where X is in
megabytes:
−
−
streamtool stopinstance --instance-id myinstance
−
streamtool setproperty --instance_id myinstance
SWS.jvmMaximumSize=X
streamtool startinstance --instance-id myinstance
You can increase the amount of memory for Streams Studio
in <install-directory>/StreamsStudio/streamsStudio.ini
−
After -vmargs, add -Xmx1024m or similar
27. MapReduce and Computational Performance
Hadoop traditionally uses the MapReduce algorithm for
processing Big Data in parallel on a cluster of machines
Each machine runs a certain number of Mappers and
Reducers
A Hadoop Mapper is a task that splits input data into
intermediate key-value pairs
A Hadoop Reducer is a task that that reduces a set of
intermediate key-value pairs with a shared key to a smaller
set of avlues
28. MapReduce and Computational Performance
You’ll want more than one reduce tasks per machine, with
both the number of available cores and the amount of
available memory constricting the number you can have
The 600 denominator comes from the value for JVM memory
in mapred.child.java.opts
−
<property>
−
<name>mapred.reduce.tasks</name>
<value><%= Math.ceil(numOfTaskTrackers *
avgNumOfCores * 0.5 * 0.9) %></value>
</property>
29. MapReduce and Computational Performance
Map tasks and reduce tasks use the machine differently. Map
tasks will fetch input locally, while reduce tasks will fetch
input from the network. They will run at the same time.
Running more tasks than will fit in a machine’s memory will
cause tasks to fail.
Set the number of map tasks per machine to use slightly less
than half the number of available processor cores
−
−
<name>tasktracker.map.tasks.maximum</name>
<value><%= Math.min(Math.ceil(numOfCores *
1.0),Math.ceil(0.8*0.66*totalMem/600)) %></value>
Set the number of reduce tasks per machine to half the
number of map tasks
−
<name>tasktracker.map.tasks.maximum</name>
−
<value><%= Math.min(Math.ceil(numOfCores *
0.5),Math.ceil(0.8*0.33*totalMem/600)) %></value>
30. MapReduce and Computational Performance
Cloud machine size
Number of mappers
Number of reducers
1 core, 2GB
1
1
1 core, 4GB
1
1
2 core, 8GB
2
1
4 core, 15GB
4
2
16 core, 61GB
16
8
16 core, 117GB
16
8
31. More options in mapred-site.xml
“mapred.child.ulimit” lets you control virtual memory used by
Hadoop’s Java processes. 1.5x the size of mapred-childjava-opts is a good. Note that the value is in kilobytes. If the
Java options are “-Xmx600m”, then a good value for the
ulimit is 600*1.5*1024 which is “921600”.
“io.sort.mb” controls the size of the output buffer for map
tasks. When it’s 80% full, it will start being written to disk.
Increasing the size of the output buffer will reduce the
number of separate writes to disk. Increasing the size will use
more memory and do less disk I/O.
“io.sort.factor” defines the number of files that can be merged
at one time. Merging is done when a map tasks is complete,
and again before reducers start executing your analytic code.
Increasing the size will use more memory and do less disk
I/O.
32. More options in mapred-site.xml (2)
“mapred.compress.map.output” enables compression when
writing the output of map tasks. Compression used more
processor capacity but reduces disk I/O. Compression
algorithm is determined by
“mapred.map.output.compression.codec”
“mapred.job.tracker.handler.count” determines the size of the
thread pool for responding to network requests from clients
and tasktrackers. A good value is the natural logarithm (ln) of
cluster size times 20. “dfs.namenode.handler.count” should
also be set to this, as it performs the same functions for
HDFS.
“mapred.jobtracker.taskScheduler” determines the algorithm
used for assigning tasks to task trackers. For production,
you’ll want something more sophisticated than the default
JobQueueTaskScheduler.
33. Kernel Configuration
Linux kernel configuration is stored in /etc/sysctl.conf
“vm.swappiness” controls kernel’s swapping of data from
memory to disk. You’ll want to discourage swapping to disk,
so 0 is a good value.
“vm.overcommit_memory” allows more memory to be
allocated than exists on the system. If you experience
memory shortages, you may want to set this to 1 as the way
the JVM spawns Hadoop processes will have them request
more memory than they need. Further tuning is done through
“vm.overcommit_ratio”.
35. IBM Big Data Platform
IBM InfoSphere BigInsights
Visualization & Discovery
Administration
Applications & Development
BigSheets
Apps
Workflow
Dashboard &
Visualization
Text Analytics
Pig & Jaql
MapReduce
Hive
Admin Console
Integration
JDBC
Monitoring
Netezza
Advanced Analytic Engines
R
Text Processing Engine &
Extractor Library)
Adaptive Algorithms
DB2
Streams
Workload Optimization
Integrated
Installer
Enhanced
Security
Splittable Text
Compression
Adaptive
MapReduce
ZooKeeper
Oozie
Jaql
Flexible
Scheduler
Lucene
Pig
Hive
Index
Runtime / Scheduler
MapReduce
Symphony
Symphony AE
DataStage
HCatalog
Management
Security
Data Store
Guardium
Platform
Computing
Cognos
Audit & History
HBase
Flume
Lineage
File System
HDFS
Sqoop
GPFS FPO
Open Source
IBM
Optional
36. Adaptive MapReduce
Adaptive MapReduce lets mappers communicated through a
distributed metadata store and take into account the global
state of the job
Open the install.properties before you install BigInsights
To Enable Adaptive MapReduce, set the following:
−
To also enable High Availability, set the following:
−
AdaptiveMR.Enable=true
AdaptiveMR.HA.Enable=true
High Availability requires at least nodes in your cluster
Adaptive MapReduce is a single-tenant implementation of
IBM Platform Symphony
38. Common Considerations
Both BigInsights and Streams rely on working with large
numbers of open files and running processes
Raise the Linux limit on the number of open files (“nofile”) to
131072 or more in /etc/security/limits.conf
Raise the Linux limit on the number of processes (“nproc”) to
unlimited in /etc/security/limits.conf
Remove RHEL forkbomb protection from
/etc/security/limits.d/90-nproc.conf
Validate your changes with a fresh login as your BigInsights
and Streams users (e.g. biadmin, streamsadmin) and the
ulimit command
41. Communities
• On-line communities, User Groups, Technical Forums, Blogs, Social
networks, and more
o Find the community that interests you …
• Information Management bit.ly/InfoMgmtCommunity
• Business Analytics bit.ly/AnalyticsCommunity
• Enterprise Content Management bit.ly/ECMCommunity
• IBM Champions
o Recognizing individuals who have made the most outstanding contributions to
Information Management, Business Analytics, and Enterprise Content
Management communities
•
ibm.com/champion
42. Thank You
Your feedback is important!
• Access the Conference Agenda Builder to
complete your session surveys
o Any web or mobile browser at
http://iod13surveys.com/surveys.html
o Any Agenda Builder kiosk onsite
Notas del editor
On this chart, you can get a quick overview of the various open source and IBM technologies provided with BigInsights Enterprise Edition. Open source technologies are shown in yellow, when IBM-specific technologies are shown in blue