3. Build Vs. Buy
Build
• No dedicated team to
support infrastructure
• Very specific tasks
• Exclusive use of
infrastructure
• Reasonable scale
Buy
• Product can bought as
service (internal or external)
• Large scale
• Multi tenancy
• You are going to use
advanced features
(e.g. map/reduce)
4. “Casual” computing
• Small computation farms (< 100 servers)
• Team owns both application and grid
• Java platform
• Reasonably short batches (< 24 hours)
• Reasonably small data sets (< 10 TiB)
6. Simple master slave topology
Control plane
RMI
Queue / scheduler
Simple in memory queue
May be more complex than just task queue
Data plane
…
7. Data plane
Never, ever, try to send data over RMI
File system
Avoid network mounts!
In-memory key-value
Client side sharding works best
Disk database (RDBMS or NoSQL)
Consider prefetch of data
Direct socket streaming
…
10. Brute force
Build / package
Deploy / SCP
Restart slaves
Start batch
Change code, repeat
Deployment problem
Computation grid software
Compile and run batch
Behind scene
Your classes would be collected
Associated with batch
Deployed on participating slaves
13. Flow organized tasks
• Input data available before
task starts
• e.g. Map/Reduce
Collaborative tasks
• Tasks communicate
intermediate results to each
other
• e.g. physic simulations
Flavors of parallel processing
14. Get back to data plane
Rules of thumb
• Insert / delete – never update
• Write locally (reducing risks)
• Read remotely (retry on error)
• Store input as is
File system
Document / column oriented NoSQL
• Input and temporary data is different
Choose right store for each
15. Exploiting file system
Avoid network file systems
• File system concept is not designed to be distributed
• Good network file system cannot not exists
• Use simple remote file access protocols
• SCP (unencrypteddatatransferoptionsaddedbyCERNguys)
• HTTP (ifyoureallydonotwantSCP)
Cheap SAN could be build from open source
16. Algorithmic optimization
Parallel computing
• N times speed up will increase
your OPEX and CAPEX cost by N*lg(N)
Algorithmic optimization
• Up front costs only
• Orders of magnitude optimization opportunities
• Exciting coding
• Ecological way of computing
17. Streaming algorithms
Finding N most frequent elements
• Min-Count
Estimating number of unique values
• HyperLogLog
Distribution histograms
https://github.com/addthis/stream-lib
https://github.com/rwl/ParallelColt
19. @Test
public void hello_remote_world() {
Cloud cloud = CloudFactory.createSimpleSshCloud();
cloud.node("myserver.acme.com").exec(new Callable<Void>(){
@Override
public Void call() throws Exception {
String localhost = InetAddress.getLocalHost().toString();
System.out.println("Hi! I'm running on " + localhost);
return null;
}
});
}
As easy as …
20. All you need is …
NanoCloud requirements
SSHd
Java (1.6 and above) present
Works though NAT and firewalls
Works on Amazon EC2
Works everywhere where SSH works
21. Master – slave communications
Master process Slave hostSSH
(Single TCP)
Slave
Slave
RMI
(TCP)
std err
std out
std in
diag
Slave
controller
Slave
controller
multiplexed slave streams Agent