This document discusses ideas and technologies for building scalable software systems and processing big data. It covers:
1. Bi-modal distribution of developers shapes architecture/design and the need for loosely/tightly coupled code.
2. Internet companies like Google and Facebook innovate at large scale using open source tools and REST architectures.
3. A REST architecture allows scalability, extensible development, and integration of tools/ideas from the internet for non-internet applications.
2. John D. Almon
• Full stack software engineer
• Implemented RTM on GPU using MPI
• Implemented Cloud basedWEM using SOA
• Terabyte scale database design and data warehousing
• Architected hybrid web interpretation and processing system
• C++, Java, MPI, C, Oracle PL/SQL, HTML,Web Based Systems, XML
• Managed software team
• Currently serves as CEO ofAdvanced SeismicTechnologies
4. Small HPC setup - Guess what company
• Fiber optic to every desktop using HPC grid
• 400Terabytes of Storage
• 300 x 10 GbE ports
• 1500 x 1 GbE ports
• Desktop workstations automatically added to HPC grid after hours
• 5,000 AMD processors + 3,000 desktop processors at night
5.
6.
7. Monsters University
• 100 Million CPU hours
• 5.5 million individual hairs
• 127 simulated garments
• Global illumination ray tracing
8. Key point #1
Perhaps we can learn new techniques from
other industries that operate at scale
10. Bi Modal Distribution of Developers
This shapes Architecture and Design Innovation
Loosely coupled code
Fast hardware
Open source
Closely coupled code
Slow hardware
More optimization
Geoscience Gap
Massive hardware changes
11. Better compilers and cheaper hardware has
changed everything about software development
• No more fortran ( sort of )
• Object oriented approach
• Teenage internet billionaires
12. Software access patterns affect memory
speed ( affected by data and users )
Word Size Affects
Memory Bandwidth
Temporal Locality &
Spatial Locality
Can affect bandwidth
13. Memory Mountain software code
/* Iterate over first "elems" elements of array "data" with stride of
* "stride". */
void test(int elems, int stride)
{
int i;
double result = 0.0;
volatile double sink;
for (i = 0; i < elems; i += stride)
result += data[i];
sink = result; /* So compiler doesn't optimize away the loop */
}
14. Everything is a cache ( memory heirachy )
• Register, ~2ns
• Primary cache, ~4-5ns
• Secondary cache, ~30ns
• Main memory, ~22ns
• Magnetic Disk, ~3ms
• SSD,~100µs
• File server on Gigabit ethernet
• Cloud
Bottleneck is the
memory bus
Bottleneck is the
network
15. New Paradigm for Optimization of Compute
at Cluster / Cloud level
• Pre sorting / caching of data for maximum
throughput
• Hueristic analysis at the application level
• Optimization of hardware resources determined by
the application
• Hardware switching based on access patterns of
application and user
16. All developers are:
(artists | engineers | brilliant | clueless )
• There is no one right way to build a piece of software
• Heterogeous development staff builds heterogeneous
solutions
• What about UI / UX ( User Interface / User Experience )
• Business workflows should drive UI / UX
• Steve jobs was tyrannical about every detail fitting into his one
overaching product vision
23. • $50 Billion in revenue
• 30,000 + employees
• Optimization throughout entire stack
• Google Filesystem, Operating System, CHROME
• 2,000,000 servers
• Free food to keep their developers working long
hours
25. Google tools
• Google Hangout - collaboration
• Google Maps
• Google compute engine
• Google bigQuery
26.
27. • $1 Billion data center in Iowa
• 450,000 servers
• API first development strategy
• Supports multiple interface connectivity using
“restful” applications
• Compete with UI / UX
• Creates user lock in through iterative conditioning
28. Iterative conditioning
• Workflows are hard to learn
• You should need software training to learn how to use software
• Software fatigue
• Switching cost
• Adoption rates
• Advanced features
• Tracking all of this and dynamic menus and configuration
29. Facebook tools and contributions
• Apache Cassandra ( Big data database, linear
scalability )
• ApacheThrift ( cross language services )
Architecture choices provide insight … still have to
implement for specifics of Oil and Gas
30. Open Source Licensing
• MIT X11 License – ANY use permissible
• BSD – Identical to MIT X11
• GPL – no linking
• LPGL – linking allowed
• Appliances – ethical / versus legal
Must read the fine print before using, but can save very large amount
of time by using these frameworks and implementations where
possible
34. Representational State Transfer
• 6 constraints
• Client Server – clients are not concerned with data storage
• Stateless – server does not store client context
• Cacheable – client stores responses
• Layered system – client does not know if it is at end server or intermediary
• Optional code on demand – client downloads code and runs
• Uniform interface – decouples interface and allows each part to evolve
independently
35. Representational State Transfer
• 6 constraints
• Client Server – clients are not concerned with data storage
• Stateless – server does not store client context
• Cacheable – client stores responses
• Layered system – client does not know if it is at end server or intermediary
• Optional code on demand – client downloads code and runs
• Uniform interface – decouples interface and allows each part to evolve
independently
36. Simplified REST
Web Browser Web Server
Database
File Servers
Presentation Layer
can’t handle
Geoscience or
local compute
Web server has the
majority of control
Compute Engine
REST API
37.
38. REST with Mashup
Web Browser Web Server 1
Database
File Servers
Presentation Layer
can mashup data
from 2 separate
sources
Compute Engine
Web Server 2
REST API
39. REST with new application layer
Form window Application
Database
File Servers
Compute Engine
Web Server 2
REST API
OpenGLWindow
Web Browser
40. Internet architecture / legacy style code
• REST Architecture for NON – INTERNET
applications
• Can keep inside corporate networks
• Distributed systems architecture
• Predominant webAPI design model
• Allows for distributed development team
• Separate data model from view model
• But allows for computation on either side
42. Client Server
• FINALLY !! Interactive HPC apps made easy
• Our tabs are the clients connection to application
layer via a “REST” style API
• Application layer provides caching and file system
access
• Application layer provides access to heterogeneous
compute
43. Stateless
• Each tab does not know about other tabs
• This creates the ability to very quickly have
developer from different teams and disciplines work
independently
• Application layer provides synchronization states
• Application layer provides for off-workstation
transferability ( work from iPad on the Beach )
44. Cacheable
• Heuristic data sorting and precaching based on user /
algorithm needs
• Allows for compute distribution without presentation layer
needing to know
• Allows for disparate file systems
• Abstracts data location from user
• Communicate with HPC grid in more advanced manner
45. Layered System
• Allows for use of 3rd party plugins
• Allows EVERY application connect to HPC grid
• Graphics as plugins
• Workflows as plugins - dynamic workflow
• No menu on Amazon
• Optimize each layer independently
46. Code on demand
• Safer since security is controlled by application layer
• Sandbox each user and only give access with additional security
credentials
• Can download and run legacy code through Pinvoke
• DLL injection
47. Uniform Interface
• HTML for cross platform consistency
• User adoption and ease of use
• Internet style decoupling of functionality from
graphics creates a better user experience and more
intuitive style workflow
• Most graphic designers do NOT know C++
• Geoscientists won’t always agree on color scheme,
styles, icons
48. Most important benefits
• More flexibility means rapid application development and easier
maintenance
• Presentation layer needs change as business requirements needs
change over time
• Hooking into outside tools that have REST API’s
• Data
• Social
• Compute engines
• Mash ups
49. Key point #4
A REST architecture enables scalability,
extensible development, and mashup of
tools and ideas created for the Internet
52. Google BigQuery
• Underlying technology is called DREMEL
• Uses google file system as abstraction for database
• Dremel can even execute a complex regular expression text matching on a huge
logging table that consists of about 35 billion rows and 20TB, in merely tens of
seconds
53. Cassandra
• Cassandra provides a structured key-value store with tunable
consistency.
• Keys map to multiple values, which are grouped into column families.
The column families are fixed when a Cassandra database is created,
but columns can be added to a family at any time.
• Furthermore, columns are added only to specified keys, so different
keys can have different numbers of columns in any given family.
• The values from a column family for each key are stored together.
54. Palantir
• Does work for government agencies
• High security layer that sits on top of disparate data sources
• The Palantir Stack Layer
• Brings together structured and unstructured data
• Serves as foundation for applications using the dataAPI
• Search and discovery layer
• Granular multi layered security model
• Revisioning database and original source tracking
• Collaboration and data editing
55. Ayasdi
• Topological data analysis using machine learning
• Can cross analyze multiple data
sources
• Query free approach
56. Zoom Data
• Automated connectivity to third party sources
• Visualization studio
• Interactive visualizations
57. WebGL ( Open GL in web browser )
• Could be used for presentation layer in mobile device
http://demos.vicomtech.org/x3dom/test/functional/volrenShaderBoun
daryEnh.xhtml
http://ourbricks.com/viewer/178d62ac29aa44459a6d57ce474fa6b6