Finance market prediction has always been one of the hottest topics in Data Science and Machine Learning. However, the prediction algorithm is just a small piece of the puzzle. Building a data stream pipeline that is constantly combining the latest price info with high volume historical data is extremely challenging using traditional platforms, requiring a lot of code and thinking about how to scale or move into the cloud. This session is going to walk-through the architecture and implementation details of an application built on top of open-source tools that demonstrate how to easily build a stock prediction solution with no source code - except a few lines of R and the web interface that will consume data through a RESTful endpoint, real-time. The solution leverages in-memory data grid technology for high-speed ingestion, combining streaming of real-time data and distributed processing for stock indicator algorithms
9. Hard to add new data sources
Why?
Hard to scale
Why so hard?
Hard to make it real-time
10. HDFS
Data Lake
Store Analytics
Hard to change
Labor intensive
Inefficient
No real-time information
ETL based
Data-source specific
Traditional models are reactive and static
11. HDFSData Lake
Expert System /
Machine Learning
In-Memory
Real-Time Data
Continuous Learning
Continuous Improvement
Continuous Adapting
Data Stream Pipeline
Multiple Data Sources
Real-Time Processing
Store Everything
Stream-based, real-time closed-loop analytics are needed
12. Info
Analysis
Look at past trends
(for similar input)
Evaluate current input
Score / Predict
Neural Network
How can it be addressed?
20. Ingest Transform Sink
SpringXD
Store / Analyze
Fast Data
Distributed Computing
Predict / Machine Learning
Other Sources and
Destinations
JMS
Streaming real-time analytics architecture
23. SpringXD
INGEST / SINK PROCESS ANALYZE
• Little or no coding required
• Dozens of built-in connectors
• Seamless integration with Kafka,
Sqoop
• Create new connectors easily
using Spring
• Call Spark, Reactor or RxJava
• Built-in configurable filtering,
splitting and transformation
• Out-of-box configurable jobs for
batch processing
• Import and invoke PMML jobs
easily
• Call Python, R, Madlib and other
tools
• Built-in configurable counters and
gauges
Data Stream Pipelining
40. Follow-up: In-Memory Unconference
"A place for all things in-memory: projects, people, ideas, roadmaps, discussions."
Location: Hill Country A/B”
Weds 4:15pm - 6pm. (after this talk)
The demo code is on GitHub!
@fredmelo_br
@william_markito