SlideShare una empresa de Scribd logo
1 de 22
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
Railroad Modeling at HadoOp
Scale
Hadoop Summit
3 June 2014, San Jose, CA
John Akred (@BigDataAnalysis),
Tatsiana Maskalevich (@notrockstar)
www.svds.com @SVDataScience
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
2
Why is a data science &
engineering consulting company
building its own Caltrain app?
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
3
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
4
• Commuter rail between San Francisco and San Mateo and
Santa Clara counties ~30 stations
• 118 passenger cars
• 60% >=30 years old
• 2014 weekday ridership is 52,019 people
• On-time performance is about 92%
• No reliable real-time status information
• API outage between April 5th and June 2nd
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
HOW DO
WE KNOW
IF THE
TRAIN IS
LATE?
• Direct observation
– We can hear the train horn
– We can see the train when it goes by
• Purpose-built systems:
– We can use Caltrain API’s (when working)
• Other signals
– We can check Twitter for delay info or rider
comments
5
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
SVDS Approach
6
 Take advantage of the available
signals
 Use historical data to make direct
and latent observations more
useful
 Provide a service that gives
users valuable planning and
riding features
 Don’t let the perfect be the
enemy of the good
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
7
Stovepipe:
One-to-one
relationship
from data
source to
product
Hard Failure:
If the data
source is
broken, so is
the app.
Multi-sourced:
Redundancy of
overlapping data
sources makes your
products more
resilient
Graceful Degradation:
If a data source
breaks, there is a
backup and your app
continues to function
Production data services
abstract the probabilistic
integration of overlapping
data sources. We call this
model a Data Mesh:
DATA RESILIENCY Products
Data
Sources
Broken
Data
Sources
Data
Services
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
8
Source
Signals
Audio
Image
Text
API
Variety
Volume
Velocity
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
9
• Microphone connected to Raspberry Pi
mic->preamp->analog-to-digital converter->usb
• PyAudio running on Raspberry Pi serializes
audio as an array of 2-byte integers.
• Sound data + metadata -> Flume on AWS
via flumelogger
• We use FFT + Decision Trees to detect and
classify the trains into express and local
based on the whistle sound.
Audio Capture and Ingest
Raspberry Pi
Raw Audio
Agent
Raw Audio
Agent
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
10
• wget pulls images from camera’s built-
in server 2-3 times a second, and
saves them on a local server/NAS
• Flume pushes the image data to our
EC2 servers
• openCV (python) is used to detect
trains in images
Image Capture and Ingest
Raw Image
Agent
Raw Image
Agent
Local Server
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
11
• Capturing all the tweets with keyword
‘Caltrain’ via Twitter API
• Flume agent sends tweets to Apache
Storm topology for processing
• Tweets are parsed and written to
HDFS and HBase
• Event Detection is based on the
baseline number of tweets per hour
and keywords
Text Capture and Ingest: Twitter
Raw Image Agent
Twitter APIs
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
12
• Real-time departure times available via
511.org developer API’s
• Python script collects data once a
minute from 511.org APIs and stores it in
HDFS as sequence files using
WebHDFS API’s.
• Python script collects data from the
Caltrain site that includes run #
• Didn’t function from April 5th until June
2nd 2014
Caltrain API Data Capturing
scraper.py
511.Org
APIs
Caltrain
Webpag
e
data_collec
tor_api.py
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
13
Combining
the Signals
Audio
Signal
Detection
Image
Recogni-
tion
Text
Analysis
STATE
of
complex
system
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
14
Twitter
Agent
Analytics
Dev
MapReduce
Event
StorageSound
Agent
Image
Agent
Twitter
Spout
Sound
Spout
Image
Spout
Tweet
Parser
Tweets
Counter
HDFS
Writer
Event
Detector
Alerts
Twitter
API
HBase
Writer
Microphone
on
Raspberry
Pi
Web
Camera
External
Data
Sources
Data Platform
Sounds
Classifier
Train
Detector
Transmit
to APP
Caltrain
Agent
Caltrain
Spout
Caltrain
API Schedule
Integrator
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
15
Batch:
• Apply FFT to audio data to
identify train based on train
whistle’s fundamental
frequencies.
• Decision tree trained to classify
trains into local or express based
on minimum and maximum
fundamental frequencies (Doppler
effect)
Data Science: Audio
Real-Time:
• Execute local / express classifier
• Send data to the Event Detector for
APP alerts
• Store results in HBase
• Apply FFT to audio
signal
• Extract min and
max fundamental
frequencies
Frequency,Hz
Histogram of Whistle Frequencies Over a Period of Time
FrequencyCounts
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
16
Real-Time
• ORB algorithm (openCV) is used to
detect the train in image
• Sends results to the Event Detector to
identify train and compare to schedule
• Event Detector updates APP with the
train’s status, alerts if late
Data Science: Image
Number of Key-PointsThat AreThe Same In Two ConsecutivesImages
Time (Sec)
NumberofMatchingPoints
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
17
Batch:
• Update baseline tweet
frequencies for each hour as
additional historical data
collected
• Store model parameters in
HBase
Data Science: Text
Real-Time:
• Count tweets as they stream
through topology
• Alert based on frequency
deviations from the baseline
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
18
Baseline
Calculation Baseline
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
19
Future Work • Detect direction of train in image
processing
• Use natural language processing on
twitter data for event detector.
• Continue evaluation of analytical
frameworks for model computation
• Add observation posts
• Release Caltrain Rider Application
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
COMING SOON:
CALTRAIN RIDER APP
• Find out what train to catch using our
‘Ride Now’ view
• Select a train, see when that train should
be reaching each stop in a trip detail
view.
• For more info:
www.svds.com/trains
20
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
questions
21
Yes, We’re Hiring
www.svds.com/join-us
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
THANK YOU
John @BigDataAnalysis
Tatsiana @notrockstar
22

Más contenido relacionado

La actualidad más candente

Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks
 
Power BI Streaming Datasets - San Diego BI Users Group
Power BI Streaming Datasets - San Diego BI Users GroupPower BI Streaming Datasets - San Diego BI Users Group
Power BI Streaming Datasets - San Diego BI Users GroupGreg McMurray
 
A Study on New York City Taxi Rides
A Study on New York City Taxi RidesA Study on New York City Taxi Rides
A Study on New York City Taxi RidesCaglar Subasi
 
A study of Data Quality and Analytics
A study of Data Quality and AnalyticsA study of Data Quality and Analytics
A study of Data Quality and AnalyticsAli Habeeb
 
The Future of Data Pipelines
The Future of Data PipelinesThe Future of Data Pipelines
The Future of Data PipelinesAll Things Open
 
FPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWSFPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWSChristoforos Kachris
 
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyAsynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyJim Dowling
 
Cruising in data lake from zero to scale
Cruising in data lake from zero to scaleCruising in data lake from zero to scale
Cruising in data lake from zero to scaleJohn Varghese
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big dataSigmoid
 

La actualidad más candente (12)

Resume
ResumeResume
Resume
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
 
Power BI Streaming Datasets - San Diego BI Users Group
Power BI Streaming Datasets - San Diego BI Users GroupPower BI Streaming Datasets - San Diego BI Users Group
Power BI Streaming Datasets - San Diego BI Users Group
 
A Study on New York City Taxi Rides
A Study on New York City Taxi RidesA Study on New York City Taxi Rides
A Study on New York City Taxi Rides
 
A study of Data Quality and Analytics
A study of Data Quality and AnalyticsA study of Data Quality and Analytics
A study of Data Quality and Analytics
 
The Future of Data Pipelines
The Future of Data PipelinesThe Future of Data Pipelines
The Future of Data Pipelines
 
StreamSet ETL tool
StreamSet  ETL toolStreamSet  ETL tool
StreamSet ETL tool
 
FPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWSFPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWS
 
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyAsynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
 
Cruising in data lake from zero to scale
Cruising in data lake from zero to scaleCruising in data lake from zero to scale
Cruising in data lake from zero to scale
 
YingqiCV
YingqiCVYingqiCV
YingqiCV
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 

Destacado

From Gust To Tempest: Scaling Storm
From Gust To Tempest: Scaling StormFrom Gust To Tempest: Scaling Storm
From Gust To Tempest: Scaling StormDataWorks Summit
 
Historical occupational classification and stratification schemes (lecture)
Historical occupational classification and stratification schemes (lecture)Historical occupational classification and stratification schemes (lecture)
Historical occupational classification and stratification schemes (lecture)Richard Zijdeman
 
Harry Verwayen, The More You Give The More You Get
Harry Verwayen, The More You Give The More You GetHarry Verwayen, The More You Give The More You Get
Harry Verwayen, The More You Give The More You GetEUscreen
 
Andreas Fickers: Transmedia Storytelling and Media History
Andreas Fickers: Transmedia Storytelling and Media HistoryAndreas Fickers: Transmedia Storytelling and Media History
Andreas Fickers: Transmedia Storytelling and Media HistoryEUscreen
 
Revista Académica de la Universidad La Salle
Revista Académica de la Universidad La SalleRevista Académica de la Universidad La Salle
Revista Académica de la Universidad La SallePérez Esquer
 
Polka Dot Teapots
Polka Dot TeapotsPolka Dot Teapots
Polka Dot Teapotscinty45
 
Dealing with an Upside Down Internet With High Performance Time Series Database
Dealing with an Upside Down Internet  With High Performance Time Series DatabaseDealing with an Upside Down Internet  With High Performance Time Series Database
Dealing with an Upside Down Internet With High Performance Time Series DatabaseDataWorks Summit
 
Marketing de Experiencias Online
Marketing de Experiencias OnlineMarketing de Experiencias Online
Marketing de Experiencias OnlineRicardo Valdés
 
Hadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business UnitHadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business UnitDataWorks Summit
 
White eagle, by robert
White eagle, by robertWhite eagle, by robert
White eagle, by robertbeatusest2
 
An Exploratory Study on the Links between Individual Upcycling, Product Attac...
An Exploratory Study on the Links between Individual Upcycling, Product Attac...An Exploratory Study on the Links between Individual Upcycling, Product Attac...
An Exploratory Study on the Links between Individual Upcycling, Product Attac...Kyungeun Sung
 
Curso gestión estratégica del tiempo 2016
Curso gestión estratégica del tiempo 2016Curso gestión estratégica del tiempo 2016
Curso gestión estratégica del tiempo 2016juansalas
 
Computación para todos (primaria) 5to grado
Computación para todos (primaria)  5to gradoComputación para todos (primaria)  5to grado
Computación para todos (primaria) 5to gradoMarcos Torres
 
陰陽編 太極から八卦へ
陰陽編 太極から八卦へ陰陽編 太極から八卦へ
陰陽編 太極から八卦へreigan_s
 

Destacado (20)

From Gust To Tempest: Scaling Storm
From Gust To Tempest: Scaling StormFrom Gust To Tempest: Scaling Storm
From Gust To Tempest: Scaling Storm
 
Historical occupational classification and stratification schemes (lecture)
Historical occupational classification and stratification schemes (lecture)Historical occupational classification and stratification schemes (lecture)
Historical occupational classification and stratification schemes (lecture)
 
Assignment.econ 260
Assignment.econ 260Assignment.econ 260
Assignment.econ 260
 
Examen de-química-ii
Examen de-química-iiExamen de-química-ii
Examen de-química-ii
 
Harry Verwayen, The More You Give The More You Get
Harry Verwayen, The More You Give The More You GetHarry Verwayen, The More You Give The More You Get
Harry Verwayen, The More You Give The More You Get
 
Andreas Fickers: Transmedia Storytelling and Media History
Andreas Fickers: Transmedia Storytelling and Media HistoryAndreas Fickers: Transmedia Storytelling and Media History
Andreas Fickers: Transmedia Storytelling and Media History
 
Revista Académica de la Universidad La Salle
Revista Académica de la Universidad La SalleRevista Académica de la Universidad La Salle
Revista Académica de la Universidad La Salle
 
Examen de-química-ii-1
Examen de-química-ii-1Examen de-química-ii-1
Examen de-química-ii-1
 
Polka Dot Teapots
Polka Dot TeapotsPolka Dot Teapots
Polka Dot Teapots
 
Dealing with an Upside Down Internet With High Performance Time Series Database
Dealing with an Upside Down Internet  With High Performance Time Series DatabaseDealing with an Upside Down Internet  With High Performance Time Series Database
Dealing with an Upside Down Internet With High Performance Time Series Database
 
Marketing de Experiencias Online
Marketing de Experiencias OnlineMarketing de Experiencias Online
Marketing de Experiencias Online
 
Hadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business UnitHadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business Unit
 
White eagle, by robert
White eagle, by robertWhite eagle, by robert
White eagle, by robert
 
Homework 3
Homework 3Homework 3
Homework 3
 
An Exploratory Study on the Links between Individual Upcycling, Product Attac...
An Exploratory Study on the Links between Individual Upcycling, Product Attac...An Exploratory Study on the Links between Individual Upcycling, Product Attac...
An Exploratory Study on the Links between Individual Upcycling, Product Attac...
 
Examen quimica diciembre
Examen quimica diciembreExamen quimica diciembre
Examen quimica diciembre
 
Curso gestión estratégica del tiempo 2016
Curso gestión estratégica del tiempo 2016Curso gestión estratégica del tiempo 2016
Curso gestión estratégica del tiempo 2016
 
Computación para todos (primaria) 5to grado
Computación para todos (primaria)  5to gradoComputación para todos (primaria)  5to grado
Computación para todos (primaria) 5to grado
 
陰陽編 太極から八卦へ
陰陽編 太極から八卦へ陰陽編 太極から八卦へ
陰陽編 太極から八卦へ
 
Ngss poster
Ngss posterNgss poster
Ngss poster
 

Similar a Railroad Modeling at Hadoop Scale

Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...StampedeCon
 
Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020Timothy Spann
 
Microservices, Events, and Breaking the Data Monolith with Kafka
Microservices, Events, and Breaking the Data Monolith with KafkaMicroservices, Events, and Breaking the Data Monolith with Kafka
Microservices, Events, and Breaking the Data Monolith with KafkaVMware Tanzu
 
SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...
SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...
SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...Cubic Corporation
 
Accumulo Summit 2014: Accumulo with Distributed SQL queries
Accumulo Summit 2014: Accumulo with Distributed SQL queriesAccumulo Summit 2014: Accumulo with Distributed SQL queries
Accumulo Summit 2014: Accumulo with Distributed SQL queriesAccumulo Summit
 
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...DataWorks Summit
 
Crime Analysis & Prediction System
Crime Analysis & Prediction SystemCrime Analysis & Prediction System
Crime Analysis & Prediction SystemBigDataCloud
 
_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdfJim Dowling
 
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark Summit
 
Spark Summit EMEA - Arun Murthy's Keynote
Spark Summit EMEA - Arun Murthy's KeynoteSpark Summit EMEA - Arun Murthy's Keynote
Spark Summit EMEA - Arun Murthy's KeynoteHortonworks
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsPat Patterson
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu
 
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...Carolyn Duby
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database RoundtableEric Kavanagh
 
Building Fast Applications for Streaming Data
Building Fast Applications for Streaming DataBuilding Fast Applications for Streaming Data
Building Fast Applications for Streaming Datafreshdatabos
 
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWSAmazon Web Services
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023Timothy Spann
 
ShibiaoNong_Resume_ColumbiaMS (1)
ShibiaoNong_Resume_ColumbiaMS (1)ShibiaoNong_Resume_ColumbiaMS (1)
ShibiaoNong_Resume_ColumbiaMS (1)Shibiao Nong
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 

Similar a Railroad Modeling at Hadoop Scale (20)

Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
 
Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020
 
Microservices, Events, and Breaking the Data Monolith with Kafka
Microservices, Events, and Breaking the Data Monolith with KafkaMicroservices, Events, and Breaking the Data Monolith with Kafka
Microservices, Events, and Breaking the Data Monolith with Kafka
 
SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...
SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...
SmartCity StreamApp Platform: Real-time Information for Smart Cities and Tran...
 
Accumulo Summit 2014: Accumulo with Distributed SQL queries
Accumulo Summit 2014: Accumulo with Distributed SQL queriesAccumulo Summit 2014: Accumulo with Distributed SQL queries
Accumulo Summit 2014: Accumulo with Distributed SQL queries
 
Liwanshi-Raheja-Resume
Liwanshi-Raheja-ResumeLiwanshi-Raheja-Resume
Liwanshi-Raheja-Resume
 
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
 
Crime Analysis & Prediction System
Crime Analysis & Prediction SystemCrime Analysis & Prediction System
Crime Analysis & Prediction System
 
_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf
 
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun Murthy
 
Spark Summit EMEA - Arun Murthy's Keynote
Spark Summit EMEA - Arun Murthy's KeynoteSpark Summit EMEA - Arun Murthy's Keynote
Spark Summit EMEA - Arun Murthy's Keynote
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Building Fast Applications for Streaming Data
Building Fast Applications for Streaming DataBuilding Fast Applications for Streaming Data
Building Fast Applications for Streaming Data
 
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
 
ShibiaoNong_Resume_ColumbiaMS (1)
ShibiaoNong_Resume_ColumbiaMS (1)ShibiaoNong_Resume_ColumbiaMS (1)
ShibiaoNong_Resume_ColumbiaMS (1)
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 

Más de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Último (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Railroad Modeling at Hadoop Scale

  • 1. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience Railroad Modeling at HadoOp Scale Hadoop Summit 3 June 2014, San Jose, CA John Akred (@BigDataAnalysis), Tatsiana Maskalevich (@notrockstar) www.svds.com @SVDataScience
  • 2. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 2 Why is a data science & engineering consulting company building its own Caltrain app?
  • 3. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 3
  • 4. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 4 • Commuter rail between San Francisco and San Mateo and Santa Clara counties ~30 stations • 118 passenger cars • 60% >=30 years old • 2014 weekday ridership is 52,019 people • On-time performance is about 92% • No reliable real-time status information • API outage between April 5th and June 2nd
  • 5. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience HOW DO WE KNOW IF THE TRAIN IS LATE? • Direct observation – We can hear the train horn – We can see the train when it goes by • Purpose-built systems: – We can use Caltrain API’s (when working) • Other signals – We can check Twitter for delay info or rider comments 5
  • 6. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience SVDS Approach 6  Take advantage of the available signals  Use historical data to make direct and latent observations more useful  Provide a service that gives users valuable planning and riding features  Don’t let the perfect be the enemy of the good
  • 7. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 7 Stovepipe: One-to-one relationship from data source to product Hard Failure: If the data source is broken, so is the app. Multi-sourced: Redundancy of overlapping data sources makes your products more resilient Graceful Degradation: If a data source breaks, there is a backup and your app continues to function Production data services abstract the probabilistic integration of overlapping data sources. We call this model a Data Mesh: DATA RESILIENCY Products Data Sources Broken Data Sources Data Services
  • 8. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 8 Source Signals Audio Image Text API Variety Volume Velocity
  • 9. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 9 • Microphone connected to Raspberry Pi mic->preamp->analog-to-digital converter->usb • PyAudio running on Raspberry Pi serializes audio as an array of 2-byte integers. • Sound data + metadata -> Flume on AWS via flumelogger • We use FFT + Decision Trees to detect and classify the trains into express and local based on the whistle sound. Audio Capture and Ingest Raspberry Pi Raw Audio Agent Raw Audio Agent
  • 10. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 10 • wget pulls images from camera’s built- in server 2-3 times a second, and saves them on a local server/NAS • Flume pushes the image data to our EC2 servers • openCV (python) is used to detect trains in images Image Capture and Ingest Raw Image Agent Raw Image Agent Local Server
  • 11. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 11 • Capturing all the tweets with keyword ‘Caltrain’ via Twitter API • Flume agent sends tweets to Apache Storm topology for processing • Tweets are parsed and written to HDFS and HBase • Event Detection is based on the baseline number of tweets per hour and keywords Text Capture and Ingest: Twitter Raw Image Agent Twitter APIs
  • 12. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 12 • Real-time departure times available via 511.org developer API’s • Python script collects data once a minute from 511.org APIs and stores it in HDFS as sequence files using WebHDFS API’s. • Python script collects data from the Caltrain site that includes run # • Didn’t function from April 5th until June 2nd 2014 Caltrain API Data Capturing scraper.py 511.Org APIs Caltrain Webpag e data_collec tor_api.py
  • 13. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 13 Combining the Signals Audio Signal Detection Image Recogni- tion Text Analysis STATE of complex system
  • 14. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 14 Twitter Agent Analytics Dev MapReduce Event StorageSound Agent Image Agent Twitter Spout Sound Spout Image Spout Tweet Parser Tweets Counter HDFS Writer Event Detector Alerts Twitter API HBase Writer Microphone on Raspberry Pi Web Camera External Data Sources Data Platform Sounds Classifier Train Detector Transmit to APP Caltrain Agent Caltrain Spout Caltrain API Schedule Integrator
  • 15. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 15 Batch: • Apply FFT to audio data to identify train based on train whistle’s fundamental frequencies. • Decision tree trained to classify trains into local or express based on minimum and maximum fundamental frequencies (Doppler effect) Data Science: Audio Real-Time: • Execute local / express classifier • Send data to the Event Detector for APP alerts • Store results in HBase • Apply FFT to audio signal • Extract min and max fundamental frequencies Frequency,Hz Histogram of Whistle Frequencies Over a Period of Time FrequencyCounts
  • 16. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 16 Real-Time • ORB algorithm (openCV) is used to detect the train in image • Sends results to the Event Detector to identify train and compare to schedule • Event Detector updates APP with the train’s status, alerts if late Data Science: Image Number of Key-PointsThat AreThe Same In Two ConsecutivesImages Time (Sec) NumberofMatchingPoints
  • 17. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 17 Batch: • Update baseline tweet frequencies for each hour as additional historical data collected • Store model parameters in HBase Data Science: Text Real-Time: • Count tweets as they stream through topology • Alert based on frequency deviations from the baseline
  • 18. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 18 Baseline Calculation Baseline
  • 19. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 19 Future Work • Detect direction of train in image processing • Use natural language processing on twitter data for event detector. • Continue evaluation of analytical frameworks for model computation • Add observation posts • Release Caltrain Rider Application
  • 20. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience COMING SOON: CALTRAIN RIDER APP • Find out what train to catch using our ‘Ride Now’ view • Select a train, see when that train should be reaching each stop in a trip detail view. • For more info: www.svds.com/trains 20
  • 21. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience questions 21 Yes, We’re Hiring www.svds.com/join-us
  • 22. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience THANK YOU John @BigDataAnalysis Tatsiana @notrockstar 22

Notas del editor

  1. When train is detected an the information is sent to Hbase and to the Event detector The camera has a network connection, so we can drop images via wget to the local server. Label wget
  2. Add API setup