8. Ground Six
Tech company based in the North East of
England
Specialise in developing web and mobile
applications
9. Ground Six
Tech company based in the North East of
England
Specialise in developing web and mobile
applications
Provide investment (financial and tech) to
interesting app ideas
10. Ground Six
Tech company based in the North East of
England
Specialise in developing web and mobile
applications
Provide investment (financial and tech) to
interesting app ideas
Got an idea? Looking for investment?
www.groundsix.com
13. Whats in store
Challenges, solutions and approaches when dealing with
billions of inserts per day
Processing and storing the data
14. Whats in store
Challenges, solutions and approaches when dealing with
billions of inserts per day
Processing and storing the data
Querying the data quickly
15. Whats in store
Challenges, solutions and approaches when dealing with
billions of inserts per day
Processing and storing the data
Querying the data quickly
Reporting against the data
16. Whats in store
Challenges, solutions and approaches when dealing with
billions of inserts per day
Processing and storing the data
Querying the data quickly
Reporting against the data
Keeping the application responsive
17. Whats in store
Challenges, solutions and approaches when dealing with
billions of inserts per day
Processing and storing the data
Querying the data quickly
Reporting against the data
Keeping the application responsive
Keeping the application running
18. Whats in store
Challenges, solutions and approaches when dealing with
billions of inserts per day
Processing and storing the data
Querying the data quickly
Reporting against the data
Keeping the application responsive
Keeping the application running
Legacy project, problems and code
22. Electric Vehicles: Need
for Data
We need to receive all of the data
We need to keep all of the data
23. Electric Vehicles: Need
for Data
We need to receive all of the data
We need to keep all of the data
We need to be able to display data in real time
24. Electric Vehicles: Need
for Data
We need to receive all of the data
We need to keep all of the data
We need to be able to display data in real time
We need to transfer large chunks of data to
customers and government departments
25. Electric Vehicles: Need
for Data
We need to receive all of the data
We need to keep all of the data
We need to be able to display data in real time
We need to transfer large chunks of data to
customers and government departments
We need to be able to calculate performance
metrics from the data
28. Some stats
500 (approx) telemetry enabled vehicles
using the system
2500 data points captured per vehicle per
second
29. Some stats
500 (approx) telemetry enabled vehicles
using the system
2500 data points captured per vehicle per
second
> 1.5 billion MySQL inserts per day
30. Some stats
500 (approx) telemetry enabled vehicles
using the system
2500 data points captured per vehicle per
second
> 1.5 billion MySQL inserts per day
Worlds largest vehicle telematics project
outside of Formula 1
31. More stats
Constant minimum of 4000 inserts per
second within the application
Peaks:
3 million inserts per second
35. Receiving continuous
data streams
We need to be online
We need to have capacity to process the
data
36. Receiving continuous
data streams
We need to be online
We need to have capacity to process the
data
We need to scale
37.
38.
39. Message Queue
Fast, secure, reliable and scalable
Hosted: they worry about the server
infrastructure and availability
We only have to process what we can
40. AMQP + PHP
php-amqplib (github.com/videlalvaro/php-
amqplib)
OR install it via composer: videlalvaro/php-amqplib
Pure PHP implementation
Handles publishing and consuming messages
from a queue
41. AMQP: Consume
// connect to the AMQP server
$connection = new AMQPConnection($host,$port,$user,$password);
// create a channel; a logical stateful link to our physical connection
$channel = $connection->channel();
// link the channel to an exchange (where messages are sent)
$channel->exchange_declare($exchange, ‘direct’);
// bind the channel to the queue
$channel->queue_bind($queue, $exchange);
// consume by sending the message to our processing callback function
$channel->basic_consume($queue, $consumerTag, false, false, false,
$callbackFunctionName);
while(count($channel->callbacks))
{
$channel->wait();
}
43. Pulling in the data
Dedicated application and hardware to
consume from the Message Queue and
convert to MySQL Inserts
MySQL: LOAD DATA INFILE
Very fast
Due to high volumes of data, these “bulk
operations” only cover a few seconds of
time - still giving a live stream of data
44. Optimising MySQL
innodb_flush_method=O_DIRECT
Lets the buffer pool bypass the OS cache
InnoDB buffer pools more efficient that OS
Can have negative side effects
Improve write performance:
innodb_flush_log_at_trx_commit=2
Prevents per-commit log flushing
Query cache size (query_cache_size)
Measure your applications usage and make a judgement
Our data stream was too frequent to make use of the cache
45. Sharding (1)
Evaluate data, look for natural break points
Split the data so each data collection unit
(vehicle) had a seperate database
Gives some support for horizontal scaling
Provided the data per vehicle is a
reasonable size
47. But the MQ can store data...why
do you have a problem?
Message Queue isn’t designed for storage
Messages are transferred in a compressed
form
Nature of vehicle data (CAN) means that a 16
character string is actually 4 - 64 pieces of
data
48. Sam Lambert
Solves big-data MySQL problems for
breakfast
Constantly tweaking the servers and
configuration to get more and more
performance
Pushing the capabilities of our SAN,
tweaking configs where no DBA has gone
before
www.samlambert.com
http://www.samlambert.com/2011/07/how-
to-push-your-san-with-open-iscsi_13.html
http://www.samlambert.com/
2011/07/diagnosing-and-fixing-
mysql-io.html
Twitter: @isamlambert
51. Long Running Queries
More and more vehicles came into service
Huge amount of data resulted in very slow
queries
Page load
Session locking
Slow exports
Slow backups
52. Real time information
Original database schema dictated all
information was accessed via a query, or a
separate subquery. Expensive.
Live information:
Up to 30 data points
Refreshing every 5 - 30 seconds via AJAX
Painful
53. Requests
Asynchronous requests let the page load before
the data
Number of these requests had to be monitored
Real time information used Fusion Charts
1 AJAX call per chart
10 - 30 charts per vehicle live screen
Refresh every 5 - 30 seconds
55. Single entry point
Multiple entry points make it difficult to
dynamically change the time out and
memory usage of key pages, as well as
dealing with session locking issues
effectively.
Single point of entry is essential
Checkout the symfony routing component...
56. Symfony Routing
// load your routes
$locator = new FileLocator( array(__DIR__ . '/../../' ) );
$loader = new YamlFileLoader( $locator );
$loader->load('routes.yml');
$request = ( isset( $_SERVER['REQUEST_URI'] ) ) ? $_SERVER['REQUEST_URI'] : '';
$requestContext = new RequestContext( $request );
// Setup the router
$router = new RoutingRouter( new YamlFileLoader( $locator ), "routes.yml",
array('cache_dir' => null), $requestContext );
$requestURL = ( isset( $_SERVER['REQUEST_URI'] ) ) ? $_SERVER['REQUEST_URI'] : '';
$requestURL = (strlen( $requestURL ) > 1 ) ? rtrim( $requestURL, '/' ) : $requestURL;
// get the route for your request
$route = $this->router->match( $requestURL );
// act on the route
58. Sharding (2)
Data is very time relevant
Only care about specific days
Don’t care about comparing data too much
Split the data so that each week had a
separate table
59. Supporting Sharding
Simple PHP function to run all queries
through. Works out the table name. Link
with a sprintf to get the full query string
/**
* Get the sharded table to use from a specific date
* @param String $date YYYY-MM-DD
* @return String
*/
public function getTableNameFromDate( $date )
{
// ASSUMPTION: todays database is ALWAYS THERE
// ASSUMPTION: You shouldn't be querying for data in the future
$date = ( $date > date( 'Y-m-d') ) ? date('Y-m-d') : $date;
$stt = strtotime( $date );
if( $date >= $this->switchOver ) {
$year = ( date( 'm', $stt ) == 01 && date( 'W', $stt ) == 52 ) ? date('Y', $stt ) - 1 : date('Y', $stt );
return 'datavalue_' . $year . '_' . date('W', $stt );
}
else {
return 'datavalue';
}
}
60. Sharding: an excuse
Alterations to the database schema
Code to support smaller buckets of data
Take advantage of needing to touch queries
and code: improve them!
61. Index Optimisation
Two sharding projects left the schema as a
frankenstien
Indexes still had data from before the first shard
(the vehicle ID)
Wasting storage space
Increasing the index size
Increasing query time
Makes the index harder to fit into memory
62. Schema Optimisation
MySQL provides a range of data-types
Varying storage implications
Does that need to be a BIGINT
Do you really need DOUBLE PRECISION when a
FLOAT will do?
Are those tables, fields or databases still required?
Perform regular schema audits
63. Query Optimisation
Run your queries through EXPLAIN
EXTENDED
Check they hit the indexes
For big queries avoid functions such as
CURDATE - this helps ensure the cache is hit
66. Reports & Intensive
Queries
How far did the vehicle travel today
Calculation involves looking at every single
motor speed value for the day
How much energy did the vehicle use today
Calculation involves looking at multiple
variables for every second of the day
Lookup time + calculation time
67. Group the queries
Leverage indexes
Perform related queries in succession
Then perform calculations
Catching up on a backlog of calculations and
exports?
Do a table of queries at a time
Make use of indexes
68. Save the report
Automate the queries in dead time, grouped
together nicely
Save the results in a reports table
Only a single record per vehicle per day of
performance data
Means users and management can run
aggregate and comparison queries
themselves quickly and easily
70. Check for efficiency
savings
Initial export scripts maintained a MySQLi
connection per database (500!)
Updated to maintain one per server and
simply switch to the database in question
71. Leverage your RAM
Intensive queries might only use X% of your
RAM
Safe to run more than one report / export
at a time
Add support for multiple exports / reports
within your scripts e.g.
73. Extrapolate & Assume
Data is only stored when it changes
Known assumptions are used to extrapolate
values for all seconds of the day
Saves MySQL but costs in RAM
“Interlation”
74. Interlation
* Add an array to the interlation
public function addArray( $name, $array )
* Get the time that we first receive data in one of our arrays
public function getFirst( $field )
* Get the time that we last received data in any of our arrays
public function getLast( $field )
* Generate the interlaced array
public function generate( $keyField, $valueField )
* Beak the interlaced array down into seperate days
public function dayBreak( $interlationArray )
* Generate an interlaced array and fill for all timestamps within the range
of _first_ to _last_
public function generateAndFill( $keyField, $valueField )
* Populate the new combined array with key fields using the common field
public function populateKeysFromField( $field, $valueField=null )
http://www.michaelpeacock.co.uk/interlation-library
77. Session Locking
Some queries were still (understandably, and
acceptably) slow
Sessions would lock and AJAX scripts would
enter race conditions
User would attempt to navigate to another
page: their session with the web server
wouldn’t respond
78. Session Locking:
Resolution
Session locking caused by how PHP handles
sessions;
Session file is closed once it has finishes
executing the request
Potential solution: use another method e.g.
database
Our solution: manually close the session
80. Live real-time data
Request consolidation helped
Each data point on the live screen was still a
separate query due to original design
constraints
Live fleet information spanned multiple
databases e.g. a map of all vehicles
belonging to a customer
Solution: caching
81. Caching with memcached
Fast, in-memory key-value store
Used to keep a copy of the most recent
data from each vehicle
$mc = new Memcache();
$mc->connect($memcacheServer, $memcachePort);
$realTimeData = $mc->get($vehicleID . ‘-’ . $dataVariable);
Failover: Moxi Memcached Proxy
87. Templates and sessions
Closing and opening sessions means you need
to know when data has been sent to the
browser
Separation of concerns and template systems
help with this
88. Database rollouts
Specific database table defines how the data should
be processed
Log database deltas
Automated process to roll out changes
Backup existing table first
DATE=`date +%H-%M-%d-%m-%y`
mysqldump -h HOST -u USER -pPASSWORD DATABASE TABLENAME > /backups/dictionary_$DATE.sql
cd /var/www/pdictionarypatcher/repo/
git pull origin master
cd src
php index.php
Rollout changes
92. NoSQL?
MySQL was used as a “golden hammer”
Original team of contractors who built the
system knew it
Easy to hire developers who know it
Not necessarily the best option
We had to introduce application-level
sharding for it to suite the growing needs
94. Direct queue interaction
Types of message queue could allow our live
data to be streamed direct from a queue
We could use this infrastructure to share
the data with partners instead of providing
them regular processed exports
97. PHP needs lots of
friends
PHP is a great tool for:
Displaying the data
Processing the data
Exporting the data
Binding business logic to the data
It needs friends to:
Queue the data
Insert the data
Visualise the data
101. Compile Data
Keep related data together
Look at storing summaries of data
Approach used by analytics companies: granularity
changes over time:
This week: per second data
Last week: Hourly summaries
Last month: Daily summaries
Last year: Monthly summaries
Hello everyone; Thanks for coming. I spent the last 12 months working on a large scale data intensive project, focusing on the development of a PHP web application which had to support, display, process, report against and export a pheonomenal amount of data each day.\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
The project concerned dealing with vehicle telematics data from vehicles produced by Smith Electric Vehicles. One of the worlds largest manufacturers of all-electric commercial vehicles. As a new and emerging industry, performance, efficiency and fault reporting data from these vehicles is very valuable. As I’m sure you can imagine, with electric vehicles the drive and battery systems generate a large amount of data - with batteries broken down into smaller cells, each giving us temperature, current, voltage and state of charge data.\n
As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
What you will realise is that we in effect built a large scale distributed-denial-of-service system, and pointed it directly at our own hardware, with the caveat of needing the data from the DDOS attack!\n
What you will realise is that we in effect built a large scale distributed-denial-of-service system, and pointed it directly at our own hardware, with the caveat of needing the data from the DDOS attack!\n
What you will realise is that we in effect built a large scale distributed-denial-of-service system, and pointed it directly at our own hardware, with the caveat of needing the data from the DDOS attack!\n
What you will realise is that we in effect built a large scale distributed-denial-of-service system, and pointed it directly at our own hardware, with the caveat of needing the data from the DDOS attack!\n
\n
before we could do anything - we need to be able to process the data and store it within the system. This includes actually transferring the data to our servers, inserting it into our database cluster and performing business logic on the data.\n
In order for us to reliably receive the data, we need the system to be online so that data can be transferred. We also need to have the server capacity to process the data, and we need to be able to scale the system. Just because there are X number of data collection units out there - we don’t know how many will be on at a given time, and we have to deal with more and more collection units being build and delivered.\n
In order for us to reliably receive the data, we need the system to be online so that data can be transferred. We also need to have the server capacity to process the data, and we need to be able to scale the system. Just because there are X number of data collection units out there - we don’t know how many will be on at a given time, and we have to deal with more and more collection units being build and delivered.\n
In order for us to reliably receive the data, we need the system to be online so that data can be transferred. We also need to have the server capacity to process the data, and we need to be able to scale the system. Just because there are X number of data collection units out there - we don’t know how many will be on at a given time, and we have to deal with more and more collection units being build and delivered.\n
The biggest problem is dealing with the pressure of that data stream. \n
\n
\n
There are a range of AMQP libraries for PHP, some of them based off the C-library and other difficult dependencies. \n\nA couple of guys developed a pure PHP implementation of the library which is really easy to use and install, and can be installed directly via Composer. As its a pure PHP implementation its really easy to get up and running on any platform.\n\nProvides support for both publishing and consuming messages from a queue.Great not only for dealing with streams of data but also for storing events and requests across multiple sessions, or dispatching jobs.\n
\n
A small buffer allows us to cope with the issue of connectivity problems to our message queue, or signal problems with the data collection devices.\n
To give data import the resources it needs, the system had dedicated hardware to consume messages from the message queue, perform business logic and convert them to MySQL Inserts.\n\nAlthough its an obvious one, its also easily overlooked. The data is bundled together into LOAD DATA INFILE statements with MySQL. \n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
With a project of this scale, dealing with business-critical data could lead to deployment anxiety. This is because a bug in rolled out code could cause problems with displaying real time data, or cause exported data or processed data reports to be incorrect; requiring them to be re-run at a cost of CPU time - most of which was already in use generating that days reports or dealing with that days data imports. Architecture of the application also provided constraints for maintenance.\n