3. Data Storage
• Hard Disk Drive - HDD
• Magnetizes a thin film of ferromagnetic material on a disk
• Reads it with a magnetic head on an actuator arm
• Solid State Drive – SSD
• Uses integrated circuit assemblies as memory to store data persistently
• No moving parts
4. Areal Storage Density
• SSD
• 2.8 Tbit/in2
• HDD
• 1.5 Tbit/in2
Terabits per square inch – numbers as of 2016 (see Wikipedia, our materials are
improving)
8. Streams: Computing Concept
Definitions
• Idea originating in 1950’s
• Standard way to get Input and
Output
• A source or sink of data
Who uses them
• C – stdin, stderr, stdout
• C++ iostream
• Perl IO
• Python io
• Java
• C#
9. What is a Stream?
• Access input and output generically
• Can write and read linearly
• May or may not be seekable
• Comes in chunks of data
10. Why do I care about streams?
• They are created to handle massive amounts of data
• Assume all files are too large to load into memory
• If this means checking size before load, do it
• If this means always treating a file as very large, do it
• PHP streams were meant for this!
11. What uses streams in PHP?
• EVERYTHING
• include/require _once
• stream functions
• file system functions
• many other extensions
15. What are Filters?
• Performs operations on stream data
• Can be prepended or appended (even on the fly)
• Can be attached to read or write
• When a filter is added for read and write, two instances of the filter are
created.
17. Things to watch for!
• Data has an input and output state
• When reading in chunks, you may need to cache in between reads to make
filters useful
• Use the right tool for the job
18. Throw away your assumptions except for:
There will be Terabytes of Cat Gifs!!
20. Random Access Memory (RAM)
• The CPU uses RAM to work
• It randomly shoves data inside and pulls data back out
• RAM is faster then SSD and HDD
• It’s also more expensive
22. There are two reasons you’ll see that error
• Recursion recursion recursion recursion
• Solution: install xdebug and get your stacktrace
• Loading too much data into memory
• Solution: manage your memory
23. Inherently PHP hides this problem
• Share nothing architecture
• Extensions with C libraries that hide memory consumption
• FastCGI/CGI blows away processes, restoring memory
• Max child and other Apache settings blow away children, restoring memory
26. Arrays are evil
• There are other ways to store data that are more efficient
• They should be used for small amounts of data
• No matter how hard you try, there is C overhead
27. Process with the appropriate tools
• Load data into the appropriate place for processing
• Hint – arrays are IN MEMORY – that is generally not an appropriate place
for processing
• Datastores are meant for storing and retrieving data, use them
29. Use the iteration, Luke
• Lazy fetching exists for database fetching – use it!
• Always page (window) your result sets from the database – ALWAYS
• Use filters or generators to format or alter results on the fly
30. The N+1 problem
• In simple terms, nested loops
• Don’t distance yourself too much from your datastore
• Collapse into one or two queries instead
36. What does this have to do with PHP?
• You are limited by the CPU your site is deployed upon.
• Yes even in a cloud – there are still physical systems running your stuff
• Yes even in a VM – there are still physical systems running your stuff
• Follow good programming habits
• PROFILE
37. Good programming habits
• Turn on opcache in production!
• Keep your code error AND WARNING free
• Watch complex logic in loops
• Short circuit the loop
• Rewrite to do the logic on the entire set in one step
• Calculate values only once
• On small arrays use array_walk
• On large arrays use generators/iterators
• Use isset instead of in_array if possible
• Profile to find the place to rewrite for slow code issues
39. Distribute the load
• Perfect for heavy processing for some type of data
• Queue code that requires heavy processing but not immediate viewing
• Design your UX so you can inform users of completed jobs
• Cache complex work items
40. Pick your system
• php-resque
• Gearman
• Beanstalkd
• IronMQ
• RabbitMQ
• ZeroMQ
• AmazonSQS
• Just visit http://queues.io
44. Networking 101
• IP – forwards packets of data based on a destination address
• TCP – verifies the correct delivery of data from client to server with error
and lost data correction
• Network Sockets – subroutines that provide TCP/IP (and UDP and some
other support) on most systems
46. Speed in the series of tubes
• Bandwidth – size of your pipe
• Latency – length of your pipe including size changes
• Jitter – air bubbles in your pipe
48. Definitions
• Socket
• Bidirectional network stream that speaks a protocol
• Transport
• Tells a network stream how to communicate
• Wrapper
• Tells a stream how to handle specific protocols and encodings
50. What does this have to do with PHP?
• APIs fail
• APIs go byby
• AWS goes down
• Or loses network connection to a specific area
• Or otherwise fails
52. Prepare for failure
• Handle timeouts
• Handle failures
• Abstract enough to replace systems if necessary, but only as much as
necessary
• If you’re not paying for it, don’t base your business model on it
53. Checklist
• Cultivate good coding habits
• Try not to loop logic or processing
• Don’t be afraid to offload work to other systems or services
• Assume every file is huge
• Assume there are 1 million rows in your DB table
• Assume that every network request is slow or going to fail
• Profile to find code bottlenecks, DON’T assume you know the bottleneck
• Wrap 3rd party tools enough to deal with downtime or retirement of apis
No matter how many virtual machines you throw at a problem you always have the physical limitations of hardware. Memory, CPU, and even your NIC's throughput have finite limits. Are you trying to load that 5 GB csv into memory to process it? No really, you shouldn't! PHP has many built in features to deal with data in more efficient ways that pumping everything into an array or object. Using PHP stream and stream filtering mechanisms you can work with chunked data in an efficient matter, with sockets and processes you can farm out work efficiently and still keep track of what your application is doing. These features can help with memory, CPU, and other physical system limitations to help you scale without the giant AWS bill.
Our first physical law we’ll talk about is mass, how much matter is in a thing
In physics, mass is a property of a physical body. It is the measure of an object's resistance to acceleration (a change in its state of motion) when a force is applied.[1] It also determines the strength of its mutual gravitational attraction to other bodies.
Mass is not the same as weight, even though we often calculate an object's mass by measuring its weight with a spring scale, rather than comparing it directly with known masses. An object on the Moon would weigh less than it does on Earth because of the lower gravity, but it would still have the same mass. This is because weight is a force, while mass is the property that (along with gravity) determines the strength of this force.
NAND sacrifices the random-access and execute-in-place advantages of NOR. NAND is best suited to systems requiring high capacity data storage. It offers higher densities, larger capacities, and lower cost. It has faster erases, sequential writes, and sequential reads.
HDDs• Enthusiast multimedia users and heavy downloaders: Video collectors need space, and you can only get to 4TB of space cheaply with hard drives.• Budget buyers: Ditto. Plenty of cheap space. SSDs are too expensive for $500 PC buyers.• Graphic arts and engineering professionals: Video and photo editors wear out storage by overuse. Replacing a 1TB hard drive will be cheaper than replacing a 500GB SSD.
The maximum areal storage density for flash memory used in SSDs is 2.8 Tbit/in2 in laboratory demonstrations as of 2016, and the maximum for HDDs is 1.5 Tbit/in2. The areal density of flash memory is doubling every two years, similar to Moore's law (40% per year) and faster than the 10–20% per year for HDDs. As of 2016, maximum capacity was 10 terabytes for an HDD,[10] and 15 terabytes for an SSD.[15] HDDs were used in 70% of the desktop and notebook computers produced in 2016, and SSDs were used in 30%. The usage share of HDDs is declining and could drop below 50% in 2018–2019 according to one forecast, because SSDs are replacing smaller-capacity (less than one-terabyte) HDDs in desktop and notebook computers and MP3 players.[154]
Areal density is a measure of the quantity of information bits that can be stored on a given length of track, area of surface, or in a given volume of a computer storage medium. Generally, higher density is more desirable, for it allows greater volumes of data to be stored in the same physical space. Density therefore has a direct relationship to storage capacity of a given medium. Density also generally has a fairly direct effect on the performance within a particular medium, as well as price.
Story about me and my vm drive and the bad blocks go bad
Let’s talk about how space on disk is important
Talk about a very early experiment writing a chat room for my personal website… using a text file that I concated and read
Hey, this was 1998 and it was running php-nuke ;)
But that rapidly changed to a gigabyte file when my friends tested it all night long
Quick computer science lesson
Originally done with magic numbers in fortran, C and unix standardized the way it worked
On Unix and related systems based on the C programming language, a stream is a source or sink of data, usually individual bytes or characters. Streams are an abstraction used when reading or writing files, or communicating over network sockets. The standard streams are three streams made available to all programs.
Who else uses them? Most languages descended from C have the “files as streams concept” and ways to extend the IO functionality beyond merely files, this allows them to be merged all together
Great way to standardize the way data is grabbed and used
Questions on who has used streams in other languages
Streams are a huge underlying component of PHP
Streams were introduced with PHP 4.3.0 – they are old, but underuse means they can have rough edges… so TEST TEST TEST
But they are more powerful then almost anything else you can use
Why is this better ?
Lots and lots of data in small chunks lets you do large volumes without maxing out memory and cpu
So this is a very common problem in PHP scripts, PHP bombing out because a file_get_contents call loaded something too big into memory
Although file_get_contents is pretty great, doing it without a size check is deadly (well, unless you total control the file)
Consider files user data just like that POST data you just got from a user
Any good extension will use the underlying streams API to let you use any kind of stream
for example, cairo does this
stuff to work with PHP streams is spread across at least two portions of the manual, plus appendixes for the build in transports/filters/context options. It’s very poorly arranged so be sure to take the time to learn where to look in the manual – there should be three main places
What doesn’t use streams? Chmod, touch and some other very file specific funtionality, lazy/bad extensions, extensions with issues in the libraries they wrap around
All input and output comes into PHP
It gets pushed through a streams filter
Then through the streams wrapper
During this point the stream context is available for the filter and wrapper to use
Streams themselves are the “objects” coming in
Wrappers are the “classes” defining how to deal with the stream
Some notes – file_get_contents and it’s cousin stream_get_contents are your fastest most efficient way if you need the whole file
File(blah) is going to be the best way to get the whole file split by lines
Both are going to stick the whole file into memory at some point.
For very large files and to help with memory consumption, the use of fgets and fread will help
You don’t even have to load all the data in to work on it with PHP! You can do everything on the fly in chunks
That’s the magic of filtering
A filter is a final piece of code which may perform operations on data as it is being read from or written to a stream. Any number of filters may be stacked onto a stream. Custom filters can be defined in a PHP script using stream_filter_register() or in an extension using the API Reference in Working with streams. To access the list of currently registered filters, use stream_get_filters().
Stream data is read from resources (both local and remote) in chunks, with any unconsumed data kept in internal buffers. When a new filter is prepended to a stream, data in the internal buffers, which has already been processed through other filters will not be reprocessed through the new filter at that time. This differs from the behavior of stream_filter_append().
Filters are nice for manipulating data on the fly – but remember you’ll be getting data in chunks, so your filter needs to be smart enough to handle that
Filters can be appended or prepended – and attached to READ or WRITE
Notice that stream_filter_prepend and append are smart – if you opened with the r flag, by default it’ll attach to read, if you opened with the w flag, it will attach to write
Note: Stream data is read from resources (both local and remote) in chunks, with any unconsumed data kept in internal buffers. When a new filter is prepended to a stream, data in the internal buffers, which has already been processed through other filters will not be reprocessed through the new filter at that time. This differs from the behavior of stream_filter_append().
Note: When a filter is added for read and write, two instances of the filter are created. stream_filter_prepend() must be called twice with STREAM_FILTER_READ and STREAM_FILTER_WRITE to get both filter resources.
Well it may look like manipulating data in a variable is preferable to the above. But the above is just a simple example. Once you add a filter to a stream it basically hides all the implementation details from the user. You will be unaware of the data being manipulated in a stream.
And also the same filter can be used with any stream (files, urls, various protocols etc.) without any changes to the underlying code.
Also multiple filters can be chained together, so that the output of one can be the input of another.
The filters need an input state and an output state. And they need torespect the the fact that number of requested bytes does not necessarilymean reading the same amount of data on the other end. In fact the outputside does generally not know whether less, the same amount or more input isto be read. But this can be dealt with inside the filter. However thefilters should return the number input vs the number of output filtersalways independently. Regarding states we would be interested if reachingEOD on the input state meant reaching EOD on the output side prior to therequested amount, at the requested amount or not at all yet (more dataavailable).
Throw away all your old assumptions and make a new one
Trust no one with your file/stream manipulations
Assume that file is a terabyte zip file of cat gifs
Nope, not kidding
Dimension is a neat word because we overload it, like filter
In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it.
The "memory wall" is the growing disparity of speed between CPU and memory outside the CPU chip. An important reason for this disparity is the limited communication bandwidth beyond chip boundaries, which is also referred to as bandwidth wall. From 1986 to 2000, CPU speed improved at an annual rate of 55% while memory speed only improved at 10%. Given these trends, it was expected that memory latency would become an overwhelming bottleneck in computer performance
Uploading items kept failing
Realized the issue was the sheer amount of data being synced, because the system had waited all evening when wifi went out
A lot of people don’t realize that you as a developer are responsible for managing the amount of memory consumed by PHP
No one wants to hear that but it’s true
PHP’s inherent characteristics are hiding this issue with memory
It’s very hard to duplicate very large scale issues in testing, which is often why this stuff isn’t caught until it’s time to deploy
SO, I had a system that worked locally on a client and then had a nightly upload
The upload itself was working properly – saving the data and processing it appropriately
But the RETURN was failing and screwing up the clients, because the return package was simply to large to send down
Two changes were made to the system to allow syncing already saved entries, and we no longer passed back the changed data
Frankly because it wasn’t important
But also because we added windowing to the paging system
In PHP 5.x a whopping 144 bytes per element were required. In PHP 7 the value is down to 36 bytes, or 32 bytes for the packed case but it’s STILL not the best
So think about 30K items in an array
Story about a cron job adding new items to our database, and no windowing functionality in sight
Nested loops are of the devil
Assume the data is always big
Because if it’s not now, some day it will be
Tests usually miss this, so just make a good habit
So the first thing you always want to think about in your PHP application is speed right? After all PHP is soooo slow, computers are so slow…
in everyday use and in kinematics, the speed of an object is the magnitude of its velocity (the rate of change of its position); it is thus a scalar quantity.
The fastest possible speed at which energy or information can travel, according to special relativity, is the speed of light in a vacuum c = 299,792,458 metres per second (approximately 1079000000 km/h or 671,000,000 mph). Matter cannot quite reach the speed of light, as this would require an infinite amount of energy. In relativity physics, the concept of rapidity replaces the classical idea of speed.
A microprocessor -- also known as a CPU or central processing unit -- is a complete computation engine that is fabricated on a single chip.
Using its ALU (Arithmetic/Logic Unit), a microprocessor can perform mathematical operations like addition, subtraction, multiplication and division. Modern microprocessors contain complete floating point processors that can perform extremely sophisticated operations on large floating point numbers.
A microprocessor can move data from one memory location to another.
A microprocessor can make decisions and jump to a new set of instructions based on those decisions.
Transmission delays occur in the wires that connect things together on a chip. The "wires" on a chip are incredibly small aluminum or copper strips etched onto the silicon. A chip is nothing more than a collection of transistors and wires that hook them together, and a transistor is nothing but an on/off switch. When a switch changes its state from on to off or off to on, it has to either charge up or drain the wire that connects the transistor to the next transistor down the line. Imagine that a transistor is currently "on." The wire it is driving is filled with electrons. When the switch changes to "off," it has to drain off those electrons, and that takes time. The bigger the wire, the longer it takes.
As the size of the wires has gotten smaller over the years, the time required to change states has gotten smaller, too. But there is some limit -- charging and draining the wires takes time. That limit imposes a speed limit on the chip.
There is also a minimum amount of time that a transistor takes to flip states. Transistors are chained together in strings, so the transistor delays add up. On a complex chip like the G5, there are likely to be longer chains, and the length of the longest chain limits the maximum speed of the entire chip.
Finally, there is heat. Every time the transistors in a gate change state, they leak a little electricity. This electricity creates heat. As transistor sizes shrink, the amount of wasted current (and therefore heat) has declined, but there is still heat being created. The faster a chip goes, the more heat it generates. Heat build-up puts another limit on speed.
processor speeds, or overall processing power for computers will double every two years
Overclocking and burning a chip to death story
Talk about my experiment with overclocking my Athlon
Was my first custom built computer, I thought I was cool beans because I figured out how to flash it, Athlons were the new shiny
I overclocked the crap out of it
And… it caught on fire (partially because I did the heat stuff too thin but also because I too much overclocked it)
The smell of a burning processor is something I will not forget, and will never do again!
(don’t guess) if you need to improve speed beyond good habits
These are just a few “good habits” to cultivate when coding
None of them SHOULD be new =, and these are often considered micro optimizations
It’s not worth it to rewrite your code probably for these, but it IS worth it to cultivate them as a natural part of your coding style
It just takes practice
I’m sure there are more!
There is nothing wrong with offloading work
PHP scales VERY well horizontally, and often pretty cheaply horizontally as well
Spin up a dedicated box for jobs
If you have scaling in place, you can spin up two during heavy load times!
often reports or generating files or images
It’s not realistic to expect complex reports to be done in seconds, physics apply here too, good UX will mask your offloading
is a good way to balance offloaded work with immediate results
I’m not going to go into a lot of detail here, because what you eventually pick for jobs/queuing is going to be specific to your needs
Queues.io is actually a really nice resource with lots of different queue types for many different languages
Story about the render system, and how good choices here (queueing, triggering one job from another, etc) made even huge file generation just work
Velocity is a physical vector quantity; both magnitude and direction are needed to define it. The scalar absolute value (magnitude) of velocity is called "speed", being a coherent derived unit whose quantity is measured in the SI (metric) system as metres per second (m/s) or as the SI base unit of (m⋅s−1). For example, "5 metres per second" is a scalar, whereas "5 metres per second east" is a vector.
Speed describes only how fast an object is moving, whereas velocity gives both how fast and in what direction the object is moving.
As with all other communications protocol, TCP/IP is composed of layers:
IP - is responsible for moving packet of data from node to node. IP forwards each packet based on a four byte destination address (the IP number). The Internet authorities assign ranges of numbers to different organizations. The organizations assign groups of their numbers to departments. IP operates on gateway machines that move data from department to organization to region and then around the world.
TCP - is responsible for verifying the correct delivery of data from client to server. Data can be lost in the intermediate network. TCP adds support to detect errors or lost data and to trigger retransmission until the data is correctly and completely received.
Sockets - is a name given to the package of subroutines that provide access to TCP/IP on most systems.
So your application level is the basic data you want to send
in most http applications this is your http page INLUDING the headers section
the transport is how you’re sending it – UDP and TCP are the most popular
the Internet layer is the “IP” layer – with the header telling the system what address (ip) to send the data to and what port to take to
then you get a frame header and footer on the actual piece of data the packet being sent
This is a VERY simplified analogy, but for the basic idea – think of the internet as water flowing through pipes at a constant pressure (data is electricity so close to the speed of light)
bigger and better pipes can handle more, you can get air bubbles in the pipes, and no matter what you did, if the pipe is longer it will take longer
There are different types of socket types you can use, a lot of people use tcp and HTTP because they’re a known procol
What is streamable behavorior? We’ll get to that in a bit
Protocol: set of rules which is used by computers to communicate with each other across a network
Resource: A resource is a special variable, holding a reference to an external resource
Talk about resources in PHP and talk about general protocols, get a list from the audience of protocols they can name (yes http is a protocol)
A socket is a special type of stream – pound this into their heads
A socket is an endpoint of communication to which a name can be bound. A socket has a type and one associated process. Sockets were designed to implement the client-server model for interprocess communication where:
In php , a wrapper ties the stream to the transport – so your http wrapper ties your PHP data to the http transport and tells it how to behave when reading and writing data
By default sockets are going to assume tcp – since that’s a pretty standard way of doing things. Notice that we have to do things the old fashioned way just for this simple http request – sticking our headers together, making sure stuff gets closed. However if you can’t use allow_url_fopen this is a way around it
a dirty dirty way but – there you have it
remember allow_url_fopen only stops “drive-by” hacking
Docker and s3 and how abstracting stuff out kept me sane
Also how error handling
Your checklist for not running out of PHP memory when your code runs
There is SOOO much more you can do from hooking objects to hooking the engine!