Just a reminder that the Official DrupalCon Party is tonight. Buses are leaving here starting at 4pm, but will be leaving continuously for awhile; which is good since all of you have places to be for the next 50 minutes…
Discuss what data sharding is, when you might need to shard your data, and what effects this has on your site or application HOW: Horizontal/partitioning and Vertical/Federation
Horizontal - More machines Vertical - Bigger machines Vertical will always eventually reach a limit
What is it – I’ll cover the different types and ways you can shard your data How does sharding help? How does it hurt? In short, WHEN is sharding right for me? Why not just keep scaling vertically?
Breaking apart your data is the easy part. The hard part is putting it back together again seamlessly. This was one of several broken plates that came from my wife’s great grandmother. I didn’t do it?
It’s easier to scale smaller pieces – makes it easier to horizontally scale Take one application that shares sensitive data split When you moved cache to memcache IS sharding So is using Varnish or a CDN like Akamai (forms of federated sharding)
Reduce your table indices The more data you have, the larger your table index overhead will be. Reduce that and you gain performances. A table with a million rows will perform better than a table with 10 million rows. Share your data with other applications or users. Great for taking CVs or form data that will be processed by an internal (proprietary) system Sometimes physically storing sensitive data (user information, credit card numbers, etc) in a different database can be a good idea. Don ’ t store these things on a database that can be accessed via non-SSL web servers
Yiouo guys are here to hear about scaling – let’s talk about all the other things you do to scale Load balancers – Apache mod_proxy and mod_proxy_balancer modules are a cheap way to load balance. There are plenty of cloud-based as well as hardware balancers you can use. '' Drupal 7 offers the concept of slave-safe queries (even in Views 3)
Have you performance tested? Is your problem data or application? Make sure that the size of your data is your problem… Compile PHP and apache without default modules. Gentoo Joke. Do you really need PDFLib or LibXML? Memory is cheap, DBAs are not
Load balancers – Apache mod_proxy and load balancing modules are a cheap way to load balance. There are plenty of cloud-based as well as hardware balancers you can use. '' Drupal 7 offers the concept of slave-safe querires (even in Views 3)
Make the individually smaller vs make the whole smaller A partition is a single piece split in half Even/Odd IDs, letters of the alphabet for user names Reduces index size A “federation” is defined as a “set of things” Logical divisions such as states, counties, countries Tend to be discrete or atomic
Reasons to choose horizontal partitioning Everything includes memcached, load balanced web servers, master/slave MySQL replication This is the sharding technique of last resort
The total number of rows in each table is reduced. This reduces index size, which generally improves search performance
This is why in theory horizontal scale sounds great – you have N-number of database clusters
Manageability – have you seen the number of tables in a Drupal install, especially in an install with tons of modules
The secondary databases no longer need to be MySQL Notice how the secondary database clusters are starting to look more like cache clusters
Disquis for commenting Edge-side includes for CDNs These are examples of application sharding
Want my website to collect resumes Want to dump resumes into my HR database, but don’t want all my HR data exposed to the web
Suppose your corporation’s web site sees thousands of applications per month or week. It might be a good idea to shard this data for scale. But also, you can shard it for data repurposing with your HR department’s software. Maybe you don’t want those guys with administrative access on the site… Keep personal information secure and off your company’s main website
This takes place in settings.php In this example we are sharing user data between multiple sites or applications. Profile field data will be available to both.
This takes place in settings.php Since profiles are integrated as fields, you may not have those tables
This takes place in settings.php Since profiles are integrated as fields, you may not have those tables
Note: This scheme will only work with databases of the same type. You can’t mix PostGRES and MySQL connections here You’ll be able to use different connection strings with usernames, etc
This does not HAVE to take place in settings.php - it should be there if at all possible moduleKey can be anything unique to your module
Setting the schema is not part of this, but strongly advised. Drupal_get_schema will static cache the table definition Db_set_active will switch database connections and THEN load the schema from static cache first, then database cache; then from code. If it can’t find the cache tables after you’ve switched database connections, it tries to throw an error; cascades down a dark path of errors after it can’t find system table, etc
What are the advantages to switching database connections? Can still use Drupal’s schema and database APIs Smaller database for your website helps with master/slave replication (faster), backups are more manageable, less overhead
From Drupal’s perspective, here’s how that looks
Mongo abstracts the need to horizontally scale – Mongo does the horizontal partitioning for you This scales vertically the application
I’m not affiliated with 10gen, I just wanted to mention their conference since we’re all here in London. They’ll have several Drupal-related sessions.
Out of the box, MongoDB module already does some things to help speed up and scale your site
Here’s a sample document that contains resume data. It’s stored in BSON – binary JSON
This is a sample query to return all users with the last name of “Smith”. - Applicants is a collection object - Applicant is a cursor object that you can loop through - $user = $users->findOne(array('username' => 'Smith', 'ssn': 1), array('first_name', 'last_name'));Can use findOne() to get a single return
THERE’S NO WEB SERVER INVOLVED AT ALL In addition to performance, you can share your MongoDB data via REST. For use in additional services Can share your data using REST and JSON to display content without costly queries
This gets a JSON object Note the trailing slash after the collection name Might need another REST interface like Sleepy.Mongoose for more advanced REST data