SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
MySQL and Search at Craigslist


           Jeremy Zawodny
        jzawodn@craigslist.org
          http://craigslist.org/

         Jeremy@Zawodny.com
    http://jeremy.zawodny.com/blog/
Who Am I?
    Creator and co-author of High Performance
●

    MySQL
    Creator of mytop
●


    Perl Hacker
●


    MySQL Geek
●


    Craigslist Engineer (as of July, 2008)
●


        MySQL, Data, Search, Perl
    –

    Ex-Yahoo (Perl, MySQL, Search, Web
●

    Services)
What is Craigslist?
What is Craigslist?
    Local Classifieds
●


        Jobs, Housing, Autos, Goods, Services
    –

    ~500 cities world-wide
●


    Free
●


        Except for jobs in ~18 cities and brokered
    –
        apartments in NYC
        Over 20B pageviews/month
    –

        50M monthly users
    –

        50+ countries, multiple languages
    –

        40+M ads/month, 10+M images
    –
What is Craigslist?
    Forums
●


        100M posts
    –

        100s of forums
    –
Technical and other Challenges
    High ad churn rate
●


        Post half-life can be short
    –

    Growth
●


    High traffic volume
●


    Back-end tools and data analysis needs
●


    Growth
●


    Need to archive postings... forever!
●


        100s of millions, searchable
    –

    Internationalization and UTF-8
●
Technical and other Challenges
    Small Team
●


        Fires take priority
    –

        Infrastructure gets creaky
    –

        Organic code and schema growth over years
    –

    Growth
●


    Lack of abstractions
●


        Too much embedded SQL in code
    –

    Documentation vs. Institutional Knowledge
●


        “Why do we have things configured like this?”
    –
Goals
    Use Open Source
●


    Keep infrastructure small and simple
●


        Lower power is good!
    –

        Efficiency all around
    –

        Do more with less
    –

    Keep site easy and appraochable
●


        Don't overload with features
    –

        People are easily confuse
    –
Craigslist Internals Overview
                                   Load Balancer



Read Proxy Array                                                    Write Proxy Array
                   Perl + memcached



                                                                          ...
Web Read Array     Apache 1.3 + mod_perl




 Object Cache                                Search Cluster
                   Perl + memcached                            Sphinx




                                                              Not Included:
Read DB Cluster    MySQL 5.0.xx                               - user db, image db
                                                              - async tasks, email
                                                              - accounting, internal tools
                                                              - and more!
Vertical Partitioning: Roles

Users             Classifieds             Forums




        Write   Read     Long   Trash




        Stats                   Archive
Vertical Partitioning
    Different roles have different access patterns
●


        Sub-roles based on query type
    –

    Easier to manage and scale
●


    Logical, self-contained data
●


    Servers may not need to be as
●

    big/fast/expensive
    Difficult to do retroactively
●


    Various named db “handles” in code
●
Horizontal Partitioning: Hydra

                                        ...
cluster_01   cluster_02    cluster_03         cluster_N




                      client
Horizontal Partitioning: Hydra
    Need to retrofit a lot of code
●


    Need non-blocking Perl MySQL client
●


    Wrapped
●

    http://code.google.com/p/perl-mysql-async/
    Eventually can size DB boxes based on
●

    price/power and adjust mapping function(s)
        Choose hardware first
    –

        Make the db “fit”
    –

    Archiving lets us age a cluster instead of
●

    migrating it's data to a new one.
Search Evolution
    Problem: Users want to find stuff.
●


    Solution: Use MySQL Full Text.
●


    ...time passes...
●


    Problem: MySQL Full Text Doesn't Scale!
●


    Solution: Use Sphinx.
●


    ...time passes...
●


    Problem: Sphinx doesn't scale!
●


    Solution: Patch Sphinx.
●
MySQL Full-Text Problems
    Hitting invisible limits
●


        CPU not pegged, Memory available
    –

        Disk I/O not unreasonable
    –

        Locking / Mutex contention? Probably.
    –

    MyISAM has occasional crashing / corruption
●


    5 clusters of 5 machines
●


        Partitioning based on city and category
    –

        All “hand balanced” and high-maintenance
    –

    ~30M queries/day
●


        Close to limits
    –
Sphinx: My First CL Project
    Sphinx is designed for text search
●


    Fast and lean C++ code
●


    Forking model scales well on multi-core
●


    Control over indexing, weighting, etc.
●


    Also spent some time looking at Apache Solr
●
Search Implementation Details
    Partitioning based on cities (each has a
●

    numeric id)
    Attributes vs. Keywords
●


    Persistent Connections
●


        Custom client and server modifications
    –

    Minimal stopword List
●


    Partition into 2 clusters (1 master, 4 slaves)
●
Sphinx Incremental Indexing
    Re-index every N minutes
●


    Use main + delta strategy
●


        Adopted as: index + today + delta
    –

        One set per city (~500 * 3)
    –

    Slaves handle live queries, update via rsync
●


    Need lots of FDs
●


    Use all 4 cores to index
●


    Every night, perform “daily merge”
●


    Generate config files via Perl
●
Sphinx Incremental Indexing
Sphinx Issues
    Merge bugs [fixed]
●


    File descriptor corruption [fixed]
●


    Persistent connections [fixed]
●


        Overhead of fork() was substantial in our testing
    –

        200 queries/sec vs. 1,000 queries/sec per box
    –

    Missing attribute updates [unreported]
●


    Bogus docids in responses
●


    We need to upgrade to latest Sphinx soon
●


    Andrew and team have been excellent!
●
Search Project Results
    From 25 MySQL Boxes to 10 Sphinx
●


    Lots more headroom!
●


    New Features
●


        Nearby Search
    –

    No seizing or locking issues
●


    1,000+ qps during peak w/room to grow
●


    50M queries per day w/steady growth
●


    Cluster partitioning built but not needed (yet?)
●


    Better separation of code
●
Sphinx Wishlist
    Efficient delete handling (kill lists)
●


    Non-fatal “missing” indexes
●


    Index dump tool
●


    Live document add/change/delete
●


    Built-in replication
●


    Stats and counters
●


    Text attributes
●


    Protocol checksum
●
Data Archiving, Replication, Indexes
    Problem: We want to keep everything.
●


    Solution: Archive to an archive cluster.
●


    Problem: Archiving is too painful. Index
●

    updates are expensive! Slaves affected.
    Solution: Archive with home-grown eventually
●

    consistent replication.
Data Archiving: OOB Replication
    Eventual Consistency
●


    Master process
●


        SET SQL_LOG_BIN=0
    –

        Select expired IDs
    –

        Export records from live master
    –

        Import records into archive master
    –

        Delete expired from live master
    –

        Add IDs to list
    –
Data Archiving: OOB Replication
    Slave process
●


        One per MySQL slave
    –

        Throttled to minimize impact
    –

        State kept on slave
    –

             Clone friendly
         ●



        Simple logic
    –

             Select expired IDs added since my sequence number
         ●


             Delete expired records
         ●


             Update local “last seen” sequence number
         ●
Long Term Data Archiving
    Schema coupling is bad
●


        ALTER TABLE takes forever
    –

        Lots of NULLs flying around
    –

    CouchDB or similar long-term?
●


        Schema-free feels like a good fit
    –

    Tested some home grown solutions already
●


    Separate storage and indexing?
●


        Indexing with Sphinx?
    –
Drizzle, XtraDB, Future Stuff
    CouchDB looks very interesting. Maybe for
●

    archive?
    XtraDB / InnoDB plugin
●


        Better concurrency
    –

        Better tuning of InnoDB internals
    –

    libdrizzle + Perl
●


        DBI/DBD may not fit an async model well
    –

        Can talk to both MySQL and Drizzle!
    –

    Oracle buying Sun?!?!
●
We're Hiring!
    Work in San Francisco
●


    Flexible, Small Company
●


    Excellent Benefits
●


    Help Millions of People Every Week
●


    We Need Perl/MySQL Hackers
●


    Come Help us Scale and Grow
●
Questions?

Más contenido relacionado

La actualidad más candente

How a Small Team Scales Instagram
How a Small Team Scales InstagramHow a Small Team Scales Instagram
How a Small Team Scales InstagramC4Media
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)Uwe Printz
 
Challenges with MongoDB
Challenges with MongoDBChallenges with MongoDB
Challenges with MongoDBStone Gao
 
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...slashn
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDBMongoDB
 
Building Applications with a Graph Database
Building Applications with a Graph DatabaseBuilding Applications with a Graph Database
Building Applications with a Graph DatabaseTobias Lindaaker
 
Non Relational Databases
Non Relational DatabasesNon Relational Databases
Non Relational DatabasesChris Baglieri
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBJustin Smestad
 
High-Performance Storage Services with HailDB and Java
High-Performance Storage Services with HailDB and JavaHigh-Performance Storage Services with HailDB and Java
High-Performance Storage Services with HailDB and Javasunnygleason
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra PerfectSATOSHI TAGOMORI
 
Document Locking with Redis in Symfony2
Document Locking with Redis in Symfony2Document Locking with Redis in Symfony2
Document Locking with Redis in Symfony2Tom Corrigan
 
Strengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDBStrengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDBlehresman
 
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityNOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityIvan Zoratti
 
MySQL HA Percona cluster @ MySQL meetup Mumbai
MySQL HA Percona cluster @ MySQL meetup MumbaiMySQL HA Percona cluster @ MySQL meetup Mumbai
MySQL HA Percona cluster @ MySQL meetup MumbaiRemote MySQL DBA
 
Put Your Thinking CAP On
Put Your Thinking CAP OnPut Your Thinking CAP On
Put Your Thinking CAP OnTomer Gabel
 
Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)gdusbabek
 
Scaling Instagram
Scaling InstagramScaling Instagram
Scaling Instagramiammutex
 
My first powershell script
My first powershell scriptMy first powershell script
My first powershell scriptDavid Cobb
 
Better encryption & security with MariaDB 10.1 & MySQL 5.7
Better encryption & security with MariaDB 10.1 & MySQL 5.7Better encryption & security with MariaDB 10.1 & MySQL 5.7
Better encryption & security with MariaDB 10.1 & MySQL 5.7Colin Charles
 

La actualidad más candente (20)

How a Small Team Scales Instagram
How a Small Team Scales InstagramHow a Small Team Scales Instagram
How a Small Team Scales Instagram
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)
 
Challenges with MongoDB
Challenges with MongoDBChallenges with MongoDB
Challenges with MongoDB
 
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDB
 
Building Applications with a Graph Database
Building Applications with a Graph DatabaseBuilding Applications with a Graph Database
Building Applications with a Graph Database
 
Mongo DB
Mongo DBMongo DB
Mongo DB
 
Non Relational Databases
Non Relational DatabasesNon Relational Databases
Non Relational Databases
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
High-Performance Storage Services with HailDB and Java
High-Performance Storage Services with HailDB and JavaHigh-Performance Storage Services with HailDB and Java
High-Performance Storage Services with HailDB and Java
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra Perfect
 
Document Locking with Redis in Symfony2
Document Locking with Redis in Symfony2Document Locking with Redis in Symfony2
Document Locking with Redis in Symfony2
 
Strengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDBStrengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDB
 
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityNOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
 
MySQL HA Percona cluster @ MySQL meetup Mumbai
MySQL HA Percona cluster @ MySQL meetup MumbaiMySQL HA Percona cluster @ MySQL meetup Mumbai
MySQL HA Percona cluster @ MySQL meetup Mumbai
 
Put Your Thinking CAP On
Put Your Thinking CAP OnPut Your Thinking CAP On
Put Your Thinking CAP On
 
Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)
 
Scaling Instagram
Scaling InstagramScaling Instagram
Scaling Instagram
 
My first powershell script
My first powershell scriptMy first powershell script
My first powershell script
 
Better encryption & security with MariaDB 10.1 & MySQL 5.7
Better encryption & security with MariaDB 10.1 & MySQL 5.7Better encryption & security with MariaDB 10.1 & MySQL 5.7
Better encryption & security with MariaDB 10.1 & MySQL 5.7
 

Similar a My Sql And Search At Craigslist

MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At CraigslistJeremy Zawodny
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesBernd Ocklin
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDBMongoDB
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into LuceneLuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into Luceneeby
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataRoger Xia
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQLCrate.io
 
Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Alexey Rybak
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMinsk MongoDB User Group
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesshnkr_rmchndrn
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DBHeriyadi Janwar
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraPyData
 

Similar a My Sql And Search At Craigslist (20)

MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion Queries
 
20080611accel
20080611accel20080611accel
20080611accel
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into LuceneLuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
 
20081022cca
20081022cca20081022cca
20081022cca
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_data
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQL
 
Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)
 
MySQL highav Availability
MySQL highav AvailabilityMySQL highav Availability
MySQL highav Availability
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebService
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skies
 
Qcon
QconQcon
Qcon
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DB
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
 

Más de MySQLConference

Memcached Functions For My Sql Seemless Caching In My Sql
Memcached Functions For My Sql Seemless Caching In My SqlMemcached Functions For My Sql Seemless Caching In My Sql
Memcached Functions For My Sql Seemless Caching In My SqlMySQLConference
 
Using Open Source Bi In The Real World
Using Open Source Bi In The Real WorldUsing Open Source Bi In The Real World
Using Open Source Bi In The Real WorldMySQLConference
 
Partitioning Under The Hood
Partitioning Under The HoodPartitioning Under The Hood
Partitioning Under The HoodMySQLConference
 
Tricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The CloudTricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The CloudMySQLConference
 
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance ProblemsD Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance ProblemsMySQLConference
 
Writing Efficient Java Applications For My Sql Cluster Using Ndbj
Writing Efficient Java Applications For My Sql Cluster Using NdbjWriting Efficient Java Applications For My Sql Cluster Using Ndbj
Writing Efficient Java Applications For My Sql Cluster Using NdbjMySQLConference
 
My Sql Performance On Ec2
My Sql Performance On Ec2My Sql Performance On Ec2
My Sql Performance On Ec2MySQLConference
 
Inno Db Performance And Usability Patches
Inno Db Performance And Usability PatchesInno Db Performance And Usability Patches
Inno Db Performance And Usability PatchesMySQLConference
 
Solving Common Sql Problems With The Seq Engine
Solving Common Sql Problems With The Seq EngineSolving Common Sql Problems With The Seq Engine
Solving Common Sql Problems With The Seq EngineMySQLConference
 
Using Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
Using Continuous Etl With Real Time Queries To Eliminate My Sql BottlenecksUsing Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
Using Continuous Etl With Real Time Queries To Eliminate My Sql BottlenecksMySQLConference
 
Make Your Life Easier With Maatkit
Make Your Life Easier With MaatkitMake Your Life Easier With Maatkit
Make Your Life Easier With MaatkitMySQLConference
 
Getting The Most Out Of My Sql Enterprise Monitor 20
Getting The Most Out Of My Sql Enterprise Monitor 20Getting The Most Out Of My Sql Enterprise Monitor 20
Getting The Most Out Of My Sql Enterprise Monitor 20MySQLConference
 
Wide Open Spaces Using My Sql As A Web Mapping Service Backend
Wide Open Spaces Using My Sql As A Web Mapping Service BackendWide Open Spaces Using My Sql As A Web Mapping Service Backend
Wide Open Spaces Using My Sql As A Web Mapping Service BackendMySQLConference
 
Unleash The Power Of Your Data Using Open Source Business Intelligence
Unleash The Power Of Your Data Using Open Source Business IntelligenceUnleash The Power Of Your Data Using Open Source Business Intelligence
Unleash The Power Of Your Data Using Open Source Business IntelligenceMySQLConference
 
Inno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code StructureInno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code StructureMySQLConference
 
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin ExpressMy Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin ExpressMySQLConference
 

Más de MySQLConference (17)

Memcached Functions For My Sql Seemless Caching In My Sql
Memcached Functions For My Sql Seemless Caching In My SqlMemcached Functions For My Sql Seemless Caching In My Sql
Memcached Functions For My Sql Seemless Caching In My Sql
 
Using Open Source Bi In The Real World
Using Open Source Bi In The Real WorldUsing Open Source Bi In The Real World
Using Open Source Bi In The Real World
 
Partitioning Under The Hood
Partitioning Under The HoodPartitioning Under The Hood
Partitioning Under The Hood
 
Tricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The CloudTricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
 
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance ProblemsD Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
 
Writing Efficient Java Applications For My Sql Cluster Using Ndbj
Writing Efficient Java Applications For My Sql Cluster Using NdbjWriting Efficient Java Applications For My Sql Cluster Using Ndbj
Writing Efficient Java Applications For My Sql Cluster Using Ndbj
 
My Sql Performance On Ec2
My Sql Performance On Ec2My Sql Performance On Ec2
My Sql Performance On Ec2
 
Inno Db Performance And Usability Patches
Inno Db Performance And Usability PatchesInno Db Performance And Usability Patches
Inno Db Performance And Usability Patches
 
The Smug Mug Tale
The Smug Mug TaleThe Smug Mug Tale
The Smug Mug Tale
 
Solving Common Sql Problems With The Seq Engine
Solving Common Sql Problems With The Seq EngineSolving Common Sql Problems With The Seq Engine
Solving Common Sql Problems With The Seq Engine
 
Using Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
Using Continuous Etl With Real Time Queries To Eliminate My Sql BottlenecksUsing Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
Using Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
 
Make Your Life Easier With Maatkit
Make Your Life Easier With MaatkitMake Your Life Easier With Maatkit
Make Your Life Easier With Maatkit
 
Getting The Most Out Of My Sql Enterprise Monitor 20
Getting The Most Out Of My Sql Enterprise Monitor 20Getting The Most Out Of My Sql Enterprise Monitor 20
Getting The Most Out Of My Sql Enterprise Monitor 20
 
Wide Open Spaces Using My Sql As A Web Mapping Service Backend
Wide Open Spaces Using My Sql As A Web Mapping Service BackendWide Open Spaces Using My Sql As A Web Mapping Service Backend
Wide Open Spaces Using My Sql As A Web Mapping Service Backend
 
Unleash The Power Of Your Data Using Open Source Business Intelligence
Unleash The Power Of Your Data Using Open Source Business IntelligenceUnleash The Power Of Your Data Using Open Source Business Intelligence
Unleash The Power Of Your Data Using Open Source Business Intelligence
 
Inno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code StructureInno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code Structure
 
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin ExpressMy Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
 

Último

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 

Último (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 

My Sql And Search At Craigslist

  • 1. MySQL and Search at Craigslist Jeremy Zawodny jzawodn@craigslist.org http://craigslist.org/ Jeremy@Zawodny.com http://jeremy.zawodny.com/blog/
  • 2. Who Am I? Creator and co-author of High Performance ● MySQL Creator of mytop ● Perl Hacker ● MySQL Geek ● Craigslist Engineer (as of July, 2008) ● MySQL, Data, Search, Perl – Ex-Yahoo (Perl, MySQL, Search, Web ● Services)
  • 4. What is Craigslist? Local Classifieds ● Jobs, Housing, Autos, Goods, Services – ~500 cities world-wide ● Free ● Except for jobs in ~18 cities and brokered – apartments in NYC Over 20B pageviews/month – 50M monthly users – 50+ countries, multiple languages – 40+M ads/month, 10+M images –
  • 5. What is Craigslist? Forums ● 100M posts – 100s of forums –
  • 6. Technical and other Challenges High ad churn rate ● Post half-life can be short – Growth ● High traffic volume ● Back-end tools and data analysis needs ● Growth ● Need to archive postings... forever! ● 100s of millions, searchable – Internationalization and UTF-8 ●
  • 7. Technical and other Challenges Small Team ● Fires take priority – Infrastructure gets creaky – Organic code and schema growth over years – Growth ● Lack of abstractions ● Too much embedded SQL in code – Documentation vs. Institutional Knowledge ● “Why do we have things configured like this?” –
  • 8. Goals Use Open Source ● Keep infrastructure small and simple ● Lower power is good! – Efficiency all around – Do more with less – Keep site easy and appraochable ● Don't overload with features – People are easily confuse –
  • 9. Craigslist Internals Overview Load Balancer Read Proxy Array Write Proxy Array Perl + memcached ... Web Read Array Apache 1.3 + mod_perl Object Cache Search Cluster Perl + memcached Sphinx Not Included: Read DB Cluster MySQL 5.0.xx - user db, image db - async tasks, email - accounting, internal tools - and more!
  • 10. Vertical Partitioning: Roles Users Classifieds Forums Write Read Long Trash Stats Archive
  • 11. Vertical Partitioning Different roles have different access patterns ● Sub-roles based on query type – Easier to manage and scale ● Logical, self-contained data ● Servers may not need to be as ● big/fast/expensive Difficult to do retroactively ● Various named db “handles” in code ●
  • 12. Horizontal Partitioning: Hydra ... cluster_01 cluster_02 cluster_03 cluster_N client
  • 13. Horizontal Partitioning: Hydra Need to retrofit a lot of code ● Need non-blocking Perl MySQL client ● Wrapped ● http://code.google.com/p/perl-mysql-async/ Eventually can size DB boxes based on ● price/power and adjust mapping function(s) Choose hardware first – Make the db “fit” – Archiving lets us age a cluster instead of ● migrating it's data to a new one.
  • 14. Search Evolution Problem: Users want to find stuff. ● Solution: Use MySQL Full Text. ● ...time passes... ● Problem: MySQL Full Text Doesn't Scale! ● Solution: Use Sphinx. ● ...time passes... ● Problem: Sphinx doesn't scale! ● Solution: Patch Sphinx. ●
  • 15. MySQL Full-Text Problems Hitting invisible limits ● CPU not pegged, Memory available – Disk I/O not unreasonable – Locking / Mutex contention? Probably. – MyISAM has occasional crashing / corruption ● 5 clusters of 5 machines ● Partitioning based on city and category – All “hand balanced” and high-maintenance – ~30M queries/day ● Close to limits –
  • 16. Sphinx: My First CL Project Sphinx is designed for text search ● Fast and lean C++ code ● Forking model scales well on multi-core ● Control over indexing, weighting, etc. ● Also spent some time looking at Apache Solr ●
  • 17. Search Implementation Details Partitioning based on cities (each has a ● numeric id) Attributes vs. Keywords ● Persistent Connections ● Custom client and server modifications – Minimal stopword List ● Partition into 2 clusters (1 master, 4 slaves) ●
  • 18. Sphinx Incremental Indexing Re-index every N minutes ● Use main + delta strategy ● Adopted as: index + today + delta – One set per city (~500 * 3) – Slaves handle live queries, update via rsync ● Need lots of FDs ● Use all 4 cores to index ● Every night, perform “daily merge” ● Generate config files via Perl ●
  • 20. Sphinx Issues Merge bugs [fixed] ● File descriptor corruption [fixed] ● Persistent connections [fixed] ● Overhead of fork() was substantial in our testing – 200 queries/sec vs. 1,000 queries/sec per box – Missing attribute updates [unreported] ● Bogus docids in responses ● We need to upgrade to latest Sphinx soon ● Andrew and team have been excellent! ●
  • 21. Search Project Results From 25 MySQL Boxes to 10 Sphinx ● Lots more headroom! ● New Features ● Nearby Search – No seizing or locking issues ● 1,000+ qps during peak w/room to grow ● 50M queries per day w/steady growth ● Cluster partitioning built but not needed (yet?) ● Better separation of code ●
  • 22. Sphinx Wishlist Efficient delete handling (kill lists) ● Non-fatal “missing” indexes ● Index dump tool ● Live document add/change/delete ● Built-in replication ● Stats and counters ● Text attributes ● Protocol checksum ●
  • 23. Data Archiving, Replication, Indexes Problem: We want to keep everything. ● Solution: Archive to an archive cluster. ● Problem: Archiving is too painful. Index ● updates are expensive! Slaves affected. Solution: Archive with home-grown eventually ● consistent replication.
  • 24. Data Archiving: OOB Replication Eventual Consistency ● Master process ● SET SQL_LOG_BIN=0 – Select expired IDs – Export records from live master – Import records into archive master – Delete expired from live master – Add IDs to list –
  • 25. Data Archiving: OOB Replication Slave process ● One per MySQL slave – Throttled to minimize impact – State kept on slave – Clone friendly ● Simple logic – Select expired IDs added since my sequence number ● Delete expired records ● Update local “last seen” sequence number ●
  • 26. Long Term Data Archiving Schema coupling is bad ● ALTER TABLE takes forever – Lots of NULLs flying around – CouchDB or similar long-term? ● Schema-free feels like a good fit – Tested some home grown solutions already ● Separate storage and indexing? ● Indexing with Sphinx? –
  • 27. Drizzle, XtraDB, Future Stuff CouchDB looks very interesting. Maybe for ● archive? XtraDB / InnoDB plugin ● Better concurrency – Better tuning of InnoDB internals – libdrizzle + Perl ● DBI/DBD may not fit an async model well – Can talk to both MySQL and Drizzle! – Oracle buying Sun?!?! ●
  • 28. We're Hiring! Work in San Francisco ● Flexible, Small Company ● Excellent Benefits ● Help Millions of People Every Week ● We Need Perl/MySQL Hackers ● Come Help us Scale and Grow ●