SlideShare una empresa de Scribd logo
1 de 57
Descargar para leer sin conexión
Sunday, August 29, 2010
Ted Han




Sunday, August 29, 2010
!




Sunday, August 29, 2010
If you would like a copy of
                        these slides the are here:
                   http://cl.ly/6233b0f56bb686e57b74

            (or at http://twitter.com/knowtheory)




Sunday, August 29, 2010
Labor Rights

                                                                   Rest
                                                                    8
          •Eight Hours for Work
          •Eight Hours for Rest
          •Eight Hours for What We Will!                 What We Will
                                                              8           Work
                                                                           8




                              This may not be a
                              pattern that hackers are
                              all that familiar with.




Sunday, August 29, 2010
We trade our time and
                          expertise for money at work
                           for 8+ hours a day at work




Sunday, August 29, 2010
But now the 8 hours of our
                            free time are just as valuable
                          to companies as our work time.




Sunday, August 29, 2010
Who collects your data?
             Do you know what data they collect?
                 What do you get in return?




Sunday, August 29, 2010
What do you get for your Data?

         • Google:      Gmail, Search
         • Apple:       iTunes Genius
         • Amazon: Recommendation
         • Last.fm:    Rec’s & Neighbors
         • Facebook: ??? (Your friends’
           families’ crazy rants)

Sunday, August 29, 2010
Companies benefit from our
                           data and can ask and answer
                          questions about our behavior.




Sunday, August 29, 2010
We benefit indirectly,
                          but why can’t we benefit
                              directly as well?




Sunday, August 29, 2010
We can, if we know
                          where and how to look.




Sunday, August 29, 2010
Ruby can help!




Sunday, August 29, 2010
Basic Data Mining
         • Data Collection
         • Data Querying & Manipulation
         • Data Analysis




Sunday, August 29, 2010
DataMapper will help
                           with these things!



Sunday, August 29, 2010
It would be nice to analyze
                             our search histories, but...

                          Google doesn’t provide an API.




Sunday, August 29, 2010
But, we can search our
                            Google Chrome histories!

           ~/Library/Application Support/Google/
                  Chrome/Default/History

                           (make a copy of your History.
                          sqlite3 dbs are easy to corrupt)


Sunday, August 29, 2010
Once we have a datasource
                              we need to answer yes to
                           at least one of three questions
                          about the format of our source.




Sunday, August 29, 2010
• Does a DataMapper
                      Adapter already exist?
                    • Can you write an adapter?
                    • Can you write a scraper to
                      import your data?



Sunday, August 29, 2010
Does a DataMapper
                           Adapter already exist?

                          Yep! Google Chrome’s History is an
                                  sqlite3 database!




Sunday, August 29, 2010
Urls Table

                    •A  example bullet point
                          CREATE TABLE urls(

                               id                INTEGER PRIMARY KEY,
                    • Another example here
                               url
                               title
                                                 LONGVARCHAR,
                                                 LONGVARCHAR,
                               visit_count       INTEGER DEFAULT 0 NOT         NULL,
                    • Some more as you want
                               typed_count
                               last_visit_time
                                                 INTEGER DEFAULT 0 NOT
                                                 INTEGER NOT NULL,
                                                                               NULL,

                               hidden            INTEGER DEFAULT 0 NOT         NULL,
                               favicon_id        INTEGER DEFAULT 0 NOT         NULL
                          );



                                                     Querying requires us to
                                                     map data out of our
                                                     source. To do this we
                                                     have to tell DataMapper
                                                     what the source schema
                                                     is.




Sunday, August 29, 2010
Url model (naive)
              class Url
                    •A  example bullet point
                include DataMapper::Resource

                   property   :id,                Serial #   Integer, :key=>true
                    • Another example here
                   property
                   property
                              :url,
                              :title,
                                                  String
                                                  String
                   property   :visit_count,       Integer,   :default    =>   0
                    • Some more as you want
                   property
                   property
                              :typed_count,
                              :last_visit_time,
                                                  Integer,
                                                  Integer,
                                                             :default
                                                             :required
                                                                         =>
                                                                         =>
                                                                              0
                                                                              true
                   property   :hidden,            Integer,   :default    =>   0
                   property   :favicon_id,        Integer,   :default    =>   0

                has n, :segments
                has n, :visits, :through => :segments
              end




Sunday, August 29, 2010
Url model (naive)
              class Url
                    •A  example bullet point
                include DataMapper::Resource

                   property   :id,                Serial
                    • Another exampleInline Validations
                   property
                   property
                              :url,
                              :title,  here       String
                                                  String
                   property   :visit_count,       Integer,   :default    =>   0
                    • Some more as you want
                   property
                   property
                              :typed_count,
                              :last_visit_time,
                                                  Integer,
                                                  Integer,
                                                             :default
                                                             :required
                                                                         =>
                                                                         =>
                                                                              0
                                                                              true
                   property   :hidden,            Integer,   :default    =>   0
                   property   :favicon_id,        Integer,   :default    =>   0

                has n, :segments
                has n, :visits, :through => :segments
              end




Sunday, August 29, 2010
Urls Table

                    •A  example bullet point
                          CREATE TABLE urls(

                               id                INTEGER PRIMARY KEY,
                    • Another example here
                               url
                               title
                                                 LONGVARCHAR,
                                                 LONGVARCHAR,
                               visit_count       INTEGER DEFAULT 0 NOT   NULL,
                    • Some more as you want
                               typed_count
                               last_visit_time
                                                 INTEGER DEFAULT 0 NOT
                                                 INTEGER NOT NULL,
                                                                         NULL,

                               hidden            INTEGER DEFAULT 0 NOT   NULL,
                               favicon_id        INTEGER DEFAULT 0 NOT   NULL
                          );



                                                      Database Constraints



Sunday, August 29, 2010
Sanity Check

          •               A example bullet point
    The Schemata Match! now lets test.
    >> Url.first(:url => "http://rubykaigi.org/")
    => #<Url @id=1294 @url="http://rubykaigi.org/"
          •               Another example here
    @title="RubyKaigi 2010, August 27-29"
    @visit_count=8 ... >
          •
    >> Url.count
    => 47007
                          Some more as you want
    >> Url.count("visit_count.lt" => 1)
    => 20
    >> # wat.




Sunday, August 29, 2010
Url model (w/ Sanity)
              class Url
                                             lets add some business
                • A example bulletrule validations
                include DataMapper::Resource
                                                   point
                   property   :id,                Serial
                    • Another
                   property
                   property example here
                              :url,
                              :title,
                                                  String,
                                                  String
                                                             :format     => :url

                   property   :visit_count,       Integer,   :min        =>   1
                    • Some more as you want
                   property
                   property
                              :typed_count,
                              :last_visit_time,
                                                  Integer,
                                                  Integer,
                                                             :default
                                                             :required
                                                                         =>
                                                                         =>
                                                                              0
                                                                              true
                   property   :hidden,            Integer,   :default    =>   0
                   property   :favicon_id,        Integer,   :default    =>   0

                has n, :segments
                has n, :visits, :through => :segments
              end




Sunday, August 29, 2010
Data Manipulation
              class Url
                    •A  example
                include DataMapper::Resource
                                             bullet point
                                                   require ‘dm-types’
                   property   :id,                Serial
                    • Another example here
                   property
                   property
                              :url,
                              :title,
                                                  URI,
                                                  String
                                                             :format     => :url

                   property   :visit_count,       Integer,   :min        =>   1
                    • Some more as you want
                   property
                   property
                              :typed_count,
                              :last_visit_time,
                                                  Integer,
                                                  Integer,
                                                             :default
                                                             :required
                                                                         =>
                                                                         =>
                                                                              0
                                                                              true
                   property   :hidden,            Integer,   :default    =>   0
                   property   :favicon_id,        Integer,   :default    =>   0

                has n, :segments
                has n, :visits, :through => :segments
              end




Sunday, August 29, 2010
Data Manipulation
      >> u = Url.first("url.like" => "%rubykaigi%")
      => #<Url @id=1294 @url=#<Addressable::URI:
           •              A example bullet point
      0x81c7a1b0 URI:http://rubykaigi.com/
      @title="RubyKaigi 2010, August 27-29"
           •              Another example here
      @last_visit_time=12927095498867853 ...>
      >> u.url
           •
      rubykaigi.com/>     Some more as you want
      => #<Addressable::URI:0x81c7a1b0 URI:http://

      >> u.url.host
      => "rubykaigi.com" # oops, .org is canonical
      >> u.url.host = "rubykaigi.org"; u.url
      => #<Addressable::URI:0x81ccfdf4 URI:http://
      rubykaigi.org/>




Sunday, August 29, 2010
Data Manipulation
      >> u = Url.first("url.like" => "%rubykaigi%")
      => #<Url @id=1294 @url=#<Addressable::URI:
           •              A example bullet point
      0x81c7a1b0 URI:http://rubykaigi.com/
      @title="RubyKaigi 2010, August 27-29"
           •              Another example here
      @last_visit_time=12927095498867853 ...>
      >> u.last_visit_time
                          Some more as you want
      => 12927095498867853 # wtf is this?
           •




Sunday, August 29, 2010
Urls Table
    CREATE TABLE urls(

         id         •A  example bullet point
                           INTEGER PRIMARY KEY,
         url               LONGVARCHAR,
         title
         visit_count
                    • Another example here
                           LONGVARCHAR,
                           INTEGER DEFAULT 0 NOT   NULL,
         typed_count       INTEGER DEFAULT 0 NOT   NULL,

         hidden
                    • Some more as you want
         last_visit_time   INTEGER NOT NULL,
                           INTEGER DEFAULT 0 NOT   NULL,
         favicon_id        INTEGER DEFAULT 0 NOT   NULL
    );

      Not a lot of clues here...
      Okay, it’s an integer time, but it’s also freaking huge:
          12927095498867853?




Sunday, August 29, 2010
chromium/src/base/time.h

                    •A example bullet point
               // Time represents an absolute point
                    Another(s/1,000,000) since
                                 example here
               //• in time, internally represented as
               // microseconds
                    Some more as with other you want
               //• a platform-dependent epoch. Each
               // platform's epoch, along
               // system-dependent clock interface
               // routines, is defined in time_PLATFORM.cc.




Sunday, August 29, 2010
chromium/src/base/time_mac.cc
    //       Core Foundation uses a double second
    //          •         A example bullet point
             count since 2001-01-01 00:00:00 UTC.
    //       The UNIX epoch is 1970-01-01 00:00:00 UTC.
    //
    //
                •         Another example here
             Windows uses a Gregorian epoch of 1601.
             We need to match this internally
    //
    //
             so •
                          Some more as you want
                that our time representations match across
             all platforms. See bug 14734.
    //         irb(main):010:0> Time.at(0).getutc()
    //         => Thu Jan 01 00:00:00 UTC 1970
    //         irb(main):011:0> Time.at(-11644473600).getutc()
    //         => Mon Jan 01 00:00:00 UTC 1601

                                Examples already in Ruby? Nice.



Sunday, August 29, 2010
Url model v2 (lib types)
   class Url
                    •A  example
     include DataMapper::Resource
                                               bullet point
                                                write ChromeEpochTime
        property          :id,                Serial
        property
        property
                    • Another example here
                          :url,
                          :title,
                                              URI,
                                              String
                                                                 :format     => :url

        property          :visit_count,       Integer,           :min        =>   1
        property
        property
                    • Some more as you want
                          :typed_count,
                          :last_visit_time,
                                              Integer,
                                              ChromeEpochTime,
                                                                 :default
                                                                 :required
                                                                             =>
                                                                             =>
                                                                                  0
                                                                                  true
        property          :hidden,            Integer,           :default    =>   0
        property          :favicon_id,        Integer,           :default    =>   0

     has n, :segments
     has n, :visits, :through => :segments
   end




Sunday, August 29, 2010
chrome_epoch_time.rb
   module DataMapper
     class Property
                    •A  example bullet point
       class ChromeEpochTime < Integer
         def load(value)

                    • Another example here
           return value unless value.respond_to?(:to_i)
           ::Time.at((value/10**6)-11644473600)
         end

                    • Some more as you want
         def dump(value)
           case value
             when ::Integer, ::Time then (value.to_i + 11644473600) * 10**6
             when ::DateTime then (value.to_time.to_i + 11644473600) * 10**6
           end
         end
       end # class ChromeEpochTime
     end # class Property
   end # module DataMapper




Sunday, August 29, 2010
Data Manipulation
       >> u = Url.first("url.like" => "%rubykaigi.com%")
       => #<Url @id=42846 @url=#<Addressable::URI:
           •              A example bullet point
       0x81e232f0 URI:http://rubykaigi.com/
       @title="RubyKaigi 2010, August 27-29"
           •              Another example here
       @last_visit_time=Tue Aug 24 12:51:38 +0900
       2010 ...>
                          Some more as you want
       >> u.last_visit_time
           • Aug 24 12:51:38 0900 2010
       => Tue




Sunday, August 29, 2010
Histograms, yay! (Analysis)

                    •A  example bullet point
                  hour_histogram = example here
                   • Another Hash.new(0)
                  Visit.all.map do |v|
                   • Some more as you want
                    hour_histogram[v.visit_time.hour] += 1
                  end




Sunday, August 29, 2010
Over what span of time?

                    •A  example bullet point
                    • Another example here
                    >> Visit.first.visit_time
                    • Some more as you want
                    => Fri May 28 17:04:39 0900 2010
                    >> Visit.last.visit_time
                      => Thu Aug 26 01:51:32 0900 2010




Sunday, August 29, 2010
Aggregate Browsing by Hour

                      8000
                      7000
                      6000
                      5000
                      4000
                      3000
                      2000
                      1000
                          0
                          Midnight 3am   6am   9am    Noon    3pm      6pm   9pm




Sunday, August 29, 2010
More Histograms, yay!

                    •A
                   example bullet point
              • Another example here
           ruby_doc = Url.all("url.like" => "%ruby-doc%");
           hour_histogram = Hash.new(0)
              • Some more as you want
           ruby_doc.visits.map do |v|
            hour_histogram[v.visit_time.hour] += 1
           end




Sunday, August 29, 2010
Aggregate Browsing for ruby-doc.org by Hour

                           50


                          37.5


                           25


                          12.5


                            0
                            Midnight 3am   6am     9am   Noon    3pm     6pm       9pm




Sunday, August 29, 2010
But what happens when
                           We have a data source
                          which isn’t well behaved?




Sunday, August 29, 2010
"Does Edge have an anti-PS3 bias?"
       http://arstechnica.com/civis/viewtopic.php?f=22&t=62024




                                 Last year a thread on Ars Technica titled
                                 "Does Edge have an anti-PS3 bias?"
                                 resulted in a flame war erupted bet ween
                                 PS3 fans and Xbox360 fans over whether
                                 or not PS3 was receiving unfair
                                 treatment, particularly held up against a
                                 game's score on metacritic.com.




Sunday, August 29, 2010
Helpfully, the thread title
                           is a testable hypothesis




Sunday, August 29, 2010
Are an review outlet’s aggregate
       game scores (dis)similar to the aggregate
          Metascore for those same games?




Sunday, August 29, 2010
Unfortunately,
                          Metascore also has no API.




Sunday, August 29, 2010
Time for the Poor Man’s API:
                               HTML scraping :(




Sunday, August 29, 2010
Save me Nokogiri!




Sunday, August 29, 2010
Yeah, that’s not pretty.

                    •A  example bullet point
    def scores_for(game)
      game_page = case
        when (game.is_a? String)
          begin
             Nokogiri::HTML(open(game))
          rescue
             puts "[FAIL] Failed to open #{game}"



                    • Another example here
             break
          end
        when (game.is_a? Nokogiri::HTML::Document)
          game
        else
          raise StandardError, "you need to provide either a url, or a nokogiri document"



                    • Some more as you want
      end

      page_title = game_page.css('title').text
      junk, title, platform, year = page_title.match(/^(.+)s*((#{PLATFORMS.join("|")}): (d+)): Reviews$/).to_a
      title.strip!
      metascore = game_page.css('table#scoretable img').select{ |i| /Metascore:/ =~ i.attributes['alt'] }.first.attributes['alt'].to_s.split.last
      puts "[WIN] #{title} on the #{platform} (#{year}) has a score of #{metascore}"
      #review_count = game_page.to_s.match(/based on <b>(d+) reviews/).to_a.last
      reviews = game_page.css('div.scoreandreview')

      review_count = reviews.size
      checksum = game_page.to_s.match(/based on <b>(d+) reviews/).to_a.last.to_i
      checksum_message = "Number of Reviews on the page not equal to the claimed number of reviews"
      raise StandardError, checksum_message unless review_count == checksum
      scores = reviews.map do |review|
        score = review.css('div.criticscore').text
        pub = review.css('span.publication').text
        [score,pub]
      end
      return { :title =>title.strip, :metascore => metascore, :platform => platform, :publish_year => year, :reviews => scores }
    end




Sunday, August 29, 2010
But it works!
                          <3 Nokogiri




Sunday, August 29, 2010
Models
    class Game
      include DataMapper::Resource                                   class ReviewPublisher
                                                                       include DataMapper::Resource


                    •A  example bullet point
       property   :id,             Serial
       property   :title,          String, :length=>255                property :id,     Serial
       property   :platform,       String
       property   :release_date,   DateTime                            property :name,   String, :length => 255
       property   :esrb_rating,    String


                    • Another example here
       property   :metascore,      Float                               has n, :reviews, :model => "Game::Review"
       property   :review_count,   Integer                             has n, :games, :through => :reviews
       property   :created_at,     DateTime
                                                                     end
       property   :updated_at,     DateTime

       class Review


                    • Some more as you want
         include DataMapper::Resource

         property :game_id,                  Integer, :key => true
         property :review_publisher_id,      Integer, :key => true
         property :score,                    Integer

         belongs_to :review_publisher
         belongs_to :game
       end

       class Developer
         include DataMapper::Resource

         property :id,   Serial
         property :name, String, :length => 255

        has n, :games
      end
    end




Sunday, August 29, 2010
Student’s T-Test (Analysis!)
    def t_value(prop1, collection1, prop2, collection2)
      c1_std
      c1_avg
                    •A
                   example bullet point
               = collection1.std(prop1)
               = collection1.avg(prop1)

              • Another example here
      c1_count = collection1.count

          c2_std   = collection2.std(prop2)
              • Some more as you want
          c2_avg   = collection2.avg(prop2)
          c2_count = collection2.count

      (c1_avg - c2_avg) /
       Math.sqrt(
          (c1_std**2 / c1_count)+(c2_std**2 / c2_count)
        )
    end




Sunday, August 29, 2010
PS3 Reviewers vs Metascore

                    •A
               example bullet point
   outlets = ReviewPublisher.all("games.platform"=>"ps3")
   t_scores = outlets.map do |outlet|

         • Another example here
     t_value(:metascore, outlet.games(:platform=>"ps3"),
             :score, outlet.reviews("game.platform"=>"ps3"))
   end # .size => 140

   significant = t_scores.select do you want
         • Some more as |t|
     (t > 1.96 or t < -1.96) and not t.infinite?
   end

   low = significant.select{ |s| s < -1.96} # .size => 20
   high = significant.select{ |s| s > 1.96} # .size => 10




Sunday, August 29, 2010
Xbox360 Reviewers vs Metascore

                    •A
             example bullet point
outlets = ReviewPublisher.all("games.platform"=>"xbox360")
t_scores = outlets.map do |outlet|

       • Another example here
 t_value(:metascore, outlet.games(:platform=>"xbox360"),
         :score, outlet.reviews("game.platform"=>"xbox360"))
end # .size => 169
       • Some more as you want
significant = t_scores.select do |t|
  (t > 1.96 or t < -1.96) and not t.infinite?
end

low = significant.select{ |s| s < -1.96} # .size => 37
high = significant.select{ |s| s > 1.96} # .size => 29




Sunday, August 29, 2010
What about Edge Magazine?

     >>
                • A =example bullet point
              outlet  ReviewPublisher.first("name.like"=>"%Edge%")

                    • Another
                            example here
     => #<ReviewPublisher @id=36 @name="Edge Magazine">
     >> t = t_value(:metascore, outlet.games
     (:platform=>"ps3"), :score, outlet.reviews
                    • Some more as you want
     ("game.platform"=>"ps3"))
     => 5.10786212293491
     >> t > 1.96
     => true # Edge has a PRO PS3 bias, not Anti!




Sunday, August 29, 2010
There are lots of other possibilities!
                  What would you like to learn?




Sunday, August 29, 2010
Learn about DataMapper perhaps?
                       http://www.datamapper.org
                   irc://irc.freenode.net#datamapper




Sunday, August 29, 2010
Thanks!
                             @knowtheory
                          ted@knowtheory.net




Sunday, August 29, 2010

Más contenido relacionado

Destacado

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destacado (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Mapping the world with DataMapper

  • 4. If you would like a copy of these slides the are here: http://cl.ly/6233b0f56bb686e57b74 (or at http://twitter.com/knowtheory) Sunday, August 29, 2010
  • 5. Labor Rights Rest 8 •Eight Hours for Work •Eight Hours for Rest •Eight Hours for What We Will! What We Will 8 Work 8 This may not be a pattern that hackers are all that familiar with. Sunday, August 29, 2010
  • 6. We trade our time and expertise for money at work for 8+ hours a day at work Sunday, August 29, 2010
  • 7. But now the 8 hours of our free time are just as valuable to companies as our work time. Sunday, August 29, 2010
  • 8. Who collects your data? Do you know what data they collect? What do you get in return? Sunday, August 29, 2010
  • 9. What do you get for your Data? • Google: Gmail, Search • Apple: iTunes Genius • Amazon: Recommendation • Last.fm: Rec’s & Neighbors • Facebook: ??? (Your friends’ families’ crazy rants) Sunday, August 29, 2010
  • 10. Companies benefit from our data and can ask and answer questions about our behavior. Sunday, August 29, 2010
  • 11. We benefit indirectly, but why can’t we benefit directly as well? Sunday, August 29, 2010
  • 12. We can, if we know where and how to look. Sunday, August 29, 2010
  • 13. Ruby can help! Sunday, August 29, 2010
  • 14. Basic Data Mining • Data Collection • Data Querying & Manipulation • Data Analysis Sunday, August 29, 2010
  • 15. DataMapper will help with these things! Sunday, August 29, 2010
  • 16. It would be nice to analyze our search histories, but... Google doesn’t provide an API. Sunday, August 29, 2010
  • 17. But, we can search our Google Chrome histories! ~/Library/Application Support/Google/ Chrome/Default/History (make a copy of your History. sqlite3 dbs are easy to corrupt) Sunday, August 29, 2010
  • 18. Once we have a datasource we need to answer yes to at least one of three questions about the format of our source. Sunday, August 29, 2010
  • 19. • Does a DataMapper Adapter already exist? • Can you write an adapter? • Can you write a scraper to import your data? Sunday, August 29, 2010
  • 20. Does a DataMapper Adapter already exist? Yep! Google Chrome’s History is an sqlite3 database! Sunday, August 29, 2010
  • 21. Urls Table •A example bullet point CREATE TABLE urls( id INTEGER PRIMARY KEY, • Another example here url title LONGVARCHAR, LONGVARCHAR, visit_count INTEGER DEFAULT 0 NOT NULL, • Some more as you want typed_count last_visit_time INTEGER DEFAULT 0 NOT INTEGER NOT NULL, NULL, hidden INTEGER DEFAULT 0 NOT NULL, favicon_id INTEGER DEFAULT 0 NOT NULL ); Querying requires us to map data out of our source. To do this we have to tell DataMapper what the source schema is. Sunday, August 29, 2010
  • 22. Url model (naive) class Url •A example bullet point include DataMapper::Resource property :id, Serial # Integer, :key=>true • Another example here property property :url, :title, String String property :visit_count, Integer, :default => 0 • Some more as you want property property :typed_count, :last_visit_time, Integer, Integer, :default :required => => 0 true property :hidden, Integer, :default => 0 property :favicon_id, Integer, :default => 0 has n, :segments has n, :visits, :through => :segments end Sunday, August 29, 2010
  • 23. Url model (naive) class Url •A example bullet point include DataMapper::Resource property :id, Serial • Another exampleInline Validations property property :url, :title, here String String property :visit_count, Integer, :default => 0 • Some more as you want property property :typed_count, :last_visit_time, Integer, Integer, :default :required => => 0 true property :hidden, Integer, :default => 0 property :favicon_id, Integer, :default => 0 has n, :segments has n, :visits, :through => :segments end Sunday, August 29, 2010
  • 24. Urls Table •A example bullet point CREATE TABLE urls( id INTEGER PRIMARY KEY, • Another example here url title LONGVARCHAR, LONGVARCHAR, visit_count INTEGER DEFAULT 0 NOT NULL, • Some more as you want typed_count last_visit_time INTEGER DEFAULT 0 NOT INTEGER NOT NULL, NULL, hidden INTEGER DEFAULT 0 NOT NULL, favicon_id INTEGER DEFAULT 0 NOT NULL ); Database Constraints Sunday, August 29, 2010
  • 25. Sanity Check • A example bullet point The Schemata Match! now lets test. >> Url.first(:url => "http://rubykaigi.org/") => #<Url @id=1294 @url="http://rubykaigi.org/" • Another example here @title="RubyKaigi 2010, August 27-29" @visit_count=8 ... > • >> Url.count => 47007 Some more as you want >> Url.count("visit_count.lt" => 1) => 20 >> # wat. Sunday, August 29, 2010
  • 26. Url model (w/ Sanity) class Url lets add some business • A example bulletrule validations include DataMapper::Resource point property :id, Serial • Another property property example here :url, :title, String, String :format => :url property :visit_count, Integer, :min => 1 • Some more as you want property property :typed_count, :last_visit_time, Integer, Integer, :default :required => => 0 true property :hidden, Integer, :default => 0 property :favicon_id, Integer, :default => 0 has n, :segments has n, :visits, :through => :segments end Sunday, August 29, 2010
  • 27. Data Manipulation class Url •A example include DataMapper::Resource bullet point require ‘dm-types’ property :id, Serial • Another example here property property :url, :title, URI, String :format => :url property :visit_count, Integer, :min => 1 • Some more as you want property property :typed_count, :last_visit_time, Integer, Integer, :default :required => => 0 true property :hidden, Integer, :default => 0 property :favicon_id, Integer, :default => 0 has n, :segments has n, :visits, :through => :segments end Sunday, August 29, 2010
  • 28. Data Manipulation >> u = Url.first("url.like" => "%rubykaigi%") => #<Url @id=1294 @url=#<Addressable::URI: • A example bullet point 0x81c7a1b0 URI:http://rubykaigi.com/ @title="RubyKaigi 2010, August 27-29" • Another example here @last_visit_time=12927095498867853 ...> >> u.url • rubykaigi.com/> Some more as you want => #<Addressable::URI:0x81c7a1b0 URI:http:// >> u.url.host => "rubykaigi.com" # oops, .org is canonical >> u.url.host = "rubykaigi.org"; u.url => #<Addressable::URI:0x81ccfdf4 URI:http:// rubykaigi.org/> Sunday, August 29, 2010
  • 29. Data Manipulation >> u = Url.first("url.like" => "%rubykaigi%") => #<Url @id=1294 @url=#<Addressable::URI: • A example bullet point 0x81c7a1b0 URI:http://rubykaigi.com/ @title="RubyKaigi 2010, August 27-29" • Another example here @last_visit_time=12927095498867853 ...> >> u.last_visit_time Some more as you want => 12927095498867853 # wtf is this? • Sunday, August 29, 2010
  • 30. Urls Table CREATE TABLE urls( id •A example bullet point INTEGER PRIMARY KEY, url LONGVARCHAR, title visit_count • Another example here LONGVARCHAR, INTEGER DEFAULT 0 NOT NULL, typed_count INTEGER DEFAULT 0 NOT NULL, hidden • Some more as you want last_visit_time INTEGER NOT NULL, INTEGER DEFAULT 0 NOT NULL, favicon_id INTEGER DEFAULT 0 NOT NULL ); Not a lot of clues here... Okay, it’s an integer time, but it’s also freaking huge: 12927095498867853? Sunday, August 29, 2010
  • 31. chromium/src/base/time.h •A example bullet point // Time represents an absolute point Another(s/1,000,000) since example here //• in time, internally represented as // microseconds Some more as with other you want //• a platform-dependent epoch. Each // platform's epoch, along // system-dependent clock interface // routines, is defined in time_PLATFORM.cc. Sunday, August 29, 2010
  • 32. chromium/src/base/time_mac.cc // Core Foundation uses a double second // • A example bullet point count since 2001-01-01 00:00:00 UTC. // The UNIX epoch is 1970-01-01 00:00:00 UTC. // // • Another example here Windows uses a Gregorian epoch of 1601. We need to match this internally // // so • Some more as you want that our time representations match across all platforms. See bug 14734. // irb(main):010:0> Time.at(0).getutc() // => Thu Jan 01 00:00:00 UTC 1970 // irb(main):011:0> Time.at(-11644473600).getutc() // => Mon Jan 01 00:00:00 UTC 1601 Examples already in Ruby? Nice. Sunday, August 29, 2010
  • 33. Url model v2 (lib types) class Url •A example include DataMapper::Resource bullet point write ChromeEpochTime property :id, Serial property property • Another example here :url, :title, URI, String :format => :url property :visit_count, Integer, :min => 1 property property • Some more as you want :typed_count, :last_visit_time, Integer, ChromeEpochTime, :default :required => => 0 true property :hidden, Integer, :default => 0 property :favicon_id, Integer, :default => 0 has n, :segments has n, :visits, :through => :segments end Sunday, August 29, 2010
  • 34. chrome_epoch_time.rb module DataMapper class Property •A example bullet point class ChromeEpochTime < Integer def load(value) • Another example here return value unless value.respond_to?(:to_i) ::Time.at((value/10**6)-11644473600) end • Some more as you want def dump(value) case value when ::Integer, ::Time then (value.to_i + 11644473600) * 10**6 when ::DateTime then (value.to_time.to_i + 11644473600) * 10**6 end end end # class ChromeEpochTime end # class Property end # module DataMapper Sunday, August 29, 2010
  • 35. Data Manipulation >> u = Url.first("url.like" => "%rubykaigi.com%") => #<Url @id=42846 @url=#<Addressable::URI: • A example bullet point 0x81e232f0 URI:http://rubykaigi.com/ @title="RubyKaigi 2010, August 27-29" • Another example here @last_visit_time=Tue Aug 24 12:51:38 +0900 2010 ...> Some more as you want >> u.last_visit_time • Aug 24 12:51:38 0900 2010 => Tue Sunday, August 29, 2010
  • 36. Histograms, yay! (Analysis) •A example bullet point hour_histogram = example here • Another Hash.new(0) Visit.all.map do |v| • Some more as you want hour_histogram[v.visit_time.hour] += 1 end Sunday, August 29, 2010
  • 37. Over what span of time? •A example bullet point • Another example here >> Visit.first.visit_time • Some more as you want => Fri May 28 17:04:39 0900 2010 >> Visit.last.visit_time => Thu Aug 26 01:51:32 0900 2010 Sunday, August 29, 2010
  • 38. Aggregate Browsing by Hour 8000 7000 6000 5000 4000 3000 2000 1000 0 Midnight 3am 6am 9am Noon 3pm 6pm 9pm Sunday, August 29, 2010
  • 39. More Histograms, yay! •A example bullet point • Another example here ruby_doc = Url.all("url.like" => "%ruby-doc%"); hour_histogram = Hash.new(0) • Some more as you want ruby_doc.visits.map do |v| hour_histogram[v.visit_time.hour] += 1 end Sunday, August 29, 2010
  • 40. Aggregate Browsing for ruby-doc.org by Hour 50 37.5 25 12.5 0 Midnight 3am 6am 9am Noon 3pm 6pm 9pm Sunday, August 29, 2010
  • 41. But what happens when We have a data source which isn’t well behaved? Sunday, August 29, 2010
  • 42. "Does Edge have an anti-PS3 bias?" http://arstechnica.com/civis/viewtopic.php?f=22&t=62024 Last year a thread on Ars Technica titled "Does Edge have an anti-PS3 bias?" resulted in a flame war erupted bet ween PS3 fans and Xbox360 fans over whether or not PS3 was receiving unfair treatment, particularly held up against a game's score on metacritic.com. Sunday, August 29, 2010
  • 43. Helpfully, the thread title is a testable hypothesis Sunday, August 29, 2010
  • 44. Are an review outlet’s aggregate game scores (dis)similar to the aggregate Metascore for those same games? Sunday, August 29, 2010
  • 45. Unfortunately, Metascore also has no API. Sunday, August 29, 2010
  • 46. Time for the Poor Man’s API: HTML scraping :( Sunday, August 29, 2010
  • 47. Save me Nokogiri! Sunday, August 29, 2010
  • 48. Yeah, that’s not pretty. •A example bullet point def scores_for(game) game_page = case when (game.is_a? String) begin Nokogiri::HTML(open(game)) rescue puts "[FAIL] Failed to open #{game}" • Another example here break end when (game.is_a? Nokogiri::HTML::Document) game else raise StandardError, "you need to provide either a url, or a nokogiri document" • Some more as you want end page_title = game_page.css('title').text junk, title, platform, year = page_title.match(/^(.+)s*((#{PLATFORMS.join("|")}): (d+)): Reviews$/).to_a title.strip! metascore = game_page.css('table#scoretable img').select{ |i| /Metascore:/ =~ i.attributes['alt'] }.first.attributes['alt'].to_s.split.last puts "[WIN] #{title} on the #{platform} (#{year}) has a score of #{metascore}" #review_count = game_page.to_s.match(/based on <b>(d+) reviews/).to_a.last reviews = game_page.css('div.scoreandreview') review_count = reviews.size checksum = game_page.to_s.match(/based on <b>(d+) reviews/).to_a.last.to_i checksum_message = "Number of Reviews on the page not equal to the claimed number of reviews" raise StandardError, checksum_message unless review_count == checksum scores = reviews.map do |review| score = review.css('div.criticscore').text pub = review.css('span.publication').text [score,pub] end return { :title =>title.strip, :metascore => metascore, :platform => platform, :publish_year => year, :reviews => scores } end Sunday, August 29, 2010
  • 49. But it works! <3 Nokogiri Sunday, August 29, 2010
  • 50. Models class Game include DataMapper::Resource class ReviewPublisher include DataMapper::Resource •A example bullet point property :id, Serial property :title, String, :length=>255 property :id, Serial property :platform, String property :release_date, DateTime property :name, String, :length => 255 property :esrb_rating, String • Another example here property :metascore, Float has n, :reviews, :model => "Game::Review" property :review_count, Integer has n, :games, :through => :reviews property :created_at, DateTime end property :updated_at, DateTime class Review • Some more as you want include DataMapper::Resource property :game_id, Integer, :key => true property :review_publisher_id, Integer, :key => true property :score, Integer belongs_to :review_publisher belongs_to :game end class Developer include DataMapper::Resource property :id, Serial property :name, String, :length => 255 has n, :games end end Sunday, August 29, 2010
  • 51. Student’s T-Test (Analysis!) def t_value(prop1, collection1, prop2, collection2) c1_std c1_avg •A example bullet point = collection1.std(prop1) = collection1.avg(prop1) • Another example here c1_count = collection1.count c2_std = collection2.std(prop2) • Some more as you want c2_avg = collection2.avg(prop2) c2_count = collection2.count (c1_avg - c2_avg) / Math.sqrt( (c1_std**2 / c1_count)+(c2_std**2 / c2_count) ) end Sunday, August 29, 2010
  • 52. PS3 Reviewers vs Metascore •A example bullet point outlets = ReviewPublisher.all("games.platform"=>"ps3") t_scores = outlets.map do |outlet| • Another example here t_value(:metascore, outlet.games(:platform=>"ps3"), :score, outlet.reviews("game.platform"=>"ps3")) end # .size => 140 significant = t_scores.select do you want • Some more as |t| (t > 1.96 or t < -1.96) and not t.infinite? end low = significant.select{ |s| s < -1.96} # .size => 20 high = significant.select{ |s| s > 1.96} # .size => 10 Sunday, August 29, 2010
  • 53. Xbox360 Reviewers vs Metascore •A example bullet point outlets = ReviewPublisher.all("games.platform"=>"xbox360") t_scores = outlets.map do |outlet| • Another example here t_value(:metascore, outlet.games(:platform=>"xbox360"), :score, outlet.reviews("game.platform"=>"xbox360")) end # .size => 169 • Some more as you want significant = t_scores.select do |t| (t > 1.96 or t < -1.96) and not t.infinite? end low = significant.select{ |s| s < -1.96} # .size => 37 high = significant.select{ |s| s > 1.96} # .size => 29 Sunday, August 29, 2010
  • 54. What about Edge Magazine? >> • A =example bullet point outlet ReviewPublisher.first("name.like"=>"%Edge%") • Another example here => #<ReviewPublisher @id=36 @name="Edge Magazine"> >> t = t_value(:metascore, outlet.games (:platform=>"ps3"), :score, outlet.reviews • Some more as you want ("game.platform"=>"ps3")) => 5.10786212293491 >> t > 1.96 => true # Edge has a PRO PS3 bias, not Anti! Sunday, August 29, 2010
  • 55. There are lots of other possibilities! What would you like to learn? Sunday, August 29, 2010
  • 56. Learn about DataMapper perhaps? http://www.datamapper.org irc://irc.freenode.net#datamapper Sunday, August 29, 2010
  • 57. Thanks! @knowtheory ted@knowtheory.net Sunday, August 29, 2010