SlideShare una empresa de Scribd logo
1 de 70
Descargar para leer sin conexión
The Effects of Time on Query
          Flow Graph-based Models for
               Query Suggestion
                        Carlos Castillo, Debora Donato   Ranieri Baraglia, Franco Maria Nardini
                                                           Raffaele Perego, Fabrizio Silvestri
                          Yahoo! Research Barcelona
                                                               HPC Lab, ISTI-CNR, Pisa




martedì 4 maggio 2010
Outline




martedì 4 maggio 2010
Outline
                    •   Introduction
                    •   Aims of this Work
                    •   The Query-Flow Graph
                    •   Evaluating the Aging Effect
                    •   Combating the Aging Effect
                    •   Distributed QFG Building
                    •   Conclusions & Future Works


martedì 4 maggio 2010
Introduction




martedì 4 maggio 2010
Introduction
       • Web search engines use query recommender
               systems to improve users’ search experience;




martedì 4 maggio 2010
Introduction
       • Web search engines use query recommender
               systems to improve users’ search experience;
       • Query recommender systems give hints to users on
               possible “interesting queries”:
             • relative to their information needs;


martedì 4 maggio 2010
Introduction
       • Web search engines use query recommender
               systems to improve users’ search experience;
       • Query recommender systems give hints to users on
               possible “interesting queries”:
             • relative to their information needs;
       • Query recommender systems exploit the
               knowledge of past web search engines users:
             • recorded in query logs.
martedì 4 maggio 2010
Aims of this Work




martedì 4 maggio 2010
Aims of this Work
   •       to show that time has negative effects on a query
           recommender model:
             •          the model becomes unable to generate good suggestions
                        as time passes;
             •          bursty queries;




martedì 4 maggio 2010
Aims of this Work
   •       to show that time has negative effects on a query
           recommender model:
             •          the model becomes unable to generate good suggestions
                        as time passes;
             •          bursty queries;
   •       to extend a state-of-the-art recommender system by providing
           a methodology for dealing efficiently with evolving data;
             •          to define a “good” strategy to update the model;
             •          to define an distributed/parallel algorithm to update the
                        model;
martedì 4 maggio 2010
The Query-Flow Graph




martedì 4 maggio 2010
The Query-Flow Graph
       •
                                                                                                                                         barcelona fc
              QFG [Boldi et al., CIKM’08] is a                                                                                             website


              compact and powerful representation                                                                                0.043
                                                                                                                                         barcelona fc

              of Web Search engine users’ behavior;                                                                              0.031
                                                                                                                                           fixtures



                                                                                                                  barcelona fc   0.017      real
                                                                                                                                           madrid
                                                                                               0.080
                                                                                                          0.011
                                                                                                                                 0.506


                                                                                                                     0.439
                                                                                              barcelona
                                                                                                hotels    0.072
                                                                           0.018                                     cheap
                                                                                                                   barcelona
                                                                                      0.023
                                                                                                                     hotels
                                                                                                          0.029
                                                                                                                                            <T>

                                                                          barcelona                                 luxury
                                                                  0.043
                                                                                                                   barcelona
                                                                  0.018
                                                      barcelona                                                      hotels
                                                       weather
                                                                                                          0.416




                                                                                               0.523
                                                                  0.100


                                                                          barcelona
                                                                           weather
                                                                           online


martedì 4 maggio 2010
The Query-Flow Graph
       •
                                                                                                                                               barcelona fc
              QFG [Boldi et al., CIKM’08] is a                                                                                                   website


              compact and powerful representation                                                                                      0.043
                                                                                                                                               barcelona fc

              of Web Search engine users’ behavior;                                                                                    0.031
                                                                                                                                                 fixtures




       •      QFG is a graph composed by:
                                                                                                     0.080
                                                                                                                        barcelona fc   0.017      real
                                                                                                                                                 madrid



             1. a set of nodes, V = Q ∪ {s,t};                                                                  0.011
                                                                                                                                       0.506


                                                                                                                           0.439

             2. a set of directed edges, E ⊆ V x V:                                                 barcelona
                                                                                                      hotels    0.072
                                                                                 0.018                                     cheap



                   •
                                                                                                                         barcelona
                                                                                            0.023
                        (q, q’) are connected if they are                                                       0.029
                                                                                                                           hotels
                                                                                                                                                  <T>

                        consecutive at least one time in                0.043
                                                                                barcelona                                 luxury

                        at least one session;
                                                                                                                         barcelona
                                                                        0.018
                                                            barcelona                                                      hotels
                                                             weather
                                                                                                                0.416

             3. a weighting function w = E --> (0, 1]:
                   •
                                                                                                     0.523
                        assigning a weight w(q, q’) to                  0.100



                        each edge;                                              barcelona
                                                                                 weather
                                                                                 online


martedì 4 maggio 2010
The Query-Flow Graph




martedì 4 maggio 2010
The Query-Flow Graph

       • two weighting schemes:
        • relative frequencies: counting query occurrences;
        • chaining probabilities: (q,q’) in the same chain
          • classification on a set of features (text, n-grams,
                        session) over all sessions where (q,q’) are
                        consecutive;


martedì 4 maggio 2010
The Query-Flow Graph

       • two weighting schemes:
        • relative frequencies: counting query occurrences;
        • chaining probabilities: (q,q’) in the same chain
          • classification on a set of features (text, n-grams,
                        session) over all sessions where (q,q’) are
                        consecutive;
       • noisy edges: edges with low probability are removed;
martedì 4 maggio 2010
The Query-Flow Graph




martedì 4 maggio 2010
The Query-Flow Graph

       • Query recommendation:
        • random walk with restart on the graph;
        • considering history of the users (on the
                        preference vector);




martedì 4 maggio 2010
The Query-Flow Graph

       • Query recommendation:
        • random walk with restart on the graph;
        • considering history of the users (on the
                        preference vector);
       • A score is associated to each suggestion;
martedì 4 maggio 2010
Experimental
                         Framework




martedì 4 maggio 2010
Experimental
                         Framework
   • Experiments on the AOL query log:




martedì 4 maggio 2010
Experimental
                         Framework
   • Experiments on the AOL query log:
      • 20 millions queries;




martedì 4 maggio 2010
Experimental
                         Framework
   • Experiments on the AOL query log:
      • 20 millions queries;
      • 650,000 different users;



martedì 4 maggio 2010
Experimental
                         Framework
   • Experiments on the AOL query log:
      • 20 millions queries;
      • 650,000 different users;
      • 3 months (03/01/2006 --> 05/31/2006).


martedì 4 maggio 2010
Experimental
                                   Framework
   • Experiments on the AOL query log:
      • 20 millions queries;
      • 650,000 different users;
      • 3 months (03/01/2006 --> 05/31/2006).
   • Three segments of the query log:
                             M1                 M2




                !"#$%&'()$         !"#$*+',-$        !"#$%&.$   /!#)$%&.$
martedì 4 maggio 2010
Experimental
                        Assumptions




martedì 4 maggio 2010
Boldi et al. in [4]. This method uses chaining probabi
                               measured by means of a machine learning method. The

                            Experimental
                               tial step was thus to extract those features from each t
                               ing log, and storing them into a compressed graph re
                               sentation. In particular we extracted 25 different feat

                            Assumptions
                               (time-related, session and textual features) for each pa
                               queries (q, q  ) that are consecutive in at least one sessio
                               the query log.
                                  Table 1 shows the number of nodes and edges of the

   • M , M are used for training;
                1       2
                               ferent graphs corresponding to each query log segment
                               for training.

     • two different QFGs;             time window
                                         March 06
                                                        id
                                                        M1
                                                                  nodes
                                                                3,814,748
                                                                               edges
                                                                             6,129,629
                                         April 06       M2      3,832,973    6,266,648

                               Table 1: Number of nodes and edges for the gra
                               corresponding to the two different training
                               ments.

                                  It is important to remark that we have not re-trained
                               classification model for the assignment of weights associ
                               with QFG edges. We reuse the one that has been used i
                               for segmenting users sessions into query chains1 . Th
                               another point in favor of QFG-based models. Once you t
                               the classifier to assign weights to QFG edges, you can r
                               it on different data-sets without losing in effectiveness.
martedì 4 maggio 2010          1
Boldi et al. in [4]. This method uses chaining probabi
                                  measured by means of a machine learning method. The

                            Experimental
                                  tial step was thus to extract those features from each t
                                  ing log, and storing them into a compressed graph re
                                  sentation. In particular we extracted 25 different feat

                            Assumptions
                                  (time-related, session and textual features) for each pa
                                  queries (q, q  ) that are consecutive in at least one sessio
                                  the query log.
                                     Table 1 shows the number of nodes and edges of the

   • M , M are used for training;
                1       2
                                  ferent graphs corresponding to each query log segment
                                  for training.

     • two different QFGs;                 time window
                                             March 06
                                                           id
                                                           M1
                                                                     nodes
                                                                   3,814,748
                                                                                  edges
                                                                                6,129,629
                                             April 06      M2      3,832,973    6,266,648


   •       Queries in the third month Number of nodes testing; for the gra
                                Table 1: are used for and edges
                                corresponding to the two different training
                                  ments.

                                     It is important to remark that we have not re-trained
                                  classification model for the assignment of weights associ
                                  with QFG edges. We reuse the one that has been used i
                                  for segmenting users sessions into query chains1 . Th
                                  another point in favor of QFG-based models. Once you t
                                  the classifier to assign weights to QFG edges, you can r
                                  it on different data-sets without losing in effectiveness.
martedì 4 maggio 2010             1
Boldi et al. in [4]. This method uses chaining probabi
                                         measured by means of a machine learning method. The

                                 Experimental
                                         tial step was thus to extract those features from each t
                                         ing log, and storing them into a compressed graph re
                                         sentation. In particular we extracted 25 different feat

                                 Assumptions
                                         (time-related, session and textual features) for each pa
                                         queries (q, q  ) that are consecutive in at least one sessio
                                         the query log.
                                            Table 1 shows the number of nodes and edges of the

   • M , M are used for training;
                1         2
                                         ferent graphs corresponding to each query log segment
                                         for training.

     • two different QFGs;                        time window
                                                    March 06
                                                                  id
                                                                  M1
                                                                            nodes
                                                                          3,814,748
                                                                                         edges
                                                                                       6,129,629
                                                    April 06      M2      3,832,973    6,266,648


   •       Queries in the third month Number of nodes testing; for the gra
                                Table 1: are used for and edges
                                corresponding to the two different training

   • We evaluate the aging effect by measuring the quality
                                         ments.


           of suggestions produced by models on M , and M ;
                                            It is important to remark that we have not re-trained
                                                                          1          2
                                         classification model for the assignment of weights associ

             • If the model ages M
                                         with QFG edges. We reuse the one that has been used i
                                              outperforms M , in terms of
                                         for segmenting users sessions1into query chains1 . Th
                                              2
                                         another point in favor of QFG-based models. Once you t
                        quality of suggestions;
                                         the classifier to assign weights to QFG edges, you can r
                                         it on different data-sets without losing in effectiveness.
martedì 4 maggio 2010                    1
Evaluating the Aging
                               Effect




martedì 4 maggio 2010
Evaluating the Aging
                               Effect
                                1e+06
                                                 Top 1000 queries in month 1 on month 1
                                                 Top 1000 queries in month 3 on month 1


                               100000




                                10000




                                 1000




                                  100




                                   10                 !#$%'()*+,'


                                    1
                                        1   10           100                              1000




martedì 4 maggio 2010
Evaluating the Aging
                                   Effect
   •       Two classes of test queries:
             •          F1: 30 queries highly
                                                   1e+06
                                                                     Top 1000 queries in month 1 on month 1
                                                                     Top 1000 queries in month 3 on month 1

                        frequent in M1 having a    100000

                        large drop in the test
                        month (ex. shakira).       10000




             •          F3: 30 queries highly        1000


                        frequent in the test
                        month having a large
                                                      100




                        drop in M1 (ex. da vinci       10                 !#$%'()*+,'
                        code, mothers day gift);
                                                        1
                                                            1   10           100                              1000




martedì 4 maggio 2010
Evaluating the Aging
                                   Effect
   •       Two classes of test queries:
             •          F1: 30 queries highly
                                                   1e+06
                                                                     Top 1000 queries in month 1 on month 1
                                                                     Top 1000 queries in month 3 on month 1

                        frequent in M1 having a    100000

                        large drop in the test
                        month (ex. shakira).       10000




             •          F3: 30 queries highly        1000


                        frequent in the test
                        month having a large
                                                      100




                        drop in M1 (ex. da vinci       10                 !#$%'()*+,'
                        code, mothers day gift);
   •
                                                        1

           F1, F3 contain very diverse
                                                            1   10           100                              1000



           queries;
martedì 4 maggio 2010
Evaluating the Aging
                             Effect (II)




martedì 4 maggio 2010
3742        2652
                                           2162        2615


                        Evaluating the Aging
                                           2001        2341
                                           1913        2341
                                           1913        2341



                             Effect (II)
                                   (!!!

                                   '!!!

                                   !!!

                                   %!!!                                     )*+,
                                                                             -./)012.342+*5
                                   $!!!

                                   #!!!

                                      !
                                            #    $   %        '   (




martedì 4 maggio 2010
3742        2652
                                                  2162        2615


                        Evaluating the Aging
                                                  2001        2341
                                                  1913        2341
                                                  1913        2341



                             Effect (II)
 •       When k suggestions share the
         same score, those are useless;   (!!!

                                          '!!!

                                          !!!

                                          %!!!                                     )*+,
                                                                                    -./)012.342+*5
                                          $!!!

                                          #!!!

                                             !
                                                   #    $   %        '   (




martedì 4 maggio 2010
3742        2652
                                                      2162        2615


                        Evaluating the Aging
                                                      2001        2341
                                                      1913        2341
                                                      1913        2341



                             Effect (II)
 •       When k suggestions share the
         same score, those are useless;       (!!!



 •       Same suggestion score:               '!!!



           •
                                              !!!
                   same probability on the
                   graph;                     %!!!                                     )*+,
                                                                                        -./)012.342+*5

           •       the model is not able to   $!!!

                   give a priority to         #!!!

                   recommendations;              !
                                                       #    $   %        '   (




martedì 4 maggio 2010
3742        2652
                                                      2162        2615


                        Evaluating the Aging
                                                      2001        2341
                                                      1913        2341
                                                      1913        2341



                             Effect (II)
 •       When k suggestions share the
         same score, those are useless;       (!!!



 •       Same suggestion score:               '!!!



           •
                                              !!!
                   same probability on the
                   graph;                     %!!!                                     )*+,
                                                                                        -./)012.342+*5

           •       the model is not able to   $!!!

                   give a priority to         #!!!

                   recommendations;              !

 •       Confirmed by an user-study
                                                       #    $   %        '   (


         on F1, and F3;
martedì 4 maggio 2010
Evaluating the Aging
                             Effect (III)




martedì 4 maggio 2010
Evaluating the Aging
                                Effect (III)
   • Working hypothesis:
     • useful recommendations do not share the same
                        recommendation score;




martedì 4 maggio 2010
Evaluating the Aging
                                Effect (III)
   • Working hypothesis:
     • useful recommendations do not share the same
                        recommendation score;
   • Automatic evaluation;
      • 400 highly frequent queries in the test month;
      • evaluating the number of useful recommendations;
      • k = 3;
martedì 4 maggio 2010
Evaluating the Aging
                             Effect (IV)




martedì 4 maggio 2010
ate recommendations are taken from different query

                        Evaluating the Aging
                         recommendations with their assigned relative scores.



                             Effect (IV)
                         reduces the “noise” on the data and generates more precise
                         knowledge on which recommendations are computed. Fur-
                         thermore, the increase is quite independent from the thresh-
                         old level, i.e. by increasing the threshold from 0.5 to 0.75
                         the overall quality is, roughly, constant.


       •      Results:        filtering
                              threshold
                                            average number
                                            of useful sugges-
                                            tions on M1
                                                                average number
                                                                of useful sugges-
                                                                tions on M2
                                   0              2.84                2.91
                                  0.5             5.85                6.23
                                 0.65             5.85                6.23
                                 0.75             5.85                6.18

                         Table 4: Recommendation statistics obtained by us-
                         ing the automatic evaluation method on a set of 400
                         queries drawn from the most frequent in the third
                         month.

                           We further break down the overall results shown in Table 4
                         to show the number of queries on which the QFG-based
martedì 4 maggio 2010
ate recommendations are taken from different query

                        Evaluating the Aging
                         recommendations with their assigned relative scores.



                             Effect (IV)
                         reduces the “noise” on the data and generates more precise
                         knowledge on which recommendations are computed. Fur-
                         thermore, the increase is quite independent from the thresh-
                         old level, i.e. by increasing the threshold from 0.5 to 0.75
                         the overall quality is, roughly, constant.


       •      Results:        filtering
                              threshold
                                            average number
                                            of useful sugges-
                                            tions on M1
                                                                average number
                                                                of useful sugges-
                                                                tions on M2
                                   0              2.84                2.91
                                  0.5             5.85                6.23
                                 0.65             5.85                6.23
                                 0.75             5.85                6.18


       •              Table 4: Recommendation statistics obtained by us-
              Average ing the automatic evaluation method on a set of 400
                       number of useful suggestions is greater in
              M2 than queries drawn from the most frequent in the third
                      in M1;
                      month.

       • Filtering process helps a lot;
                           We further break down the overall results shown in Table 4
                         to show the number of queries on which the QFG-based
martedì 4 maggio 2010
Evaluating the Aging
                             Effect (V)




martedì 4 maggio 2010
Evaluating the Aging
                                   Effect (V)

       • On a histogram (cumulative distribution):
                        400


                        300


                        200


                        100


                          0
                              0   1   2   3   4   5        6   7   8   9   10   11   12    13   14   15   16   17   18



                                                      M1                              M2




martedì 4 maggio 2010
Evaluating the Aging
                                   Effect (V)

       • On a histogram (cumulative distribution):
                        400


                        300


                        200


                        100


                          0
                              0   1   2   3   4   5        6   7   8   9   10   11   12    13   14   15   16   17   18



                                                      M1                              M2




       • Results on M are always better than those on M :
                                          2                                                                               1

          • less queries without suggestions;
martedì 4 maggio 2010
Combating the Aging
                             Effect




martedì 4 maggio 2010
Combating the Aging
                               Effect

   • QFG recommender models age:
     • Average recommendation quality degrades;
     • Recommendations should not be influenced by
                        time;




martedì 4 maggio 2010
Combating the Aging
                               Effect

   • QFG recommender models age:
     • Average recommendation quality degrades;
     • Recommendations should not be influenced by
                        time;
   • Update of the model vs. rebuilding it “from scratch”;
martedì 4 maggio 2010
Combating the Aging
                            Effect (II)




martedì 4 maggio 2010
Combating the Aging
t a model
 or which                    Effect (II)
              QFGs. Suppose the model used to generate recommenda-
              tions consists of a portion of data representing one month
              (for M1 and M2 ) or two months (for M12 ) of the query
commen-       log. The model is being updated every 15 days (for M1

    •
to always     and M2 ) or every 30 days (for M12 ). By using the first ap-
       Solution: incremental update of Mevery means days to rebuild
              proach, we pay 22 (44) minutes 1 by 15 (30) of “fresh data”             in M2

              •
              the new model from scratch on a new set of data obtained
           Graph the last two months of the query log. Instead, by using
              from algebra [Bordino et al., 2008];
FLOW
    •
              the second approach, we need to pay only 15 (32) minutes
       Some measures on the two different approaches:
              for updating the one-month (two-months) QFG.
apidly in
                                                  “From scratch”    “Incremental”
commen-
                                Dataset           strategy [min.]   strategy [min.]
  endation                 M1 (March 2006)              21                14
tive queries.               M2 (April 2006)             22                15
 both fre-                M12 (March and April)         44                32
heir value
ariation).               Table 5: Time needed to build a Query Flow Graph
o movies,                from scratch and using our “incremental” approach
 eral with               (from merging two QFG representing an half of
 it is easy              data).
 martedì 4 maggio 2010
Combating the Aging
t a model
 or which                    Effect (II)
              QFGs. Suppose the model used to generate recommenda-
              tions consists of a portion of data representing one month
              (for M1 and M2 ) or two months (for M12 ) of the query
commen-       log. The model is being updated every 15 days (for M1

    •
to always     and M2 ) or every 30 days (for M12 ). By using the first ap-
       Solution: incremental update of Mevery means days to rebuild
              proach, we pay 22 (44) minutes 1 by 15 (30) of “fresh data”            in M2

              •
              the new model from scratch on a new set of data obtained
           Graph the last two months of the query log. Instead, by using
              from algebra [Bordino et al., 2008];
FLOW
    •
              the second approach, we need to pay only 15 (32) minutes
       Some measures on the two different approaches:
              for updating the one-month (two-months) QFG.
apidly in
                                                 “From scratch”    “Incremental”
commen-
                               Dataset           strategy [min.]   strategy [min.]
  endation                M1 (March 2006)              21                14
tive queries.              M2 (April 2006)             22                15
 both fre-               M12 (March and April)         44                32

    •
heir value
         Incremental updates: 2/3 of the build w.r.t. “from scratch” strategy;
ariation).      Table 5: Time needed to time a Query Flow Graph
                from scratch and using our “incremental” approach
    •
o movies,
         Evaluation onmerging two QFG representing an half of
 eral with      (from the same set of 400 queries;
 it is easy     data).
 martedì 4 maggio 2010
Combating the Aging
                           Effect (III)




martedì 4 maggio 2010
3698   shakira video
                                     shakira        3135   shakira nude



                        Combating the Aging
                                                    3099   shakira wallpaper
                                                    3020   shakira biography
                                                    3018   shakira aol music
                                                    2015   free video downloads




                           Effect (III)
                         Table 7: Some examples of recommendations gen-
                         erated on different QFG models. Queries used to
                         generate recommendations are taken from different
                         query sets.


      •       Results:        filtering
                              threshold
                                               average number
                                               of useful sugges-
                                               tions on M2
                                                                    average number
                                                                    of useful sugges-
                                                                    tions on M12
                                   0                 2.91                 3.64
                                  0.5                6.23                 7.95
                                 0.65                6.23                 7.94
                                 0.75                6.18                  7.9

                         Table 8: Recommendation statistics obtained by us-
                         ing the automatic evaluation method on a relatively
                         large set of 400 queries drawn from the most fre-
                         quent in the third month.


martedì 4 maggio 2010
                         gated the main reasons why we obtain such an improvement.
3698   shakira video
                                     shakira        3135   shakira nude



                        Combating the Aging
                                                    3099   shakira wallpaper
                                                    3020   shakira biography
                                                    3018   shakira aol music
                                                    2015   free video downloads




                           Effect (III)
                         Table 7: Some examples of recommendations gen-
                         erated on different QFG models. Queries used to
                         generate recommendations are taken from different
                         query sets.


      •       Results:        filtering
                              threshold
                                               average number
                                               of useful sugges-
                                               tions on M2
                                                                    average number
                                                                    of useful sugges-
                                                                    tions on M12
                                   0                 2.91                 3.64
                                  0.5                6.23                 7.95
                                 0.65                6.23                 7.94
                                 0.75                6.18                  7.9


      •       Average number of useful suggestion is obtained by us-
                     Table 8: Recommendation statistics greater in
                     ing the automatic evaluation method on a relatively
              M12 than in M2, or 400M1;
                     large set of in queries drawn from the most fre-
                         quent in the third month.


martedì 4 maggio 2010
                         gated the main reasons why we obtain such an improvement.
Combating the Aging
                           Effect (IV)




martedì 4 maggio 2010
12,5




                                Combating the Aging
                            0
                                0   1   2   3        4   5   6   7   8    9   10   11   12   13   14    15   16   17   18


                                                M1                       M2                       M12




                                   Effect (IV)
                      Figure 4: Histogram showing the number of queries
                      (on the y axis) having a certain number of useful
                      recommendations (on the x axis). Results are eval-

          •           uated automatically.
                  On a histogram (cumulative distribution):
                       400


                       300



t                      200


                       100


                            0
                                0   1   2   3        4   5   6   7   8    9   10   11   12   13   14    15   16   17   18

                                                M1                       M2                       M12



-
                      Figure 5: Histogram showing the total number of
                      queries (on the y axis) having at least a certain num-
                      ber of useful recommendations (on the x axis). For
                      instance the third bucket shows how many queries
    martedì 4 maggio 2010
12,5




                                Combating the Aging
                            0
                                0   1   2   3        4   5   6   7   8    9   10   11   12   13   14    15   16   17   18


                                                M1                       M2                       M12




                                   Effect (IV)
                      Figure 4: Histogram showing the number of queries
                      (on the y axis) having a certain number of useful
                      recommendations (on the x axis). Results are eval-

          •           uated automatically.
                  On a histogram (cumulative distribution):
                       400


                       300



t                      200


                       100


                            0
                                0   1   2   3        4   5   6   7   8    9   10   11   12   13   14    15   16   17   18

                                                M1                       M2                       M12



-         •       Results on M12 are always better than M1, and M2;
                      Figure 5: Histogram showing the total number of
                    • queries improvement ofhaving at least aleast four good
                        large (on the y axis) queries with at certain num-
                        suggestions;
                      ber of useful recommendations (on the x axis). For
                      instance the third bucket shows how many queries
    martedì 4 maggio 2010
Distributed QFG
                             Building




martedì 4 maggio 2010
Distributed QFG
                                                        4. using the graph algebra described in [8], each pa
                                                           graph is iteratively merged. Each iteration is do
                                                           parallel on the different available nodes of the clo


                                    Building
                                                        5. the final resulting data-graph is now processed
                                                           other steps [4] (normalization, chain extraction,
                                                           dom walk) to obtain the complete and usable QF


   •       a parallel way to update QFGs:
                                                                         01)2()*+,'#3456#7)8#
           Divide-and-Conquer approach;
             •          the query log is split in m
                                                              !#$%'#    !#$%'#    !#$%'#    !#$%'#
                        parts;
             •          parallel extraction of the
                                                                -./#       -./#        -./#        -./#
                        features;
             •          compressing step;
                                                                 !#()*+,#-./#            !#()*+,#-./#
             •          merging graphs;
             •          final operations                                      9#()*+,'#-./#
                        (normalization, pagerank, etc.);
martedì 4 maggio 2010                                 Figure 6: Example of the building of a two mo
Conclusions




martedì 4 maggio 2010
Conclusions
   •       We study the effects of time on QFG-based query
           recommender systems;




martedì 4 maggio 2010
Conclusions
   •       We study the effects of time on QFG-based query
           recommender systems;
   •       We built different QFGs from the AOL query log;
             •          we analyze the quality of recommendation;
             •          we show that recommendation models ages;
             •          we introduce an “incremental” algorithm for updating
                        the model;
             •          we propose a parallel/distributed way of building
                        QFGs;
martedì 4 maggio 2010
Future Works




martedì 4 maggio 2010
Future Works
   • to define a strategy for merging graphs assigning
           different weights to each subgraph;
             • more importance to “fresh” data;



martedì 4 maggio 2010
Future Works
   • to define a strategy for merging graphs assigning
           different weights to each subgraph;
             • more importance to “fresh” data;
   • to compare the robustness of QFG recommender
           systems with other query recommenders with
           respect to aging;



martedì 4 maggio 2010
Future Works
   • to define a strategy for merging graphs assigning
           different weights to each subgraph;
             • more importance to “fresh” data;
   • to compare the robustness of QFG recommender
           systems with other query recommenders with
           respect to aging;
   • to design a MapReduce algorithm to build and update
           efficiently QFGs recommender systems;
martedì 4 maggio 2010
Questions?


                   Thank you for your attention!



martedì 4 maggio 2010
References

      • [Boldi et al., CIKM’08]: The Query Flow Graph: model
             and applications. Boldi, Bonchi, Castillo, Donato,
             Gionis,Vigna. CIKM’08.
      • [Boldi et al., WSCD’09]: Query Suggestions using
             Query-Flow Graphs. Boldi, Bonchi, Castillo, Donato,
             Vigna. WSCD’09.
      • [Bordino et al., 2008]: Algebra for the joint mining of
             query log graphs, 2008.

martedì 4 maggio 2010

Más contenido relacionado

Destacado

TweetCred: Real-Time Credibility Assessment of 
 Content on Twitter @ Socinfo...
TweetCred: Real-Time Credibility Assessment of 
 Content on Twitter @ Socinfo...TweetCred: Real-Time Credibility Assessment of 
 Content on Twitter @ Socinfo...
TweetCred: Real-Time Credibility Assessment of 
 Content on Twitter @ Socinfo...IIIT Hyderabad
 
Kdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-iiKdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-iiLaks Lakshmanan
 
Kdd12 tutorial-inf-part-iv
Kdd12 tutorial-inf-part-ivKdd12 tutorial-inf-part-iv
Kdd12 tutorial-inf-part-ivLaks Lakshmanan
 
Kdd12 tutorial-inf-part-i
Kdd12 tutorial-inf-part-iKdd12 tutorial-inf-part-i
Kdd12 tutorial-inf-part-iLaks Lakshmanan
 
Extracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social MediaExtracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social MediaMuhammad Imran
 
What to Expect When the Unexpected Happens: Social Media Communications Acros...
What to Expect When the Unexpected Happens: Social Media Communications Acros...What to Expect When the Unexpected Happens: Social Media Communications Acros...
What to Expect When the Unexpected Happens: Social Media Communications Acros...Carlos Castillo (ChaTo)
 
Emotions and dialogue in a peer-production community: the case of Wikipedia
Emotions and dialogue in a peer-production community: the case of WikipediaEmotions and dialogue in a peer-production community: the case of Wikipedia
Emotions and dialogue in a peer-production community: the case of WikipediaDavid Laniado
 
Kdd12 tutorial-inf-part-iii
Kdd12 tutorial-inf-part-iiiKdd12 tutorial-inf-part-iii
Kdd12 tutorial-inf-part-iiiLaks Lakshmanan
 
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...Artificial Intelligence Institute at UofSC
 

Destacado (13)

TweetCred: Real-Time Credibility Assessment of 
 Content on Twitter @ Socinfo...
TweetCred: Real-Time Credibility Assessment of 
 Content on Twitter @ Socinfo...TweetCred: Real-Time Credibility Assessment of 
 Content on Twitter @ Socinfo...
TweetCred: Real-Time Credibility Assessment of 
 Content on Twitter @ Socinfo...
 
Kdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-iiKdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-ii
 
Kdd12 tutorial-inf-part-iv
Kdd12 tutorial-inf-part-ivKdd12 tutorial-inf-part-iv
Kdd12 tutorial-inf-part-iv
 
Kdd12 tutorial-inf-part-i
Kdd12 tutorial-inf-part-iKdd12 tutorial-inf-part-i
Kdd12 tutorial-inf-part-i
 
Extracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social MediaExtracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social Media
 
What to Expect When the Unexpected Happens: Social Media Communications Acros...
What to Expect When the Unexpected Happens: Social Media Communications Acros...What to Expect When the Unexpected Happens: Social Media Communications Acros...
What to Expect When the Unexpected Happens: Social Media Communications Acros...
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Emotions and dialogue in a peer-production community: the case of Wikipedia
Emotions and dialogue in a peer-production community: the case of WikipediaEmotions and dialogue in a peer-production community: the case of Wikipedia
Emotions and dialogue in a peer-production community: the case of Wikipedia
 
Crisis Computing
Crisis ComputingCrisis Computing
Crisis Computing
 
Kdd12 tutorial-inf-part-iii
Kdd12 tutorial-inf-part-iiiKdd12 tutorial-inf-part-iii
Kdd12 tutorial-inf-part-iii
 
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
SIAM SDM2014 tutorial - Social Media and Web of Data to Assist Crisis Respons...
 
Social Media Mining and Retrieval
Social Media Mining and RetrievalSocial Media Mining and Retrieval
Social Media Mining and Retrieval
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 

Más de Carlos Castillo (ChaTo)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social MediaCarlos Castillo (ChaTo)
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Carlos Castillo (ChaTo)
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Carlos Castillo (ChaTo)
 

Más de Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
K-Means Algorithm
K-Means AlgorithmK-Means Algorithm
K-Means Algorithm
 

Último

Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - AvrilIvanti
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 

Último (20)

Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - Avril
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 

The Effects of Time on Query Flow Graph-based Models for Query Suggestion

  • 1. The Effects of Time on Query Flow Graph-based Models for Query Suggestion Carlos Castillo, Debora Donato Ranieri Baraglia, Franco Maria Nardini Raffaele Perego, Fabrizio Silvestri Yahoo! Research Barcelona HPC Lab, ISTI-CNR, Pisa martedì 4 maggio 2010
  • 3. Outline • Introduction • Aims of this Work • The Query-Flow Graph • Evaluating the Aging Effect • Combating the Aging Effect • Distributed QFG Building • Conclusions & Future Works martedì 4 maggio 2010
  • 5. Introduction • Web search engines use query recommender systems to improve users’ search experience; martedì 4 maggio 2010
  • 6. Introduction • Web search engines use query recommender systems to improve users’ search experience; • Query recommender systems give hints to users on possible “interesting queries”: • relative to their information needs; martedì 4 maggio 2010
  • 7. Introduction • Web search engines use query recommender systems to improve users’ search experience; • Query recommender systems give hints to users on possible “interesting queries”: • relative to their information needs; • Query recommender systems exploit the knowledge of past web search engines users: • recorded in query logs. martedì 4 maggio 2010
  • 8. Aims of this Work martedì 4 maggio 2010
  • 9. Aims of this Work • to show that time has negative effects on a query recommender model: • the model becomes unable to generate good suggestions as time passes; • bursty queries; martedì 4 maggio 2010
  • 10. Aims of this Work • to show that time has negative effects on a query recommender model: • the model becomes unable to generate good suggestions as time passes; • bursty queries; • to extend a state-of-the-art recommender system by providing a methodology for dealing efficiently with evolving data; • to define a “good” strategy to update the model; • to define an distributed/parallel algorithm to update the model; martedì 4 maggio 2010
  • 12. The Query-Flow Graph • barcelona fc QFG [Boldi et al., CIKM’08] is a website compact and powerful representation 0.043 barcelona fc of Web Search engine users’ behavior; 0.031 fixtures barcelona fc 0.017 real madrid 0.080 0.011 0.506 0.439 barcelona hotels 0.072 0.018 cheap barcelona 0.023 hotels 0.029 <T> barcelona luxury 0.043 barcelona 0.018 barcelona hotels weather 0.416 0.523 0.100 barcelona weather online martedì 4 maggio 2010
  • 13. The Query-Flow Graph • barcelona fc QFG [Boldi et al., CIKM’08] is a website compact and powerful representation 0.043 barcelona fc of Web Search engine users’ behavior; 0.031 fixtures • QFG is a graph composed by: 0.080 barcelona fc 0.017 real madrid 1. a set of nodes, V = Q ∪ {s,t}; 0.011 0.506 0.439 2. a set of directed edges, E ⊆ V x V: barcelona hotels 0.072 0.018 cheap • barcelona 0.023 (q, q’) are connected if they are 0.029 hotels <T> consecutive at least one time in 0.043 barcelona luxury at least one session; barcelona 0.018 barcelona hotels weather 0.416 3. a weighting function w = E --> (0, 1]: • 0.523 assigning a weight w(q, q’) to 0.100 each edge; barcelona weather online martedì 4 maggio 2010
  • 15. The Query-Flow Graph • two weighting schemes: • relative frequencies: counting query occurrences; • chaining probabilities: (q,q’) in the same chain • classification on a set of features (text, n-grams, session) over all sessions where (q,q’) are consecutive; martedì 4 maggio 2010
  • 16. The Query-Flow Graph • two weighting schemes: • relative frequencies: counting query occurrences; • chaining probabilities: (q,q’) in the same chain • classification on a set of features (text, n-grams, session) over all sessions where (q,q’) are consecutive; • noisy edges: edges with low probability are removed; martedì 4 maggio 2010
  • 18. The Query-Flow Graph • Query recommendation: • random walk with restart on the graph; • considering history of the users (on the preference vector); martedì 4 maggio 2010
  • 19. The Query-Flow Graph • Query recommendation: • random walk with restart on the graph; • considering history of the users (on the preference vector); • A score is associated to each suggestion; martedì 4 maggio 2010
  • 20. Experimental Framework martedì 4 maggio 2010
  • 21. Experimental Framework • Experiments on the AOL query log: martedì 4 maggio 2010
  • 22. Experimental Framework • Experiments on the AOL query log: • 20 millions queries; martedì 4 maggio 2010
  • 23. Experimental Framework • Experiments on the AOL query log: • 20 millions queries; • 650,000 different users; martedì 4 maggio 2010
  • 24. Experimental Framework • Experiments on the AOL query log: • 20 millions queries; • 650,000 different users; • 3 months (03/01/2006 --> 05/31/2006). martedì 4 maggio 2010
  • 25. Experimental Framework • Experiments on the AOL query log: • 20 millions queries; • 650,000 different users; • 3 months (03/01/2006 --> 05/31/2006). • Three segments of the query log: M1 M2 !"#$%&'()$ !"#$*+',-$ !"#$%&.$ /!#)$%&.$ martedì 4 maggio 2010
  • 26. Experimental Assumptions martedì 4 maggio 2010
  • 27. Boldi et al. in [4]. This method uses chaining probabi measured by means of a machine learning method. The Experimental tial step was thus to extract those features from each t ing log, and storing them into a compressed graph re sentation. In particular we extracted 25 different feat Assumptions (time-related, session and textual features) for each pa queries (q, q ) that are consecutive in at least one sessio the query log. Table 1 shows the number of nodes and edges of the • M , M are used for training; 1 2 ferent graphs corresponding to each query log segment for training. • two different QFGs; time window March 06 id M1 nodes 3,814,748 edges 6,129,629 April 06 M2 3,832,973 6,266,648 Table 1: Number of nodes and edges for the gra corresponding to the two different training ments. It is important to remark that we have not re-trained classification model for the assignment of weights associ with QFG edges. We reuse the one that has been used i for segmenting users sessions into query chains1 . Th another point in favor of QFG-based models. Once you t the classifier to assign weights to QFG edges, you can r it on different data-sets without losing in effectiveness. martedì 4 maggio 2010 1
  • 28. Boldi et al. in [4]. This method uses chaining probabi measured by means of a machine learning method. The Experimental tial step was thus to extract those features from each t ing log, and storing them into a compressed graph re sentation. In particular we extracted 25 different feat Assumptions (time-related, session and textual features) for each pa queries (q, q ) that are consecutive in at least one sessio the query log. Table 1 shows the number of nodes and edges of the • M , M are used for training; 1 2 ferent graphs corresponding to each query log segment for training. • two different QFGs; time window March 06 id M1 nodes 3,814,748 edges 6,129,629 April 06 M2 3,832,973 6,266,648 • Queries in the third month Number of nodes testing; for the gra Table 1: are used for and edges corresponding to the two different training ments. It is important to remark that we have not re-trained classification model for the assignment of weights associ with QFG edges. We reuse the one that has been used i for segmenting users sessions into query chains1 . Th another point in favor of QFG-based models. Once you t the classifier to assign weights to QFG edges, you can r it on different data-sets without losing in effectiveness. martedì 4 maggio 2010 1
  • 29. Boldi et al. in [4]. This method uses chaining probabi measured by means of a machine learning method. The Experimental tial step was thus to extract those features from each t ing log, and storing them into a compressed graph re sentation. In particular we extracted 25 different feat Assumptions (time-related, session and textual features) for each pa queries (q, q ) that are consecutive in at least one sessio the query log. Table 1 shows the number of nodes and edges of the • M , M are used for training; 1 2 ferent graphs corresponding to each query log segment for training. • two different QFGs; time window March 06 id M1 nodes 3,814,748 edges 6,129,629 April 06 M2 3,832,973 6,266,648 • Queries in the third month Number of nodes testing; for the gra Table 1: are used for and edges corresponding to the two different training • We evaluate the aging effect by measuring the quality ments. of suggestions produced by models on M , and M ; It is important to remark that we have not re-trained 1 2 classification model for the assignment of weights associ • If the model ages M with QFG edges. We reuse the one that has been used i outperforms M , in terms of for segmenting users sessions1into query chains1 . Th 2 another point in favor of QFG-based models. Once you t quality of suggestions; the classifier to assign weights to QFG edges, you can r it on different data-sets without losing in effectiveness. martedì 4 maggio 2010 1
  • 30. Evaluating the Aging Effect martedì 4 maggio 2010
  • 31. Evaluating the Aging Effect 1e+06 Top 1000 queries in month 1 on month 1 Top 1000 queries in month 3 on month 1 100000 10000 1000 100 10 !#$%'()*+,' 1 1 10 100 1000 martedì 4 maggio 2010
  • 32. Evaluating the Aging Effect • Two classes of test queries: • F1: 30 queries highly 1e+06 Top 1000 queries in month 1 on month 1 Top 1000 queries in month 3 on month 1 frequent in M1 having a 100000 large drop in the test month (ex. shakira). 10000 • F3: 30 queries highly 1000 frequent in the test month having a large 100 drop in M1 (ex. da vinci 10 !#$%'()*+,' code, mothers day gift); 1 1 10 100 1000 martedì 4 maggio 2010
  • 33. Evaluating the Aging Effect • Two classes of test queries: • F1: 30 queries highly 1e+06 Top 1000 queries in month 1 on month 1 Top 1000 queries in month 3 on month 1 frequent in M1 having a 100000 large drop in the test month (ex. shakira). 10000 • F3: 30 queries highly 1000 frequent in the test month having a large 100 drop in M1 (ex. da vinci 10 !#$%'()*+,' code, mothers day gift); • 1 F1, F3 contain very diverse 1 10 100 1000 queries; martedì 4 maggio 2010
  • 34. Evaluating the Aging Effect (II) martedì 4 maggio 2010
  • 35. 3742 2652 2162 2615 Evaluating the Aging 2001 2341 1913 2341 1913 2341 Effect (II) (!!! '!!! !!! %!!! )*+, -./)012.342+*5 $!!! #!!! ! # $ % ' ( martedì 4 maggio 2010
  • 36. 3742 2652 2162 2615 Evaluating the Aging 2001 2341 1913 2341 1913 2341 Effect (II) • When k suggestions share the same score, those are useless; (!!! '!!! !!! %!!! )*+, -./)012.342+*5 $!!! #!!! ! # $ % ' ( martedì 4 maggio 2010
  • 37. 3742 2652 2162 2615 Evaluating the Aging 2001 2341 1913 2341 1913 2341 Effect (II) • When k suggestions share the same score, those are useless; (!!! • Same suggestion score: '!!! • !!! same probability on the graph; %!!! )*+, -./)012.342+*5 • the model is not able to $!!! give a priority to #!!! recommendations; ! # $ % ' ( martedì 4 maggio 2010
  • 38. 3742 2652 2162 2615 Evaluating the Aging 2001 2341 1913 2341 1913 2341 Effect (II) • When k suggestions share the same score, those are useless; (!!! • Same suggestion score: '!!! • !!! same probability on the graph; %!!! )*+, -./)012.342+*5 • the model is not able to $!!! give a priority to #!!! recommendations; ! • Confirmed by an user-study # $ % ' ( on F1, and F3; martedì 4 maggio 2010
  • 39. Evaluating the Aging Effect (III) martedì 4 maggio 2010
  • 40. Evaluating the Aging Effect (III) • Working hypothesis: • useful recommendations do not share the same recommendation score; martedì 4 maggio 2010
  • 41. Evaluating the Aging Effect (III) • Working hypothesis: • useful recommendations do not share the same recommendation score; • Automatic evaluation; • 400 highly frequent queries in the test month; • evaluating the number of useful recommendations; • k = 3; martedì 4 maggio 2010
  • 42. Evaluating the Aging Effect (IV) martedì 4 maggio 2010
  • 43. ate recommendations are taken from different query Evaluating the Aging recommendations with their assigned relative scores. Effect (IV) reduces the “noise” on the data and generates more precise knowledge on which recommendations are computed. Fur- thermore, the increase is quite independent from the thresh- old level, i.e. by increasing the threshold from 0.5 to 0.75 the overall quality is, roughly, constant. • Results: filtering threshold average number of useful sugges- tions on M1 average number of useful sugges- tions on M2 0 2.84 2.91 0.5 5.85 6.23 0.65 5.85 6.23 0.75 5.85 6.18 Table 4: Recommendation statistics obtained by us- ing the automatic evaluation method on a set of 400 queries drawn from the most frequent in the third month. We further break down the overall results shown in Table 4 to show the number of queries on which the QFG-based martedì 4 maggio 2010
  • 44. ate recommendations are taken from different query Evaluating the Aging recommendations with their assigned relative scores. Effect (IV) reduces the “noise” on the data and generates more precise knowledge on which recommendations are computed. Fur- thermore, the increase is quite independent from the thresh- old level, i.e. by increasing the threshold from 0.5 to 0.75 the overall quality is, roughly, constant. • Results: filtering threshold average number of useful sugges- tions on M1 average number of useful sugges- tions on M2 0 2.84 2.91 0.5 5.85 6.23 0.65 5.85 6.23 0.75 5.85 6.18 • Table 4: Recommendation statistics obtained by us- Average ing the automatic evaluation method on a set of 400 number of useful suggestions is greater in M2 than queries drawn from the most frequent in the third in M1; month. • Filtering process helps a lot; We further break down the overall results shown in Table 4 to show the number of queries on which the QFG-based martedì 4 maggio 2010
  • 45. Evaluating the Aging Effect (V) martedì 4 maggio 2010
  • 46. Evaluating the Aging Effect (V) • On a histogram (cumulative distribution): 400 300 200 100 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 M1 M2 martedì 4 maggio 2010
  • 47. Evaluating the Aging Effect (V) • On a histogram (cumulative distribution): 400 300 200 100 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 M1 M2 • Results on M are always better than those on M : 2 1 • less queries without suggestions; martedì 4 maggio 2010
  • 48. Combating the Aging Effect martedì 4 maggio 2010
  • 49. Combating the Aging Effect • QFG recommender models age: • Average recommendation quality degrades; • Recommendations should not be influenced by time; martedì 4 maggio 2010
  • 50. Combating the Aging Effect • QFG recommender models age: • Average recommendation quality degrades; • Recommendations should not be influenced by time; • Update of the model vs. rebuilding it “from scratch”; martedì 4 maggio 2010
  • 51. Combating the Aging Effect (II) martedì 4 maggio 2010
  • 52. Combating the Aging t a model or which Effect (II) QFGs. Suppose the model used to generate recommenda- tions consists of a portion of data representing one month (for M1 and M2 ) or two months (for M12 ) of the query commen- log. The model is being updated every 15 days (for M1 • to always and M2 ) or every 30 days (for M12 ). By using the first ap- Solution: incremental update of Mevery means days to rebuild proach, we pay 22 (44) minutes 1 by 15 (30) of “fresh data” in M2 • the new model from scratch on a new set of data obtained Graph the last two months of the query log. Instead, by using from algebra [Bordino et al., 2008]; FLOW • the second approach, we need to pay only 15 (32) minutes Some measures on the two different approaches: for updating the one-month (two-months) QFG. apidly in “From scratch” “Incremental” commen- Dataset strategy [min.] strategy [min.] endation M1 (March 2006) 21 14 tive queries. M2 (April 2006) 22 15 both fre- M12 (March and April) 44 32 heir value ariation). Table 5: Time needed to build a Query Flow Graph o movies, from scratch and using our “incremental” approach eral with (from merging two QFG representing an half of it is easy data). martedì 4 maggio 2010
  • 53. Combating the Aging t a model or which Effect (II) QFGs. Suppose the model used to generate recommenda- tions consists of a portion of data representing one month (for M1 and M2 ) or two months (for M12 ) of the query commen- log. The model is being updated every 15 days (for M1 • to always and M2 ) or every 30 days (for M12 ). By using the first ap- Solution: incremental update of Mevery means days to rebuild proach, we pay 22 (44) minutes 1 by 15 (30) of “fresh data” in M2 • the new model from scratch on a new set of data obtained Graph the last two months of the query log. Instead, by using from algebra [Bordino et al., 2008]; FLOW • the second approach, we need to pay only 15 (32) minutes Some measures on the two different approaches: for updating the one-month (two-months) QFG. apidly in “From scratch” “Incremental” commen- Dataset strategy [min.] strategy [min.] endation M1 (March 2006) 21 14 tive queries. M2 (April 2006) 22 15 both fre- M12 (March and April) 44 32 • heir value Incremental updates: 2/3 of the build w.r.t. “from scratch” strategy; ariation). Table 5: Time needed to time a Query Flow Graph from scratch and using our “incremental” approach • o movies, Evaluation onmerging two QFG representing an half of eral with (from the same set of 400 queries; it is easy data). martedì 4 maggio 2010
  • 54. Combating the Aging Effect (III) martedì 4 maggio 2010
  • 55. 3698 shakira video shakira 3135 shakira nude Combating the Aging 3099 shakira wallpaper 3020 shakira biography 3018 shakira aol music 2015 free video downloads Effect (III) Table 7: Some examples of recommendations gen- erated on different QFG models. Queries used to generate recommendations are taken from different query sets. • Results: filtering threshold average number of useful sugges- tions on M2 average number of useful sugges- tions on M12 0 2.91 3.64 0.5 6.23 7.95 0.65 6.23 7.94 0.75 6.18 7.9 Table 8: Recommendation statistics obtained by us- ing the automatic evaluation method on a relatively large set of 400 queries drawn from the most fre- quent in the third month. martedì 4 maggio 2010 gated the main reasons why we obtain such an improvement.
  • 56. 3698 shakira video shakira 3135 shakira nude Combating the Aging 3099 shakira wallpaper 3020 shakira biography 3018 shakira aol music 2015 free video downloads Effect (III) Table 7: Some examples of recommendations gen- erated on different QFG models. Queries used to generate recommendations are taken from different query sets. • Results: filtering threshold average number of useful sugges- tions on M2 average number of useful sugges- tions on M12 0 2.91 3.64 0.5 6.23 7.95 0.65 6.23 7.94 0.75 6.18 7.9 • Average number of useful suggestion is obtained by us- Table 8: Recommendation statistics greater in ing the automatic evaluation method on a relatively M12 than in M2, or 400M1; large set of in queries drawn from the most fre- quent in the third month. martedì 4 maggio 2010 gated the main reasons why we obtain such an improvement.
  • 57. Combating the Aging Effect (IV) martedì 4 maggio 2010
  • 58. 12,5 Combating the Aging 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 M1 M2 M12 Effect (IV) Figure 4: Histogram showing the number of queries (on the y axis) having a certain number of useful recommendations (on the x axis). Results are eval- • uated automatically. On a histogram (cumulative distribution): 400 300 t 200 100 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 M1 M2 M12 - Figure 5: Histogram showing the total number of queries (on the y axis) having at least a certain num- ber of useful recommendations (on the x axis). For instance the third bucket shows how many queries martedì 4 maggio 2010
  • 59. 12,5 Combating the Aging 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 M1 M2 M12 Effect (IV) Figure 4: Histogram showing the number of queries (on the y axis) having a certain number of useful recommendations (on the x axis). Results are eval- • uated automatically. On a histogram (cumulative distribution): 400 300 t 200 100 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 M1 M2 M12 - • Results on M12 are always better than M1, and M2; Figure 5: Histogram showing the total number of • queries improvement ofhaving at least aleast four good large (on the y axis) queries with at certain num- suggestions; ber of useful recommendations (on the x axis). For instance the third bucket shows how many queries martedì 4 maggio 2010
  • 60. Distributed QFG Building martedì 4 maggio 2010
  • 61. Distributed QFG 4. using the graph algebra described in [8], each pa graph is iteratively merged. Each iteration is do parallel on the different available nodes of the clo Building 5. the final resulting data-graph is now processed other steps [4] (normalization, chain extraction, dom walk) to obtain the complete and usable QF • a parallel way to update QFGs: 01)2()*+,'#3456#7)8# Divide-and-Conquer approach; • the query log is split in m !#$%'# !#$%'# !#$%'# !#$%'# parts; • parallel extraction of the -./# -./# -./# -./# features; • compressing step; !#()*+,#-./# !#()*+,#-./# • merging graphs; • final operations 9#()*+,'#-./# (normalization, pagerank, etc.); martedì 4 maggio 2010 Figure 6: Example of the building of a two mo
  • 63. Conclusions • We study the effects of time on QFG-based query recommender systems; martedì 4 maggio 2010
  • 64. Conclusions • We study the effects of time on QFG-based query recommender systems; • We built different QFGs from the AOL query log; • we analyze the quality of recommendation; • we show that recommendation models ages; • we introduce an “incremental” algorithm for updating the model; • we propose a parallel/distributed way of building QFGs; martedì 4 maggio 2010
  • 65. Future Works martedì 4 maggio 2010
  • 66. Future Works • to define a strategy for merging graphs assigning different weights to each subgraph; • more importance to “fresh” data; martedì 4 maggio 2010
  • 67. Future Works • to define a strategy for merging graphs assigning different weights to each subgraph; • more importance to “fresh” data; • to compare the robustness of QFG recommender systems with other query recommenders with respect to aging; martedì 4 maggio 2010
  • 68. Future Works • to define a strategy for merging graphs assigning different weights to each subgraph; • more importance to “fresh” data; • to compare the robustness of QFG recommender systems with other query recommenders with respect to aging; • to design a MapReduce algorithm to build and update efficiently QFGs recommender systems; martedì 4 maggio 2010
  • 69. Questions? Thank you for your attention! martedì 4 maggio 2010
  • 70. References • [Boldi et al., CIKM’08]: The Query Flow Graph: model and applications. Boldi, Bonchi, Castillo, Donato, Gionis,Vigna. CIKM’08. • [Boldi et al., WSCD’09]: Query Suggestions using Query-Flow Graphs. Boldi, Bonchi, Castillo, Donato, Vigna. WSCD’09. • [Bordino et al., 2008]: Algebra for the joint mining of query log graphs, 2008. martedì 4 maggio 2010