SlideShare una empresa de Scribd logo
1 de 4
Descargar para leer sin conexión
2007 IEEE/WIC/ACM International Conference on Intelligent Agent Technology




       Simple Algorithms for Representing Tag Frequencies in the SCOT
                                  Exporter

                        Hak-Lae Kim                                               Sung-Kwon Yang
             Digital Enterprise Research Institute                     Biomedical Knowledge Engineering Lab
            National University of Ireland, Galway                            Seoul National University
              IDA Business Park, Lower Dangan                            28-22 Yeonkun-Dong, Chongno-Ku
                       Galway, Ireland                                              Seoul, Korea
                     haklae.kim@deri.org                                     sungkwon.yang@gmail.com
                        John G. Breslin                                            Hong-Gee Kim
             Digital Enterprise Research Institute                     Biomedical Knowledge Engineering Lab
            National University of Ireland, Galway                           Seoul National University
              IDA Business Park, Lower Dangan                            28-22 Yeonkun-Dong, Chongno-Ku
                        Galway, Ireland                                             Seoul, Korea
                     john.breslin@deri.org                                        hgkim@snu.ac.kr


                            Abstract                                   because we do not know which type of frequency for-
                                                                       mat (i.e. absolute and relative frequency etc) to use.
      In this paper we describe the SCOT Exporter and its              To solve the limitations, we need a semantically and
   algorithms to create instance data based on the SCOT                statistically enriched model to be able to describe tag-
   (Social Semantic Cloud of Tags) ontology for sharing                ging information.
   and reusing tag data. The algorithms use tag frequen-                  The SCOT (Social Semantic Cloud Of Tags)1 on-
   cies and co-occurrence relations to represent statisti-             tology is an ontology for sharing and reusing tag data
   cal information via the SCOT ontology. We give an                   and representing social relations among individuals. It
   overview of the Exporter and the algorithms, and then               provides the structure and semantics for describing re-
   discuss some experimental results.                                  sources, tags, users and extended tag information such
                                                                       as tag frequency, tag co-occurrence frequency, and tag
                                                                       equivalence.
   1. Introduction                                                        In this paper, we describe the SCOT Exporter and
                                                                       the algorithms for constructing data based on the
      Social tagging systems encourage user participation              SCOT ontology. In particular, the contributions of this
   through easy-to-use free tagging tools and produce ag-              paper are:
   gregations of user metadata through bottom-up con-                    • SCOT Exporters that can automatically construct
   sensus. However, these systems on the Web are not suf-                  SCOT instances from various data sources such as
   ficient to provide uniform way to define the semantics                    weblogs, relational databases, etc. The algorithms
   of the tag and to calculate its frequencies. In fact, folk-             for the Exporters support an efficient way to gen-
   sonomies, statistically-weighted list of tags, have faced               erate the necessary information for SCOT.
   a number of important linguistic issues (i.e. polysemy,
   synonymy, homonymy, plurals, and spelling variants,                   • The experimental evaluation demonstrates that
   etc.) and statistical issues (i.e. frequency, weighted                  the SCOT Exporter is scalable and generates
   value, etc.). In comparison with the linguistic issues,                 SCOT instances with good run time performance.
   the statistical issues have not been focused on at the
                                                                       The rest of the paper is organized as follows: Section
   moment. However, when we intend to integrate or ex-
                                                                       2 describes the SCOT Exporters and its algorithms for
   change tag data across various applications, different
   folksonomies, or different users, it is a critical problem             1 http://scot-project.org




0-7695-3027-3/07 $25.00 © 2007 IEEE                              536
                                                                 542
                                                                 544
DOI 10.1109/IAT.2007.101
constructing tags and their co-occurrences. Section 3            frequency, the actual number of frequency, tells us how
describes the experimental setting and results. Section          a frequently tag is used in a give domain while a relative
4 discusses related work and Section 5 concludes the             frequency of each tag means its proportion in the total
paper.                                                           tag occurrence.
                                                                    In SCOT, this information is represented by the
2. SCOT Exporter                                                 scot:ownAFrequency and scot:ownRFrequency prop-
                                                                 erties which are subproperties of scot:AFrequency
                                                                 and scot:RFrequency respectively3 . We use the Tag
   We describe how data using the SCOT ontology can
                                                                 Frequency Computation (see Algorithm 1) to calculate
be automatically generated from both client-side we-
                                                                 the frequency of each tag.
blog engines and databases. Two types of exporters
for creating the SCOT ontology are as follows:
                                                                 Algorithm 1 Tag Frequency Computation
  • Exporter for the WordPress 2 . This allows                   Require: tagList is a list of tags
    the production of SCOT ontology from a blog.                  var tagList ← empty list
    Its design is based on the assumption that cat-               var totalTagFrequency ← 0
    egories in WordPress are used as tags. This
    plugin is activated in the plugin menu on the                  for all item ∈ getUserItemList(user ) do
    WordPress administration panel and it requires                   for all tag ∈ getItemTagList(item) do
    no user configuration in order to work. The on-                     if tagList.exist(tag) then
    tology created by the plugin can be found at                          tag.AFrequency ← tag.AFrequency + 1
    http://yourhost/scot/scot.rdf.                                     else
                                                                          tagList.add(tag)
  • Exporter for database. This aims to create the                     end if
    SCOT ontology data from a large number of RSS                      totalTagFrequency ← totalTagFrequency + 1
    feeds in a DBMS. This type of data set consists                  end for
    of multiple users from various blog systems, so it             end for
    creates the ontology data per each individual user.            for all tag ∈ tagList do
                                                                     tag.RFrequency ← tag.AFrequency / totalTagFre-
Both types of the exporters have almost the same func-
                                                                     quency
tionalities that basically create user profiles such as
                                                                   end for
name, blog name, blog url etc. and extended tag in-
formation such as name, frequency, co-occurrence, hi-
erarchy between tags, equivalence of tags etc.                      Several functions are assumed:           getUserItem-
                                                                 List(user) returns the item lists for the user from the
2.1. Tag Frequency Computation Algo-                             data source. getItemTagList(item) gets the tag lists for
     rithm                                                       the given item. tagList.exist(tag) returns true if the tag
                                                                 exists in the tagList. tagList.add(tag) inserts the tag
                                                                 into the tagList.
   A tag has a weighted frequency that is associated
with or assigned to certain items. The frequency of an
individual tag would reflect how popular it is.                   2.2. Co-occurrence Frequency Computation
   Frequencies can be expressed as absolute frequencies                Algorithm
or relative frequencies [4]. The absolute frequencies
are raw observations, that have not been normalized                 People are inclined to generate one more tags
with respect to the base rates of the event in question.         when they tag in resources. The meaning of the
When we speak about the frequency of tags it usually             tag becomes more specific when the tag is combined
means the absolute frequency format. At the same                 with a set of tags. To simply define the term co-
time, the ’relative frequency’ means a frequency that            occurrence: if an item contains both the tags se-
is expressed in relation to a sample size or rate. This          manticweb and blog, these two tags are said to
format is used to compare the occurrence of objects              co-occur or have a first order co-occurrence.    It
in two or more groups. Accordingly, this format is               can play an important role to reduce ’tag ambigu-
used in tag clouds where the size of each tag represents         ity’ in tagging systems. scot:cooccurAFrequency
the proportion of the tags. Simply stated, an absolute
                                                                    3 The former is intended to describe the absolute format and
  2 http://scot-project.org/?page_id=12                          the purpose of the latter is to represent the relative format




                                                           537
                                                           543
                                                           545
and scot:cooccurRFrequency properties describe co-                dication feeds (e.g. using the dc:subject attribute for
occurrence as an absolute and relative value to the               an RSS item).
frequency amongst a set of tags.      Co-occurrence
frequencies will be computed by the Co-occurrence                 3.2. Evaluation Results
Frequency Computation, as shown in Algorithm 2.
Note that getUserItemList(user) returns the item                     All the tests were run on an Intel T2600 2.16 GHz
                                                                  machine with 2 GB of RAM. The mean number of posts
Algorithm 2 Co-occurrence Frequency Computation                   per blog for the data set is 96.6 and the mean number
Require: cooccurList is a list of co-occurrences                  of tags per blog is 29.4. The mean number of frequency
 var cooccurList ← empty list                                     of tag usage per blog is 97.4, therefore more than 30%
 var totalCooccurFrequency ← 0                                    of the tags are overlapping (i.e. 29.4/97.4). Table 1
 for all item ∈ getUserItemList(user) do                          shows the top five tags. We can see that several tags
   itemT agList ← getItemTagList(item)                            are associated with the country Ireland (i.e. irishblogs
   if itemT agList.count ≥ 2 then                                 and ireland etc).
      cooccur ← cooccurList.exist(itemT agList)
      if cooccur = null then                                                    Tag Ranking           AF       RF
         cooccur.AFrequency ← cooccur.AFrequency +                                irishblogs         2557     3.7%
         1                                                                          general          1828     2.7%
      else                                                                          ireland          1821     2.7%
         cooccur ← newCooccur(itemT agList)                                     uncategorized        976      1.4%
         cooccurList.add(cooccur)                                                    music           714      1.0%
      end if                                                                            .              .        .
      totalCooccurFrequency ← totalCooccurFre-                                          .              .        .
      quency + 1                                                                                    68785     100%
   end if
 end for                                                                             Table 1. Top five tags
 for all cooccur ∈ cooccurList do
   cooccur.RFrequency ← cooccur.AFrequency / to-                     The co-occurring tags4 in Table 2 are created by cal-
   talCooccurFrequency                                            culating tags which appeared together using Algorithm
 end for                                                          2. We can see that the tags such as irishblogs, ireland,
                                                                  blogs, and irish in Table 1 are frequently used with
list for the user from the data source. getItem-                  others. This means that frequently used tags are likely
TagList(item) gets the tag list for the given item. cooc-         to combine with other tags. In SCOT, only the first
curList.exist(itemTagList) returns the tag set if there           order co-occurrence that happened in real world usage
exists a same cooccurring tag set in the cooccurList.             without a conditional probability is described. It is not
cooccurList.add(cooccur) inserts the co-occurring tag             the main of the SCOT to describe a similarity among
set into the cooccurList.                                         tags or other types of term relationships as this ontol-
                                                                  ogy is intended to reflect the fact that tags happened
                                                                  in a given domain.
3. Evaluation                                                        The run time performance is a critical issue when
                                                                  creating and maintaining a SCOT data set, due to the
3.1. Datasets                                                     complex analysis of word relations and the frequent
                                                                  updates. To generate the SCOT data, we must carry
   Planet Journals is a blog aggregator website that              out analyzes such as tag frequencies and co-occurrence
collects country-specific blogs for residents or cit-              relations among tags using the Algorithms 1-2 in or-
izens of a particular country.       The Ireland site             der. Figure 1 shows the runtime performance for the
(planet.journals.ie) has a collection of around 1322              given data. The number of tags+cooccurrences on the
blogs, 119101 posts, and 13688 tags. The data set used            Y axis is found from the sum of ’total frequency of tags’
here is taken from that website during the two years              and ’total frequency of tag sets’. The time required de-
period between February 2005 and October 2006, and                pends on the amount of tags: the mean time for it is
the software that powers the website (Planet PHP) also            1.1 seconds. Most cases in the data (under 200 tags)
allows the aggregation of any tags that are used in the              4 gaeilge is the Irish word for gaelic and podchraoladh is pod-

various blogs and that are passed through via RSS syn-            casting.




                                                            538
                                                            544
                                                            546
Cooccurrences             AF       RF                resources using keyword. Therefore the core concepts
   gaeilge irishblogs podchraoladh      71     0.8%               of the ontologies are Tagging, Tagger and Resource
   cork flickr ireland photography       36     0.4%               to represent users, events, and resources respectively.
            blogs irishblogs            35     0.4%               However, there is no way to describe frequency of tags
           ireland real news            32     0.4%               in the ontologies. The SCOT ontology can be easily
      irishphotos photography           31     0.4%               represent this information using properties of tag fre-
                    .                    .       .                quency.
                    .                    .       .
                                      11514    100%               5   Conclusion

        Table 2. Top five co-occurring tags                           The SCOT Exporter provides an efficient run time
                                                                  performance to construct SCOT instances. The algo-
                                                                  rithms for the exporter are simple and provide the com-
are generated in less than 1 second. Although there               plete set of information to describe metadata using the
are several large tags in the data, it takes less than            SCOT ontology. The exporters are implemented for
6 seconds to generate the corresponding SCOTs. We                 each site or application since there are many possible
believe that the algorithms and the tool produces an              kinds of social tagging sites for which exporters can be
efficient performance in terms of time.                             developed. Our approach is a starting point to rep-
                                                                  resent the structure and the semantics of social tag-
                                                                  ging. We will provide further information through the
                                                                  project web site (http://scot-project.org).

                                                                  Acknowledgments

                                                                     This material is based upon works supported by
                                                                  the Science Foundation Ireland under Grant No.
                                                                  SFI/02/CE1/I131.

                                                                  References

                                                                   [1] R. Newman, Tag Ontology design, available
          Figure 1. Runtime performance
                                                                       at:   http://www.holygoat.co.uk/projects/tags/
                                                                       (viewed 16/08/2007), 2005.

4. Related Work                                                    [2] T. Gruber, Ontology of Folksonomy: A Mash-up
                                                                       of Apples and Oranges, First on-Line conference
                                                                       on Metadata and Semantics Research (MTSR’05).
   There are several efforts that try to represent the
                                                                       http://www.metadata-semantics.org/.
concept of tagging, the operation of tagging, and the
tags themselves. Newman [1] describes the relation-                [3] Knerr, T. (2006),Tagging Ontology- Towards a
ship between an agent, an arbitrary resource, and one                  Common Ontology for Folksonomies, available at:
or more tags. In his ontology, there are three core con-               http://code.google.com/p/tagont/
cepts such as Taggers, Tagging, and Tags to represent
tagging activity. Gruber [2] describes the core idea of            [4] Clare Harries, N. H., Are absolute frequencies,
tagging that consists of object, tag, tagger, and source.              relative frequencies, or both effective in reducing
Knerr [3] describes the concept of tagging in the Tag-                 cognitive biases?, Journal of Behavioral Decision
ging Ontology. Since his approach is based on the ideas                Making 13, Issue 4, 431-444, 2000.
from [1, 2], the core element of the ontology is Tagging.
                                                                   [5] Tagcommons            project             website:
The ontology consists of time, user, domain, visibility,
                                                                       http://tagcommons.org
tag, resource, and type. There are some other projects
[5, 6] related to tag sharing and semantic tagging.                [6] Tagora Project website:       http://www.tagora-
   The approaches in the related work are focused on                   project.eu
the tagging activities or events that people use to tag



                                                            539
                                                            545
                                                            547

Más contenido relacionado

Destacado

My Personal Brand
My Personal BrandMy Personal Brand
My Personal BrandKim Lucas
 
悠邁科技公司簡介
悠邁科技公司簡介悠邁科技公司簡介
悠邁科技公司簡介Sean Liu
 
Exist, To Be Or Not To Be
Exist, To Be Or Not To BeExist, To Be Or Not To Be
Exist, To Be Or Not To Bebilly1986
 
悠邁介紹(Update)
悠邁介紹(Update)悠邁介紹(Update)
悠邁介紹(Update)Sean Liu
 
High Performance Jdbc
High Performance JdbcHigh Performance Jdbc
High Performance JdbcSam Pattsin
 
Cedfl 2013
Cedfl 2013Cedfl 2013
Cedfl 2013CED FL
 

Destacado (8)

My Personal Brand
My Personal BrandMy Personal Brand
My Personal Brand
 
悠邁科技公司簡介
悠邁科技公司簡介悠邁科技公司簡介
悠邁科技公司簡介
 
Library Lovers Day at Broughton
Library Lovers Day at BroughtonLibrary Lovers Day at Broughton
Library Lovers Day at Broughton
 
Exist, To Be Or Not To Be
Exist, To Be Or Not To BeExist, To Be Or Not To Be
Exist, To Be Or Not To Be
 
悠邁介紹(Update)
悠邁介紹(Update)悠邁介紹(Update)
悠邁介紹(Update)
 
Book Week at BACIRC
Book Week at BACIRCBook Week at BACIRC
Book Week at BACIRC
 
High Performance Jdbc
High Performance JdbcHigh Performance Jdbc
High Performance Jdbc
 
Cedfl 2013
Cedfl 2013Cedfl 2013
Cedfl 2013
 

Similar a Algorithm

IRJET- Recruitment Chatbot
IRJET- Recruitment ChatbotIRJET- Recruitment Chatbot
IRJET- Recruitment ChatbotIRJET Journal
 
Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...Amit Sheth
 
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...IRJET Journal
 
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...IRJET Journal
 
IRJET- Automatic Database Schema Generator
IRJET- Automatic Database Schema GeneratorIRJET- Automatic Database Schema Generator
IRJET- Automatic Database Schema GeneratorIRJET Journal
 
On the Choice of Models of Computation for Writing Executable Specificatoins ...
On the Choice of Models of Computation for Writing Executable Specificatoins ...On the Choice of Models of Computation for Writing Executable Specificatoins ...
On the Choice of Models of Computation for Writing Executable Specificatoins ...ijeukens
 
Syst biol 2012-burguiere-sysbio sys069
Syst biol 2012-burguiere-sysbio sys069Syst biol 2012-burguiere-sysbio sys069
Syst biol 2012-burguiere-sysbio sys069Thomas Burguiere
 
Named Entity Recognition For Hindi-English code-mixed Twitter Text
Named Entity Recognition For Hindi-English code-mixed Twitter Text Named Entity Recognition For Hindi-English code-mixed Twitter Text
Named Entity Recognition For Hindi-English code-mixed Twitter Text Amogh Kawle
 
Reviews on swarm intelligence algorithms for text document clustering
Reviews on swarm intelligence algorithms for text document clusteringReviews on swarm intelligence algorithms for text document clustering
Reviews on swarm intelligence algorithms for text document clusteringIRJET Journal
 
IRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET- A Novel Approch Automatically Categorizing Software TechnologiesIRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET- A Novel Approch Automatically Categorizing Software TechnologiesIRJET Journal
 
AI and Web-Based Interactive College Enquiry Chatbot
AI and Web-Based Interactive College Enquiry ChatbotAI and Web-Based Interactive College Enquiry Chatbot
AI and Web-Based Interactive College Enquiry ChatbotIRJET Journal
 
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET Journal
 
Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...
Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...
Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...IJwest
 
Cohesive Software Design
Cohesive Software DesignCohesive Software Design
Cohesive Software Designijtsrd
 
GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT
GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT
GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT IAEME Publication
 
Introduction to Data Structure
Introduction to Data Structure Introduction to Data Structure
Introduction to Data Structure Prof Ansari
 
IRJET - Voice based Natural Language Query Processing
IRJET -  	  Voice based Natural Language Query ProcessingIRJET -  	  Voice based Natural Language Query Processing
IRJET - Voice based Natural Language Query ProcessingIRJET Journal
 

Similar a Algorithm (20)

IRJET- Recruitment Chatbot
IRJET- Recruitment ChatbotIRJET- Recruitment Chatbot
IRJET- Recruitment Chatbot
 
Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...
 
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
 
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
 
IRJET- Automatic Database Schema Generator
IRJET- Automatic Database Schema GeneratorIRJET- Automatic Database Schema Generator
IRJET- Automatic Database Schema Generator
 
On the Choice of Models of Computation for Writing Executable Specificatoins ...
On the Choice of Models of Computation for Writing Executable Specificatoins ...On the Choice of Models of Computation for Writing Executable Specificatoins ...
On the Choice of Models of Computation for Writing Executable Specificatoins ...
 
Syst biol 2012-burguiere-sysbio sys069
Syst biol 2012-burguiere-sysbio sys069Syst biol 2012-burguiere-sysbio sys069
Syst biol 2012-burguiere-sysbio sys069
 
Named Entity Recognition For Hindi-English code-mixed Twitter Text
Named Entity Recognition For Hindi-English code-mixed Twitter Text Named Entity Recognition For Hindi-English code-mixed Twitter Text
Named Entity Recognition For Hindi-English code-mixed Twitter Text
 
LEXICAL ANALYZER
LEXICAL ANALYZERLEXICAL ANALYZER
LEXICAL ANALYZER
 
Reviews on swarm intelligence algorithms for text document clustering
Reviews on swarm intelligence algorithms for text document clusteringReviews on swarm intelligence algorithms for text document clustering
Reviews on swarm intelligence algorithms for text document clustering
 
IRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET- A Novel Approch Automatically Categorizing Software TechnologiesIRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET- A Novel Approch Automatically Categorizing Software Technologies
 
AI and Web-Based Interactive College Enquiry Chatbot
AI and Web-Based Interactive College Enquiry ChatbotAI and Web-Based Interactive College Enquiry Chatbot
AI and Web-Based Interactive College Enquiry Chatbot
 
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
 
Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...
Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...
Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...
 
Distributed Tracing
Distributed TracingDistributed Tracing
Distributed Tracing
 
Cohesive Software Design
Cohesive Software DesignCohesive Software Design
Cohesive Software Design
 
GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT
GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT
GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT
 
CytoScape
CytoScapeCytoScape
CytoScape
 
Introduction to Data Structure
Introduction to Data Structure Introduction to Data Structure
Introduction to Data Structure
 
IRJET - Voice based Natural Language Query Processing
IRJET -  	  Voice based Natural Language Query ProcessingIRJET -  	  Voice based Natural Language Query Processing
IRJET - Voice based Natural Language Query Processing
 

Algorithm

  • 1. 2007 IEEE/WIC/ACM International Conference on Intelligent Agent Technology Simple Algorithms for Representing Tag Frequencies in the SCOT Exporter Hak-Lae Kim Sung-Kwon Yang Digital Enterprise Research Institute Biomedical Knowledge Engineering Lab National University of Ireland, Galway Seoul National University IDA Business Park, Lower Dangan 28-22 Yeonkun-Dong, Chongno-Ku Galway, Ireland Seoul, Korea haklae.kim@deri.org sungkwon.yang@gmail.com John G. Breslin Hong-Gee Kim Digital Enterprise Research Institute Biomedical Knowledge Engineering Lab National University of Ireland, Galway Seoul National University IDA Business Park, Lower Dangan 28-22 Yeonkun-Dong, Chongno-Ku Galway, Ireland Seoul, Korea john.breslin@deri.org hgkim@snu.ac.kr Abstract because we do not know which type of frequency for- mat (i.e. absolute and relative frequency etc) to use. In this paper we describe the SCOT Exporter and its To solve the limitations, we need a semantically and algorithms to create instance data based on the SCOT statistically enriched model to be able to describe tag- (Social Semantic Cloud of Tags) ontology for sharing ging information. and reusing tag data. The algorithms use tag frequen- The SCOT (Social Semantic Cloud Of Tags)1 on- cies and co-occurrence relations to represent statisti- tology is an ontology for sharing and reusing tag data cal information via the SCOT ontology. We give an and representing social relations among individuals. It overview of the Exporter and the algorithms, and then provides the structure and semantics for describing re- discuss some experimental results. sources, tags, users and extended tag information such as tag frequency, tag co-occurrence frequency, and tag equivalence. 1. Introduction In this paper, we describe the SCOT Exporter and the algorithms for constructing data based on the Social tagging systems encourage user participation SCOT ontology. In particular, the contributions of this through easy-to-use free tagging tools and produce ag- paper are: gregations of user metadata through bottom-up con- • SCOT Exporters that can automatically construct sensus. However, these systems on the Web are not suf- SCOT instances from various data sources such as ficient to provide uniform way to define the semantics weblogs, relational databases, etc. The algorithms of the tag and to calculate its frequencies. In fact, folk- for the Exporters support an efficient way to gen- sonomies, statistically-weighted list of tags, have faced erate the necessary information for SCOT. a number of important linguistic issues (i.e. polysemy, synonymy, homonymy, plurals, and spelling variants, • The experimental evaluation demonstrates that etc.) and statistical issues (i.e. frequency, weighted the SCOT Exporter is scalable and generates value, etc.). In comparison with the linguistic issues, SCOT instances with good run time performance. the statistical issues have not been focused on at the The rest of the paper is organized as follows: Section moment. However, when we intend to integrate or ex- 2 describes the SCOT Exporters and its algorithms for change tag data across various applications, different folksonomies, or different users, it is a critical problem 1 http://scot-project.org 0-7695-3027-3/07 $25.00 © 2007 IEEE 536 542 544 DOI 10.1109/IAT.2007.101
  • 2. constructing tags and their co-occurrences. Section 3 frequency, the actual number of frequency, tells us how describes the experimental setting and results. Section a frequently tag is used in a give domain while a relative 4 discusses related work and Section 5 concludes the frequency of each tag means its proportion in the total paper. tag occurrence. In SCOT, this information is represented by the 2. SCOT Exporter scot:ownAFrequency and scot:ownRFrequency prop- erties which are subproperties of scot:AFrequency and scot:RFrequency respectively3 . We use the Tag We describe how data using the SCOT ontology can Frequency Computation (see Algorithm 1) to calculate be automatically generated from both client-side we- the frequency of each tag. blog engines and databases. Two types of exporters for creating the SCOT ontology are as follows: Algorithm 1 Tag Frequency Computation • Exporter for the WordPress 2 . This allows Require: tagList is a list of tags the production of SCOT ontology from a blog. var tagList ← empty list Its design is based on the assumption that cat- var totalTagFrequency ← 0 egories in WordPress are used as tags. This plugin is activated in the plugin menu on the for all item ∈ getUserItemList(user ) do WordPress administration panel and it requires for all tag ∈ getItemTagList(item) do no user configuration in order to work. The on- if tagList.exist(tag) then tology created by the plugin can be found at tag.AFrequency ← tag.AFrequency + 1 http://yourhost/scot/scot.rdf. else tagList.add(tag) • Exporter for database. This aims to create the end if SCOT ontology data from a large number of RSS totalTagFrequency ← totalTagFrequency + 1 feeds in a DBMS. This type of data set consists end for of multiple users from various blog systems, so it end for creates the ontology data per each individual user. for all tag ∈ tagList do tag.RFrequency ← tag.AFrequency / totalTagFre- Both types of the exporters have almost the same func- quency tionalities that basically create user profiles such as end for name, blog name, blog url etc. and extended tag in- formation such as name, frequency, co-occurrence, hi- erarchy between tags, equivalence of tags etc. Several functions are assumed: getUserItem- List(user) returns the item lists for the user from the 2.1. Tag Frequency Computation Algo- data source. getItemTagList(item) gets the tag lists for rithm the given item. tagList.exist(tag) returns true if the tag exists in the tagList. tagList.add(tag) inserts the tag into the tagList. A tag has a weighted frequency that is associated with or assigned to certain items. The frequency of an individual tag would reflect how popular it is. 2.2. Co-occurrence Frequency Computation Frequencies can be expressed as absolute frequencies Algorithm or relative frequencies [4]. The absolute frequencies are raw observations, that have not been normalized People are inclined to generate one more tags with respect to the base rates of the event in question. when they tag in resources. The meaning of the When we speak about the frequency of tags it usually tag becomes more specific when the tag is combined means the absolute frequency format. At the same with a set of tags. To simply define the term co- time, the ’relative frequency’ means a frequency that occurrence: if an item contains both the tags se- is expressed in relation to a sample size or rate. This manticweb and blog, these two tags are said to format is used to compare the occurrence of objects co-occur or have a first order co-occurrence. It in two or more groups. Accordingly, this format is can play an important role to reduce ’tag ambigu- used in tag clouds where the size of each tag represents ity’ in tagging systems. scot:cooccurAFrequency the proportion of the tags. Simply stated, an absolute 3 The former is intended to describe the absolute format and 2 http://scot-project.org/?page_id=12 the purpose of the latter is to represent the relative format 537 543 545
  • 3. and scot:cooccurRFrequency properties describe co- dication feeds (e.g. using the dc:subject attribute for occurrence as an absolute and relative value to the an RSS item). frequency amongst a set of tags. Co-occurrence frequencies will be computed by the Co-occurrence 3.2. Evaluation Results Frequency Computation, as shown in Algorithm 2. Note that getUserItemList(user) returns the item All the tests were run on an Intel T2600 2.16 GHz machine with 2 GB of RAM. The mean number of posts Algorithm 2 Co-occurrence Frequency Computation per blog for the data set is 96.6 and the mean number Require: cooccurList is a list of co-occurrences of tags per blog is 29.4. The mean number of frequency var cooccurList ← empty list of tag usage per blog is 97.4, therefore more than 30% var totalCooccurFrequency ← 0 of the tags are overlapping (i.e. 29.4/97.4). Table 1 for all item ∈ getUserItemList(user) do shows the top five tags. We can see that several tags itemT agList ← getItemTagList(item) are associated with the country Ireland (i.e. irishblogs if itemT agList.count ≥ 2 then and ireland etc). cooccur ← cooccurList.exist(itemT agList) if cooccur = null then Tag Ranking AF RF cooccur.AFrequency ← cooccur.AFrequency + irishblogs 2557 3.7% 1 general 1828 2.7% else ireland 1821 2.7% cooccur ← newCooccur(itemT agList) uncategorized 976 1.4% cooccurList.add(cooccur) music 714 1.0% end if . . . totalCooccurFrequency ← totalCooccurFre- . . . quency + 1 68785 100% end if end for Table 1. Top five tags for all cooccur ∈ cooccurList do cooccur.RFrequency ← cooccur.AFrequency / to- The co-occurring tags4 in Table 2 are created by cal- talCooccurFrequency culating tags which appeared together using Algorithm end for 2. We can see that the tags such as irishblogs, ireland, blogs, and irish in Table 1 are frequently used with list for the user from the data source. getItem- others. This means that frequently used tags are likely TagList(item) gets the tag list for the given item. cooc- to combine with other tags. In SCOT, only the first curList.exist(itemTagList) returns the tag set if there order co-occurrence that happened in real world usage exists a same cooccurring tag set in the cooccurList. without a conditional probability is described. It is not cooccurList.add(cooccur) inserts the co-occurring tag the main of the SCOT to describe a similarity among set into the cooccurList. tags or other types of term relationships as this ontol- ogy is intended to reflect the fact that tags happened in a given domain. 3. Evaluation The run time performance is a critical issue when creating and maintaining a SCOT data set, due to the 3.1. Datasets complex analysis of word relations and the frequent updates. To generate the SCOT data, we must carry Planet Journals is a blog aggregator website that out analyzes such as tag frequencies and co-occurrence collects country-specific blogs for residents or cit- relations among tags using the Algorithms 1-2 in or- izens of a particular country. The Ireland site der. Figure 1 shows the runtime performance for the (planet.journals.ie) has a collection of around 1322 given data. The number of tags+cooccurrences on the blogs, 119101 posts, and 13688 tags. The data set used Y axis is found from the sum of ’total frequency of tags’ here is taken from that website during the two years and ’total frequency of tag sets’. The time required de- period between February 2005 and October 2006, and pends on the amount of tags: the mean time for it is the software that powers the website (Planet PHP) also 1.1 seconds. Most cases in the data (under 200 tags) allows the aggregation of any tags that are used in the 4 gaeilge is the Irish word for gaelic and podchraoladh is pod- various blogs and that are passed through via RSS syn- casting. 538 544 546
  • 4. Cooccurrences AF RF resources using keyword. Therefore the core concepts gaeilge irishblogs podchraoladh 71 0.8% of the ontologies are Tagging, Tagger and Resource cork flickr ireland photography 36 0.4% to represent users, events, and resources respectively. blogs irishblogs 35 0.4% However, there is no way to describe frequency of tags ireland real news 32 0.4% in the ontologies. The SCOT ontology can be easily irishphotos photography 31 0.4% represent this information using properties of tag fre- . . . quency. . . . 11514 100% 5 Conclusion Table 2. Top five co-occurring tags The SCOT Exporter provides an efficient run time performance to construct SCOT instances. The algo- rithms for the exporter are simple and provide the com- are generated in less than 1 second. Although there plete set of information to describe metadata using the are several large tags in the data, it takes less than SCOT ontology. The exporters are implemented for 6 seconds to generate the corresponding SCOTs. We each site or application since there are many possible believe that the algorithms and the tool produces an kinds of social tagging sites for which exporters can be efficient performance in terms of time. developed. Our approach is a starting point to rep- resent the structure and the semantics of social tag- ging. We will provide further information through the project web site (http://scot-project.org). Acknowledgments This material is based upon works supported by the Science Foundation Ireland under Grant No. SFI/02/CE1/I131. References [1] R. Newman, Tag Ontology design, available Figure 1. Runtime performance at: http://www.holygoat.co.uk/projects/tags/ (viewed 16/08/2007), 2005. 4. Related Work [2] T. Gruber, Ontology of Folksonomy: A Mash-up of Apples and Oranges, First on-Line conference on Metadata and Semantics Research (MTSR’05). There are several efforts that try to represent the http://www.metadata-semantics.org/. concept of tagging, the operation of tagging, and the tags themselves. Newman [1] describes the relation- [3] Knerr, T. (2006),Tagging Ontology- Towards a ship between an agent, an arbitrary resource, and one Common Ontology for Folksonomies, available at: or more tags. In his ontology, there are three core con- http://code.google.com/p/tagont/ cepts such as Taggers, Tagging, and Tags to represent tagging activity. Gruber [2] describes the core idea of [4] Clare Harries, N. H., Are absolute frequencies, tagging that consists of object, tag, tagger, and source. relative frequencies, or both effective in reducing Knerr [3] describes the concept of tagging in the Tag- cognitive biases?, Journal of Behavioral Decision ging Ontology. Since his approach is based on the ideas Making 13, Issue 4, 431-444, 2000. from [1, 2], the core element of the ontology is Tagging. [5] Tagcommons project website: The ontology consists of time, user, domain, visibility, http://tagcommons.org tag, resource, and type. There are some other projects [5, 6] related to tag sharing and semantic tagging. [6] Tagora Project website: http://www.tagora- The approaches in the related work are focused on project.eu the tagging activities or events that people use to tag 539 545 547