SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
An Empirical Study on the Risks of Using Off-the-Shelf
          Techniques for Processing Mailing List Data
                       Nicolas Bettenburg, Emad Shihab, Ahmed E. Hassan
                                                    Queen’s University, Canada




                                                                                 1
Development Repositories

SOURCE    COMMUNICATION      BUG
 CODE        ARCHIVES     DATABASES




                                      2
Development Repositories

SOURCE    COMMUNICATION      BUG
 CODE        ARCHIVES     DATABASES




                                      3
The Importance of Mailing List
          Archives


                       rm of comm   unication
 • Emai  l popular fo
                  to distribu te messages
  • Mailing lists
                          valuable in formation
   • Messa   ges contain
              ssions of s ource code
     • Discu
         evelopmen    t decisions
      •D
      • Er ror reports
          ser support   requests
      •U

                                                  4
Mining the Mailing Lists of
23 Open-Source Projects


       • Summarizing developer mailing lists
       • Using off-the-shelf tools
       • Data from around 500,000 emails
       • Unexpected results from experiments



                                               5
catter   !! things !! info !! mlw !! bool !! palloc(bufsize); !! symlinks !! configuration
! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://w

!   > !!   break; !! get   !! SIGNATURE !! -----END !! != !! symlinks. !! command !!

! char !! 1F !!    file !!   postgres   !! Dec !! 43 !! DataDir, !! pg_hba.conf !! 69 !! + SetData

 ! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B
path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(ch

"datadir"   !! running !! PGP !! (GNU/Linux) !! "hbaconfig" !! file. !! ((opt !! 2001 !!

! bits !! simple !! databases !!   */ !!   servers   !! multiple !! /* !! share !! "A:a:B:b:c:D:d:Fh:ik:lm:M

#include !! *) !! vendors !! E3 !! people !! + { !! 08:27:06 !! 3B !! 16 !! +# !! explic
malloc(strlen(DataDir)!! +++ !! !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case
diff !! easier !! certs !! given !! { !!

                                                                                                           6
nfiguration !! PGDATA mlw !! bool !! palloc(bufsize);!!!!long !! environment !! agrees ! resides. !
  catter !! things !! info !!
                              !! impose !! them. !! opinion !! keys
                                                                         symlinks !! configuration
 eating !! exactly !! postgresql !! original !! stuff !! described !! (My !! say !! "pg" !! BSD !
 ! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://w
 !
 !
  specifies
  >                           !! certs !! data !! Tux !! looks !! policy, !! '/etc/pgsql/pg_hba.conf' !!
         !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !!
attered !! patch !! layout !! linux, !! '/u01/postgres' !! give !! path !! all. !! file. !! live !! belongs, !
 ! char !! 1F !!
 ecified,   !! hey,   !!
                         file !! !! reasons !! it. !!
                        reasonable.
                                        !! Dec !!
                                       postgres                   43
                                                                    damn    !! options: !! utterly !! line, !!   files !!
                                                                       !! DataDir, !! pg_hba.conf !! 69 !! + SetData
                                                                                                                              co

 ! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B
hod                         options
         !! considering !! always. !!                  !! symlinks. !! different !! 5434 !! /etc/pgsql/
path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(ch
 postgresql PGP !! !!(GNU/Linux) !! !! !! file. !! ((opt !! 2001 !!
 "datadir" !!        !! !!
                      running
                             things !!overides         !! convenient !!
                                                                  using,     symlinking
                                                                       "hbaconfig"
                                                                                                                              ab


 onf !! command !! !! */ !! !! !! multiple !! !!/* !! share !!!! !!
 !    !!
   bits        !! databases
            simple
                                      !! controllable    modssl
                                                      servers
                                                                    !!
                                                                   undesired        /path/name3"      ","I   Similarly,   ObFlam
                                                                                                    "A:a:B:b:c:D:d:Fh:ik:lm:M

ster !! Config !! directory!!!!+ { !!
#include !! *) !! vendors !!
                             !! people           E3
                                                                       discussion   !! packager !! ass. !! really !! machine !!
                                                                              08:27:06    !! 3B !! 16 !! +# !! explic
                      !! +++ !! !! /etc/nessusd. !! logical. !! behavior !! crypto !!
evil !! sense !! hbaconfig
malloc(strlen(DataDir)

diff !! easier !! certs !! given !! { !!
                                                                                                   Debian
                                  !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case

g !! kinda !! see !! forced !! people !! pg_hba.conf ! !! pgdatadir !! /path/name2 !! guess !! get !! o
                                                                                                                              6
While mining Mailing Lists of
        23 Open-Source Projects


• Don’t treat mail archives as textual data
• Changing technologies
• Up to 98% of messages contain noise




                                              7
While mining Mailing Lists of
        23 Open-Source Projects


• Don’t treat mail archives as textual data
• Changing technologies
• Up to 98% of messages contain noise

 Additional processing and cleaning needed!



                                              8
From geek+@cmu.edu Wed Jan 21 08:11:26 1998
Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)
From: "Brian E. Gallew" <geek+@cmu.edu>
Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

>      If you can grab a copy and run it on your machine, and send me
> the output, that would help alot.

Here is a gzip'ed tar of the results.



=====================================================================
| Please do not shoot at the thermonuclear weapons! -- Deacon       |
=====================================================================
| Finger geek@andrew.cmu.edu for my public key.                     |
=====================================================================

- ---559023410-851401618-854387445=:824
Content-Type: APPLICATION/x-gzip
Content-Transfer-Encoding: BASE64
Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W
UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx
ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/
gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

                                                                        9
From geek+@cmu.edu Wed Jan 21 08:11:26 1998
Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)
From: "Brian E. Gallew" <geek+@cmu.edu>
Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

>      If you can grab a copy and run it on your machine, and send me
> the output, that would help alot.

Here is a gzip'ed tar of the results.



=====================================================================
| Please do not shoot at the thermonuclear weapons! -- Deacon       |
=====================================================================
| Finger geek@andrew.cmu.edu for my public key.                     |
=====================================================================

- ---559023410-851401618-854387445=:824
Content-Type: APPLICATION/x-gzip
Content-Transfer-Encoding: BASE64
Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W
UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx
ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/
gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

                                                                        10
Resolving Multiple Sender Identities


•   Participants send mail from different addresses
•   Up to 21% of addresses are aliases
•   Such aliases bias identity-based analyses
•   Manual inspection and correction tedious
•   No fully automated approach to resolve identities




                                                        11
Reconstructing Discussion Threads

•   Mail stored sequentially in archives
•   Logical grouping: discussion topics
•   Required information erroneous or missing
•   Essential for social network and topic analysis

                                             A           A


                                             B                     B


                                             C                                  C


                                             D                     D


                                       Linear Sequence       Thread Hierarchy




                                                                                    12
From geek+@cmu.edu Wed Jan 21 08:11:26 1998
Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)
From: "Brian E. Gallew" <geek+@cmu.edu>
Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

>      If you can grab a copy and run it on your machine, and send me
> the output, that would help alot.

Here is a gzip'ed tar of the results.



=====================================================================
| Please do not shoot at the thermonuclear weapons! -- Deacon       |
=====================================================================
| Finger geek@andrew.cmu.edu for my public key.                     |
=====================================================================

- ---559023410-851401618-854387445=:824
Content-Type: APPLICATION/x-gzip
Content-Transfer-Encoding: BASE64
Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W
UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx
ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/
gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

                                                                        13
From geek+@cmu.edu Wed Jan 21 08:11:26 1998
Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)
From: "Brian E. Gallew" <geek+@cmu.edu>
Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

>      If you can grab a copy and run it on your machine, and send me
> the output, that would help alot.

Here is a gzip'ed tar of the results.



=====================================================================
| Please do not shoot at the thermonuclear weapons! -- Deacon       |
=====================================================================
| Finger geek@andrew.cmu.edu for my public key.                     |
=====================================================================

- ---559023410-851401618-854387445=:824
Content-Type: APPLICATION/x-gzip
Content-Transfer-Encoding: BASE64
Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W
UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx
ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/
gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

                                                                        14
Attachments


•   MIME standard defines extensions to email
•   Binary data encoded as text
•   Around 10% of messages have attachments
•   Extract attachments and store separately




                                               15
From geek+@cmu.edu Wed Jan 21 08:11:26 1998
Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)
From: "Brian E. Gallew" <geek+@cmu.edu>
Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

>      If you can grab a copy and run it on your machine, and send me
> the output, that would help alot.

Here is a gzip'ed tar of the results.



=====================================================================
| Please do not shoot at the thermonuclear weapons! -- Deacon       |
=====================================================================
| Finger geek@andrew.cmu.edu for my public key.                     |
=====================================================================

- ---559023410-851401618-854387445=:824
Content-Type: APPLICATION/x-gzip
Content-Transfer-Encoding: BASE64
Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W
UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx
ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/
gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

                                                                        16
From geek+@cmu.edu Wed Jan 21 08:11:26 1998
Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)
From: "Brian E. Gallew" <geek+@cmu.edu>
Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

>      If you can grab a copy and run it on your machine, and send me
> the output, that would help alot.

Here is a gzip'ed tar of the results.



=====================================================================
| Please do not shoot at the thermonuclear weapons! -- Deacon       |
=====================================================================
| Finger geek@andrew.cmu.edu for my public key.                     |
=====================================================================

- ---559023410-851401618-854387445=:824
Content-Type: APPLICATION/x-gzip
Content-Transfer-Encoding: BASE64
Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W
UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx
ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/
gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

                                                                        17
Quotes and Signatures
•   Duplicate information
•   Unrelated to actual message
•   Removing signatures is challenging
•   Quoted text may or may not be desirable
•   Signatures impact text mining approaches
•   No perfect method for signature removal
                                                                                                    ====
                                                                                           === ====      |
                                                                                     = ==== n
                                                                                 ===
                                                                            ==== -- Deaco =======
                                                                      ====                     ==
                                                               = ==== eapons!             ====            |
                                                      ===  ==== ear w            === ====
                                                 ==== rmonucl               ====                        ===
                                          === ==        e            === ==                    === ====
                                  === ==== t the th =======              key  .       === ====
                          === ==== shoot a        === ====      pub lic      === ====
                 === ==== do not         === ====      fo r my ========
                        ase         ==== cmu.edu              ===
                 |  Ple     === ==== rew.            === ====
                       ==== eek@and             ====
                  ==== er g            ====
                                            ===
                       ng           ==
                  | Fi ========
                        ==
                   ====
                                                                                                          18
More Risks presented
         in the Paper




                        19
(1) Mailing Lists contain valuable
    information on a project.


(2) Data Needs Pre-Processing before
    applying traditional tools.


(3) Manual Data Processing is often not
    feasible or requires much effort.


(4) Off-the-Shelf tools were not designed
    to prepare data for mining.

                                           20

Más contenido relacionado

Destacado

Mining Development Repositories to Study the Impact of Collaboration on Softw...
Mining Development Repositories to Study the Impact of Collaboration on Softw...Mining Development Repositories to Study the Impact of Collaboration on Softw...
Mining Development Repositories to Study the Impact of Collaboration on Softw...Nicolas Bettenburg
 
Automatic Identification of Bug Introducing Changes
Automatic Identification of Bug Introducing ChangesAutomatic Identification of Bug Introducing Changes
Automatic Identification of Bug Introducing ChangesNicolas Bettenburg
 
Cloning Considered Harmful Considered Harmful
Cloning Considered Harmful Considered HarmfulCloning Considered Harmful Considered Harmful
Cloning Considered Harmful Considered HarmfulNicolas Bettenburg
 
Finding Paths in Large Spaces - A* and Hierarchical A*
Finding Paths in Large Spaces - A* and Hierarchical A*Finding Paths in Large Spaces - A* and Hierarchical A*
Finding Paths in Large Spaces - A* and Hierarchical A*Nicolas Bettenburg
 
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction ModelsThink Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction ModelsNicolas Bettenburg
 
Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...
Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...
Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...Nicolas Bettenburg
 
The Quality of Bug Reports in Eclipse ETX'07
The Quality of Bug Reports in Eclipse ETX'07The Quality of Bug Reports in Eclipse ETX'07
The Quality of Bug Reports in Eclipse ETX'07Nicolas Bettenburg
 
Duplicate Bug Reports Considered Harmful ... Really?
Duplicate Bug Reports Considered Harmful ... Really?Duplicate Bug Reports Considered Harmful ... Really?
Duplicate Bug Reports Considered Harmful ... Really?Nicolas Bettenburg
 
Computing Accuracy Precision And Recall
Computing Accuracy Precision And RecallComputing Accuracy Precision And Recall
Computing Accuracy Precision And RecallNicolas Bettenburg
 

Destacado (10)

Mining Development Repositories to Study the Impact of Collaboration on Softw...
Mining Development Repositories to Study the Impact of Collaboration on Softw...Mining Development Repositories to Study the Impact of Collaboration on Softw...
Mining Development Repositories to Study the Impact of Collaboration on Softw...
 
Automatic Identification of Bug Introducing Changes
Automatic Identification of Bug Introducing ChangesAutomatic Identification of Bug Introducing Changes
Automatic Identification of Bug Introducing Changes
 
Cloning Considered Harmful Considered Harmful
Cloning Considered Harmful Considered HarmfulCloning Considered Harmful Considered Harmful
Cloning Considered Harmful Considered Harmful
 
Finding Paths in Large Spaces - A* and Hierarchical A*
Finding Paths in Large Spaces - A* and Hierarchical A*Finding Paths in Large Spaces - A* and Hierarchical A*
Finding Paths in Large Spaces - A* and Hierarchical A*
 
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction ModelsThink Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
 
Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...
Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...
Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...
 
The Quality of Bug Reports in Eclipse ETX'07
The Quality of Bug Reports in Eclipse ETX'07The Quality of Bug Reports in Eclipse ETX'07
The Quality of Bug Reports in Eclipse ETX'07
 
Duplicate Bug Reports Considered Harmful ... Really?
Duplicate Bug Reports Considered Harmful ... Really?Duplicate Bug Reports Considered Harmful ... Really?
Duplicate Bug Reports Considered Harmful ... Really?
 
Computing Accuracy Precision And Recall
Computing Accuracy Precision And RecallComputing Accuracy Precision And Recall
Computing Accuracy Precision And Recall
 
Fuzzy Logic in Smart Homes
Fuzzy Logic in Smart HomesFuzzy Logic in Smart Homes
Fuzzy Logic in Smart Homes
 

Similar a An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

A DevOps Perspective: MongoDB & MMF
A DevOps Perspective: MongoDB & MMFA DevOps Perspective: MongoDB & MMF
A DevOps Perspective: MongoDB & MMFMapMyFitness
 
Crunching data with go: Tips, tricks, use-cases
Crunching data with go: Tips, tricks, use-casesCrunching data with go: Tips, tricks, use-cases
Crunching data with go: Tips, tricks, use-casesSergii Khomenko
 
20th.陈晓鸣 百度海量日志分析架构及处理经验分享
20th.陈晓鸣 百度海量日志分析架构及处理经验分享20th.陈晓鸣 百度海量日志分析架构及处理经验分享
20th.陈晓鸣 百度海量日志分析架构及处理经验分享elevenma
 
CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...
CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...
CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...CODE BLUE
 
What happens when firefox crashes?
What happens when firefox crashes?What happens when firefox crashes?
What happens when firefox crashes?Erik Rose
 
About Multiblock Reads v4
About Multiblock Reads v4About Multiblock Reads v4
About Multiblock Reads v4Enkitec
 
Cloud flare jgc bigo meetup rolling hashes
Cloud flare jgc   bigo meetup rolling hashesCloud flare jgc   bigo meetup rolling hashes
Cloud flare jgc bigo meetup rolling hashesCloudflare
 
Дмитрий Щадей "Что помогает нам писать качественный JavaScript-код?"
Дмитрий Щадей "Что помогает нам писать качественный JavaScript-код?"Дмитрий Щадей "Что помогает нам писать качественный JavaScript-код?"
Дмитрий Щадей "Что помогает нам писать качественный JavaScript-код?"Yandex
 
Groovy 1.8 and 2.0 at GR8Conf Europe 2012
Groovy 1.8 and 2.0 at GR8Conf Europe 2012Groovy 1.8 and 2.0 at GR8Conf Europe 2012
Groovy 1.8 and 2.0 at GR8Conf Europe 2012Guillaume Laforge
 
An Introduction to Go
An Introduction to GoAn Introduction to Go
An Introduction to GoCloudflare
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Chris Fregly
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboyKenneth Geisshirt
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
 
NGS Informatics and Interpretation - Hardware Considerations by Michael McManus
NGS Informatics and Interpretation - Hardware Considerations by Michael McManusNGS Informatics and Interpretation - Hardware Considerations by Michael McManus
NGS Informatics and Interpretation - Hardware Considerations by Michael McManusKnome_Inc
 

Similar a An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data (20)

A DevOps Perspective: MongoDB & MMF
A DevOps Perspective: MongoDB & MMFA DevOps Perspective: MongoDB & MMF
A DevOps Perspective: MongoDB & MMF
 
Crunching data with go: Tips, tricks, use-cases
Crunching data with go: Tips, tricks, use-casesCrunching data with go: Tips, tricks, use-cases
Crunching data with go: Tips, tricks, use-cases
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20th.陈晓鸣 百度海量日志分析架构及处理经验分享
20th.陈晓鸣 百度海量日志分析架构及处理经验分享20th.陈晓鸣 百度海量日志分析架构及处理经验分享
20th.陈晓鸣 百度海量日志分析架构及处理经验分享
 
CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...
CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...
CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...
 
Brasil Ross 2011
Brasil Ross 2011Brasil Ross 2011
Brasil Ross 2011
 
What happens when firefox crashes?
What happens when firefox crashes?What happens when firefox crashes?
What happens when firefox crashes?
 
About Multiblock Reads v4
About Multiblock Reads v4About Multiblock Reads v4
About Multiblock Reads v4
 
Cloud flare jgc bigo meetup rolling hashes
Cloud flare jgc   bigo meetup rolling hashesCloud flare jgc   bigo meetup rolling hashes
Cloud flare jgc bigo meetup rolling hashes
 
Дмитрий Щадей "Что помогает нам писать качественный JavaScript-код?"
Дмитрий Щадей "Что помогает нам писать качественный JavaScript-код?"Дмитрий Щадей "Что помогает нам писать качественный JavaScript-код?"
Дмитрий Щадей "Что помогает нам писать качественный JavaScript-код?"
 
Groovy 1.8 and 2.0 at GR8Conf Europe 2012
Groovy 1.8 and 2.0 at GR8Conf Europe 2012Groovy 1.8 and 2.0 at GR8Conf Europe 2012
Groovy 1.8 and 2.0 at GR8Conf Europe 2012
 
An Introduction to Go
An Introduction to GoAn Introduction to Go
An Introduction to Go
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboy
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
NGS Informatics and Interpretation - Hardware Considerations by Michael McManus
NGS Informatics and Interpretation - Hardware Considerations by Michael McManusNGS Informatics and Interpretation - Hardware Considerations by Michael McManus
NGS Informatics and Interpretation - Hardware Considerations by Michael McManus
 
PowerDNS Webinar
PowerDNS Webinar PowerDNS Webinar
PowerDNS Webinar
 

Más de Nicolas Bettenburg

10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...
10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...
10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...Nicolas Bettenburg
 
Using Fuzzy Code Search to Link Code Fragments in Discussions to Source Code
Using Fuzzy Code Search to Link Code Fragments in Discussions to Source CodeUsing Fuzzy Code Search to Link Code Fragments in Discussions to Source Code
Using Fuzzy Code Search to Link Code Fragments in Discussions to Source CodeNicolas Bettenburg
 
Managing Community Contributions: Lessons Learned from a Case Study on Andro...
Managing Community Contributions:  Lessons Learned from a Case Study on Andro...Managing Community Contributions:  Lessons Learned from a Case Study on Andro...
Managing Community Contributions: Lessons Learned from a Case Study on Andro...Nicolas Bettenburg
 
Predictors of Customer Perceived Quality
Predictors of Customer Perceived QualityPredictors of Customer Perceived Quality
Predictors of Customer Perceived QualityNicolas Bettenburg
 
Extracting Structural Information from Bug Reports.
Extracting Structural Information from Bug Reports.Extracting Structural Information from Bug Reports.
Extracting Structural Information from Bug Reports.Nicolas Bettenburg
 

Más de Nicolas Bettenburg (7)

10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...
10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...
10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...
 
Using Fuzzy Code Search to Link Code Fragments in Discussions to Source Code
Using Fuzzy Code Search to Link Code Fragments in Discussions to Source CodeUsing Fuzzy Code Search to Link Code Fragments in Discussions to Source Code
Using Fuzzy Code Search to Link Code Fragments in Discussions to Source Code
 
Managing Community Contributions: Lessons Learned from a Case Study on Andro...
Managing Community Contributions:  Lessons Learned from a Case Study on Andro...Managing Community Contributions:  Lessons Learned from a Case Study on Andro...
Managing Community Contributions: Lessons Learned from a Case Study on Andro...
 
Approximation Algorithms
Approximation AlgorithmsApproximation Algorithms
Approximation Algorithms
 
Predictors of Customer Perceived Quality
Predictors of Customer Perceived QualityPredictors of Customer Perceived Quality
Predictors of Customer Perceived Quality
 
Extracting Structural Information from Bug Reports.
Extracting Structural Information from Bug Reports.Extracting Structural Information from Bug Reports.
Extracting Structural Information from Bug Reports.
 
Metropolis Instant Radiosity
Metropolis Instant RadiosityMetropolis Instant Radiosity
Metropolis Instant Radiosity
 

Último

Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 

Último (20)

Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 

An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

  • 1. An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data Nicolas Bettenburg, Emad Shihab, Ahmed E. Hassan Queen’s University, Canada 1
  • 2. Development Repositories SOURCE COMMUNICATION BUG CODE ARCHIVES DATABASES 2
  • 3. Development Repositories SOURCE COMMUNICATION BUG CODE ARCHIVES DATABASES 3
  • 4. The Importance of Mailing List Archives rm of comm unication • Emai l popular fo to distribu te messages • Mailing lists valuable in formation • Messa ges contain ssions of s ource code • Discu evelopmen t decisions •D • Er ror reports ser support requests •U 4
  • 5. Mining the Mailing Lists of 23 Open-Source Projects • Summarizing developer mailing lists • Using off-the-shelf tools • Data from around 500,000 emails • Unexpected results from experiments 5
  • 6. catter !! things !! info !! mlw !! bool !! palloc(bufsize); !! symlinks !! configuration ! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://w ! > !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! ! char !! 1F !! file !! postgres !! Dec !! 43 !! DataDir, !! pg_hba.conf !! 69 !! + SetData ! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(ch "datadir" !! running !! PGP !! (GNU/Linux) !! "hbaconfig" !! file. !! ((opt !! 2001 !! ! bits !! simple !! databases !! */ !! servers !! multiple !! /* !! share !! "A:a:B:b:c:D:d:Fh:ik:lm:M #include !! *) !! vendors !! E3 !! people !! + { !! 08:27:06 !! 3B !! 16 !! +# !! explic malloc(strlen(DataDir)!! +++ !! !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case diff !! easier !! certs !! given !! { !! 6
  • 7. nfiguration !! PGDATA mlw !! bool !! palloc(bufsize);!!!!long !! environment !! agrees ! resides. ! catter !! things !! info !! !! impose !! them. !! opinion !! keys symlinks !! configuration eating !! exactly !! postgresql !! original !! stuff !! described !! (My !! say !! "pg" !! BSD ! ! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://w ! ! specifies > !! certs !! data !! Tux !! looks !! policy, !! '/etc/pgsql/pg_hba.conf' !! !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! attered !! patch !! layout !! linux, !! '/u01/postgres' !! give !! path !! all. !! file. !! live !! belongs, ! ! char !! 1F !! ecified, !! hey, !! file !! !! reasons !! it. !! reasonable. !! Dec !! postgres 43 damn !! options: !! utterly !! line, !! files !! !! DataDir, !! pg_hba.conf !! 69 !! + SetData co ! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B hod options !! considering !! always. !! !! symlinks. !! different !! 5434 !! /etc/pgsql/ path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(ch postgresql PGP !! !!(GNU/Linux) !! !! !! file. !! ((opt !! 2001 !! "datadir" !! !! !! running things !!overides !! convenient !! using, symlinking "hbaconfig" ab onf !! command !! !! */ !! !! !! multiple !! !!/* !! share !!!! !! ! !! bits !! databases simple !! controllable modssl servers !! undesired /path/name3" ","I Similarly, ObFlam "A:a:B:b:c:D:d:Fh:ik:lm:M ster !! Config !! directory!!!!+ { !! #include !! *) !! vendors !! !! people E3 discussion !! packager !! ass. !! really !! machine !! 08:27:06 !! 3B !! 16 !! +# !! explic !! +++ !! !! /etc/nessusd. !! logical. !! behavior !! crypto !! evil !! sense !! hbaconfig malloc(strlen(DataDir) diff !! easier !! certs !! given !! { !! Debian !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case g !! kinda !! see !! forced !! people !! pg_hba.conf ! !! pgdatadir !! /path/name2 !! guess !! get !! o 6
  • 8. While mining Mailing Lists of 23 Open-Source Projects • Don’t treat mail archives as textual data • Changing technologies • Up to 98% of messages contain noise 7
  • 9. While mining Mailing Lists of 23 Open-Source Projects • Don’t treat mail archives as textual data • Changing technologies • Up to 98% of messages contain noise Additional processing and cleaning needed! 8
  • 10. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 9
  • 11. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 10
  • 12. Resolving Multiple Sender Identities • Participants send mail from different addresses • Up to 21% of addresses are aliases • Such aliases bias identity-based analyses • Manual inspection and correction tedious • No fully automated approach to resolve identities 11
  • 13. Reconstructing Discussion Threads • Mail stored sequentially in archives • Logical grouping: discussion topics • Required information erroneous or missing • Essential for social network and topic analysis A A B B C C D D Linear Sequence Thread Hierarchy 12
  • 14. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 13
  • 15. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 14
  • 16. Attachments • MIME standard defines extensions to email • Binary data encoded as text • Around 10% of messages have attachments • Extract attachments and store separately 15
  • 17. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 16
  • 18. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 17
  • 19. Quotes and Signatures • Duplicate information • Unrelated to actual message • Removing signatures is challenging • Quoted text may or may not be desirable • Signatures impact text mining approaches • No perfect method for signature removal ==== === ==== | = ==== n === ==== -- Deaco ======= ==== == = ==== eapons! ==== | === ==== ear w === ==== ==== rmonucl ==== === === == e === == === ==== === ==== t the th ======= key . === ==== === ==== shoot a === ==== pub lic === ==== === ==== do not === ==== fo r my ======== ase ==== cmu.edu === | Ple === ==== rew. === ==== ==== eek@and ==== ==== er g ==== === ng == | Fi ======== == ==== 18
  • 20. More Risks presented in the Paper 19
  • 21. (1) Mailing Lists contain valuable information on a project. (2) Data Needs Pre-Processing before applying traditional tools. (3) Manual Data Processing is often not feasible or requires much effort. (4) Off-the-Shelf tools were not designed to prepare data for mining. 20