SlideShare una empresa de Scribd logo
1 de 19
Database Optimization for Large
Scale Web Access Log Management
Catalogue
•

Overview

•

Introduction

•

Performance Evaluation
•

Experiment Data

Related Concepts & Terms

•

Worst Case

Web Access Log Data

•

Log File

•

Hash

•

•

Experiment Environment

•

•

•

Bucket

•

•

•
•
•

•
JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

Run Time

•

Transaction Processing

Sorting Frame

•

Log Analysis Algorithm

Creation Of Statistical Information

•

Pre-Processor

Writing Data to File

•

System Analysis & Design

Data Input Time

•

Optimization Of Database Design
•

Experiment Result

Memory Usage

Conclusion
2
Overview
•

We will be talking about an optimal method of commercial DBMS for analysing largescale web access log. Also, we develop the pre-processor in memory structure to
improve performance and the user management tool for user friendliness.

•

We have three stages for this research. The first stage in research is data collection. In
this stage, we study characteristics of large-scale web access log data and investigate
commercial DBMS tuning techniques. The next stage, we design the system model. The
proposed system has three components including pre-processor, DBMS and
management tool. To improve the effectiveness the system, Pre-processor hash and
sort log data in memory. And we perform DBMS tuning to improve performance. The
final stage, we implement system and evaluate the performance. And we develop
system management tool for user friendliness.

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

3
Introduction
•

Log generally collects some data for analysis, such as, click rates, type of users, how
many users access each web page, and time-of-use .

•

Commercial DBMS with high performance and high scalability is used by many
companies in the industry, such as Oracle, Microsoft, IBM, and etc.

•

In order to optimize the performance of DBMS with high performance, it needs to
widely and deeply understand OS, hardware system, DMBS, and application etc.
because the inappropriate optimization cause the performance degradation.

•

The main point of this study is about how to optimize DBMS to efficiently store and to
manage web access log data.

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

4
Web Access Log Data

The system using only DB

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

The proposal system using Pre-Processor

5
A. Hash
•

Hash is a value that is stored on a specific address using hash function. After a value is
processed by hash function, it is mapped to a hash table. A hash table is composed of a
bucket that store data and a hash address that corresponds to a key value. When a
hash address is formed by hash function, each different key might have the same hash
address, which is called Hash Collision.

•

ADVANTAGE : it’s possible to get specific data with only one access using a key .

B. Bucket
•

A value is mapped to a hash table. A hash table is composed of n buckets. Each bucket
has m slots and each slot can store only one value.

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

6
JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

7
Optimization Of Database Design

The System Structure
JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

8
B. Pre-Processor:
•

Pre-Processor remove redundant data by reading the collected web log data and is the
role of preprocessing for storing data in Database. Pre-Processor transfers collected
data through log collection server and read data is stored in a hash table using a key. At
the moment, if redundant data about URL is in a hash table, then increase a count
value and make statistical information about collected URL.

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

9
JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

10
C. System Analysis and Design:
•

With key/value pair, data is divided into a key and a value. Time and source IP are
stored into a key, on the other hand, other information and a count value are stored
into a value. Basically, we compare a key using key/value pair and increase a count
value if the key exits in a hash table. Because the hash table in this study uses 32-bit
hash value, collision probability is 1/4.284.967.296.

D. Transaction Processing:
•

Statistical information made by Pre-Processor has mostly a lot of URL that access
frequency is 1. In case of URL that access frequency is high is because it brings
information related to a web page without a direct request, such as, images,
advertisement, etc.

•

Transaction processing is a way to sort out some irrelevant data as one transaction is
considered all data related to a request page. Also, it processes log in n seconds, which
is considered a unit of one transaction. After processing, the processed log is split
among two results, such as, a result list and a filter list. The result list is a list without
filtering, that is, an aggregate of unfiltered URL in the whole log.

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

11
Access Log

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

12
Filter list & Result List

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

Transaction Filter

13
Performance Evaluation
A.

Experiment Environment

B. Experiment Data
1. Zipf Distribution:
•

Zipf’s Law is that the frequency of a specific item is proportional to a priority item as
creating the optimized data for an experiment. That is, the frequency of n priority item
is that a proportion of some element is proportional to 1/priority with the law that the
frequency is 1/n.

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

14
C. Worst Case:
•

Worst case (WC) is that collected web access log is not duplicated and there is no
constant distribution for web access. We expect the worst performance because all
collected data are stored in DB and data de-duplication is not carried out.

D. Log File:
•

A sample data for creating experiment data uses URL Snooper and is composed of
collected time, source IP, destination IP, and domain, URL. We set up to collect over a
million data in a server.

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

15
EXPERIMENT RESULT
A.

•

Data Input Time

Zipf parameter 0.2, which includes the
most redundant data, takes 437.5520
seconds to store 376,934 logs to DB is

and the worst case, includes no redundant
data, takes 1044.4392 seconds.

Input Time

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

16
B. Writing Data to a File

File Write Time

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

C. Creation of Statistical Information:

Data input time to Hash Table

17
CONCLUSION:
•

We understand the architecture and the feature of DBMS through how to store, index
structure, and data storage structure of commercial DMBS. And, using the feature of DBMS,
we make the optimized DBMS and the tool for handling web access log data and providing
fundamental knowledge to develop software.

•

The input time from Pre-Processor to DB depends on how often de-duplication occurs in a
hash table. The memory usage is Pre-Processor uses the memory the twice time more than
DB when creating statistical information. However, run time is Pre-Processor, which operate
in-memory, is from 18 to 20 times better than DB

•

In summary, using Pre-Processor, works in-memory, than only using DB is expected to better
performance, run-time, and the result data than the traditional system. Furthermore, this
study is expected to be used from other area like network security.

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

18
References
•

Database Optimization for Large Scale Web Access Log Management - Minjae Lee,
Jaehun Lee, Dohyung Kim, Taeil Kim, Sooyong Kang - Hanyang University {Dept. of
Electronics & Computer Engineering}

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

19

Más contenido relacionado

La actualidad más candente

Education Data Warehouse System
Education Data Warehouse SystemEducation Data Warehouse System
Education Data Warehouse System
daniyalqureshi712
 
Indexing for Large DNA Database sequences
Indexing for Large DNA Database sequencesIndexing for Large DNA Database sequences
Indexing for Large DNA Database sequences
CSCJournals
 
Physical database design(database)
Physical database design(database)Physical database design(database)
Physical database design(database)
welcometofacebook
 

La actualidad más candente (13)

Comparison of data recovery techniques on master file table between Aho-Coras...
Comparison of data recovery techniques on master file table between Aho-Coras...Comparison of data recovery techniques on master file table between Aho-Coras...
Comparison of data recovery techniques on master file table between Aho-Coras...
 
The AMB Data Warehouse: A Case Study
The AMB Data Warehouse: A Case StudyThe AMB Data Warehouse: A Case Study
The AMB Data Warehouse: A Case Study
 
Education Data Warehouse System
Education Data Warehouse SystemEducation Data Warehouse System
Education Data Warehouse System
 
Database Migration Tool
Database Migration ToolDatabase Migration Tool
Database Migration Tool
 
A time efficient and accurate retrieval of range aggregate queries using fuzz...
A time efficient and accurate retrieval of range aggregate queries using fuzz...A time efficient and accurate retrieval of range aggregate queries using fuzz...
A time efficient and accurate retrieval of range aggregate queries using fuzz...
 
NOVEL FUNCTIONAL DEPENDENCY APPROACH FOR STORAGE SPACE OPTIMISATION IN GREEN ...
NOVEL FUNCTIONAL DEPENDENCY APPROACH FOR STORAGE SPACE OPTIMISATION IN GREEN ...NOVEL FUNCTIONAL DEPENDENCY APPROACH FOR STORAGE SPACE OPTIMISATION IN GREEN ...
NOVEL FUNCTIONAL DEPENDENCY APPROACH FOR STORAGE SPACE OPTIMISATION IN GREEN ...
 
A Survey on Different File Handling Mechanisms in HDFS
A Survey on Different File Handling Mechanisms in HDFSA Survey on Different File Handling Mechanisms in HDFS
A Survey on Different File Handling Mechanisms in HDFS
 
Abstract.DOCX
Abstract.DOCXAbstract.DOCX
Abstract.DOCX
 
Advanced database protocols
Advanced database protocolsAdvanced database protocols
Advanced database protocols
 
Indexing for Large DNA Database sequences
Indexing for Large DNA Database sequencesIndexing for Large DNA Database sequences
Indexing for Large DNA Database sequences
 
L017418893
L017418893L017418893
L017418893
 
Physical database design(database)
Physical database design(database)Physical database design(database)
Physical database design(database)
 
ADVANCE DATABASE MANAGEMENT SYSTEM CONCEPTS & ARCHITECTURE by vikas jagtap
ADVANCE DATABASE MANAGEMENT SYSTEM CONCEPTS & ARCHITECTURE by vikas jagtapADVANCE DATABASE MANAGEMENT SYSTEM CONCEPTS & ARCHITECTURE by vikas jagtap
ADVANCE DATABASE MANAGEMENT SYSTEM CONCEPTS & ARCHITECTURE by vikas jagtap
 

Destacado

Multi-Protocol Label Switching
Multi-Protocol Label SwitchingMulti-Protocol Label Switching
Multi-Protocol Label Switching
seanraz
 

Destacado (20)

Multi-Protocol Label Switching
Multi-Protocol Label SwitchingMulti-Protocol Label Switching
Multi-Protocol Label Switching
 
Weblogs
WeblogsWeblogs
Weblogs
 
Web Log Files
Web Log FilesWeb Log Files
Web Log Files
 
Web Proxy Log Analysis and Management 2007
Web Proxy Log Analysis and Management 2007Web Proxy Log Analysis and Management 2007
Web Proxy Log Analysis and Management 2007
 
Weblogs
Weblogs Weblogs
Weblogs
 
New Forms Of Communication: Harnessing Collective Knowledge through Web Logs
New Forms Of Communication: Harnessing Collective Knowledge through Web LogsNew Forms Of Communication: Harnessing Collective Knowledge through Web Logs
New Forms Of Communication: Harnessing Collective Knowledge through Web Logs
 
Clericalismo liderazgo del s.xxi y cambio de paradigmas parte 1
Clericalismo  liderazgo del s.xxi y cambio de paradigmas parte 1Clericalismo  liderazgo del s.xxi y cambio de paradigmas parte 1
Clericalismo liderazgo del s.xxi y cambio de paradigmas parte 1
 
Weblogs and the Public Sphere
Weblogs and the Public SphereWeblogs and the Public Sphere
Weblogs and the Public Sphere
 
MPLS Deployment Chapter 1 - Basic
MPLS Deployment Chapter 1 - BasicMPLS Deployment Chapter 1 - Basic
MPLS Deployment Chapter 1 - Basic
 
Mpls
MplsMpls
Mpls
 
Voice over MPLS
Voice over MPLSVoice over MPLS
Voice over MPLS
 
Multi-Protocol Label Switching
Multi-Protocol Label SwitchingMulti-Protocol Label Switching
Multi-Protocol Label Switching
 
Web log & clickstream
Web log & clickstream Web log & clickstream
Web log & clickstream
 
Multi-Protocol Label Switching: Basics and Applications
Multi-Protocol Label Switching: Basics and ApplicationsMulti-Protocol Label Switching: Basics and Applications
Multi-Protocol Label Switching: Basics and Applications
 
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
 
MPLS
MPLSMPLS
MPLS
 
Clickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customersClickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customers
 
MPLS (Multi-Protocol Label Switching)
MPLS (Multi-Protocol Label Switching)MPLS (Multi-Protocol Label Switching)
MPLS (Multi-Protocol Label Switching)
 
MPLS Presentation
MPLS PresentationMPLS Presentation
MPLS Presentation
 
Blogging ppt
Blogging pptBlogging ppt
Blogging ppt
 

Similar a Web Access Log Management

BUILDING A DATA WAREHOUSE
BUILDING A DATA WAREHOUSEBUILDING A DATA WAREHOUSE
BUILDING A DATA WAREHOUSE
Neha Kapoor
 
Tips tricks to speed nw bi 2009
Tips tricks to speed  nw bi  2009Tips tricks to speed  nw bi  2009
Tips tricks to speed nw bi 2009
HawaDia
 

Similar a Web Access Log Management (20)

PHP/MySQL First Session Material
PHP/MySQL First Session MaterialPHP/MySQL First Session Material
PHP/MySQL First Session Material
 
F1803013034
F1803013034F1803013034
F1803013034
 
UNIT3 DBMS.pptx operation nd management of data base
UNIT3 DBMS.pptx operation nd management of data baseUNIT3 DBMS.pptx operation nd management of data base
UNIT3 DBMS.pptx operation nd management of data base
 
1-introduction to DB.pdf
1-introduction to DB.pdf1-introduction to DB.pdf
1-introduction to DB.pdf
 
BUILDING A DATA WAREHOUSE
BUILDING A DATA WAREHOUSEBUILDING A DATA WAREHOUSE
BUILDING A DATA WAREHOUSE
 
Advanced Database System
Advanced Database SystemAdvanced Database System
Advanced Database System
 
Tips tricks to speed nw bi 2009
Tips tricks to speed  nw bi  2009Tips tricks to speed  nw bi  2009
Tips tricks to speed nw bi 2009
 
Database performance management
Database performance managementDatabase performance management
Database performance management
 
Advance database system (part 2)
Advance database system (part 2)Advance database system (part 2)
Advance database system (part 2)
 
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop Framework
 
Database part1-
Database part1-Database part1-
Database part1-
 
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENTQUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
 
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseA Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
 
Query optimization
Query optimizationQuery optimization
Query optimization
 
Unit 1: Introduction to DBMS Unit 1 Complete
Unit 1: Introduction to DBMS Unit 1 CompleteUnit 1: Introduction to DBMS Unit 1 Complete
Unit 1: Introduction to DBMS Unit 1 Complete
 
Comparison of In-memory Data Platforms
Comparison of In-memory Data PlatformsComparison of In-memory Data Platforms
Comparison of In-memory Data Platforms
 
Query processing
Query processingQuery processing
Query processing
 
Chapter 11 Enterprise Resource Planning System
Chapter 11 Enterprise Resource Planning SystemChapter 11 Enterprise Resource Planning System
Chapter 11 Enterprise Resource Planning System
 
Study on potential capabilities of a nodb system
Study on potential capabilities of a nodb systemStudy on potential capabilities of a nodb system
Study on potential capabilities of a nodb system
 
Overview of Data Base Systems Concepts and Architecture
Overview of Data Base Systems Concepts and ArchitectureOverview of Data Base Systems Concepts and Architecture
Overview of Data Base Systems Concepts and Architecture
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Web Access Log Management

  • 1. Database Optimization for Large Scale Web Access Log Management
  • 2. Catalogue • Overview • Introduction • Performance Evaluation • Experiment Data Related Concepts & Terms • Worst Case Web Access Log Data • Log File • Hash • • Experiment Environment • • • Bucket • • • • • • JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) Run Time • Transaction Processing Sorting Frame • Log Analysis Algorithm Creation Of Statistical Information • Pre-Processor Writing Data to File • System Analysis & Design Data Input Time • Optimization Of Database Design • Experiment Result Memory Usage Conclusion 2
  • 3. Overview • We will be talking about an optimal method of commercial DBMS for analysing largescale web access log. Also, we develop the pre-processor in memory structure to improve performance and the user management tool for user friendliness. • We have three stages for this research. The first stage in research is data collection. In this stage, we study characteristics of large-scale web access log data and investigate commercial DBMS tuning techniques. The next stage, we design the system model. The proposed system has three components including pre-processor, DBMS and management tool. To improve the effectiveness the system, Pre-processor hash and sort log data in memory. And we perform DBMS tuning to improve performance. The final stage, we implement system and evaluate the performance. And we develop system management tool for user friendliness. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 3
  • 4. Introduction • Log generally collects some data for analysis, such as, click rates, type of users, how many users access each web page, and time-of-use . • Commercial DBMS with high performance and high scalability is used by many companies in the industry, such as Oracle, Microsoft, IBM, and etc. • In order to optimize the performance of DBMS with high performance, it needs to widely and deeply understand OS, hardware system, DMBS, and application etc. because the inappropriate optimization cause the performance degradation. • The main point of this study is about how to optimize DBMS to efficiently store and to manage web access log data. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 4
  • 5. Web Access Log Data The system using only DB JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) The proposal system using Pre-Processor 5
  • 6. A. Hash • Hash is a value that is stored on a specific address using hash function. After a value is processed by hash function, it is mapped to a hash table. A hash table is composed of a bucket that store data and a hash address that corresponds to a key value. When a hash address is formed by hash function, each different key might have the same hash address, which is called Hash Collision. • ADVANTAGE : it’s possible to get specific data with only one access using a key . B. Bucket • A value is mapped to a hash table. A hash table is composed of n buckets. Each bucket has m slots and each slot can store only one value. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 6
  • 7. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 7
  • 8. Optimization Of Database Design The System Structure JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 8
  • 9. B. Pre-Processor: • Pre-Processor remove redundant data by reading the collected web log data and is the role of preprocessing for storing data in Database. Pre-Processor transfers collected data through log collection server and read data is stored in a hash table using a key. At the moment, if redundant data about URL is in a hash table, then increase a count value and make statistical information about collected URL. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 9
  • 10. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 10
  • 11. C. System Analysis and Design: • With key/value pair, data is divided into a key and a value. Time and source IP are stored into a key, on the other hand, other information and a count value are stored into a value. Basically, we compare a key using key/value pair and increase a count value if the key exits in a hash table. Because the hash table in this study uses 32-bit hash value, collision probability is 1/4.284.967.296. D. Transaction Processing: • Statistical information made by Pre-Processor has mostly a lot of URL that access frequency is 1. In case of URL that access frequency is high is because it brings information related to a web page without a direct request, such as, images, advertisement, etc. • Transaction processing is a way to sort out some irrelevant data as one transaction is considered all data related to a request page. Also, it processes log in n seconds, which is considered a unit of one transaction. After processing, the processed log is split among two results, such as, a result list and a filter list. The result list is a list without filtering, that is, an aggregate of unfiltered URL in the whole log. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 11
  • 12. Access Log JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 12
  • 13. Filter list & Result List JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) Transaction Filter 13
  • 14. Performance Evaluation A. Experiment Environment B. Experiment Data 1. Zipf Distribution: • Zipf’s Law is that the frequency of a specific item is proportional to a priority item as creating the optimized data for an experiment. That is, the frequency of n priority item is that a proportion of some element is proportional to 1/priority with the law that the frequency is 1/n. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 14
  • 15. C. Worst Case: • Worst case (WC) is that collected web access log is not duplicated and there is no constant distribution for web access. We expect the worst performance because all collected data are stored in DB and data de-duplication is not carried out. D. Log File: • A sample data for creating experiment data uses URL Snooper and is composed of collected time, source IP, destination IP, and domain, URL. We set up to collect over a million data in a server. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 15
  • 16. EXPERIMENT RESULT A. • Data Input Time Zipf parameter 0.2, which includes the most redundant data, takes 437.5520 seconds to store 376,934 logs to DB is and the worst case, includes no redundant data, takes 1044.4392 seconds. Input Time JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 16
  • 17. B. Writing Data to a File File Write Time JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) C. Creation of Statistical Information: Data input time to Hash Table 17
  • 18. CONCLUSION: • We understand the architecture and the feature of DBMS through how to store, index structure, and data storage structure of commercial DMBS. And, using the feature of DBMS, we make the optimized DBMS and the tool for handling web access log data and providing fundamental knowledge to develop software. • The input time from Pre-Processor to DB depends on how often de-duplication occurs in a hash table. The memory usage is Pre-Processor uses the memory the twice time more than DB when creating statistical information. However, run time is Pre-Processor, which operate in-memory, is from 18 to 20 times better than DB • In summary, using Pre-Processor, works in-memory, than only using DB is expected to better performance, run-time, and the result data than the traditional system. Furthermore, this study is expected to be used from other area like network security. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 18
  • 19. References • Database Optimization for Large Scale Web Access Log Management - Minjae Lee, Jaehun Lee, Dohyung Kim, Taeil Kim, Sooyong Kang - Hanyang University {Dept. of Electronics & Computer Engineering} JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 19