SlideShare una empresa de Scribd logo
1 de 33
Object storage optimization in Swift
Alexandre LECUYER
DevOps / irc: alecuyer
Romain LE DISEZ
DevOps / irc: rledisez
What’s the problem?
• Performance is bad
• Disks 100% busy
• Replication/reconstruction is very (very) slow
2
Replica in Swift
3
/srv/node/<device>/objects/<partition>/<suffix>/<hash>/<timestamp>.data
012345
012345
012345
012345
Erasure Coding in Swift
4
012345
03
14
25
/srv/node/<device>/objects-1/<partition>/<suffix>/<hash>/<timestamp>#<fragment>#d.data
a9
Comparison
• Replica:
– Performance
– Overhead
– 3 files per object
(3 replicas)
• Erasure coding
– Cost effective
– Slow-ish
– 15 files per object
(12+3 fragments)
5
Where inodes join the party…
• XFS:
– one inode per file
– one inode per directory
• Inode:
– ctime/mtime/atime
– owner/group
– Permissions
6
Bad things happen
• One inode takes 300 bytes to 1k of memory
• Average: 2.4 inodes per fragment
– Data file: 1
– Object directory: 1
– Suffix directory + Partition directory: 0.4
7
Memory issues
• Inodes cannot fit in cache anymore
– But every inode of the path must be checked to
open a data file
• Only top level directories are cached
– Only 20% of hit on inode cache
– Up to 50% of devices activity to read inodes
8
Stability issues
• More filesystem corruptions
• Inability to run xfs_repair
– 1K of memory per inode
• Need a dedicated servers just to repair filesystems
– About 48 hours to repair one filesystem
9
Let’s fix it!
(a.k.a. inodes are useless, right?)
10
We tried crazy things
• Storing objects in a K/V (RocksDB, LevelDB, …)
– Not suited to synchronous IO. Write amplification.
• Storing in a K/V the file handle of datafiles
– Atomicity on two separate data structures
• Patching XFS to drop useless information
– It’s already well optimized, inodes may be compressed
• Storing in ZFS DMU
– Lots of very cool features, but performance issues if full, low
level development
11
12
Object Header
Volume Header
Object Data
Object Header
Object Data
Store multiple objects in
large files
13
Object Header
Volume Header
Object Data
Object Header
Object Data
Dedicated to a partition
No concurrent writes
Append only
Swift request path
14
Proxy server
Proxy server
Object server Object server Object server
PUT / GET requests
How does Swift organize data ?
• PUT: « photo.jpg » -> MD5 hash:
bc6a624f493bf3042662064285f355c4
• Partition : bc6a -> 48234
• Suffix : 5c4
• Timestamp : 1449519086.42102.data
• /srv/node/sda/objects/48234/5c4/bc6a624f493b
f3042662064285f355c4/1449519086.42102.data
15
Example : writing an object
16
Proxy server Object server Index server
Volume Volume Volume
Obtain a write lock on a volume (fcntl)
Write the object at the end of the volume
Register the objectPUT
Example : reading an object
17
Proxy server Object server Index server
Volume Volume Volume
Open the volume
Read the object at the given offset
Get object locationGET
Index server
• Stores data in a key/value store : LevelDB
• Communication with gRPC
• Key : hash + filename
• Value : volume index + offset
• Keys are sorted on-disk for efficient seeks
18
Index server – keys example
• ……
• bc6a46b909cf7a8e9529fac36f0669e31475194591.74265.data
• bc6a624f493bf3042662064285f355c41449519086.42102.data
• bc6b78b325b81b28fcfcdaef49dc87d11415965115.56792.data
• ……
19
What about directories ?
20
• bc6a46b909cf7a8e9529fac36f0669e31475194591.74265.data
• bc6a624f493bf3042662064285f355c41449519086.42102.data
• bc6b78b325b81b28fcfcdaef49dc87d11415965115.56792.data
48234
48235
9e3
5c4
7d1
bc6a46b... 1475194591.74265.data
bc6a624...
bc6b78b…
1449519086.42102.data
1415965115.56792.data
Deletion - Hole punching
21https://en.wikipedia.org/wiki/Sparse_file#/media/File:Sparse_file_(en).svg
Deletion
• Hole-punching with fallocate()
• Reclaim space without
changing the file size!
22
Object Header
Volume Header
Object Data
Object Header
Object Data
Space reclaimed by the filesystem
Implementation overview
23
Swift code,
patched.
diskfile.py
Index server,
with levelDB as
the backing key-
value store
gRPC
vfile.py
module
vfile.py
• Provides a file like interface
• f = vfile.open(« /path/to/file »)
• f.read()
• vfile.listdir(« /srv/node/<disk>/<partition>/ »)
24
Managing fragmentation
Dedicated volumes for short lived files
25
Volume
Volume
Volume
Volume
Volume
Volume
« .data » files « .ts » files
Write performance
• We cannot afford two synchronous writes
• The large file write is synchronous (fdatasync)
• The large file is preallocated
• K/V writes are asynchronous
26
Recovery
• Scan the volumes backwards
• Add missing information to the key value
27
How does it perform ?
• Bytes per objects in K/V : 42 bytes
• Latency : slightly worse when empty, much
better when full
• REPLICATE : served from memory
• Saved space
• Room for improvement
28
Benchmarks
• PUT single thread
– XFS: 17/s
– Volumes: 40/s
• PUT 20 threads
– XFS: 4.7s (99%)
– Volumes: 615ms
(99%)
29
• GET
– XFS: 39/s
– Volumes: 93/s
What’s next
• Upstream
• Store short-lived objects in dedicated volumes
• Replication of volumes
• Choose replica/erasure-coding on the fly
30
Credits
• Haystack (Facebook project)
• Openstack Swift community
31
Thank you
Metadata storage
• (extra slide if time)
• Previously stored as extended attributes
• Now serialized with protobuf and stored in the
volume
33

Más contenido relacionado

La actualidad más candente

OpenStack Architecture
OpenStack ArchitectureOpenStack Architecture
OpenStack ArchitectureMirantis
 
MinIO January 2020 Briefing
MinIO January 2020 BriefingMinIO January 2020 Briefing
MinIO January 2020 BriefingJonathan Symonds
 
Ceph with CloudStack
Ceph with CloudStackCeph with CloudStack
Ceph with CloudStackShapeBlue
 
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJAEvaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJADataWorks Summit
 
[오픈테크넷서밋2022] 국내 PaaS(Kubernetes) Best Practice 및 DevOps 환경 구축 사례.pdf
[오픈테크넷서밋2022] 국내 PaaS(Kubernetes) Best Practice 및 DevOps 환경 구축 사례.pdf[오픈테크넷서밋2022] 국내 PaaS(Kubernetes) Best Practice 및 DevOps 환경 구축 사례.pdf
[오픈테크넷서밋2022] 국내 PaaS(Kubernetes) Best Practice 및 DevOps 환경 구축 사례.pdfOpen Source Consulting
 
카카오에서의 Trove 운영사례
카카오에서의 Trove 운영사례카카오에서의 Trove 운영사례
카카오에서의 Trove 운영사례Won-Chon Jung
 
Room 3 - 6 - Nguyễn Văn Thắng & Dzung Nguyen - Ứng dụng openzfs làm lưu trữ t...
Room 3 - 6 - Nguyễn Văn Thắng & Dzung Nguyen - Ứng dụng openzfs làm lưu trữ t...Room 3 - 6 - Nguyễn Văn Thắng & Dzung Nguyen - Ứng dụng openzfs làm lưu trữ t...
Room 3 - 6 - Nguyễn Văn Thắng & Dzung Nguyen - Ứng dụng openzfs làm lưu trữ t...Vietnam Open Infrastructure User Group
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
An Introduction to OpenStack
An Introduction to OpenStackAn Introduction to OpenStack
An Introduction to OpenStackScott Lowe
 
Elasticsearch Monitoring in Openshift
Elasticsearch Monitoring in OpenshiftElasticsearch Monitoring in Openshift
Elasticsearch Monitoring in OpenshiftLukas Vlcek
 
Introduction to Grafana Loki
Introduction to Grafana LokiIntroduction to Grafana Loki
Introduction to Grafana LokiJulien Pivotto
 
SDN Architecture & Ecosystem
SDN Architecture & EcosystemSDN Architecture & Ecosystem
SDN Architecture & EcosystemKingston Smiler
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...ScyllaDB
 
OpenStack Architecture and Use Cases
OpenStack Architecture and Use CasesOpenStack Architecture and Use Cases
OpenStack Architecture and Use CasesJalal Mostafa
 
Building an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache PulsarBuilding an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache PulsarScyllaDB
 

La actualidad más candente (20)

OpenStack Architecture
OpenStack ArchitectureOpenStack Architecture
OpenStack Architecture
 
MinIO January 2020 Briefing
MinIO January 2020 BriefingMinIO January 2020 Briefing
MinIO January 2020 Briefing
 
Ceph with CloudStack
Ceph with CloudStackCeph with CloudStack
Ceph with CloudStack
 
Scalable News Feed with Mongo DB
Scalable News Feed with Mongo DBScalable News Feed with Mongo DB
Scalable News Feed with Mongo DB
 
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJAEvaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
 
[오픈테크넷서밋2022] 국내 PaaS(Kubernetes) Best Practice 및 DevOps 환경 구축 사례.pdf
[오픈테크넷서밋2022] 국내 PaaS(Kubernetes) Best Practice 및 DevOps 환경 구축 사례.pdf[오픈테크넷서밋2022] 국내 PaaS(Kubernetes) Best Practice 및 DevOps 환경 구축 사례.pdf
[오픈테크넷서밋2022] 국내 PaaS(Kubernetes) Best Practice 및 DevOps 환경 구축 사례.pdf
 
kafka
kafkakafka
kafka
 
카카오에서의 Trove 운영사례
카카오에서의 Trove 운영사례카카오에서의 Trove 운영사례
카카오에서의 Trove 운영사례
 
Room 3 - 6 - Nguyễn Văn Thắng & Dzung Nguyen - Ứng dụng openzfs làm lưu trữ t...
Room 3 - 6 - Nguyễn Văn Thắng & Dzung Nguyen - Ứng dụng openzfs làm lưu trữ t...Room 3 - 6 - Nguyễn Văn Thắng & Dzung Nguyen - Ứng dụng openzfs làm lưu trữ t...
Room 3 - 6 - Nguyễn Văn Thắng & Dzung Nguyen - Ứng dụng openzfs làm lưu trữ t...
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
Microservice intro
Microservice introMicroservice intro
Microservice intro
 
An Introduction to OpenStack
An Introduction to OpenStackAn Introduction to OpenStack
An Introduction to OpenStack
 
Red Hat Insights
Red Hat InsightsRed Hat Insights
Red Hat Insights
 
Elasticsearch Monitoring in Openshift
Elasticsearch Monitoring in OpenshiftElasticsearch Monitoring in Openshift
Elasticsearch Monitoring in Openshift
 
Introduction to Grafana Loki
Introduction to Grafana LokiIntroduction to Grafana Loki
Introduction to Grafana Loki
 
SDN Architecture & Ecosystem
SDN Architecture & EcosystemSDN Architecture & Ecosystem
SDN Architecture & Ecosystem
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
 
OpenStack Architecture and Use Cases
OpenStack Architecture and Use CasesOpenStack Architecture and Use Cases
OpenStack Architecture and Use Cases
 
Building an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache PulsarBuilding an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache Pulsar
 

Similar a Openstack Swift - Lots of small files

SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSeeQuality.net
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Haoyuan Li
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Take your database source code and data under control
Take your database source code and data under controlTake your database source code and data under control
Take your database source code and data under controlMarcin Przepiórowski
 
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Nexus, Inc.
 
Collaborate instant cloning_kyle
Collaborate instant cloning_kyleCollaborate instant cloning_kyle
Collaborate instant cloning_kyleKyle Hailey
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformMaris Elsins
 
Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationKyle Hailey
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nlbartzon
 
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsLars Nielsen
 
Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i  Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i Zend by Rogue Wave Software
 
W1.1 i os in database
W1.1   i os in databaseW1.1   i os in database
W1.1 i os in databasegafurov_x
 
Building Storage for Clouds (ONUG Spring 2015)
Building Storage for Clouds (ONUG Spring 2015)Building Storage for Clouds (ONUG Spring 2015)
Building Storage for Clouds (ONUG Spring 2015)Howard Marks
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
 

Similar a Openstack Swift - Lots of small files (20)

SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
 
week1slides1704202828322.pdf
week1slides1704202828322.pdfweek1slides1704202828322.pdf
week1slides1704202828322.pdf
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Take your database source code and data under control
Take your database source code and data under controlTake your database source code and data under control
Take your database source code and data under control
 
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)
 
Collaborate instant cloning_kyle
Collaborate instant cloning_kyleCollaborate instant cloning_kyle
Collaborate instant cloning_kyle
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Flashback in OCI
Flashback in OCIFlashback in OCI
Flashback in OCI
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance Platform
 
Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualization
 
Super hybrid2016 tdc
Super hybrid2016 tdcSuper hybrid2016 tdc
Super hybrid2016 tdc
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Scalability
ScalabilityScalability
Scalability
 
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data Systems
 
Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i  Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i
 
W1.1 i os in database
W1.1   i os in databaseW1.1   i os in database
W1.1 i os in database
 
Building Storage for Clouds (ONUG Spring 2015)
Building Storage for Clouds (ONUG Spring 2015)Building Storage for Clouds (ONUG Spring 2015)
Building Storage for Clouds (ONUG Spring 2015)
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 

Último

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 

Último (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 

Openstack Swift - Lots of small files

  • 1. Object storage optimization in Swift Alexandre LECUYER DevOps / irc: alecuyer Romain LE DISEZ DevOps / irc: rledisez
  • 2. What’s the problem? • Performance is bad • Disks 100% busy • Replication/reconstruction is very (very) slow 2
  • 4. Erasure Coding in Swift 4 012345 03 14 25 /srv/node/<device>/objects-1/<partition>/<suffix>/<hash>/<timestamp>#<fragment>#d.data a9
  • 5. Comparison • Replica: – Performance – Overhead – 3 files per object (3 replicas) • Erasure coding – Cost effective – Slow-ish – 15 files per object (12+3 fragments) 5
  • 6. Where inodes join the party… • XFS: – one inode per file – one inode per directory • Inode: – ctime/mtime/atime – owner/group – Permissions 6
  • 7. Bad things happen • One inode takes 300 bytes to 1k of memory • Average: 2.4 inodes per fragment – Data file: 1 – Object directory: 1 – Suffix directory + Partition directory: 0.4 7
  • 8. Memory issues • Inodes cannot fit in cache anymore – But every inode of the path must be checked to open a data file • Only top level directories are cached – Only 20% of hit on inode cache – Up to 50% of devices activity to read inodes 8
  • 9. Stability issues • More filesystem corruptions • Inability to run xfs_repair – 1K of memory per inode • Need a dedicated servers just to repair filesystems – About 48 hours to repair one filesystem 9
  • 10. Let’s fix it! (a.k.a. inodes are useless, right?) 10
  • 11. We tried crazy things • Storing objects in a K/V (RocksDB, LevelDB, …) – Not suited to synchronous IO. Write amplification. • Storing in a K/V the file handle of datafiles – Atomicity on two separate data structures • Patching XFS to drop useless information – It’s already well optimized, inodes may be compressed • Storing in ZFS DMU – Lots of very cool features, but performance issues if full, low level development 11
  • 12. 12 Object Header Volume Header Object Data Object Header Object Data Store multiple objects in large files
  • 13. 13 Object Header Volume Header Object Data Object Header Object Data Dedicated to a partition No concurrent writes Append only
  • 14. Swift request path 14 Proxy server Proxy server Object server Object server Object server PUT / GET requests
  • 15. How does Swift organize data ? • PUT: « photo.jpg » -> MD5 hash: bc6a624f493bf3042662064285f355c4 • Partition : bc6a -> 48234 • Suffix : 5c4 • Timestamp : 1449519086.42102.data • /srv/node/sda/objects/48234/5c4/bc6a624f493b f3042662064285f355c4/1449519086.42102.data 15
  • 16. Example : writing an object 16 Proxy server Object server Index server Volume Volume Volume Obtain a write lock on a volume (fcntl) Write the object at the end of the volume Register the objectPUT
  • 17. Example : reading an object 17 Proxy server Object server Index server Volume Volume Volume Open the volume Read the object at the given offset Get object locationGET
  • 18. Index server • Stores data in a key/value store : LevelDB • Communication with gRPC • Key : hash + filename • Value : volume index + offset • Keys are sorted on-disk for efficient seeks 18
  • 19. Index server – keys example • …… • bc6a46b909cf7a8e9529fac36f0669e31475194591.74265.data • bc6a624f493bf3042662064285f355c41449519086.42102.data • bc6b78b325b81b28fcfcdaef49dc87d11415965115.56792.data • …… 19
  • 20. What about directories ? 20 • bc6a46b909cf7a8e9529fac36f0669e31475194591.74265.data • bc6a624f493bf3042662064285f355c41449519086.42102.data • bc6b78b325b81b28fcfcdaef49dc87d11415965115.56792.data 48234 48235 9e3 5c4 7d1 bc6a46b... 1475194591.74265.data bc6a624... bc6b78b… 1449519086.42102.data 1415965115.56792.data
  • 21. Deletion - Hole punching 21https://en.wikipedia.org/wiki/Sparse_file#/media/File:Sparse_file_(en).svg
  • 22. Deletion • Hole-punching with fallocate() • Reclaim space without changing the file size! 22 Object Header Volume Header Object Data Object Header Object Data Space reclaimed by the filesystem
  • 23. Implementation overview 23 Swift code, patched. diskfile.py Index server, with levelDB as the backing key- value store gRPC vfile.py module
  • 24. vfile.py • Provides a file like interface • f = vfile.open(« /path/to/file ») • f.read() • vfile.listdir(« /srv/node/<disk>/<partition>/ ») 24
  • 25. Managing fragmentation Dedicated volumes for short lived files 25 Volume Volume Volume Volume Volume Volume « .data » files « .ts » files
  • 26. Write performance • We cannot afford two synchronous writes • The large file write is synchronous (fdatasync) • The large file is preallocated • K/V writes are asynchronous 26
  • 27. Recovery • Scan the volumes backwards • Add missing information to the key value 27
  • 28. How does it perform ? • Bytes per objects in K/V : 42 bytes • Latency : slightly worse when empty, much better when full • REPLICATE : served from memory • Saved space • Room for improvement 28
  • 29. Benchmarks • PUT single thread – XFS: 17/s – Volumes: 40/s • PUT 20 threads – XFS: 4.7s (99%) – Volumes: 615ms (99%) 29 • GET – XFS: 39/s – Volumes: 93/s
  • 30. What’s next • Upstream • Store short-lived objects in dedicated volumes • Replication of volumes • Choose replica/erasure-coding on the fly 30
  • 31. Credits • Haystack (Facebook project) • Openstack Swift community 31
  • 33. Metadata storage • (extra slide if time) • Previously stored as extended attributes • Now serialized with protobuf and stored in the volume 33

Notas del editor

  1. Je vais vous parler d’un travail d’optimisation réalisé sur openstack swift. OVH opère plusieurs cluster swift, connus commercialement sous les noms Hubic, et PCS. Nos clients ont tendances à stocker énormément de petits fichiers sur ces infras. En particulier sur Hubic. Regarder le public (ordi entre moi et public) Pas répéter trop (replica / EC) Expliquer vfile = file, sur implementation Discuter après sur le stand
  2. This is really the case on hubic. No problem on PCS, because there are more spindles
  3. I’m going remind quickly some differences between replica and erasure code in Swift. In a replica policy, each object is written many times, on different devices. The usual replication factor is 3, but this is configurable. The durability of the object is dependent on the replication factor. In this example, each object is written 3 times, it means that even if you lose 2 replica, the object is still available. It is also a good way to increase download bandwidth by distributing the requests over the devices. Drawback of replication is the overhead. Each bytes is written N times. In this example, 6 bytes of the user becomes 18 bytes on the cluster. Each replica of an object is stored in a file, you can see the path on top. Important parts are the hash, which is a computation of the URL of the object, partition and suffix are extrracted from the hash. The timestamp is the date of the upload of the object, it is set by the cluster during the upload. The user can’t set it. It is essential in the « eventual consistency » model of Swift. In case of an incident, by comparing the different timestamps of a single objects, Swift can decice which one is the good one. The latest actually.
  4. Erasure Coding is a bit different. I’m not going to do all the theoritical explanation, with Reed Solomon and stuff, there is a good introduction in the Swift documentation. Each object will be split in N fragments, and M fragments of parity will be added to ensure the redondency, so the durability. In this example, the cluster is configured with 3 fragments of data and 1 fragment of parity. It means that if I lose 1 device, my object is still accesssible. All the computation of fragmenting and calculating parity is done on the swift proxies. The major interest of erasure coding is that you can balance overhead and durability in your cluster. In this example, the overhead is 1.3, but durability is not that good (2 device down and the object is unavailable). If you choose 10 fragments of data and 2 fragments of parity, you get the same level of durability than 3 replica, but with an overhead of only 1.2. (Well, durability is not that simple, because the more devices, the more risk, it’s statistics, but i’m simplifying) Compared to replica, you can’t scale the downloads, each fragment must be accessed to rebuild the object. Also, you have to anticipate the CPU consumption on the proxies. To sumarize, you can think of replication as RAID-1 while Erasure Coding is like RAID-5 or RAID-6, but with more configuration possibilities. Looking at the path of file, there is a new information: the fragment number. As each fragment is unique, they must be accessed in correct order to rebuild the object.
  5. It was even 30 files per object at beginning because of the durable file. Thankfully, it was dropped since then. X5 factor in number of files. -> problem is most acute for erasure coding
  6. 40M (to confirm?) inodes per devices, 36 devices per server, for 64GB of RAM => would require 700+GB of RAM to have everything in cache Bad choice at first: too man partitions per device. Reducing the number of partitions would tend to 2 inodes per fragments (17% improvement)
  7. K/V not suited at all to synchronous IO, which is required before the proxy replies that we object is actually safe on disk Explain write amp. Persistent file handle : open a file without having to walk through all inodes in the path So what’s the solution ? Too many inodes means we have too many files. Let’s have less files !
  8. Limiter les inodes veut dire limiter le nombre de fichiers. Evident ! On les appelle des « volumes ». Quelles sont leurs caractéristiques?
  9. Three important characteristics : Dedicated to a partition : Not one large volume the size of the disk !  Make a volume dedicated to a partition. It makes it easier to move a partition to another node (ring change) Append-only : we only append new objects at the end of the file. Nothing is ever overwritten. We don’t want to write a space allocator No concurrent writes : We must support concurrent writes to the same partition. Create multiple volumes. Now, we need a way to locate the objects we write in those large files. Let’s take a step back first
  10. Very simplified overview, for a replica configuration. not discussing authentication or container server, etc.. An object-server may have multiple disks with multiple object server processes. Explain PUT, GET (one server only) The request will arrive on one proxy server, which will contact specific object-servers based on the ring. Won’t go in details about that, but just to explain that we are modifying the object server code only, nothing above. We are at the bottom of the stack. The problem which we described is on the object server. This is where we are working, let’s zoom in.
  11. Explain consistent hashing We calculate a MD5 hash from the object name Then the partition is extracted from the hash, given the cluster configuration The ring tells us which object-servers will store a partition The suffix is used to limit the number of entries in a directory. (XFS developers unhappy about that) Timestamp : to manage versions : user uploads a new version of photo.jpg Now, let’s see in practice how this works with the new system
  12. Take care to explain again the request : Object server receives something like PUT toto.jpg Will calculate the object hash, and then PUT that to the object server
  13. Explain the get Now let’s zoom on the index server
  14. Un peu de détail sur l’index server. Il est écrit en go. Il y a une instance par disque : 1 base + 1 process.
  15. Explain key, value We are now able to find our files. What about directories ? Files are stored below multiple directories : partition, suffix These are necessary for the cluster (replicator, reconstructor)
  16. Give examples of operations happening : Per partition (placement through the ring configuration) Per suffix (Replication) Explain the partition power and its relation to the partition Explain how we scan seek to the prefix, and continue until the next partition number For suffixes just get the end of the name We trade CPU for memory. Ok we can write, read, and listdir. What about deletion?
  17. Explain hole punching mechanism. Reclaim space without changing the file size Extent count will increase
  18. Explain hole punching mechanism. Reclaim space without changing the file size
  19. Explain the flow One golang process and database per disk : avoid hanging or slowing down everyone if a disk is being slow I left out a few details
  20. Explain the flow One golang process and database per disk : avoid hanging or slowing down everyone if a disk is being slow I left out a few details
  21. Hole punching is great but there is still a small cost : more extents in the file Tombstone volumes can be closed and deleted once all files have been deleted Also planned for files with a X-Delete-At header Not a problem until you have lots of extents. Not expected to be needed often
  22. Explain why we can’t sync the KV Describe the recovery procedure in case we crashed
  23. Explain why we can’t sync the KV Describe the recovery procedure in case we crashed
  24. For 10 millions files, 400MB, vs 3 to 8GB with inodes Explain REPLICATE (non intuitive name) Improvement : smaller keys..
  25. Better performance expected now (fdatasync)
  26. Add hybrid access