SlideShare a Scribd company logo
1 of 17
1 The New File system API: FileContext & AbstractFileSystem Sanjay Radia Cloud Computing Yahoo Inc.
Agenda Overview – old vs new Motivation The New APIs What is next
In a Nutshell: Old vs New        Old: 1 Layer         New: 2 Layers 3 FileContext User API User API FileSystem FS Impl API AbstractFileSystem FS Impl API FS  implementations S3 DistributedFS S3Fs LocalFS LocalFs Hdfs
Motivation (1):   1st class URI file system namespace First class URI file system namespace Old:  Shell offers a URI namespace view: e.g. cp uri1  uri2 But at user-API layer,  Create a FileSystem instance for each target scheme-authority Incorrect: FileSystem.create(uriPath..) must take path in the FileSystem instance Correct: Fs = FileSystem.get(Uri1, …) Fs.create(pathInUri1, …) Original patch for symbolic link depicts the problem FileSystem.open(UriPath, … ) is invalid if UriPath is foreign But one can fool the FileSystem into following a symlink to the same UriPath. Need a layer that provides first class URI file namespace 4
Motivation (2) : Separate Layers Two layers are merged in old FileSystem User Api which provides notion of default file system and working dir Implementation Api for implementers of the file systems Why separate? Simpler to implement file systems  an API for implementing file systems (like VFS in Unix). It does not need to deal with Slash, wd, umask, etc Each file system instance is limited to its namespace User Api layer provides a natural place for The context: Slash, wd, umask, …. Uri namespace – cuts across the namespace of all file system instances. Hence a natural place to implement symbolic links to foreign namespaces 5
Motivation (3): Cleanup API and Semantics FileSystem API & some semantics are not very good We should have adapted Unix Apis where appropriate Ask yourself: are you smarter than Ritche & Thompson and understand the issues well enough to be different Semantics: the recursive parent creation This convenience can cause accidental creation of parents E.g. A problem for speculative executions Semantics: Rename method, etc Too many overloaded methods … (eg. Create) The cache has leaked through: FileSystem.getNewInstance() Ugliness: e.g. copyLocal() copy(), … FileSystem leaked into Path Adding InterruptedException Some could have been fixed in the FileSystem class, but was getting messy to provide compatibility in the transition A clean break made things much easier 6
Motivation (4): The Config The client-side config is too complex: Client should only need: Slash, wd, umask; that’s it nothing more. But Hadoop needs server-side defaults in client-side config  An unnecessary burden on the client and admin Cluster admin cannot be expected to copy the config to the desktops Does not work for a federated environment where a client connects to many file systems each with its own defaults Solution: client grabs/uses needed properties from target server A transition to this solution from the current config is challenging if one needs to maintain compatibility within the existing APIs A common complaint is that Hadoop config is way too complicated 7
8 The New File System APIs HADOOP-4952, HADOOP-6223
First: Some Naming Fundamentals Addresses, routes, names are ALL names (identifiers) Numbers, or strings, or paths, or addresses or routes are chosen based on the audience or how they are processed ALL names are relative to some context Even absolute or global names have a context in which they are resolved Names appear to be global/absolute because you have simply chosen a frame-of-reference and are excluding the world outside that frame-of-reference.  When two worlds, that each have “global” names, collide names get ambiguous unless you manage the closure/context There is always an implicit context – if you make that implicit context to be explicit by naming it, you need a context for the name of the context A more local context makes apps portable across similar environments A program can move from one Unix machine to another as long the names relative to the machine’s root refer to the “same” objects A Unix process’s context: Root and working dir plus default default domain, etc. 9
We have URIs, why do we need Slash-relative names? Our world:  a forest of file systems, each referenced by its URI Why isn’t the URI namespace good enough? The URI’s will bind your application to the very specific servers that provide that URI namepace. A application may run on cluster 1 today and be moved to cluster two in the future. If you move the data to the second cluster the app should work Better to let each cluster have its on default fs (i.e. slash) Also need the convenience of working dir 10
Enter FileContext: A focus point on a forest of file systems A FileContext is a focus point on a forest of file systems In general, it is set for you in your environment (just like your DNS domain) It lets you access the common files in your cluster using location independent names Your home, tmp, your project’s data, You can still access the files in other clusters or file systems In Unix you had to mount remote file systems But we have URIs which are fully qualified, automatically mounted Fully qualified Uri  is to Slash-relative-name  as Slash-relative-names is to wd-relative-name … its just contexts …. 11 /foo / wd / wd hdfs://nn3/foo ….
Examples Use default config which has your default FS myFC = FileContext.getFileContext(); Access files in your default file system myFC.create(“/foo”, ...); myFC.setWorkingDir(“/foo”) myFC.open (“bar”, ...);  Access files in other clusters myFC.open(“hdfs://nn3/bar”, ..) You can even set your wd to another fs! myFC. setWorkingDir(“hdfs://nn3/foo”) Variations on getting your context A specific URI as the default FS  myFC = FileContext.getFileContext(URI) Local file system as the default FS myFC = FileContext.getLocalFSFileContext() Use a specific config,  ignore $HADOOP_CONFIG Generally you should not need use a config unless you are doing something special configX = someConfigPassedToYou. myFC =FileContext.getFileContext(configX); //configX not changed but passed down 12
So what is in the FileContext? The default file system (Slash)  - obtained from config A pointer to the file system object is kept The working dir (lack of Slash)  Stored as a path which is prefixed to relative path names Umask – obtained from config Absolute permissions after applying mask are sent to layer below Any other file system accessed are simply created 0.21 – uses the FileSystem which has a cache 0.22 – use the new AbstractFileSystem Do we need to add a cache? Hadoop-6356 13
HDFS config – client & server side Client side config: Default file system Umask Default values for blocksize, buffersize, replication are obtained at runtime from the specific filesystem in question Finally, federation can work Server side config: What used to be there before (except the above two items) + cleaned config variables for SS defaults for blocksize, etc. 14
Abstract File System (0.22) 15 FileContext User API Does not deal with ,[object Object]
 URIs,
UmaskAbstractFileSystem FS Impl API   DelegateTo FileSystem Hdfs LocalFs FilterFs ChecksumFs RawLocalFs RawLocal FileSystem

More Related Content

What's hot

Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop User Group
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupApache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupfvanvollenhoven
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDataWorks Summit
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst TrainingCloudera, Inc.
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Hadoop / Spark Conference Japan
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
 

What's hot (20)

Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User Group
 
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl PresentationJanuary 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
Upgrading hadoop
Upgrading hadoopUpgrading hadoop
Upgrading hadoop
 
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupApache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst Training
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 

Viewers also liked

Hadoop Record Reader In Python
Hadoop Record Reader In PythonHadoop Record Reader In Python
Hadoop Record Reader In PythonHadoop User Group
 
Twitter Protobufs And Hadoop Hug 021709
Twitter Protobufs And Hadoop   Hug 021709Twitter Protobufs And Hadoop   Hug 021709
Twitter Protobufs And Hadoop Hug 021709Hadoop User Group
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009yhadoop
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talksyhadoop
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21Hadoop User Group
 
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-211 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21Hadoop User Group
 
Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for HadoopHadoop User Group
 
The Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitThe Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitTom Croucher
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataYahoo Developer Network
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce APITom Croucher
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduceHadoop User Group
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Hadoop User Group
 

Viewers also liked (19)

Hadoop Record Reader In Python
Hadoop Record Reader In PythonHadoop Record Reader In Python
Hadoop Record Reader In Python
 
Twitter Protobufs And Hadoop Hug 021709
Twitter Protobufs And Hadoop   Hug 021709Twitter Protobufs And Hadoop   Hug 021709
Twitter Protobufs And Hadoop Hug 021709
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Hadoop Release Plan Feb17
Hadoop Release Plan Feb17Hadoop Release Plan Feb17
Hadoop Release Plan Feb17
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
 
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-211 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
 
Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for Hadoop
 
The Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitThe Bixo Web Mining Toolkit
The Bixo Web Mining Toolkit
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big Data
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
 
HUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - FacebookHUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - Facebook
 
Cloudera Desktop
Cloudera DesktopCloudera Desktop
Cloudera Desktop
 
3 avro hug-2010-07-21
3 avro hug-2010-07-213 avro hug-2010-07-21
3 avro hug-2010-07-21
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
 
하둡관리
하둡관리하둡관리
하둡관리
 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
 

Similar to File Context

Distributed File Systems
Distributed File SystemsDistributed File Systems
Distributed File Systemsawesomesos
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSKathirvel Ayyaswamy
 
Presentation on nfs,afs,vfs
Presentation on nfs,afs,vfsPresentation on nfs,afs,vfs
Presentation on nfs,afs,vfsPrakriti Dubey
 
Lecture-20.pptx
Lecture-20.pptxLecture-20.pptx
Lecture-20.pptxmohaaalsa
 
dotCloud (now Docker) Paas under the_hood
dotCloud (now Docker) Paas under the_hood dotCloud (now Docker) Paas under the_hood
dotCloud (now Docker) Paas under the_hood Susan Wu
 
Distributed Filesystems Review
Distributed Filesystems ReviewDistributed Filesystems Review
Distributed Filesystems ReviewSchubert Zhang
 
Distributed file systems
Distributed file systemsDistributed file systems
Distributed file systemsSri Prasanna
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
Chapter 8 distributed file systems
Chapter 8 distributed file systemsChapter 8 distributed file systems
Chapter 8 distributed file systemsAbDul ThaYyal
 
Dfs (Distributed computing)
Dfs (Distributed computing)Dfs (Distributed computing)
Dfs (Distributed computing)Sri Prasanna
 
AliEnFS - A Linux File System For The AliEn Grid Services
AliEnFS - A Linux File System For The AliEn Grid ServicesAliEnFS - A Linux File System For The AliEn Grid Services
AliEnFS - A Linux File System For The AliEn Grid ServicesNathan Mathis
 
Locus Distributed Operating System
Locus Distributed Operating SystemLocus Distributed Operating System
Locus Distributed Operating SystemTamer Rezk
 
best presentation ever by tayyab.pptx
best presentation ever by tayyab.pptxbest presentation ever by tayyab.pptx
best presentation ever by tayyab.pptxHAIDERALICH3
 

Similar to File Context (20)

Distributed File Systems
Distributed File SystemsDistributed File Systems
Distributed File Systems
 
DFSNov1.pptx
DFSNov1.pptxDFSNov1.pptx
DFSNov1.pptx
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
 
Unit 1
Unit 1Unit 1
Unit 1
 
Presentation on nfs,afs,vfs
Presentation on nfs,afs,vfsPresentation on nfs,afs,vfs
Presentation on nfs,afs,vfs
 
Hadoop
HadoopHadoop
Hadoop
 
Lecture-20.pptx
Lecture-20.pptxLecture-20.pptx
Lecture-20.pptx
 
dotCloud (now Docker) Paas under the_hood
dotCloud (now Docker) Paas under the_hood dotCloud (now Docker) Paas under the_hood
dotCloud (now Docker) Paas under the_hood
 
Distributed Filesystems Review
Distributed Filesystems ReviewDistributed Filesystems Review
Distributed Filesystems Review
 
Distributed file systems
Distributed file systemsDistributed file systems
Distributed file systems
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Chapter 8 distributed file systems
Chapter 8 distributed file systemsChapter 8 distributed file systems
Chapter 8 distributed file systems
 
Dfs (Distributed computing)
Dfs (Distributed computing)Dfs (Distributed computing)
Dfs (Distributed computing)
 
Kosmos Filesystem
Kosmos FilesystemKosmos Filesystem
Kosmos Filesystem
 
AliEnFS - A Linux File System For The AliEn Grid Services
AliEnFS - A Linux File System For The AliEn Grid ServicesAliEnFS - A Linux File System For The AliEn Grid Services
AliEnFS - A Linux File System For The AliEn Grid Services
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Locus Distributed Operating System
Locus Distributed Operating SystemLocus Distributed Operating System
Locus Distributed Operating System
 
Hadoop
HadoopHadoop
Hadoop
 
best presentation ever by tayyab.pptx
best presentation ever by tayyab.pptxbest presentation ever by tayyab.pptx
best presentation ever by tayyab.pptx
 
Lec3 Dfs
Lec3 DfsLec3 Dfs
Lec3 Dfs
 

More from Hadoop User Group

Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsHadoop User Group
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
1 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit20101 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit2010Hadoop User Group
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Hadoop User Group
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Hadoop User Group
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupHadoop User Group
 
Flightcaster Presentation Hadoop
Flightcaster  Presentation  HadoopFlightcaster  Presentation  Hadoop
Flightcaster Presentation HadoopHadoop User Group
 

More from Hadoop User Group (16)

Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentation
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Cascalog internal dsl_preso
Cascalog internal dsl_presoCascalog internal dsl_preso
Cascalog internal dsl_preso
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Pig at Linkedin
Pig at LinkedinPig at Linkedin
Pig at Linkedin
 
1 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit20101 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit2010
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Flightcaster Presentation Hadoop
Flightcaster  Presentation  HadoopFlightcaster  Presentation  Hadoop
Flightcaster Presentation Hadoop
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

File Context

  • 1. 1 The New File system API: FileContext & AbstractFileSystem Sanjay Radia Cloud Computing Yahoo Inc.
  • 2. Agenda Overview – old vs new Motivation The New APIs What is next
  • 3. In a Nutshell: Old vs New Old: 1 Layer New: 2 Layers 3 FileContext User API User API FileSystem FS Impl API AbstractFileSystem FS Impl API FS implementations S3 DistributedFS S3Fs LocalFS LocalFs Hdfs
  • 4. Motivation (1): 1st class URI file system namespace First class URI file system namespace Old: Shell offers a URI namespace view: e.g. cp uri1 uri2 But at user-API layer, Create a FileSystem instance for each target scheme-authority Incorrect: FileSystem.create(uriPath..) must take path in the FileSystem instance Correct: Fs = FileSystem.get(Uri1, …) Fs.create(pathInUri1, …) Original patch for symbolic link depicts the problem FileSystem.open(UriPath, … ) is invalid if UriPath is foreign But one can fool the FileSystem into following a symlink to the same UriPath. Need a layer that provides first class URI file namespace 4
  • 5. Motivation (2) : Separate Layers Two layers are merged in old FileSystem User Api which provides notion of default file system and working dir Implementation Api for implementers of the file systems Why separate? Simpler to implement file systems an API for implementing file systems (like VFS in Unix). It does not need to deal with Slash, wd, umask, etc Each file system instance is limited to its namespace User Api layer provides a natural place for The context: Slash, wd, umask, …. Uri namespace – cuts across the namespace of all file system instances. Hence a natural place to implement symbolic links to foreign namespaces 5
  • 6. Motivation (3): Cleanup API and Semantics FileSystem API & some semantics are not very good We should have adapted Unix Apis where appropriate Ask yourself: are you smarter than Ritche & Thompson and understand the issues well enough to be different Semantics: the recursive parent creation This convenience can cause accidental creation of parents E.g. A problem for speculative executions Semantics: Rename method, etc Too many overloaded methods … (eg. Create) The cache has leaked through: FileSystem.getNewInstance() Ugliness: e.g. copyLocal() copy(), … FileSystem leaked into Path Adding InterruptedException Some could have been fixed in the FileSystem class, but was getting messy to provide compatibility in the transition A clean break made things much easier 6
  • 7. Motivation (4): The Config The client-side config is too complex: Client should only need: Slash, wd, umask; that’s it nothing more. But Hadoop needs server-side defaults in client-side config An unnecessary burden on the client and admin Cluster admin cannot be expected to copy the config to the desktops Does not work for a federated environment where a client connects to many file systems each with its own defaults Solution: client grabs/uses needed properties from target server A transition to this solution from the current config is challenging if one needs to maintain compatibility within the existing APIs A common complaint is that Hadoop config is way too complicated 7
  • 8. 8 The New File System APIs HADOOP-4952, HADOOP-6223
  • 9. First: Some Naming Fundamentals Addresses, routes, names are ALL names (identifiers) Numbers, or strings, or paths, or addresses or routes are chosen based on the audience or how they are processed ALL names are relative to some context Even absolute or global names have a context in which they are resolved Names appear to be global/absolute because you have simply chosen a frame-of-reference and are excluding the world outside that frame-of-reference. When two worlds, that each have “global” names, collide names get ambiguous unless you manage the closure/context There is always an implicit context – if you make that implicit context to be explicit by naming it, you need a context for the name of the context A more local context makes apps portable across similar environments A program can move from one Unix machine to another as long the names relative to the machine’s root refer to the “same” objects A Unix process’s context: Root and working dir plus default default domain, etc. 9
  • 10. We have URIs, why do we need Slash-relative names? Our world: a forest of file systems, each referenced by its URI Why isn’t the URI namespace good enough? The URI’s will bind your application to the very specific servers that provide that URI namepace. A application may run on cluster 1 today and be moved to cluster two in the future. If you move the data to the second cluster the app should work Better to let each cluster have its on default fs (i.e. slash) Also need the convenience of working dir 10
  • 11. Enter FileContext: A focus point on a forest of file systems A FileContext is a focus point on a forest of file systems In general, it is set for you in your environment (just like your DNS domain) It lets you access the common files in your cluster using location independent names Your home, tmp, your project’s data, You can still access the files in other clusters or file systems In Unix you had to mount remote file systems But we have URIs which are fully qualified, automatically mounted Fully qualified Uri is to Slash-relative-name as Slash-relative-names is to wd-relative-name … its just contexts …. 11 /foo / wd / wd hdfs://nn3/foo ….
  • 12. Examples Use default config which has your default FS myFC = FileContext.getFileContext(); Access files in your default file system myFC.create(“/foo”, ...); myFC.setWorkingDir(“/foo”) myFC.open (“bar”, ...); Access files in other clusters myFC.open(“hdfs://nn3/bar”, ..) You can even set your wd to another fs! myFC. setWorkingDir(“hdfs://nn3/foo”) Variations on getting your context A specific URI as the default FS myFC = FileContext.getFileContext(URI) Local file system as the default FS myFC = FileContext.getLocalFSFileContext() Use a specific config, ignore $HADOOP_CONFIG Generally you should not need use a config unless you are doing something special configX = someConfigPassedToYou. myFC =FileContext.getFileContext(configX); //configX not changed but passed down 12
  • 13. So what is in the FileContext? The default file system (Slash) - obtained from config A pointer to the file system object is kept The working dir (lack of Slash) Stored as a path which is prefixed to relative path names Umask – obtained from config Absolute permissions after applying mask are sent to layer below Any other file system accessed are simply created 0.21 – uses the FileSystem which has a cache 0.22 – use the new AbstractFileSystem Do we need to add a cache? Hadoop-6356 13
  • 14. HDFS config – client & server side Client side config: Default file system Umask Default values for blocksize, buffersize, replication are obtained at runtime from the specific filesystem in question Finally, federation can work Server side config: What used to be there before (except the above two items) + cleaned config variables for SS defaults for blocksize, etc. 14
  • 15.
  • 17. UmaskAbstractFileSystem FS Impl API DelegateTo FileSystem Hdfs LocalFs FilterFs ChecksumFs RawLocalFs RawLocal FileSystem
  • 18. The Jira: Approx 9 months FileContext: Three main discussions Use Java’s new IO. Main issue is that their default and wd is tied to the file system of the JVM in order to provide compatibility with older APIs Config – a simpler per-process environment? Differed to Doug’s view that a per-process env is not sufficient for a MT Java application Besides, out of scope for this Jira – a radical change SS config What I would have liked to do, but didn’t 2 Apis: Java Interface and Abstract class Not the Hadoop way! For example if we had done this in the old FileSystem, it would have facilitated dynamic class loading for protocol compatibility as an interim solution 16
  • 19. What is next? Exceptions: Adding InterruptedException and declaring the sub-exceptions of IOException Issue: Apps need to manage an interrupt differently then other exceptions IO-streams throw IOInterruptedException FileContext – three choices: IOInterruptedException InterruptedException An unchecked exception The Cache: Do we need it? How do we deal with Facebook’s application that forced the cache to leak through Other AbstractFileSystem impls (can use the delegator) 17