I can’t really talk about Hortonworks without first taking a moment to talk about the history of Hadoop.What we now know of as Hadoop really started back in 2005, when Eric Baldeschwieler – known as “E14” – started to work on a project that to build a large scale data storage and processing technology that would allow them to store and process massive amounts of data to underpin Yahoo’s most critical application, Search. The initial focus was on building out the technology – the key components being HDFS and MapReduce – that would become the Core of what we think of as Hadoop today, and continuing to innovate it to meet the needs of this specific application.By 2008, Hadoop usage had greatly expanded inside of Yahoo, to the point that many applications were now using this data management platform, and as a result the team’s focus extended to include a focus on Operations: now that applications were beginning to propagate around the organization, sophisticated capabilities for operating it at scale were necessary. It was also at this time that usage began to expand well beyond Yahoo, with many notable organizations (including Facebook and others) adopting Hadoop as the basis of their large scale data processing and storage applications and necessitating a focus on operations to support what as by now a large variety of critical business applications.In 2011, recognizing that more mainstream adoption of Hadoop was beginning to take off and with an objective of facilitating it, the core team left – with the blessing of Yahoo – to form Hortonworks. The goal of the group was to facilitate broader adoption by addressing the Enterprise capabilities that would would enable a larger number of organizations to adopt and expand their usage of Hadoop.[note: if useful as a talk track, Cloudera was formed in 2008 well BEFORE the operational expertise of running Hadoop at scale was established inside of Yahoo]
In that capacity,Arun allows Hortonworks to be instrumental in working with the community to drive the roadmap for Core Hadoop, where the focus today is on things like YARN, MapReduce2, HDFS2 and more.For Core Hadoop, in absolute terms, Hortonworkers have contributed more than twice as many lines of code as the next closest contributor, and even more if you include Yahoo, our development partner. Taking such a prominent role also enables us to ensure that our distribution integrates deeply with the ecosystem: on both choice of deployment platforms such as Windows, Azure and more, but also to create deeply engineered solutions with key partners such as Teradata.And consistent with our approach, all of this is done in 100% open source.
Across all of our user base, we have identified just 3 separate usage patterns – sometimes more than one is used in concert during a complex project, but the patterns are distinct nonetheless. These are Refine, Explore and Enrich.The first of these, the Refine case, is probably the most common today. It is about taking very large quantities of data and using Hadoop to distill the information down into a more manageable data set that can then be loaded into a traditional data warehouse for usage with existing tools. This is relatively straightforward and allows an organization to harness a much larger data set for their analytics applications while leveraging their existing data warehousing and analytics tools.Using the graphic here, in step 1 data is pulled from a variety of sources, into the Hadoop platform in step 2, and then in step 3 loaded into a data warehouse for analysis by existing BI tools
A second use case is what we would refer to as Data Exploration – this is the use case in question most commonly when people talk about “Data Science”.In simplest terms, it is about using Hadoop as the primary data store rather than performing the secondary step of moving data into a data warehouse. To support this use case you’ve seen all the BI tool vendor rally to add support for Hadoop – and most commonly HDP – as a peer to the database and in so doing allow for rich analytics on extremely large datasets that would be both unwieldy and also costly in a traditional data warehouse. Hadoop allows for interaction with a much richer dataset and has spawned a whole new generation of analytics tools that rely on Hadoop (HDP) as the data store.To use the graphic, in step 1 data is pulled into HDP, it is stored and processed in Step 2, before being surfaced directly into the analytics tools for the end user in Step 3.
The final use case is called Application Enrichment.This is about incorporating data stored in HDP to enrich an existing application. This could be an on-line application in which we want to surface custom information to a user based on their particular profile. For example: if a user has been searching the web for information on home renovations, in the context of your application you may want to use that knowledge to surface a custom offer for a product that you sell related to that category. Large web companies such as Facebook and others are very sophisticated in the use of this approach.In the diagram, this is about pulling data from disparate sources into HDP in Step 1, storing and processing it in Step 2, and then interacting with it directly from your applications in Step 3, typically in a bi-directional manner (e.g. request data, return data, store response).
It is for that reason that we focus on HDP interoperability across all of these categories:Data systemsHDP is endorsed and embedded with SQL Server, Teradata and moreBI tools: HDP is certified for use with the packaged applications you already use: from Microsoft, to Tableau, Microstrategy, Business Objects and moreWith Development tools: For .Net developers: Visual studio, used to build more than half the custom applications in the world, certifies with HDP to enable microsoft app developers to build custom apps with HadoopFor Java developers: Spring for Apache Hadoop enables Java developers to quickly and easily build Hadoop based applications with HDPOperational toolsIntegration with System Center, and with Teradata viewpoint
At its core, Hadoop is about HDFS and MapReduce, 2 projects that are really about distributed storage and data processing which are the underpinnings of Hadoop.In addition to Core Hadoop, we must identify and include the requisite “Platform Services” that are central to any piece of enterprise software. These include High Availability, Disaster Recovery, Security, etc, which enable use of the technology for a much broader (and mission critical) problem set.This is accomplished not by introducing new open source projects, but rather ensuring that these aspects are addressed within existing projects.HDFS: Self-healing, distributed file system for multi-structured data; breaks files into blocks & stores redundantly across clusterMapReduce: Framework for running large data processing jobs in parallel across many nodes & combining resultsYARN: New application management framework that enables Hadoop to go beyond MapReduce appsEnterprise-ready servicesHigh availability, disaster recovery, snapshots, security, …
Beyond Core and Platform Services, we must add a set of Data Services that enable the full data lifecycle. This includes capabilities to:Store dataProcess dataAccess dataFor example: how do we maintain consistent metadata information required to determine how best to query data stored in HDFS? The answer: a project called Apache HCatalogOr how do we access data stored in Hadoop from SQL-oriented tools? The answer: with projects such as Hive, which is the defacto standard for accessing data stored in HDFS.All of these are broadly captured under the category of “data services”.Apache HCatalog: Metadata & Table ManagementMetadata service that enables users to access Hadoop data as a set of tables without needing to be concerned with where or how their data is storedEnables consistent data sharing and interoperability across data processing tools such as Pig, MapReduce and HiveEnables deep interoperability and data access with systems such as Teradata, SQL Server, etc.Apache Hive: SQL Interface for HadoopThe de-facto SQL-like interface for Hadoop that enables data summarization, ad-hoc query, and analysis of large datasetsConnects to Excel, Microstrategy, PowerPivot, Tableau and other leading BI tools via Hortonworks Hive ODBC DriverHive currently serves batch and non-interactive use cases; in 2013, Hortonworks is working with Hive community to extend use cases to interactive query. Cloudera, on the other hand, has chosen to abandon Hive in lieu of Cloudera Impala (a Cloudera controlled technology aimed at the analytics market and solely focused on non-operational interactive query use cases)Apache HBase: NoSQL DB for Interactive AppsNon-relational, columnar database that provides a way for developers to create, read, update, and delete data in Hadoop in a way that performs well for interactive applicationsCommonly used for serving “intelligent applications” that predict user behavior, detect shifting usage patterns, or recommend ways for users to engageWebHDFS: Web service interface for HDFSScalable REST API that enables easy and scalable access to HDFS Move files in & out and delete from HDFS; leverages parallelism of clusterPerform file and directory functionswebhdfs://<HOST>:<HTTP PORT>/PATHIncluded in versions 1.0 and 2.0 of Hadoop; created & driven by HortonworkersTalend Open Studio for Big Data: open source ETL tool available as an optional download with HDPIntuitive graphical data integration tools for HDFS, Hive, HBase, HCatalog and PigOozie scheduling allows you to manage and stage jobs Connectors for any database, business application or systemIntegrated HCatalog storage
HCatalog – metadata shared across whole platformFile locations become abstract (not hard-coded)Data types become shared (not redefined per tool)Partitioning and HDFS-optimized
Any data management platform that is operated at any reasonable scale requires a management technology – for example SQL Server Management Studio for SQL Server, or Oracle Enterprise Manager for Oracle DB, etc. Hadoop is no exception, and means Apache Ambari, which is increasingly being recognized as foundational to the operation of Hadoop infrastructures. It allows users to provision, manage and monitor a cluster and provides a set of tools to visualize and diagnose operational issues. There are other projects in this category (such as Oozie) but Ambari is really the most influential.Apache Ambari: Management & MonitoringMake Hadoop clusters easy to operateSimplified cluster provisioning with a step-by-step install wizardPre-configured operational metrics for insight into health of Hadoop servicesVisualization of job and task execution for visibility into performance issuesComplete RESTful API for integrating with existing operational toolsIntuitive user interface that makes controlling a cluster easy and productive
So how does this get brought together into our distribution? It is really pretty straightforward, but also very unique:We start with this group of open source projects that I described and that we are continually driving in the OSS community. [CLICK] We then package the appropriate versions of those open source projects, integrate and test them using a full suite, including all the IP for regression testing contributed by Yahoo, and [CLICK] contribute back all of the bug fixes to the open source tree. From there, we package and certify a distribution in the from of the Hortonworks Data Platform (HDP) that includes both Hadoop Core as well as the related projects required by the Enterprise user, and provide to our customers.Through this application of Enterprise Software development process to the open source projects, the result is a 100% open source distribution that has been packaged, tested and certified by Hortonworks. It is also 100% in sync with the open source trees.
And finally, because any enterprise runs a heterogeneous set of infrastructures, we ensure that HDP runs on your choice of infrastructure. Whether this is Linux, Windows (HDP is the only distribution certified for Windows), on a cloud platform such as Azure or Rackspace, or in an appliance, we ensure that all of them are supported and that this work is all contributed back to the open source community.
Without the namenode, the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes. For this reason, it is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this. The first way is to back up the files that make up the persistent state of the filesystem metadata. Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems. These writes are synchronous and atomic. The usual configu- ration choice is to write to local disk as well as a remote NFS mount.
Despite its name the SNN does not act as a namenode. Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large. The secondary namenode usually runs on a separate physical machine, since it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps a copy of the merged name- space image, which can be used in the event of the namenode failing. However, the state of the secondary namenode lags that of the primary, so in the event of total failure of the primary, data loss is almost certain. The usual course of action in this case is to copy the namenode’s metadata files that are on NFS to the secondary and run it as the new primary. Edit LogWhen a filesystem client performs a write operation (such as creating or moving a file), it is first recorded in the edit log. The namenode also has an in-memory representation of the filesystem metadata, which it updates after the edit log has been modified. The in-memory metadata is used to serve read requests. The edit log is flushed and synced after every write before a success code is returned to the client. For namenodes that write to multiple directories, the write must be flushed and synced to every copy before returning successfully. This ensures that no operation is lost due to machine failure. fsimageThe fsimagefile is a persistent checkpoint of the filesystem metadata. However, it is not updated for every filesystem write operation, since writing out the fsimagefile, which can grow to be gigabytes in size, would be very slow. This does not compromise resilience, however, because if the namenode fails, then the latest state of its metadata can be reconstructed by loading the fsimagefrom disk into memory, then applying each of the operations in the edit log. In fact, this is precisely what the namenode does when it starts up. According to Apache, SecondaryNameNode is deprecated. See: http://hadoop.apache.org/hdfs/docs/r0.22.0/hdfs_user_guide.html#Secondary+NameNode (Accessed May 2012). – LWSNN configuration options:dfs.namenode.checkpoint.period valuedfs.namenode.checkpoint.size valuedfs.http.addressvaluepoint to name nodes port 50070Gets Fsimage and EditLogdfs.namenode.checkpoint.dirvalueelse defaults to /tmp
Data Disk Failure, Heartbeats and Re-ReplicationEach DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.Cluster RebalancingThe HDFS architecture is compatible with data rebalancing schemes. A scheme might automatically move data from one DataNode to another if the free space on a DataNode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented.Data IntegrityIt is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.Metadata Disk FailureThe fsImage and the editLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support. However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and EditLog to use.The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported.
Use the put commandIf bar is a directory then foo will be placed in the directory barIf no file or directory exists of name bar, file foo will be named If bar is already a file in HDFS then an error is returnedIf subNamedoes not exist then the file will be created in the directoryNote: -copyFromLocalcan be used instead of -put
Can display using catCopy a file from HDFS to the local file system with –get command$ bin/hadoopfs –get foo LocalFooOther commands$ hadoopfs -rmrdirectory|file Recursively
A MapReducejob usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system (HDFS). The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasksrunning on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.M/R supports multiple nodes processing contiguous data sets in parallel. Often, one node will “Map”, while another “Reduces”. In between, is the “shuffle”. We’ll cover all these in more detail.Basic “phases”, or steps, are displayed here. Other custom phases may be added.InMapReduce data is passed around as key/value pairs.QUESTION: is the “Reduce” phase required? NOINSTRUCTOR SUGGESTION: During the second day start – have one or more students re-draw this on the white board to re-iterate its importance.
MapReduce consists of Java classes that can process big data in parallel processes. It receives three inputs, a source collection, a Map function and a Reduce function, then returns a new result data collection.The algorithm is composed of a few steps, the first one executes the map() function to each item within the source collection. Map will return zero or may instances of Key/Value objects.Map’s responsibility is to convert an item from the source collection to zero or many instances of Key/Value pairs.In the next step , an algorithm will sort all Key/Value pairs to create new object instances where all values will be grouped by Key.reduce() then groups each by Key/Value instance. It returns a new instance to be included into the result collection.
A MapReducejob usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.An InputFormat determines how the data is split across map tasksTextInputFormat is the default InputFormatIt divides the input data into an InputSplit[] arrayEach InputSplit is given an individual mapEach InputSplit is associated with a destinationnodethat should be used to read the data so Hadoop can try to assign the computation to local data locationThe RecordReader class:Reads input data and produces key-value pairs to be passed into the MapperMay control how data is decompressedConverts data to Java types that MapReduce can work withMapReduce relies on the InputFormat to:Validate the input-specification of the job.Split-up the input file(s) into logical InputSplit instances, each of which is then assigned to an individual Mapper.Provide the RecordReader implementation used to glean input records from the logical InputSplit for processing by the Mapper.The default behavior of InputFormatis to split the input into logical InputSplit instances based on the total size, in bytes, of the input files. [However, HDFS blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via the config parameter mapred.min.split.size.] * validate thisLogical splits based on input-size is insufficient for many applications since record boundaries must be respected. In such cases, the application should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task.If the defaultTextInputFormat is used for a given job, the framework detects compressed inputfiles (i.e., with the .gz extension)and automatically decompresses them using an appropriate CompressionCodec. Note that, depending on the codec used, compressed files cannot be split and each compressed file is processed in its entirety by a single mapper.gzip cannot be split, bzip and lzocan.
Mapping is often used for typical “ETL” processes, such as filtering, transformation, and categorization.Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.The OutputCollector is provided by theframework to collect data output by the Mapper or the Reducer.The Reporter class is a facility for MapReduce applications to report progress, set application-level status messages and update Counters.Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive. In scenarios where the application takes a significant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task. Another way to avoid this is to set the configuration parameter mapred.task.timeout to a high-enough value (or even set it to zero for no time-outs).Applications can also update Counters using the Reporter.Parallel Map ProcessesThe number of maps is usually driven by the number of inputs, orthe total number of blocks in the input files.The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.If you expect 10TB of input data and have a block size of 128MB, you'll end up with 82,000 maps, unless you use setNumMapTasks(int) (which only provides a hint to the framework) is used to set it even higher.
The Partitioner controls the balancing of the keys of intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function on the key (or portion of it)AllMapper outputs are sorted and then partitioned per ReducerThe total number of partitions is the same as the number of Reduce tasks for the job. ThePartitioner is like a load balancer, that controls which one of the Reduce tasks the intermediate key (and hence the record) is sent to for reduction.Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer to determine final output. Users can control that grouping by specifying a Comparator via JobConf.setOutputKeyComparatorClass(Class).MapReduce comes bundled with a library of generally useful Mappers, Reducers, and Partitioner classes.
MapReduce partitions data among indiidual Reducers, usually running on separate nodes.The Shuffle phase is determined by how a Partitioner (embedded or custom) partitions value pairs to ReducersThe shuffle is where refinements and improvements are continually being made, In many ways, the shuffle is the heart of MapReduce and is where the “magic” happens.
The Sort phase guarantees that the input to every Reducer is sorted by key.
ARecuder extends MapReduceBase(). It implements the Reducer interfaceReceives output from multiple mappersreduce() method is called for each <key, (list of values)> pair in the grouped inputsIt sorts data by key of key/value pairTypically iterates through the values associated with a keyIt is perfectly legal to set the number of reduce-tasks to zero if no reduction is desired.In this case the outputs of the map-tasks go directly to HDFS, to the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out toHDFS.Reducer has 3 primary phases: Shuffle, Sort and Reduce:ShuffleInput to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.SortThe framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.Secondary SortIf equivalence rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, then one may specify a Comparator via JobConf.setOutputValueGroupingComparator(Class). Since JobConf.setOutputKeyComparatorClass(Class) can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate secondary sort on values.ReduceIn this phase the reduce() method is called for each <key, (list of values)> pair in the grouped inputs.The output of the reduce task is typically written to HDFS via OutputCollector.collect(WritableComparable, Writable).Applications can use the Reporter to report progress, set application-level status messages and update Counters, or just indicate that they are alive.The output of the Reducer is not sorted.
OutputFormat has two responsibilities: determining where and how data is written. Itdetermines where by examining the JobConf and checking to make sure it is a legitimate destination. The 'how' is handled by the getRecordWriter() function.From Hadoop's point of view, these classes come together when the MapReduce job finishes and each reducer has produced a stream of key-value pairs. Hadoop calls checkOutputSpecs with the job's configuration. If the function runs without throwing an Exception, it moves on and calls getRecordWriter which returns an object which can write that stream of data. When all of the pairs have been written, Hadoop calls the close function on the writer, committing that data to HDFS and finishing the responsibility of that reducer.MapReducerelies on the OutputFormat of the job to:Validate the output-specification of the job; for example, check that the output directory doesn't already existProvide the RecordWriter implementation used to write the output files of the job. Output files are stored in a FileSystem (usually HDFS)TextOutputFormat is the default OutputFormat.OutputFormats specify how to serialize data by providing a implementation of RecordWriter. RecordWriter classes handle the job of taking an individual key-value pair and writing it to the location prepared by the OutputFormat. There are two main functions to a RecordWriter implements: 'write' and 'close'. The 'write' function takes key-values from the MapReduce job and writes the bytes to disk. The default RecordWriter is LineRecordWriter, part of the TextOutputFormat mentioned earlier. It writes:The key's bytes (returned by the getBytes() function)a tab character delimiterthe value's bytes (again, produced by getBytes())a newline character.The 'close' function closes the Hadoop data stream to the output file.We've talked about the format of output data, but where is it stored? Again, you've probably seen the output of a job stored in many ‘part' files under the output directory like so:|-- output-directory| |-- part-00000| |-- part-00001| |-- part-00002| |-- part-00003| |-- part-00004 '-- part-00005from http://www.infoq.com/articles/HadoopOutputFormat
Mostly "boilerplate" with fill in the blank for datatypesHadoopdatatypes have well defined methods to get/put values into themStandard pattern to declare your output objects outside map() loop scopeStandard pattern to 1) get Java datatypes 2) Do your logic 3) Put into HadoopdatatypesThis map() method is called in a "loop" with successive key, value pairs.Each time through, you typically write key, value pairs to the reduce phaseThe key, value pairs go to the Hadoop Framework where the key is hashed to choose a reducer
In the Reducer:Reduce code is copied to multiple nodes in the clusterThe copies are identicalAll key,value pairs with identical keys are placed into a pair of (key, Iterator<values>) when passed to the reducer.Each invocation of the reduce() method is passed all values associated with a particular key.The keys are guaranteed to arrive in sorted orderEach incoming value is a numeric count of words in a particular line of textThe multiple values represent multiple lines of text processed by multiple map() methodsThe values are in an Iterator, and are summed in a loopThe sum, with its associated key, is sent to a disk file via the Hadoop FrameworkThere are typically many copies of your reduce codeIncoming key,value pairs are of the datatypes that map emitsOutput key,value pairs go to the HDFS, which writes them to diskThe Hadoop Framework constructs the Iterator from map values that are sent to this Reducer
An inverted index is a data structure that stores a mapping from content to its locations in a database file, or in a document or a set of documentsCan provide what to display for a searchNOTE: need to deal with OutputCollector() here.
Pig Latin is a data flow language – a sequence of steps where each step is an operation or commandDuring execution each statement is processed by the Pig interpreter and checked for syntax errorsIf a statement is valid, it gets added to a logical plan built by the interpreterHowever, the step does not actually execute until the entire script is processed, unless the step is a DUMP or STORE command, in which case the logical plan is compiled into a physical plan and executed
After the LOAD statement, point out that nothing is actually loaded! The LOAD command only defines a relationThe FILTER and GROUP commands are fairly self explanatoryThe HDFS is not even touched until the STORE command executes, at which point a MapReduce application is built from the Pig Latin statements shown here.
GruntPig’s interactive shellEnables users to enter Pig Latin interactivelyGrunt will do basic syntax and semantic checking as you enter each lineProvides a shell for users to interact with HDFSTo enter Grunt and use local file system instead:$ pig –x localIn the example script:A is called a relation or aliasimmutable, recreated if reusedmyfile is read from your home directory in HDFSA has no schema associated with itbytearrayLOAD uses default function PigStorage() to load dataAssumes TAB delimited in HDFSThe entire file 'myfile' will be read"pigs eat anything"The elements in A can be referenced by position if no schema is associated$0, $1, $2, ...
Pig return codes:(type in page 18 of Pig)
The complex types can contain data of any type
The DESCRIBE output is:describe employeesemployees: {name: chararray,age: bytearray,zip: int,salary: bytearray}
Pig includes the concept of a data element being null. Data of any type can be null. It is important to understand that in Pig the concept of null is the same as in SQL, which is completely different from the concept of null in C, Java, Python, etc. In Pig a null data element means the value is unknown. This might be because the data is missing, an error occurred in processing it, etc. In most procedural languages, a data value is said to be null when it is unset or does not point to a valid address or object. This difference in the concept of null is important and affects the way Pig treats null data, especially when operating on it.from Programming Pig, Alan Gates 2011
twokey.pig collects all records with the same value for the provided key together into a bagWithin a single relation, group together tuples with the same group keyCan use keywords BY and ALL with GROUPCan optionally be passed as an aggregate function (count.pig)Can group on multiple keys if they are surrounded by parentheses