32. 4 Main Characteristics of Scala
JVM
Statically
Typed
Object
Oriented
Functional
Programming
33. def map[B](f: (A) ⇒ B): List[B]
Builds a new collection by applying a function to all
elements of this list.
def reduce[A1 >: A](op: (A1, A1) ⇒ A1): A1
Reduces the elements of this list using the specified
associative binary operator.
37. Recap
Map/Reduce
• Programming paradigm that employs concepts from Functional
Programming
Scala
• Map/Reduce
• Functional Language that runs on the JVM
Hadoop
• Open Source Implementation of MR in the JVM
43. High level approaches
input_lines = LOAD ‘myfile.txt' AS (line:chararray);
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES 'w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count,
group AS word;
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
44. User defined functions (UDF)
-- myscript.pig
REGISTER myudfs.jar;
A = LOAD 'student_data' AS (name: chararray,
age: int, gpa: float);
B = FOREACH A GENERATE myudfs.UPPER(name); package myudfs;
import java.io.IOException;
DUMP B;
import org.apache.pig.EvalFunc;
Java
Pig
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;
public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw WrappedIOException.wrap("Caught exception processing
input row ", e);
}
}
}
45.
46.
47. WordCount in Cascading
package impatient;
import java.util.Properties;
import cascading.flow.Flow;
import cascading.flow.FlowDef;
import cascading.flow.hadoop.HadoopFlowConnector;
import cascading.operation.aggregator.Count;
import cascading.operation.regex.RegexFilter;
import cascading.operation.regex.RegexSplitGenerator;
import cascading.pipe.Each;
import cascading.pipe.Every;
import cascading.pipe.GroupBy;
import cascading.pipe.Pipe;
import cascading.property.AppProps;
import cascading.scheme.Scheme;
import cascading.scheme.hadoop.TextDelimited;
import cascading.tap.Tap;
import cascading.tap.hadoop.Hfs;
import cascading.tuple.Fields;
public class Main {
public static void main( String[] args )
{
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
);
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex operation to split the "document" text lines into a
token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[
[](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
Properties properties = new Properties();
wcFlow.writeDOT( "dot/wc.dot" );
AppProps.setApplicationJarClass( properties, Main.class );
wcFlow.complete();
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties}
}
75. Fail-fast with type errors
(Int,Int,Int,Int)
TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
76. Fail-fast with type errors
(Int,Int,Int,Int)
TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
val
val
val
val
w
x
y
z
=
=
=
=
5
5
5
5
w + x + y + z = 20
77. Fail-fast with type errors
(Int,Int,Int,Int)
TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
val
val
val
val
w
x
y
z
=
=
=
=
5
5
5
5
w + x + y + z = 20
val
val
val
val
w
x
y
z
=
=
=
=
Meters(5)
Miles(5)
Celsius(5)
Fahrenheit(5)
w + x + y + z
=> type error