Wednesday, May 15, 2013

Football zero, Apache Pig hero – the essence from hundreds of posts from Apache Pig user mailing list.


I am big fan of football and I really like reading football news. Last week however, I definitely overdid reading it (because Poland played against England in the World Cup 2014 qualifying match). Hopefully, I did realize that it is not the best way to waste my time and today I decided that my next 7 days will be different. Instead, I will read posts from Apache Pig user mailing lists!
The idea is just to read post from the mailing list anytime I feel like reading about football. It means Football zero, Apache Pig hero for me this week ;)

In this post I will share all interesting information, links, tricks, suggestions and maybe exceptions, bugs related Pig that I have learnt by reading the posts. I will start reviewing archival posts from April to today. It sums up to 234 threads.
  1. RANK operator like in SQL will be available in Pig 0.11
  2. Described in Pig-2353RANK BY operator prepends a consecutive integer to each tuple in the relation starting from 1 based on some ordering. My quick fancy example:
    1
    2
    3
    4
    5
    6
    
    Runner = LOAD 'runners.dat' AS (name, timeInSeconds);
    Reward = LOAD 'rewards.dat' AS (position, prize);
     
    RunnerRanked = RANK Runner BY timeInSeconds ASC;
    RunnerRawardJoin = JOIN Runner ON rank*, Reward ON position;
    RunnerRaward = FOREACH RunnerRawardJoin GENERATE Runner::name, Reward::prize;
    * I have not verified the name of prepended field containing the ranking position, but without loss of generality I simply assumed rank.
  3. Macros can be imported from jars in Pig 0.11
  4. It is described in Pig-2850. The goal is to distribute macros in jars in the same way as UDFs. Then if you REGISTER a jar, you can easily IMPORT macros that this jar contains. Example:
    1
    2
    
    REGISTER my_udfs_and_macros.jar;
    IMPORT 'some_path/my_macros.pig';
    If my_udfs_and_macros.jar contains some_path/my_macros.pig, then macros will be imported.
  5. Machine Learning in Pig
  6. There is an interesting publication by Jimmy Lin and Alek Kolcz from Twitter where authors present a case study of Twitter’s integration of machine learning tools into its Pig-centric analytics platform.
    Btw, there is another (both practical and controversial) paper by Jimmy Lin titled “MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail!” that I definitely recommend to read (Pig is mentioned as well).
  7. Analyzing Big Data with Twitter
  8. I highly recommend to visit Analyzing Big Data with Twitter course (a special UC Berkeley iSchool course) that contains awesome pre-recoreded videos about Big Data, Hadoop, Pig and real-world use cases!
    Additionally, the series of blog posts titled “Analyzing Twitter Data with Hadoop” by Jonathan Natkins is available at Cloudera’s blog (part 1 and part 2).
  9. Berkeley lecture on Pig
  10. Continuing the topic, there is really awesome presentation about Pig given by Jon Coveney (Twitter) at Berkeley. Here are slides and here is full presentation.
    I really liked the simple explanation of FLATTEN: it turns Tuples into columns (because Tuples contain columns) and turns Bags into rows (because Bags contain rows).
  11. Algebraic and Accumulator Interfaces
  12. There were a couple of questions about implementing Pig’s UDFs using Algebraic and Accumulator interfaces. Both of them are nicely described in “Programming Pig” by Alan Gates (freely available to read at O’Rreilly OFPS).
  13. Pig datatypes and memory issues
  14. The Berkeley lecture, Chapter 10 from “Programming Pig” by Alan Gates and one post from the mailing list mention about memory issues when dealing with Pig’s datatypes and UDFs. Just in a few words:
    • Tuple and map must fit in memory
    • Bag is the only datatype that Pig knows how to spill, so that it does not have to fit in memory
    • However, bags that are too large to fit in memory can still be referenced in a tuple or a map
  15. Ambrose – a platform for visualization and real-time monitoring of data workflows
  16. Twitter’s Ambrose is a really impressing platform for visualization and real-time monitoring of Hadoop workflows (current support is limited to Apache Pig). Since a picture is worth a thousand words, please look at this amazing screenshots from Ambrose’s github:
  17. Running scripts inside grunt
  18. It is possible to execute Pig’s script inside Grunt and pass parameters to it:
    1
    
    $ grunt> run -param key=value script.pig
  19. Overriding funcToArgMapping method to give a Pig’s UDF multiple incarnations
  20. According to EvalFunc’s Java-doc:
    funcToArgMapping allows a UDF to specify type specific implementations of itself. For example, an implementation of arithmetic sum might have int and float implementations, since integer arithmetic performs much better than floating point arithmetic. Pig’s typechecker will call this method and using the returned list plus the schema of the function’s input data, decide which implementation of the UDF to use.
    How does it looks in the code? Here is a snippet form SUM where two implementations are used: DoubleSum and LongSum.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    
    @Override
    public List<FuncSpec> getArgToFuncMapping() throws FrontendException {
       List<FuncSpec> funcList = new ArrayList<FuncSpec>();
       funcList.add(new FuncSpec(this.getClass().getName(), Schema.generateNestedSchema(DataType.BAG, DataType.BYTEARRAY)));
       // DoubleSum works for both Floats and Doubles
       funcList.add(new FuncSpec(DoubleSum.class.getName(), Schema.generateNestedSchema(DataType.BAG, DataType.DOUBLE)));
       funcList.add(new FuncSpec(DoubleSum.class.getName(), Schema.generateNestedSchema(DataType.BAG, DataType.FLOAT)));
       // LongSum works for both Ints and Longs.
       funcList.add(new FuncSpec(LongSum.class.getName(), Schema.generateNestedSchema(DataType.BAG, DataType.INTEGER)));
       funcList.add(new FuncSpec(LongSum.class.getName(), Schema.generateNestedSchema(DataType.BAG, DataType.LONG)));
       return funcList;
    }
  21. Comparing Pig to SQL and Hive
  22. There is a thread that contains links to two informative posts about comparison between Pig and SQL/Hive:
    Comparing Pig Latin and SQL for Constructing Data Processing Pipelines by Alan Gates
    Hive vs. Pig by Lars George (plus great comments by Jeff Hammerbacher).
  23. Streaming data to an external script or program
  24. There is a snippet of code that would like to use STREAM statement in a nested FOREACH. Acctually, a nested FOREACH does not support STREAM (however CROSSDISTINCT,FILTERFOREACHLIMIT, and ORDER BY are supported).
    This question brought my attention, because usually I do not use STREAM in my daily work (rather implement Java UDFs) and I wanted learn more about it. STREAM sends data to an external script or program, so that it is possible to integrate own code with Pig. You may read more about it in … Alan Gate’s book (Section “Stream” in Chapter 6).
  25. Pig eats bags with an appetite
  26. How to read file in a following format (which contains a bag that probably should be flattened before storing in a file)?
    1
    2
    
    doc1    {(doc1,1),(doc1,2),(doc1,3),(doc1,4)}
    doc2    {(doc2,1),(doc2,2),(doc2,3)}
    Pig’s philosophy says that “Pigs Eat Anything” (Pig can operate on data whether it has metadata or not. It can operate on data that is relational, nested, or unstructured).
    Pigs eats bags with an appetite!
    1
    
    doc_grp = load 'doc-grouped.dat' as (doc: chararray, b: bag {t: tuple(doc: chararray, no: int)});
    More examples of eating reading complex types (bags, tuples and maps) are presented inPig’s documentation.
  27. Embedding instructions to make code more compact
  28. You can find excellent tutorials (with slightly different syntax) on TF-IDF in Pig by Jacob Perkins and Russell Jurney.
    However, what interested me the most what the syntax for grouping (as well as joining, crossing, sorting etc.) on the fly. My example that uses this “flat-wide” syntax ;) to find pairs of authors who contributed to the largest number of documents:
    1
    2
    3
    4
    5
    6
    7
    8
    
    Doc = LOAD 'coansys.dat' AS (docId: chararray, personId: chararray);
    Doc2 = FOREACH Doc GENERATE *;
     
    Pair = FILTER (JOIN Doc BY docId, Doc2 BY docId) BY Doc::personId < Doc2::personId;
    Counted = FOREACH (GROUP Pair BY (Doc::personId, Doc2::personId)) GENERATE FLATTEN(group), COUNT(Pair) AS docCnt;
    Top = LIMIT (ORDER Counted BY docCnt DESC, Doc::personId ASC) 3;
     
    DUMP Top;
  29. Mock Loader and Storer to simplify unit testing of Pig scripts
  30. Patch PIG-2650 available in Pig 0.10.1, will give a developer more flexibility when testing Pig Latin scripts, so that he/she will get convenient access to the output produced by the script.
  31. Difference between COUNT and COUNT_STAR
  32. COUNT_STAR includes NULL values in the count computation (unlike COUNT, which ignoresNULL values).
  33. Hadoop counters in Apache Pig
  34. Hadoop counter are available in Apache Pig. In Pig 0.9.2, you may use them in a following way:
    1
    2
    3
    4
    
    PigStatusReporter reporter = PigStatusReporter.getInstance();
    if (reporter != null) {
      reporter.getCounter(TEMPERATURE.NEGATIVE).increment(1);
    }
    There is an informative post about Hadoop counters in Pig that points to ElephantBird’s class called PigCounterHelper.

Credits

The community around Pig User Mailing List is really active and helpful. Obviously it results in a fact that each question is answered really quickly and one can learn a lot by reading it (hope that my post proves it because I have learnt a lot). Large contribution comes from Pig’s committers, however there is a great number of the non-committers who have been making awesome contributions to Pig and they are nicely rewarded in this cool and touching message ;)

No comments:

Post a Comment