I am big fan of football and I really like reading football news. Last week however, I definitely overdid reading it (because Poland played against England in the World Cup 2014 qualifying match). Hopefully, I did realize that it is not the best way to waste my time and today I decided that my next 7 days will be different. Instead, I will read posts from Apache Pig user mailing lists!
The idea is just to read post from the mailing list anytime I feel like reading about football. It means Football zero, Apache Pig hero for me this week ;)
In this post I will share all interesting information, links, tricks, suggestions and maybe exceptions, bugs related Pig that I have learnt by reading the posts. I will start reviewing archival posts from April to today. It sums up to 234 threads.
- RANK operator like in SQL will be available in Pig 0.11
- Macros can be imported from jars in Pig 0.11
- Machine Learning in Pig
- Analyzing Big Data with Twitter
- Berkeley lecture on Pig
- Algebraic and Accumulator Interfaces
- Pig datatypes and memory issues
- Tuple and map must fit in memory
- Bag is the only datatype that Pig knows how to spill, so that it does not have to fit in memory
- However, bags that are too large to fit in memory can still be referenced in a tuple or a map
- Ambrose – a platform for visualization and real-time monitoring of data workflows
- Running scripts inside grunt
- Overriding
funcToArgMapping method to give a Pig’s UDF multiple incarnations - Comparing Pig to SQL and Hive
- Streaming data to an external script or program
- Pig eats bags with an appetite
- Embedding instructions to make code more compact
- Mock Loader and Storer to simplify unit testing of Pig scripts
- Difference between COUNT and COUNT_STAR
- Hadoop counters in Apache Pig
Described in Pig-2353, RANK BY operator prepends a consecutive integer to each tuple in the relation starting from 1 based on some ordering. My quick fancy example:
* I have not verified the name of prepended field containing the ranking position, but without loss of generality I simply assumed rank .
It is described in Pig-2850. The goal is to distribute macros in jars in the same way as UDFs. Then if you REGISTER a jar, you can easily IMPORT macros that this jar contains. Example:
If my_udfs_and_macros.jar contains some_path/my_macros.pig , then macros will be imported.
There is an interesting publication by Jimmy Lin and Alek Kolcz from Twitter where authors present a case study of Twitter’s integration of machine learning tools into its Pig-centric analytics platform.
Btw, there is another (both practical and controversial) paper by Jimmy Lin titled “MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail!” that I definitely recommend to read (Pig is mentioned as well).
I highly recommend to visit Analyzing Big Data with Twitter course (a special UC Berkeley iSchool course) that contains awesome pre-recoreded videos about Big Data, Hadoop, Pig and real-world use cases!
Additionally, the series of blog posts titled “Analyzing Twitter Data with Hadoop” by Jonathan Natkins is available at Cloudera’s blog (part 1 and part 2).
Continuing the topic, there is really awesome presentation about Pig given by Jon Coveney (Twitter) at Berkeley. Here are slides and here is full presentation.
I really liked the simple explanation of FLATTEN : it turns Tuples into columns (because Tuples contain columns) and turns Bags into rows (because Bags contain rows).
There were a couple of questions about implementing Pig’s UDFs using Algebraic and Accumulator interfaces. Both of them are nicely described in “Programming Pig” by Alan Gates (freely available to read at O’Rreilly OFPS).
The Berkeley lecture, Chapter 10 from “Programming Pig” by Alan Gates and one post from the mailing list mention about memory issues when dealing with Pig’s datatypes and UDFs. Just in a few words:
Twitter’s Ambrose is a really impressing platform for visualization and real-time monitoring of Hadoop workflows (current support is limited to Apache Pig). Since a picture is worth a thousand words, please look at this amazing screenshots from Ambrose’s github:
It is possible to execute Pig’s script inside Grunt and pass parameters to it:
According to EvalFunc’s Java-doc:
How does it looks in the code? Here is a snippet form SUM where two implementations are used: DoubleSum and LongSum .
There is a thread that contains links to two informative posts about comparison between Pig and SQL/Hive:
* Comparing Pig Latin and SQL for Constructing Data Processing Pipelines by Alan Gates
* Hive vs. Pig by Lars George (plus great comments by Jeff Hammerbacher).
* Comparing Pig Latin and SQL for Constructing Data Processing Pipelines by Alan Gates
* Hive vs. Pig by Lars George (plus great comments by Jeff Hammerbacher).
There is a snippet of code that would like to use STREAM statement in a nested FOREACH . Acctually, a nested FOREACH does not support STREAM (however CROSS , DISTINCT ,FILTER , FOREACH , LIMIT , and ORDER BY are supported).
This question brought my attention, because usually I do not use STREAM in my daily work (rather implement Java UDFs) and I wanted learn more about it. STREAM sends data to an external script or program, so that it is possible to integrate own code with Pig. You may read more about it in … Alan Gate’s book (Section “Stream” in Chapter 6).
How to read file in a following format (which contains a bag that probably should be flattened before storing in a file)?
Pig’s philosophy says that “Pigs Eat Anything” (Pig can operate on data whether it has metadata or not. It can operate on data that is relational, nested, or unstructured).
Pigs eats bags with an appetite!
More examples of eating reading complex types (bags, tuples and maps) are presented inPig’s documentation.
You can find excellent tutorials (with slightly different syntax) on TF-IDF in Pig by Jacob Perkins and Russell Jurney.
However, what interested me the most what the syntax for grouping (as well as joining, crossing, sorting etc.) on the fly. My example that uses this “flat-wide” syntax ;) to find pairs of authors who contributed to the largest number of documents:
Patch PIG-2650 available in Pig 0.10.1, will give a developer more flexibility when testing Pig Latin scripts, so that he/she will get convenient access to the output produced by the script.
Hadoop counter are available in Apache Pig. In Pig 0.9.2, you may use them in a following way:
There is an informative post about Hadoop counters in Pig that points to ElephantBird’s class called PigCounterHelper.
Credits
The community around Pig User Mailing List is really active and helpful. Obviously it results in a fact that each question is answered really quickly and one can learn a lot by reading it (hope that my post proves it because I have learnt a lot). Large contribution comes from Pig’s committers, however there is a great number of the non-committers who have been making awesome contributions to Pig and they are nicely rewarded in this cool and touching message ;)
No comments:
Post a Comment