From Data Analysis with Python and PySpark by Jonathan Rioux
This chapter covers using transformer and estimators to prepare data into ML features.
From Data Analysis with Python and PySpark by Jonathan Rioux
This article covers
· Using pandas Series UDF to accelerate column transformation compared to Python UDF.
· Addressing the cold start of some UDF using Iterator of Series UDF.
From Data Analysis with Python and PySpark by Jonathan Rioux
This article covers
· Using pandas Series UDF to accelerate column transformation compared to Python UDF.
· Addressing the cold start of some UDF using Iterator of Series UDF.
From Data Analysis with Python and PySpark by Jonathan Rioux
This article covers window functions and the kind of data transformation they enable.
By Chuck Lam, author of Hadoop in Action, Second Edition
In this article, we’ll talk about the challenges of scaling a data processing program and the benefits of using a framework such as MapReduce to handle the tedious chores for you.
By Alex Holmes, author of Hadoop in Practice, Second Edition
YARN was created so that Hadoop clusters could run any type of work. This meant MapReduce had to become a YARN application and required the Hadoop developers to rewrite key parts of MapReduce. This article will demystify how MapReduce works in Hadoop 2.
By Alex Holmes
MapReduce patterns help you write effective code and make efficient use of your data and your Hadoop cluster. It can be just as useful to learn from anti-patterns, which are patterns that are commonly used but are either ineffective or worse, detrimental in practice. In this article based on chapter 13 of Hadoop in Practice, you can learn and laugh at mistakes that the author made in MapReduce on production clusters, which range from loading too much data into memory in tasks to going crazy with counters and bringing down the JobTracker.
MapReduce Anti-patterns (PDF)