Tag

hadoop

Robust Machine Learning with ML Pipelines

From Data Analysis with Python and PySpark by Jonathan Rioux

This chapter covers using transformer and estimators to prepare data into ML features.

Big Data is Just a Lot of Small Data: using pandas UDF, part 2

From Data Analysis with Python and PySpark by Jonathan Rioux

This article covers

·         Using pandas Series UDF to accelerate column transformation compared to Python UDF.

·         Addressing the cold start of some UDF using Iterator of Series UDF.

Big Data is Just a Lot of Small Data: using pandas UDF

From Data Analysis with Python and PySpark by Jonathan Rioux

This article covers

·   Using pandas Series UDF to accelerate column transformation compared to Python UDF.

·   Addressing the cold start of some UDF using Iterator of Series UDF.

Your Data under a Different Lens: window functions

From Data Analysis with Python and PySpark by Jonathan Rioux

This article covers window functions and the kind of data transformation they enable.

Hadoop: Understanding MapReduce

By Chuck Lam, author of Hadoop in Action, Second Edition
In this article, we’ll talk about the challenges of scaling a data processing program and the benefits of using a framework such as MapReduce to handle the tedious chores for you.

Hadoop: Understanding MapReduce (PDF)

YARN and how MapReduce works in Hadoop

By Alex Holmes, author of Hadoop in Practice, Second Edition
YARN was created so that Hadoop clusters could run any type of work. This meant MapReduce had to become a YARN application and required the Hadoop developers to rewrite key parts of MapReduce. This article will demystify how MapReduce works in Hadoop 2.

YARN and how MapReduce works in Hadoop (PDF)

MapReduce Anti-patterns

By Alex Holmes
MapReduce patterns help you write effective code and make efficient use of your data and your Hadoop cluster. It can be just as useful to learn from anti-patterns, which are patterns that are commonly used but are either ineffective or worse, detrimental in practice. In this article based on chapter 13 of Hadoop in Practice, you can learn and laugh at mistakes that the author made in MapReduce on production clusters, which range from loading too much data into memory in tasks to going crazy with counters and bringing down the JobTracker.

MapReduce Anti-patterns (PDF)

© 2022 Manning — Design Credits