Tag

scale

Robust Machine Learning with ML Pipelines

From Data Analysis with Python and PySpark by Jonathan Rioux

This chapter covers using transformer and estimators to prepare data into ML features.

Collective Communication Pattern: Improving Performance When Parameter Servers Become a Bottleneck

From Distributed Machine Learning Patterns by Yuan Tang

In this article, we introduce the collective communication pattern, which is a great alternative to parameter servers when the machine learning model we are building is not too large without having to tune the ratio between the number of workers and parameter servers.

Big Data is Just a Lot of Small Data: using pandas UDF, part 2

From Data Analysis with Python and PySpark by Jonathan Rioux

This article covers

·         Using pandas Series UDF to accelerate column transformation compared to Python UDF.

·         Addressing the cold start of some UDF using Iterator of Series UDF.

Big Data is Just a Lot of Small Data: using pandas UDF

From Data Analysis with Python and PySpark by Jonathan Rioux

This article covers

·   Using pandas Series UDF to accelerate column transformation compared to Python UDF.

·   Addressing the cold start of some UDF using Iterator of Series UDF.

Parameter Server Pattern: Tagging Entities in 8 Millions of YouTube Videos

From Distributed Machine Learning Patterns by Yuan Tang

In this article, we introduce the parameter server pattern which comes handy for situations where the model is too large to fit in a single machine such as one we would have to build for tagging entities in the 8 millions of YouTube videos.

Your Data under a Different Lens: window functions

From Data Analysis with Python and PySpark by Jonathan Rioux

This article covers window functions and the kind of data transformation they enable.

© 2022 Manning — Design Credits