Machine Learning on Tabular Data by Mark Ryan and Luca Massaron

Imagine a world where every bank transaction, every retail purchase, and every click on a website creates ripples in a vast ocean of data. Now picture this ocean as tables upon tables of organized information: this is tabular data.

As we traverse our digital landscape, we leave trails in these tables that, if deciphered correctly, can unveil countless insights. Welcome to our book, Machine Learning on Tabular Data, an insightful guide to unlocking these mysteries. Dive in and navigate the ever-evolving dynamics of machine learning and deep learning, exclusively focused on this crucial data form. By the time you turn the last page, you’ll be prepared to transform columns and rows into actionable strategies and insights.


 

Who is this book for?

Machine Learning on Tabular Data is more than just a manual; it’s the gateway to transforming raw numbers into actionable insights. But who stands to gain the most from its pages?

This book caters to a wide audience. Data enthusiasts, both beginners, and experts, will find it a refreshing dive into the depths of tabular data. Professionals across industries—from tech to finance, from healthcare to retail—will discover tools to make their structured datasets work harder for them. Academics, from eager students to seasoned researchers, will appreciate its thorough yet accessible approach to the latest methodologies. Even tech entrepreneurs looking to stand out in today’s competitive market will find insights to give them that edge.

However, it’s worth noting that while the book is designed to be comprehensive, some basic understanding of data concepts and a general familiarity with programming will help readers fully grasp its contents. Whether you’re just curious about the world of data or actively seeking to master it, Machine Learning on Tabular Data is here to guide your journey.

 


Joining our Newsletter to be notified of Deals of the Day, and More


 

Decoding Tabular Data

When we talk about data, images, texts, and videos often dominate the conversation. However, a significant portion of the data that powers industries worldwide is in tabular format. Tabular data, simply put, refers to structured data organized into tables (rows and columns) like our good old Excel sheets, CSV files, and more expansive datasets in platforms like Google Cloud Spanner or AWS Aurora.

Let’s dig a bit deeper.

 

Understanding Tabular Data:
It’s essential to differentiate between tabular data and other structured data like JSON. While both contain information in a structured format, tabular data is specifically organized into rows and columns. Think of an Excel spreadsheet where each row could represent an individual and columns denote their attributes, like age, name, or address.

Machine Learning or Deep Learning?:
The debate is real and longstanding. While both machine learning and deep learning are subsets of artificial intelligence, their approach to handling tabular data differs:

  • Machine Learning: Models like those in Scikit-learn or XGBoost are designed for efficiency and effectiveness in handling structured tabular data. They can quickly discern patterns from columns and rows and are particularly effective when the dataset size is moderate.
  • Deep Learning: Deep learning methods are neural network-based. For tabular data, specific architectures like SAINT and DeepTables come into play. These are powerful when handling massive datasets or when data exhibits complex nonlinear patterns.

The Art of Feature Engineering:
One of the foundational steps in preparing tabular data for machine learning or deep learning is feature engineering. This involves converting data into a format most suitable for your model. For instance, categorical data (like ‘male’ or ‘female’) can be transformed into numerical data using techniques like one-hot encoding. Also, domain-specific knowledge can sometimes lead to the creation of new features that provide more insights into a model than the original features.

Myths Surrounding Deep Learning:
There’s a common misconception that deep learning is always superior. In reality, its effectiveness is context-dependent. For certain tabular data problems, traditional machine learning models like decision trees or gradient-boosted machines might outperform deep neural networks, especially when data is scarce or when interpretability is key.

 

Understanding and processing tabular data is a nuanced affair. Whether you’re leaning towards traditional machine learning techniques or exploring the depths of neural networks, recognizing the nature of your data and the problem at hand is crucial. The era of data-driven decision-making is here, and tabular data remains its unsung hero. Embrace it, understand it, and let it guide your next business move.

 


Out Now! Only on Manning.com


Key lessons:

  • Every digital action, from bank transactions to website clicks, populates a vast sea of organized tables known as tabular data.
  • Machine Learning on Tabular Data delves into decoding this data type, turning it into actionable insights for various professionals, from tech experts to retail specialists.
  • Tabular data dominates industries, consisting of structured information arranged in rows and columns, similar to Excel sheets or CSV files.
  • There’s a distinction in handling tabular data between machine learning (efficient with moderate datasets) and deep learning (neural network-based, which excels with massive, complex data).
  • Proper understanding and feature engineering are pivotal in processing tabular data, and choosing between machine learning or deep learning depends on the data’s nature and the problem’s specifics.