|
From Software Engineering for Data Scientists by Andrew Treadway Software Engineering for Data Scientists presents important software engineering principles that will radically improve the performance and efficiency of data science projects. |
As a data scientist, being able to pass your code to someone else (like a software engineer, or even another data scientist) can be very important when it comes to getting your code or model to be used by others. Having messy code spread across Jupyter Notebook files, or perhaps scattered across several programming languages or tools makes transitioning a code base to someone else much more painful and frustrating. It can also cost more time and resources in order to re-write a data scientist’s code into something that is more readable, maintainable, and able to be put into production. Having a greater ability to think like a software engineer can greatly help a data scientist minimize these frustrations.
That’s where this book comes in. A very important part of data science is effectively conveying your work to your coworkers and clients, and Software Engineering for Data Scientists will teach you software engineering skills that will help you work with developers and other colleagues to do just this—and it will teach you to code better! Learning about software engineering will help you do your job better, both individually and within a larger organization.
This book is written for readers looking to learn how to apply software engineering concepts to data science.
The book is split into four parts:
-
- Part 1 – Getting started
o This part will cover topics such as source control, exception handling, better structuring your code, object-oriented programming (OOP) for data science, and monitoring the progress of your code (such as model training or data extraction) - Part 2 – Scaling
o Part 2 covers scaling your code effectively. For example – how do you deal with larger datasets? We’ll cover both the computational and memory components of scaling - Part 3 – Scheduling, testing, and deployment into production
o Part 3 details how to rigorously test your code, protecting your credentials (for example when connecting to a database to query data, scheduling models and data pipelines to run automatically, and packaging data analytics code into a portable library that can be shared with and downloaded by others - Part 4 – Monitoring your data processing and modeling code
o Lastly, Part 4 will teach you how to effectively monitor your code in production. This is especially relevant when you deploy a machine learning model to make predictions on a recurring or automated basis. We’ll cover logging, automated reporting, and how to build dashboards with Python.
- Part 1 – Getting started
In addition to the direct topics we cover in the book, you’ll also get hands-on experience with the code examples. The code examples in the book are meant to be runnable on your own with downloadable datasets, and you’ll find corresponding files available in the Github repository. Besides the examples laid out in the book, you’ll also find Practice on your own sections at the end of most chapters so that you can delve further into the material in a practical way.
Who this book is for
The minimally qualified reader should be someone with at least introductory Python knowledge and basic understanding of data science principles. The reader could range from a student majoring in a data science-related field to a professional data scientist with years of experience. Having at least a basic foundation in Python is necessary to help the readers be able to get more out of the exercises and sample projects in the book. The data science prerequisites below will provide helpful background knowledge as the reader learns about how to scale data processing techniques and how to put models into production etc.
Prerequisites
Python programming
- Basic understanding of packages (i.e. what is a package) and how to import and install them
- Python syntax fundamentals, like if statements, loops, and data structures, including lists and dictionaries
- How to create functions
- Introductory pandas knowledge:
o Importing data
o Manipulating data frames (e.g. filtering, creating new columns)
o Getting summary statistics from data frames
Data science
- Understanding of basic machine learning terminology, including:
o What is machine learning
o Heard of some ML modeling techniques, like logistic regression, random forest, etc. Reader is not expected to be an expert in these techniques
o Model evaluation metrics, such as precision, recall (sensitivity), and accuracy - Understanding of data pre-processing principles, including:
o Handling missing values
o Statistical correlation
o Outliers
What readers will learn in this book
The purpose of this book is to teach the reader about software engineering principles that are critically useful for data science. These principles can be used in many ways on the job. For example, if you’re working on a modeling project with other teammates, being able to more easily share, track, and backup your codebase is crucial. This can be handled using source control, which is covered in the second chapter of the book.
If you want to deploy your model in production so that it’s used by others, being able to handle various issues or errors that arise is critical. This is another key topic that’s covered in the book.
This should help the reader be able to do the following:
- Implementing a machine learning model in production
- Scheduling Python code to automatically run
- “Hardening” code to handle exceptions, restrict inputs, and making code more well-structured
- Refactoring code so that is more scalable, better organized, and more efficient
- Applying OOP principles to data science-related code
- How to effectively monitor a model in production
- Creating better-structured code to minimize errors (both in terms of bugs in the code, but also in terms of inputs into code functions, such as the features being fed into a model)
- Scaling code to be able to process large datasets efficiently and effectively, which is an important skill in modern data science
- Effectively testing your code (such as a model or data pipeline) to reduce future issues
- Packaging Python code to be distributed to others, including non-Python users
Furthermore, readers will also learn:
- How to use source control
- How to handle exceptions/errors that arise in your code
- How to effectively test your modeling and data processing code prior to deployment
- How to schedule a model to automatically run on a recurring basis
- How to generate automated reports for monitoring a model in production
The book covers an extensive set of topics, and I hope you find it helpful in your technical journey.
If you want to see more, check out the book here.