Data science preflight checklist.

July 06, 2022 · 9 mins · 1780 words

In aviation, a preflight checklist is a list of tasks that should be performed by pilots and aircrew before takeoff. Its purpose is to improve flight safety by ensuring that no important tasks are forgotten. The same procedure can be applied before starting a data science project. In this post, I’ll go through all the checks that I ran before starting a new project. Also, I assume that you already know how to train and evaluate a model - there are already a lot of resources explaining the typical data->EDA->clean->train->evaluate->repeat loop - so I’m going to focus on all the other parts that you need to succeed. Of course, you don’t need to implement each check yourself and you can delegate the responsibility to other team members, but to succeed you should be able to answer the question “who is going to take care of this part?”. You don’t need to do everything, but just surround yourself with people that know how to do it.

This post covers a list of safety checks to do before starting a new data science project. Here I’m not saying that you need to implement each check by yourself, but only that you should know who is going to take the responsibility for each step.

Why is important to run a preflight checklist?

And without further delay, let me present my personal checklist.

Prediction frequency

One of the first things you want to know is if your model is going to be used in real-time or in batch. The requirements and technologies you’ll need in each case can be completely different, so it’s an important question to answer before starting to code.

Real time

Batch

Model execution

Data

Integration

After rollout

Model retraining

Model monitoring

Once the model is deployed, you’ll want to know how good your predictions are, so you’ll need to monitor multiple metrics for your model.

Feature monitoring

Conclusions

Thanks for sticking with me so far. I covered the basic checks I follow before starting a new project, but I’m sure this list is biased, so let me know if you have a different set of checks or if you believe the list is incomplete.

To finish, let me say that I don’t think a Data Scientist should be able to do all the things in this list. I think of this list more as a team effort than as an individual effort. The point is to be prepared, because as the saying goes in latin Amat victoria curam - victory loves preparation.


  1. I’ve experienced this problem and it can be a pain in the ass. The solution we made at the end was to have a common library to compute the features, and it was used to generate the features for training and it was used at inference time as well.