Simplify Your Data Preparation

 

Simplify Your Data Preparation with These Four Lesser-Known Scikit-Learn Classes







Data preparation is famously the least-loved aspect of Data Science. If done right, however, it needn’t be such a headache.

While scikit-learn has fallen out of vogue as a modelling library in recent years given the meteoric rise of PyTorch, LightGBM, and XGBoost, it’s still easily one of the best data preparation libraries out there.

And I’m not just talking about that old chestnut: train_test_split. If you’re prepared to dig a little deeper, you’ll find a treasure trove of helpful tools for more advanced data preparation techniques, all of which are perfectly compatible with using other libraries like lightgbm, xgboost and catboost for subsequent modelling.

Transformer: any object with the fit() and transform() methods. You can think of a transformer as an object that’s used for processing your data, and you will commonly have multiple transformers in your data preparation workflow. E.g., you might use one transformer to impute missing values, and another one to scale features or one-hot encode your categorical variables. MinMaxScaler(), SimpleImputer() and OneHotEncoder() are all examples of transformers.

Post a Comment

0 Comments