Data Preprocessing in Machine Learning: Step-by-Step Guide

 

Data Preprocessing in Machine Learning: Step-by-Step Guide
Data Preprocessing in Machine Learning: Step-by-Step Guide


 Data Preprocessing in Machine learning

It involves transforming data into a format that can be effectively utilized by a machine learning model. It serves as an step in the development of machine learning models.


In the realm of machine learning endeavors it is common to encounter disorganized data. Therefore before engaging in any data related tasks it is imperative to refine and structure the data appropriately. This is where the role of data preparation comes into play.


Why do we require data preprocessing? 

Real world datasets often come with elements such as noise, missing values or irregular formatting that render them unsuitable for machine learning applications. Data preprocessing becomes essential to rid the dataset of impediments and ensure it is optimized for use, in machine learning models ultimately enhancing their accuracy and efficiency.

It involves the following steps:


Get the Dataset

To start building a machine learning model you need a dataset since machine learning models heavily rely on data. A dataset refers to a collection of data tailored for a task.


Datasets come in forms depending on the purpose. For instance the dataset required for developing a machine learning model, for business differs from what's needed for studying liver diseases. Each dataset is unique in its way. Typically we save the dataset as a CSV file for programming purposes although there are instances where we might use formats, like HTML or XLSX files.


What exactly is a CSV file? 

CSV files, or "Comma-Separated Values" files, are a type of file format that lets us store tabular data in documents like spreadsheets. Large datasets are beneficial to it, and programs can use these datasets. 

We can get datasets online from several sources, including Kaggle, for real-world challenges.


Library Importation 

We must import a few predefined Python libraries before we can use Python to preprocess data. These libraries are employed for a few particular tasks. We will utilize the following three packages in particular for data preprocessing:


Numpy:  

 Any kind of mathematical operation can be included into the programming using the Numpy Python library. This Python library is essential for performing scientific calculations. It also allows the addition of sizable matrices and multidimensional arrays.


Matplotlib :

Matplotlib is a Python 2D charting toolkit that requires the import of a sub-library, pyplot. This module is used to create any form of Python chart for the code.


Pandas: 

One of the most well-known Python libraries, the Pandas library is used for maintaining and importing datasets. It's an open-source data processing and analysis package.


Importing the datasets 

For our machine learning project, we now need to import the datasets that we have gathered. The current directory must be established as the working directory before we can import a dataset. To configure a working directory in Spyder IDE, follow the instructions below:


read_csv() function:

We utilized the pandas library's read_csv() method to read and process a CSV file. Using this method, we can read a csv file both locally and over URL.


Handling Missing data:

The next step in data preparation is to address any missing data in datasets. Missing data in our dataset could provide a substantial barrier to our machine learning model. As a result, it is necessary to manage any missing data from the collection.


Ways to handle missing data:

There are primarily two techniques to deal with missing data, which are: 


By removing the specific row: The first method is typically used to deal with null data. In this case, we simply remove the row or column that contains null data. However, this method is inefficient, and deleting data may result in information loss, rendering the output inaccurate. 


By computing the mean: In this case, we will compute the mean of the column or row that contains any missing values and substitute it for the missing values. This method is appropriate for features including numeric data, such as age, income, year, etc. This is the technique we shall use here.

To deal with missing values, we will utilize the Scikit-learn package in our code, which includes a variety of tools for developing machine learning models.


Encoding Categorical data:

Categorical data refers to data that is divided into categories.


Since machine learning models are totally based on mathematics and numbers, including a categorical variable in our dataset may cause problems while developing the model. As a result, these category variables must be encoded in numerical form.


Splitting the Dataset into the Training set and Test set

During machine learning data preparation, we split our dataset into two parts: training and test sets. This is a critical step in data preparation since it allows us to improve the performance of our machine learning models.


Assume we trained our machine learning model on one dataset and then tested it on another. As a result, our model will have difficulty understanding the relationships between the models.

If we train our model extremely well and its training accuracy is quite high, but then we give it a fresh dataset, the performance will decrease. So we constantly strive to create a machine learning model that works well both on the training and test datasets. Here, these datasets may be defined as: 


Training Set: A subset of the dataset used to train the machine learning model, whose output is already known.


Testing Set : The test set is a subset of the dataset used to test the machine learning model, and the model predicts the output based on the test set.


 Feature Scaling

The final step in machine learning's data preprocessing is feature scaling. It is a technique for standardizing the dataset's independent variables within a specific range. In feature scaling, we group our variables into the same range and scale so that no one variable dominates the others.