[46]

Data Management

Cleaning and organizing data takes up 60% of a data scientist's time, according to research.

Data Collection

Data is available in various forms and files, making it difficult to collect the relevant data in the required format.

Data Preprocessing

Data pre-processing is the process of taking raw data and cleaning it, filling in any missing values, and organizing it in a way that is easy to work with. It is the process of transforming or converting raw data into an understandable format.

There are 4 steps in Data Preprocessing:

Data Cleaning
Data Integration
Data Reduction
Data Transformation

Data Cleaning

It is the process of incorrect, incomplete and inaccurate data, it also replaces missing data. It involves getting rid of inconsistencies in data such as missing values or redundant variables.

Handling Missing Values

In place of missing values, we can replace it with "NA" or with mean values in case of Normal Distribution or with the median values in case of Non-Normal Distribution. Sometimes we can replace it with the most probable values which might have high chances of occurring too. Missing values can be handled in two ways: Manual: It works fine for smaller datasets and Automatic: It is efficient for larger datasets.

Handling Noisy Data

Noisy data is nothing but inconsistent data or error data.

Methods to handle the noisy data:

Binning: First the data has to be sorted and then the sorted data is stored in the bins. Once the data is stored into the created bins, the smoothing process is done. Smoothing is nothing but removing or replacing the error values.

Methods to handle data in bins:

Smoothing by bin mean
Smoothing by bin median
Smoothing by bin boundary: Replacing with min and max values

Regression: It is the numerical prediction of data.
Clustering: Similar data items are grouped into one cluster and dissimilar items are thrown outside the cluster.

Lemmatization

It is a linguistic technique which reduces words to their base or root form, known as lemma, eventually helping in capturing the essential meaning of the word done by considering the context and morphological analysis of words to determine their base form. It is a part of data cleaning.

Data Integration

Multiple heterogeneous sources of data are combined into single dataset.

Types of Data Integration:

Tight Coupling: Data is combined together into a physical location.
Loose Coupling: Here the data is actually not integrated. Only an interface will be created and data is combined through the interface and also accessed through it.

Data Reduction

The volume of data is reduced in order to make the analysis easier.

Methods of Data Reduction:

Dimensionality Reduction: It reduces the number of input variables because large number of input variables can lead to poor performance.
Data Cube Aggregation: The raw data or individual pieces of data is combined to construct a data cube. Here the redundant or noisy data is removed, if present.
Attributes Subset Selection: It states that only the highly relevant attributes should be used and others should be discarded.
Numerosity Reduction: Here instead of storing the entire data only the model or the sample of data is stored.

Data Transformation

Data is transformed into appropriate form that is suitable for mining process.

There are 4 methods of Data Transformation:

Normalization
Attribute Selection: New attributes are created using the older ones.
Discretization
Concept Hierarchy Generation

Data Augmentation

There are various reasons why data might need to be augmented, but one of the most challenging is the lack of labels. Because real-world data is frequently un-labelled, labelling is a difficult task itself.

If data is not correctly labelled, it might cause problems in training. The best way to deal with data labelling is to use a data annotation platform to manage your training data in one place.

How to clean and transform data

Table of contents