1- Normalization:
Summary: Scaling data to a standard range (typically 0-1) to ensure fair comparisons between variables.-
- Definition: Normalization is the process of scaling numeric data to a common range, usually between 0 and 1, to eliminate the effect of different scales in the data.
- Methods:
- Min-Max Scaling: Rescales the data to a fixed range (usually 0 to 1) using the minimum and maximum values of the variable.
- Z-Score Standardization: Standardizes the data to have a mean of 0 and a standard deviation of 1.
- Purpose: Normalization ensures that all variables contribute equally to the analysis by eliminating the dominance of variables with larger scales.
2- Aggregation:
Summary: Combining data points into summary statistics (e.g., sums, averages) to simplify and focus on essential information.-
- Definition: Aggregation involves combining multiple data points into summary statistics to reduce the complexity of the dataset while retaining essential information.
- Methods:
- Summation: Adding up values within groups or categories.
- Averaging: Calculating the mean value within groups or categories.
- Counting: Counting the number of occurrences within groups or categories.
- Other Summary Statistics: Calculating other statistics such as median, mode, or standard deviation.
- Purpose: Aggregation simplifies large datasets, making them easier to analyze and interpret, while still preserving the key insights and trends.
3- Encoding:
Summary: Converting categorical variables into numerical format (one-hot encoding, label encoding) for analysis and visualization.-
- Definition: Encoding is the process of converting categorical variables into numerical format, which is necessary for many machine learning algorithms and statistical analyses.
- Methods:
- One-Hot Encoding: Creates binary columns for each category in a categorical variable, indicating its presence or absence.
- Label Encoding: Assigns a unique numerical label to each category in a categorical variable.
- Purpose: Encoding allows categorical variables to be included in mathematical models and analyses, enabling the utilization of valuable information contained in categorical data.