👋🏻 Welcome back to my blog! If you're new here, I recommend starting with Day 1 : Introduction to get a solid foundation in generative AI.
On Day 2, we explored the crucial steps of data acquisition and data preparation. Today, we'll delve into the third step: feature engineering. This is where we extract meaningful information from our data to enhance model performance.
Full Playlist : ▶ Master GenAI Series

😊 Hello World, I'm Manav Paul, a 24-year-young spiritual developer on a mission to understand and leverage the power of generative AI. This blog serves as a roadmap, guiding you through my 30-day journey of discovering the fascinating world of generative AI. Let's continue our exploration together!

3. Feature Engineering

Feature Engineering: The Art of Transforming Data into Meaningful Insights

Feature engineering is a crucial step in the machine learning pipeline, involving the creation of new features from raw data to improve model performance. By carefully selecting and transforming features, we can enhance the model's ability to learn patterns and make accurate predictions.

Why Feature Engineering Matters

Improved Model Performance: Well-engineered features can significantly boost a model's accuracy and generalization capabilities.
Reduced Overfitting: By creating informative features, we can reduce the risk of overfitting, where the model becomes too tailored to the training data.
Enhanced Interpretability: Feature engineering can help us understand the underlying relationships between variables and the target outcome.

Before we began you need to understand few concepts :

Understanding Corpus, Vocabulary, Documents, and Words

Corpus:

A large collection of texts.

It can be a single document, a set of documents, or a whole library of texts.

Example: A corpus could be the entire works of Shakespeare, a collection of news articles, or a set of scientific papers.

Vocabulary:

The unique set of words in a corpus.

It doesn't include duplicate words.

Example: If a corpus contains the sentence "The quick brown fox jumps over the lazy dog," the vocabulary would be: {The, quick, brown, fox, jumps, over, the, lazy, dog}

Documents:

Individual units of text within a corpus.

A document can be a paragraph, a sentence, an entire article, or even a single word.

Example: In a corpus of news articles, each article would be a document.

Word:

The smallest unit of language.

It's a single unit of meaning.

Example: In the sentence "The quick brown fox jumps over the lazy dog," each word (e.g., "The," "quick," "brown") is a word.

To summarize:

Corpus: The entire text collection.

Vocabulary: The unique words in the corpus.

Documents: Individual units of text within the corpus.

Word: The smallest unit of language.

These concepts are fundamental to natural language processing (NLP) and text mining tasks, such as text classification, sentiment analysis, and machine translation.

Key Techniques in Feature Engineering

Text Vectorization: Transforming Text into Numerical Representations

Text data, being inherently unstructured, cannot be directly fed into machine learning models. Text vectorization techniques convert text into numerical representations that models can understand.

1. TF-IDF (Term Frequency-Inverse Document Frequency)

How it works: Assigns weights to words based on their frequency within a document and across the entire corpus.

Example: In a text corpus about cars, the word "car" might have a high TF-IDF score, while common words like "the" or "and" would have lower scores.

Let's break down the components of TF-IDF:

Term Frequency (TF):
- Measures how often a term appears in a document.
- Formula:
```
  TF(t,d) = (Number of times term t appears in document d) / (Total number of terms in document d)
```
- Example: Consider the following document:
  
  "The quick brown fox jumps over the lazy dog" The TF of the word "the" would be 2/9.
Inverse Document Frequency (IDF):
- Measures how rare a term is across all documents.
- Formula:
```
  IDF(t,D) = log(Total number of documents / Number of documents containing term t)
```
- Example: If the word "quantum" appears in only 10 documents out of a total of 10,000 documents, its IDF would be very high.
TF-IDF:
- Combines TF and IDF to give a score for each term in a document.
- Formula:
```
  TF-IDF(t,d,D) = TF(t,d) * IDF(t,D)
```
- Example: If a term has a high TF-IDF score, it means that it is both frequent in the document and rare across the entire collection. This indicates that the term is important to the document's content.

2. Bag of Words

How it works: Represents text as a bag of words, ignoring word order.

Example: The sentence "The quick brown fox jumps over the lazy dog" would be represented as a vector of word counts: [1, 1, 1, 1, 1, 1, 1].

3. Word2Vec

How it works: Learns word embeddings by considering the context of words in a large corpus of text.

Example: Words like "king" and "queen" might have similar embeddings due to their semantic relationships.

4. One-Hot Encoding

How it works: Converts categorical variables into numerical representations using binary vectors.

Example: The categorical variable "color" with values "red," "green," and "blue" would be encoded as:

Red: [1, 0, 0]

Green: [0, 1, 0]

Blue: [0, 0, 1]

5. Transformers

How it works: Advanced neural network architecture that processes sequences of data, such as text, by considering the context of each element.

Example: Transformers can be used to generate highly accurate text representations, leading to improved performance in various NLP tasks.

Beyond Text Vectorization

Feature engineering techniques extend beyond text data. Some common approaches include:

Numerical Feature Engineering:
- Normalization: Scaling numerical features to a common range (e.g., 0 to 1).
- Standardization: Transforming features to have zero mean and unit variance.
- Feature Creation: Deriving new features from existing ones (e.g., calculating ratios or differences).
Categorical Feature Engineering:
- Label Encoding: Assigning numerical labels to categorical values.
- Target Encoding: Encoding categorical features based on the target variable.
Time Series Feature Engineering:
- Time-Based Features: Extracting features like day of the week, month, or time of day.
- Lag Features: Incorporating past values of a variable as features.
- Rolling Window Features: Calculating statistics (e.g., mean, median, standard deviation) over a rolling window of past values.

4. Modeling

Modeling: The Heart of Machine Learning

Once we have meticulously engineered our features, the next step is to select and train a suitable machine learning model. This section delves into the two primary approaches to modeling: open-source and paid models.

Open-Source Models

Open-source models offer flexibility and customization, but they require significant computational resources and technical expertise.

Advantages:

Customization: Tailored to specific use cases and datasets.
Cost-Effective: Free to use and train.
Community Support: Access to a large community of developers.

Challenges:

Computational Requirements: Demand for powerful hardware (GPUs, TPUs) and significant computational resources.
Model Training: Time-consuming and complex process.
Deployment: Requires infrastructure setup and maintenance on cloud platforms.
Scalability: Challenges in scaling models to handle large-scale workloads.

Popular Open-Source Frameworks:

TensorFlow: Versatile framework for building and training a wide range of models.
PyTorch: Known for its flexibility and ease of use.
Scikit-learn: Comprehensive library for classical machine learning algorithms.

Paid Models

Paid models, offered by cloud providers, provide a convenient and efficient way to leverage advanced machine learning capabilities without the need for extensive infrastructure and expertise.

Advantages:

Ease of Use: Simple API access and pre-trained models.
Scalability: Easily handle large-scale workloads.
Reduced Infrastructure Costs: No need to manage hardware and software.
Advanced Features: Access to cutting-edge techniques and models.

Challenges:

Cost: Potential for significant costs, especially for large-scale usage.
Vendor Lock-in: Reliance on specific cloud providers.
Data Privacy Concerns: Sharing data with third-party providers.

Popular Paid Model Providers:

OpenAI: Offers a range of powerful language models, including GPT-4.
Google Gemini AI: Provides access to advanced AI models and APIs.
Hugging Face: A platform for sharing and deploying machine learning models.

Choosing the Right Model

The choice between open-source and paid models depends on various factors:

Budget: Consider the costs of hardware, cloud infrastructure, and model licensing fees.
Expertise: Assess the team's technical skills and resources.
Time Constraints: Evaluate the time required for model training and deployment.
Data Privacy: Determine the level of data sensitivity and compliance requirements.
Scalability: Consider the future growth and scalability needs of the application.

5. Evaluation: Assessing Model Performance

Evaluation is a critical step in the machine learning pipeline, allowing us to assess the quality and effectiveness of our models. It involves measuring how well a model performs on a given task and identifying areas for improvement.

5.1 Intrinsic Evaluation

Intrinsic evaluation focuses on measuring a model's performance on specific tasks using predefined metrics. It is typically conducted by machine learning engineers during the development and testing phases.

Common Intrinsic Evaluation Metrics:

Accuracy: Proportion of correct predictions.
Precision: Ratio of true positive predictions to total positive predictions.
Recall: Ratio of true positive predictions to total actual positive cases.
F1-score: Harmonic mean of precision and recall.
Mean Squared Error (MSE): Measures the average squared difference between predicted and actual values.
Mean Absolute Error (MAE): Measures the average absolute difference between predicted and actual values.
Root Mean Squared Error (RMSE): Square root of MSE.

👋🏻Hey folks!
I have a whole Cloud Bootcamp project on Machine learning using RMSE :
♥ Project → Predict Taxi Fare with a BigQuery ML Forecasting Model

5.2 Extrinsic Evaluation

Extrinsic evaluation involves assessing a model's performance in real-world scenarios, often through user feedback and usage data. It helps identify the model's limitations and areas for improvement.

Common Extrinsic Evaluation Techniques:

User Surveys and Feedback: Gathering qualitative and quantitative feedback from users to assess satisfaction and identify areas for improvement.
A/B Testing: Comparing the performance of different model versions or configurations.
Log Analysis: Analyzing user interactions and system logs to identify patterns and issues.
Continuous Learning: Using user feedback and new data to retrain and improve the model over time.

Example: Large Language Models

Large language models like GPT-4 often incorporate extrinsic evaluation techniques:

User Feedback: Collecting user ratings and comments on generated text to identify strengths and weaknesses.
Reinforcement Learning from Human Feedback (RLHF): Training models to align with human preferences by rewarding desired behaviors and penalizing undesirable ones.
Iterative Refinement: Continuously updating and refining the model based on user feedback and new data.

6. Deployment

Once a model is trained and evaluated, the final step is to deploy it into a production environment. Deployment involves making the model accessible to end-users or integrating it into other systems.

Deployment Strategies:

Cloud-Based Deployment:
- Platform as a Service (PaaS): Deploy models to cloud platforms like Heroku, Google App Engine, or AWS Elastic Beanstalk.
- Infrastructure as a Service (IaaS): Deploy models to virtual machines or containers on platforms like AWS EC2, Google Compute Engine, or Azure Virtual Machines.
- Function as a Service (FaaS): Deploy models as serverless functions on platforms like AWS Lambda, Google Cloud Functions, or Azure Functions.
On-Premise Deployment:
- Server-Based Deployment: Deploy models to servers within an organization's infrastructure.
- Edge Deployment: Deploy models to edge devices like IoT devices or mobile phones.

7. Monitoring and Retraining:

Model Monitoring:
- Performance Metrics: Continuously track key performance metrics like accuracy, precision, recall, and F1-score.
- Data Drift: Monitor changes in the input data distribution to identify potential performance degradation.
- Concept Drift: Detect changes in the underlying relationships between features and the target variable.
Retraining:
- Scheduled Retraining: Retrain the model at regular intervals to incorporate new data and improve performance.
- Trigger-Based Retraining: Retrain the model when performance metrics fall below a certain threshold or when significant data drift is detected.
- Incremental Learning: Update the model with new data without retraining the entire model from scratch.

🎉That's a wrap for Day 3!

Kudos to you! if you reaches the end. Now, you are all set to get your hands dirty on the data preprocessing practical working.
In our next post, we'll finally do hands-on data preprocessing techniques. Stay tuned for more insights and exciting experiments!

▶ Next → Day 4: Hands-On Data Preprocessing.

⭐Github Repo : Journey Roadmap

👨🏻‍💻 Connect with me on LinkedIn and X(Twitter).

Day 3 : End to End Generative AI Pipeline | Part 2

Table of contents