Simplifying Arabic Text Preprocessing for NLP: A Step-by-Step Guide

1. Introduction

Natural Language Processing (NLP) in Arabic poses unique challenges due to its rich morphology, diacritics, and dialectal variations. Proper preprocessing is essential for improving the accuracy of NLP models. This guide offers a comprehensive overview of techniques for preprocessing Arabic text, from normalization to tokenization, stemming, and more.


2. Challenges of Arabic Text Preprocessing

1. Morphological Complexity

Egyptian Arabic words, like MSA, have complex morphology with multiple prefixes and suffixes, making lemmatization and stemming difficult.

Example:

  • The verb "بيكتبوا" (they are writing) comes from the root "كتب" (write).

    • A stemmer might incorrectly return "بيكتب" (he is writing) instead of "كتب".

    • Proper lemmatization should return "كتب".

  • The word "مكتبات" (libraries) comes from the same root "كتب" (write).

    • A naive stemmer might return "مكتب" (office), which changes the meaning.

2. Diacritics and Ambiguity

Egyptian Arabic words, like MSA, can have different meanings depending on diacritics, and removing diacritics can introduce ambiguity.

Example:

  • The word "عين" can mean:

    • عِين (eyes)

    • عَيِّن (appoint/select someone)

    • عَيْن (a water spring)

  • The word "ملك" can mean:

    • مَلَك (angel)

    • مَلَكْ (he owned)

Without context, removing diacritics can lead to misinterpretation in NLP tasks.

3. Word Segmentation

Egyptian Arabic often combines words with clitics and particles, making segmentation challenging.

Example:

  • "وبالليل" (and at night) consists of:

    • و (and)

    • ب (in)

    • الليل (night)

  • "هقولك" (I will tell you) consists of:

    • ه (future marker)

    • قول (say/tell)

    • ك (you)

Proper tokenization is necessary to split these components correctly.

4. Dialectal Variations

Egyptian Arabic differs significantly from MSA, requiring dialect-specific preprocessing for NLP tasks.

Example:

  • "How are you?" in different Arabic dialects:

    • MSA (Modern Standard Arabic): "كيف حالك؟"

    • Egyptian: "إزيّك؟" (Ezzayyak? for a male, Ezzayyik? for a female.)

    • Gulf (Bahrain, Kuwait, Iraq, Oman, Qatar, Saudi Arabia, and the UAE): "شخبارك؟"

    • Levantine (Syria, Lebanon, Palestine, and Jordan): "كيفك؟"

A model trained on MSA alone might struggle to process Egyptian Arabic unless fine-tuned with dialectal data.

5. Mixed-Language Text Processing (Code-Switching)

Egyptian Arabic speakers frequently mix Arabic with English and French, especially on social media.

Example:

  • "أنا رايح ال gym بعد الشغل." ( I’m going to the gym after work. )

  • "بجد ال movie دي كانت جامدة!" (Seriously, this movie was amazing! )

  • "أنا literally مش قادر." (I am literally exhausted.)


3. Preprocessing Egyptian Arabic Text for NLP

1. Text Normalization

Egyptian Arabic has spelling variations due to dialectal differences. Normalization helps unify the text by:

  • Removing diacritics: "مُستَنيّك" → "مستنيك"

    • Dediacritization

    •     from camel_tools.utils.dediac import dediac_ar
      
          sentence = "هَلْ ذَهَبْتَ إِلَى المَكْتَبَةِ؟"
          print(sentence) # هَلْ ذَهَبْتَ إِلَى المَكْتَبَةِ؟
      
          sent_dediac = dediac_ar(sentence)
          print(sent_dediac) # هل ذهبت إلى المكتبة؟
      
  • Normalizing Alef variants: "إزاي" → "ازاي"

  • Converting Teh Marbuta (ة) to Heh (ه): "مدرسة" → "مدرسه"

    • Orthographic Normalization with camel_tools

    •     from camel_tools.utils.normalize import normalize_alef_maksura_ar
          from camel_tools.utils.normalize import normalize_alef_ar
          from camel_tools.utils.normalize import normalize_teh_marbuta_ar
      
          sentence = "هل ذهبت إلى المكتبة؟"
          print(sentence) # هل ذهبت إلى المكتبة؟
      
          # Normalize alef variants to 'ا'
          sent_norm = normalize_alef_ar(sentence)
          print(sent_norm) # هل ذهبت الى المكتبة؟
      
          # Normalize alef maksura 'ى' to yeh 'ي'
          sent_norm = normalize_alef_maksura_ar(sent_norm)
          print(sent_norm) # هل ذهبت الي المكتبة؟
      
          # Normalize teh marbuta 'ة' to heh 'ه'
          sent_norm = normalize_teh_marbuta_ar(sent_norm)
          print(sent_norm) # هل ذهبت الي المكتبه؟
      
  • Eliminating elongation (ـــ): "عاااااش" → "عاش"

    •         # Remove Tatweel (Kashida) and digits
              text = re.sub(r'[\d_\u0640]', '', text)
      

2. Stopword Removal

Removing common words like "أنا", "كنت", "في" helps reduce noise.

  • Example: "أنا كنت في الجامعة" → "جامعة"

All Code

import logging
import re
import string
import emoji
import nltk
from camel_tools.utils.dediac import dediac_ar
from nltk.corpus import stopwords

# Setup logging
logging.basicConfig(level=logging.INFO)

# Download stopwords if not available
nltk.download('stopwords')

# Load Arabic stopwords
stop_words = set(stopwords.words('arabic'))
excluded_stopwords = {"لم", "لن", "لا", "لكن"}  # Negation words to keep
stop_words -= excluded_stopwords

# Clean text by removing URLs, emojis, punctuations, and normalizing Arabic
def data_cleaning(text: str) -> str:
    if not isinstance(text, str) or not text.strip():
        logging.warning("Empty or invalid input for data_cleaning.")
        return ""

    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)

    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()

    # Remove diacritics
    text = dediac_ar(text)

    # Remove Tatweel (Kashida) and digits
    text = re.sub(r'[\d_\u0640]', '', text)

    # Normalize Arabic letters
    replacements = {
        "\u0622": "\u0627",  # Alef with madda
        "\u0625": "\u0627",  # Alef with hamza below
        "\u0623": "\u0627",  # Alef with hamza above
        "\u0624": "\u0648",  # Waw with hamza
        "\u0626": "\u064a",  # Yeh with hamza
        "\u0629": "\u0647",  # Teh Marbuta to Heh
    }
    text = normalize_arabic(text, replacements)

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove emojis
    text = emoji.replace_emoji(text, replace="")

    # Remove stopwords (directly on text)
    text = ' '.join(word for word in text.split() if word not in stop_words)

    return text


# Helper function for normalizing Arabic letters
def normalize_arabic(text: str, replacements: dict) -> str:
    for old_char, new_char in replacements.items():
        text = text.replace(old_char, new_char)
    return text
"أهلا وسهلا بحضرتك! 🌟 معاك أحمد من شركة X، حابين نقدملك عرض خاص جدًا على خدمات الإنترنت فائق السرعة. 🚀🔥 العرض متاح لفترة محدودة، تحب تعرف التفاصيل؟ 📞😊"

Processing Steps:

  • Remove emojis: "🌟🚀🔥📞😊" are removed.

  • Normalize Arabic: "أهلا" -> "اهلا", `"حضرتك" remains the same.

  • Remove stopwords: `"معاك" remains the same for clarity.

  • Remove punctuation: "!" is removed.

Expected Output (Cleaned Text):

"اهلا وسهلا بحضرتك معاك احمد من شركة X حابين نقدملك عرض خاص جدا على خدمات الانترنت فائق السرعة العرض متاح لفترة محدودة تحب تعرف التفاصيل"

3. Stemming vs. Lemmatization

  • Stemming: Reduces words to the root form (but may not be a real word).

    • https://www.semanticscholar.org/paper/Information-Retrieval-%3A-Stemming-or-Lemmatization-Zeroual-Lakhouaja/b5d0def90c59d6d697b5e59f4e5cff7870dbb124
  • Lemmatization: Converts words to their dictionary form.

Example:

  • "كتبتله" → Stem: "كتب", Lemma: "يكتب"

📌 Tools: ISRI Stemmer, Tashaphyne, Farasa Lemmatizer


4. Tokenization

Egyptian Arabic words contain prefixes and suffixes, making proper word segmentation essential.

  • Example: "هروحلك بكرة" → ["هروح", "لك", "بكرة"]

📌 Tools: Farasa, MADAMIRA, CamelTools

We talked about that in a previous article: Arabic Language Tokenization Explained: Key Concepts and Methods


5. Handling Mixed-Language Text

1. Removing Non-Arabic Words

Egyptian Arabic often includes English or French words. To maintain linguistic consistency, it’s essential to remove non-Arabic words.

Example:
Original: "أنا رايح mall بكرة"
Processed: "أنا رايح بكرة"

📌 Methods:
Regex-based filtering:

import re  

def remove_non_arabic(text):  
    return re.sub(r'[a-zA-Z]+', '', text)  

text = "أنا رايح mall بكرة"  
print(remove_non_arabic(text))  # Output: "أنا رايح بكرة"

📌 Tools:

  • re (Regex) for pattern-based filtering

  • langdetect or fasttext for language identification

2. Multilingual Word Embeddings

Some word embedding models support multiple languages. We discussed this in detail in another article.


Conclusion

Arabic text preprocessing is a crucial step for effective NLP models. By applying the right techniques—normalization, tokenization, stopword removal, stemming, and lemmatization—we can significantly improve model performance. Future advancements in deep learning and NLP tools will continue to refine Arabic text processing for better language understanding.

1
Subscribe to my newsletter

Read articles from Mahmoud Hamed Namnam directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Mahmoud Hamed Namnam
Mahmoud Hamed Namnam