Step by Step: How to use CAMeL TOOLS for Arabic language Processing — Tokenization

Sam Shamsan
4 min readJul 28, 2020

In this article we will explore the new python package that compiled years of diligent work in the domain of Arabic processing language, This is an advance topic and would expect you to have some prerequisites:

. Python programming language

. Natural processing language concepts, practices, tasks and enabling technologies.

Introduction:

Natural Processing language for Arabic is relatively new, majority of the domain knowledge base has been developed only in the last 20 years. a great milestones has been achieved during this time, thanks for the early pioneers in this field that established some resources to start thinking of Arabic Processing in a computational manner. Those great pioneers includes but not limited to Nizar Habash, Houda Bouamor and many other researchers across the globe.

Every language will have a degree of complexity that introduce challenges, Arabic has a great deal of challenges such as the Arabic Script (Cursive writing style, Right to left direction), A mega rich morphology ( 22,400 tags vs 48 tags for English), orthographic ambiguity (a word average has 12.3 analysis and 6.8 Diacritization) and dialectal variation [1].

Previous work has been done in this domain mostly with Java, this is the first time to have such a pakage for python. Table1: Feature comparison of CAMeL Tools, MADAMIRA, Stanford CoreNLP and Farasa.[2]

Installation:

Before installing, Please be aware that this package is compatible with Python 3.4 and above. it can be found on this like: https://pypi.org/project/camel-tools/

You can obtain the package by installing it using pip by typ: pip install camel-tools, it also accept the pip install camel_tools.

Once you get it installed successfully, let’s explore the contents within: The tools provide both Command line and Python API, We will start with APIs and we can cover the command line tools in later articles.

We will approach the exploration and testing based on what task we need to accomplish, we will start with tokenization. Prior to this step you need to include the following line in your code to be able processing Arabic script, this is should be helpful with any other language other than the english. Just add to the top of your python file

# -*- coding: utf-8 -*-

Before start we prepared a short text [3] to process, it has the format of text file txt, read it and store to text variable:

We will continue with the data cleaning before applying the tokenization, we need to remove all noises from the text such as punctuation mark and links.

To tokenize a word in Arabic using Camel tools, it’s as easy as importing the model and call the tokenize functions as follow

The simple_word_tokenize function will convert any text to its tokenized list.So far this package is supporting the word level tokenization. In the next article, we will discuss and explore the morphological tokenized. Stay tuned..

— — — — — — — — — — — — — — — — — — — — — — —

[1]: Introduction to Arabic Natural Language Processing (Synthesis Lectures on Human Language Technologies), Nizar Habash

[2] CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing. Ossama Obeid, Nasser Zalmout, Salam Khalifa, Dima Taji, Mai Oudah, Bashar Alhafni, Go Inoue, Fadhl Eryani, Alexander Erdmann, Nizar Habash

[3] The text was randomly selected from Moudo3 website.

--

--