6

I have large TXT files in arabic Tashkil and I'm trying to find lines that contain specific pattern mashkula with َ ً ُ ٌ ّ ْ ٍ , I've tried the following grep syntax:

cat file.txt | grep "اهلا"

This returns nothing until I insert Tashkil marks:

cat file.txt | grep "أهْلاً"

I get the correct output

أهْلاً

I also tried

grep -P "[ُ\ ّ\ َ\ ً\ ِ\ ٍ\ ٌ\ ْ\ \~]|[اهلا]" file.txt

And this returns all matching characters in different patterns:

أهْلاً أ ... هْ.. لًا أنْتَ لَيْلاً ..

How to match arabic diacritical marks with grep? Is it possible to remove Tashkil marks from text before using grep? My OS is Ubuntu 18.04

UPDATE: At this moment, I remove Tashkil marks from text with: sed "s/[ُ ّ َ ً ِ ٍ ٌ ْ]//g", then I can grep what I want. But in this approach, sed command removes spaces from all text!

Pablo Bianchi
  • 17,552
s3idani
  • 443

2 Answers2

5

Assuming UTF-8 source and locale, removing U+064B-U+065B range using Perl:

$ echo "أَهْلاً وَ سَهْلاً" | perl -CSAD -pe 's/[\x{064B}-\x{065B}]//g'

أهلا و سهلا

Source: This works because vowel diacritics in Arabic are combining characters, meaning that a simple search and remove of these should be enough.

GNU sed also seems to work (note that based on these answers, there are other diacritics):

$ echo "أَهْلاً وَ سَهْلاً" | sed -e 's/َ//g;s/ُ//g;s/ِ//g;s/ّ//g;s/ً//g;s/ٌ//g;s/ٍ//g;s/ْ//g'

أهلا و سهلا

uconv might also work.

Check the comments area of this and s3idani's answer for more info.

Other sources

Pablo Bianchi
  • 17,552
  • Nice answer :) ... that worked directly like so sed -e 's/َ//g;s/ُ//g;s/ِ//g;s/ّ//g;s/ً//g;s/ٌ//g;s/ٍ//g;s/ْ//g' file | grep -n أهلا and in a function containing this grep -n "$1" <(sed -e 's/َ//g;s/ُ//g;s/ِ//g;s/ّ//g;s/ً//g;s/ٌ//g;s/ٍ//g;s/ْ//g' "$2") where $1 is the first argument: word and $2 is the second argument: filename like so argrep أهلا file... I wonder if such function can be improved to return the original matching lines with diacritics in them by for example reading lines by numbers that grep -n returns from the original file again? – Raffa Apr 16 '22 at 22:58
  • @Pablo Bianchi The perl example requires 11 hex entitys of all arabic diacritical characters, one by one; x064B x064C x064D x064E x064F x0650 x0651 x0652 x0653 x0654 x0655 ً ٌ ٍ َ َُ ّ ْ ˜ ٓ ٕ ٔ – s3idani Apr 16 '22 at 23:24
  • @Pablo Bianchi And yes maybe i shoud accept your answer since grep command cant match directly diacritics exept with your sed perl example. – s3idani Apr 16 '22 at 23:42
  • Well done! ... both methods now give the same expected result ... they are both working and are identical. – Raffa Apr 17 '22 at 17:17
  • @Raffa I don't know anything about arabic, but reading this, should be removed up to U+065F, including Other combining marks? – Pablo Bianchi Apr 18 '22 at 05:54
  • I like the way you do your research ... but not all characters in your linked document are used in the Arabic language (We only use 28 letters and around 11 diacritics) ... Diacritics, however, are used lightly in everyday writing example newspaper article ... Arabic script is the 3rd most used script ... It is widely used by non-Arabs and sometimes with a few modifications/additions including letters/dicratics. ... – Raffa Apr 18 '22 at 18:16
  • ... Thus your answer will help computer users in all those Languages/countries ... Arabic alphabet has undergone many changes throughout its history ... IMO, the set of diacritics in your answer "especially the perl one" are enough for general use everyday Arabic script for Arabic language speakers ... That said, I think if you chose to enrich your answer with some of the information in my comments, that would make it Global and address a much wider audience. This is only a suggestion but your answer is very useful as it is. – Raffa Apr 18 '22 at 18:16
1

Based on Pablo Bianchi's answer, Here's the workaround:

Text: أَهْلاً وَ سَهْلاً

Command: cat Text | sed -e 's/َ//g;s/ُ//g;s/ِ//g;s/ّ//g;s/ً//g;s/ٌ//g;s/ٍ//g;s/ْ//g;s/أ/ا/g;s/آ/ا/g;s/إ/ا/g' | grep -o "اهلا"

Output: اهلا

s3idani
  • 443
  • I wouldn't use tr with non-ASCII characters. – Pablo Bianchi Apr 17 '22 at 01:26
  • @Pablo Bianchi I replaced tr command with full sed syntax and now all diacritical marks are removed. – s3idani Apr 25 '22 at 21:32
  • I don't know why did you replace those three characters at the end, why removing vowel diacritics (combining characters) isn't enough – Pablo Bianchi Apr 25 '22 at 22:02
  • @Pablo Bianchi in arabic language, أ إ آ ؤ ء ئ called "Hamza" not Tashkil marks. – s3idani Apr 25 '22 at 22:43
  • I assume removing maddah and hamza characters (U+0653 U+0654 U+0655) (BTW, three symbols, not six, is wrong that list?) is not enaough. This is wrong and not all vowel diacritics in Arabic are combining characters? Maybe that's why here are more ranges. – Pablo Bianchi Apr 26 '22 at 01:52
  • Im talking about arabic Tashkil simple text from ISO 8859-6, here is a Quran example: أَتَى أَمْرُ اللَّهِ فَلَا تَسْتَعْجِلُوهُ سُبْحَانَهُ وَتَعَالَى عَمَّا يُشْرِكُو in this case sed command responds to my need; اتى امر الله فلا تستعجلوه سبحانه وتعالى عما يشركو and yes, in other Quranic annotation it does not. – s3idani Apr 26 '22 at 18:51
  • Here's other example with أ إ آ in إِنْ أَحْسَنْتُمْ أَحْسَنْتُمْ لِأَنْفُسِكُمْ وَإِنْ أَسَأْتُمْ فَلَهَا فَإِذَا جَاءَ وَعْدُ الْآخِرَةِ لِيَسُوءُوا وُجُوهَكُمْ وَلِيَدْخُلُوا الْمَسْجِدَ كَمَا دَخَلُوهُ أَوَّلَ مَرَّةٍ وَلِيُتَبِّرُوا مَا عَلَوْا تَتْبِيرًا and this is sed output: `ان احسنتم احسنتم لانفسكم وان اساتم فلها فاذا جاء وعد الاخرة ليسوءوا وجوهكم وليدخلوا المسجد كما دخلوه اول مرة وليتبروا ما علوا تتبيرا – s3idani Apr 26 '22 at 19:21
  • Sorry, but examples doesn't help much for me. I only wonder if there are other ranges of Unicode chars I should remove with the Perl command (which has the advantage that it's not necessary to input non-ASCII chars on terminal), and if it's absolutely necessary to replace instead of delete certain combining chars? – Pablo Bianchi Apr 26 '22 at 22:18
  • If you mean combining hamza U+0655 ٕ U+0654  ٔ and maddah U+0653  ٓ I don't think it's possible to replace them with other chars but you can remove them (and all diacritics) in hand writing, text still already readable for arabic speaking people. The full Tashkil/diacritics are used strictly in Quranic readings. – s3idani Apr 26 '22 at 23:42