8

Sometimes I need to search files with accented characters (diacritic in general), usually with locate (mlocate flavor, Merging Locate; see below the warning related to plocate). I wish to setup (maybe in /etc/updatedb.conf) so it let me search for this special characters using a certain language mapping, for example:

a == âàáäÂÀÂÄ
e == êèéëÊÈÉË
i == îïíÎÏ
o == ôöóÔÖ
u == ûùüÛÜÙ
c == çÇ
n == ñ

So locate -i liberación also should search for file names with string liberacion and even liberaciòn.

Notes and assumptions

  • And maybe others: ÂÃÄÀÁÅÆ ÇÈÉÊËÌÍÎÏ ÐÑÒÓÔÕÖØÙÚÛÜÝÞ ßàáâãäåæç èéêëìíîïðñòóôõö øùúûüýþÿ.
  • This is a common situation on romance languages like Spanish, French, and German.
  • I'm always using a locale 100% UTF-8.
  • I would rather not have to use regular expressions.
  • A patch might use ASCII transliterations of Unicode as Unidecode/cUnidecode does. Most of mlocate is written on C.

Related

Pablo Bianchi
  • 17,552

2 Answers2

3

If we take a look at updatedb.conf(5), we'll find that there is no much we can do with configuration items.

So we are going to write a script using locate; At the end we are able to run something like my-locate.sh liberacion or my-locate.sh liberâciòn and it will brings us all the possible combinations.


Lets start

First create a simple file as our database anywhere you want it to be, e.g: ~/.mydb; then add your accents characters into that file like this:

aâàáäÂÀÂÄ
eêèéëÊÈÉË
iîïíÎÏ
uûùüÛÜÙ
cçÇ
oôöóÔÖóòòò
...
...

Then we need a small script which does the job for us, I wrote a simple one:

#!/bin/bash

# Final search term 
STR=""

# Loop throughout all characters of desired string
for (( i=0; i<${#1}; i++ )); do

  # Split the string in one char
  CH="${1:$i:1}"

  # Find all possible combinations of this char
  CHARS=$(grep "$CH" ~/.mydb)

  # Add an "or" operator between characters
  REG=$(echo "$CHARS" |  sed 's/.\{1\}/&\|/g' )
  REG="($REG)"

  # Append all possible combination of this character
  # to our final search term as an or statement
  if [ "$REG" == '()' ];
  then
   STR=$STR$CH
  else
   STR=$STR$REG
  fi

done

# locate it using regex
locate --regex "$STR$"

Now save it somewhere in your PATH with a desired name, e.g: in ~/bin. It should be already in your PATH environment.

After all simply use something like this to search all possible combinations.

my-locate.sh liberacion

Will find for me all of these:

~/lab/liberacion
~/lab/liberaciòn
~/lab/liberación
~/lab/liberâciòn
~/lab/liberäciòn
~/lab/libÈrâciòn
David Foerster
  • 36,900
  • 56
  • 98
  • 152
Ravexina
  • 57,426
  • You can use grep -f or fgrep to avoid the interpretation of "$CH" as a special character, e. g. grep ^ would match any line but grep -f ^ only matches those that contain the character ^. It may also be easier to use character classes to craft the regular expression, i. e. REG="[$CHARS]" is probably easier than your sed command. Watch out for special characters though! Otherwise a good approach. +1 – David Foerster May 22 '17 at 09:13
  • This seems a good alternative if we have to use plocate, but for some reason I'm getting too many apparently false positives. Also, do you know if it's possible to avoid SC2001 on line 18? – Pablo Bianchi Nov 22 '22 at 23:25
2

mlocate

Now with mlocate 0.26 we have -t --transliterate option (see the man page) on Ubuntu 18.04+ (without the need of workarounds):

Creating some test files:

$ touch liberación liberacion liberaciôn

Update and search:

$ updatedb
$ locate --transliterate liberacion 
/home/pablo/liberacion
/home/pablo/liberación
/home/pablo/liberaciôn

So now locate -t liberación also search for files with string liberacion and even liberaciòn!

Finally, creating an alias... :-)

$ alias locate="locate --transliterate"

plocate

plocate: is a much faster locate with smaller index. But doesn't have a --transliterate option, and won't have anytime soon, as said from its only maintainer (Steinar Gunderson):

This is highly nontrivial, unfortunately. mlocate can get away with it because it scans through each and every filename linearly; plocate doesn't, which is why it is so much faster than mlocate, but also makes such locale-dependent searches much harder. So no, this is unlikely to happen anytime soon, unfortunately.

I couldn't find an issue tracker related to this project.

However it has --regex, so it seems internnally could use something simmilar to Ravexina's aproach.

On distros with plocate by default I remove it, installed mlocate and hold it so isn't replaced with plocate:

sudo apt update && sudo apt upgrade
wget "http://archive.ubuntu.com/ubuntu/pool/main/m/mlocate/mlocate_0.26-3ubuntu3_amd64.deb"
sudo apt install -f
sudo apt-mark hold mlocate
Pablo Bianchi
  • 17,552