21

I'm developing a bash script and came up with the following strange behaviour!

$ echo £ |cut -c 1
�

The sign £ is passed to the next command cut whose filter is picking one character only.

When I modify the filter in the cut command to pick 2 characters, then the £ is passed through!

$ echo £ |cut -c 1-2
£

Not a severe problem, I have a workaround solution in the script, but why does the filter in the cut command require 2 positions instead of 1 when picking a £ sign?

αғsнιη
  • 36,400

2 Answers2

44

The cut command in Ubuntu is not multi-byte character aware. Characters are the same as bytes for this version of the cut command.

The pound sign (£) is a UTF-8 character that consists of two bytes (c2 and a3):

$ echo £ | od -t x1
0000000 c2 a3 0a
0000003

Note: The 0a character is the "New Line" (ASCII "Line Feed" character).

When you cut the first character from the line, you are selecting only the c2 part of £, and this is not a valid UTF-8 character. As a result you get the strange question mark (the replacement character) on screen:

$ echo £ | cut -c 1 | od -t x1
0000000 c2 0a
0000002

Note: The above was tested with the latest version of cut in Ubuntu 20.10 (GNU coreutils version 8.32).

If you want to select multi-byte characters, you can use the grep (GNU grep version 3.4) command like this:

$ echo x£β | grep -o '^.'
x
$ echo x£β | grep -o '^..'
x£
$ echo x£β | grep -o '^...'
x£β

This answer was improved with the help of the comments.

FedKad
  • 13,900
  • 3
    "The cut command is not multi-byte character aware." - Interestingly, (GNU) cut has both options for selecting bytes (-b), and for selecting characters (-c). One would hope it would know how to deal with multi-byte characters then... – marcelm Nov 03 '20 at 23:16
  • You might want to change echo to echo -n in your first example, so that there's no extra 0a – Grzegorz Oledzki Nov 04 '20 at 06:03
  • 2
    Initially I did that way @GrzegorzOledzki . However, since the second example with cut had it already, I removed the -n in the first example, for consistency. – FedKad Nov 04 '20 at 09:09
  • 8
    @marcelm Some cuts do actually make a distinction between -b and -c. My cut (GNU coreutils) 8.32 does the right thing with -c in an UTF-8 locale, but it turns out that it's due to a downstream Fedora patch. Upstream coreutils still handle -b and -c as aliases of the same thing at the moment. – TooTea Nov 04 '20 at 09:42
  • 3
    Note, that strange question mark is known in Unicode as the replacement character. It’s officially supposed to be used when a character or byte cannot be translated to a Unicode code point in the currently selected encoding (and in some cases it may also be used to represent characters that the current font does not include glyphs for). – Austin Hemmelgarn Nov 04 '20 at 12:11
  • 2
20

In UTF-8 encoding, the hex value of £ is 0xC2 0xA3 (c2a3) which is 11000010 10100011 in binary.

So it's two bytes (like two character). cut -c considers each byte a character which produces .


$ echo -n £ | xxd
00000000: c2a3                                     ..

$ echo -n £ | wc --bytes 2

Ravexina
  • 57,426
  • 1
    Characters starting from U+0080 (Latin-1 Supplement) usually show similar behaviour. You can find Unicode table on https://unicode-table.com/ – Kulfy Nov 03 '20 at 11:37
  • 3
    UTF-8 can have up to 4 bytes, which is not very intuitive. It's a gotcha, as it includes 7-bit ASCII but extends it. – mckenzm Nov 04 '20 at 08:12
  • Curiously, echo -n £ | wc --char returns 1 so wc knows a different definition of char than cut. – Criggie Nov 04 '20 at 19:38
  • 2
    To be clear GNU cut considers each byte to be a character with -c — other versions of cut will treat characters correctly. – Tim Nov 04 '20 at 22:09