Why is character "£" in a string interpreted strange in the command cut?

Question

I'm developing a bash script and came up with the following strange behaviour!

$ echo £ |cut -c 1
�

The sign £ is passed to the next command cut whose filter is picking one character only.

When I modify the filter in the cut command to pick 2 characters, then the £ is passed through!

$ echo £ |cut -c 1-2
£

Not a severe problem, I have a workaround solution in the script, but why does the filter in the cut command require 2 positions instead of 1 when picking a £ sign?

Potential duplicate of this Unix.SE question. – marcelm Nov 03 '20 at 23:18 — marcelm, Nov 03 '20 at 23:18

FedKad · Answer 1 · 2020-11-04T14:07:58.000

44

The cut command in Ubuntu is not multi-byte character aware. Characters are the same as bytes for this version of the cut command.

The pound sign (£) is a UTF-8 character that consists of two bytes (c2 and a3):

$ echo £ | od -t x1
0000000 c2 a3 0a
0000003

Note: The 0a character is the "New Line" (ASCII "Line Feed" character).

When you cut the first character from the line, you are selecting only the c2 part of £, and this is not a valid UTF-8 character. As a result you get the strange question mark � (the replacement character) on screen:

$ echo £ | cut -c 1 | od -t x1
0000000 c2 0a
0000002

Note: The above was tested with the latest version of cut in Ubuntu 20.10 (GNU coreutils version 8.32).

If you want to select multi-byte characters, you can use the grep (GNU grep version 3.4) command like this:

$ echo x£β | grep -o '^.'
x
$ echo x£β | grep -o '^..'
x£
$ echo x£β | grep -o '^...'
x£β

_{This answer was improved with the help of the comments.}

edited Nov 04 '20 at 14:07

answered Nov 03 '20 at 10:46

FedKad

13,900

3

"The cut command is not multi-byte character aware." - Interestingly, (GNU) cut has both options for selecting bytes (-b), and for selecting characters (-c). One would hope it would know how to deal with multi-byte characters then... – marcelm Nov 03 '20 at 23:16
You might want to change echo to echo -n in your first example, so that there's no extra 0a – Grzegorz Oledzki Nov 04 '20 at 06:03
2

Initially I did that way @GrzegorzOledzki . However, since the second example with cut had it already, I removed the -n in the first example, for consistency. – FedKad Nov 04 '20 at 09:09
8

@marcelm Some cuts do actually make a distinction between -b and -c. My cut (GNU coreutils) 8.32 does the right thing with -c in an UTF-8 locale, but it turns out that it's due to a downstream Fedora patch. Upstream coreutils still handle -b and -c as aliases of the same thing at the moment. – TooTea Nov 04 '20 at 09:42
3

Note, that strange question mark is known in Unicode as the replacement character. It’s officially supposed to be used when a character or byte cannot be translated to a Unicode code point in the currently selected encoding (and in some cases it may also be used to represent characters that the current font does not include glyphs for). – Austin Hemmelgarn Nov 04 '20 at 12:11
2

@marcelm, cut is specified to have both -b and -c. The GNU implementation just treats them as identical... – ilkkachu Nov 04 '20 at 13:58

score 20 · Answer 2 · answered Nov 03 '20 at 10:47

20

In UTF-8 encoding, the hex value of £ is 0xC2 0xA3 (c2a3) which is 11000010 10100011 in binary.

So it's two bytes (like two character). cut -c considers each byte a character which produces �.

$ echo -n £ | xxd
00000000: c2a3                                     ..

$ echo -n £ | wc --bytes
2

answered Nov 03 '20 at 10:47

Ravexina

57,426

1

Characters starting from U+0080 (Latin-1 Supplement) usually show similar behaviour. You can find Unicode table on https://unicode-table.com/ – Kulfy Nov 03 '20 at 11:37
3

UTF-8 can have up to 4 bytes, which is not very intuitive. It's a gotcha, as it includes 7-bit ASCII but extends it. – mckenzm Nov 04 '20 at 08:12
Curiously, echo -n £ | wc --char returns 1 so wc knows a different definition of char than cut. – Criggie Nov 04 '20 at 19:38
2

To be clear GNU cut considers each byte to be a character with -c — other versions of cut will treat characters correctly. – Tim Nov 04 '20 at 22:09

Why is character "£" in a string interpreted strange in the command cut?

2 Answers2