BOM 是 Byte Order Mark 的缩写,代表一个 Unicode 字符 FEFF
。
Windows 系统下的很多软件就用 BOM 字符作为 Magic Number, 用来确认文件的字符编码和字节顺序。
这个设计可谓巧妙,但是给开发者处理文本文件带来了非常多的不便。
Encoding | Hexadecimal | Decimal | CP1252 (latin1) |
---|---|---|---|
UTF-8 | EF BB BF | 239 187 191 |  |
UTF-16 (BE) | FE FF | 254 255 | þÿ |
UTF-16 (LE) | FF FE | 255 254 | ÿþ |
UTF-32 (BE) | 00 00 FE FF | 0 0 254 255 | ^@^@þÿ |
UTF-32 (LE) | FF FE 00 00 | 255 254 0 0 | ÿþ^@^@ |
UTF-7 | 2B 2F 76 | 43 47 118 | +/v |
UTF-1 | F7 64 4C | 247 100 76 | ÷dL |
UTF-EBCDIC | DD 73 66 73 | 221 115 102 115 | Ýsfs |
SCSU | 0E FE FF | 14 254 255 | ^Nþÿ |
BOCU-1 | FB EE 28 | 251 238 40 | ûî( |
GB-18030 | 84 31 95 33 | 132 49 149 51 | „1•3 |
PS: ^@
is the null character
PS: ^N
is the "shift out" character
a = '\ufeff'
encodings = 'utf-8', 'utf-16-le', 'utf-16-be', 'utf-32-le', 'utf-32-be', 'utf-7', 'gb18030'
print('| %-15s | %-22s | %-16s | %-15s |' % ('Encoding', 'Hexadecimal', 'Decimal', 'Latin-1'))
print('| ' + (' | '.join(['-' * 15, '-' * 22, '-' * 16, '-' * 15])) + ' |')
for encoding in encodings:
print('| %-15s | %-22s | %-16s | %-15s |' % (
'**%s**' % encoding,
'`%s`' % a.encode(encoding),
'`%s`' % (' '.join(['%02x' % i for i in a.encode(encoding)])),
'`%r`' % a.encode(encoding).decode('cp1252'),
))
Encoding | Hexadecimal | Decimal | Latin-1 |
---|---|---|---|
utf-8 | b'\xef\xbb\xbf' |
ef bb bf |
'' |
utf-16-le | b'\xff\xfe' |
ff fe |
'ÿþ' |
utf-16-be | b'\xfe\xff' |
fe ff |
'þÿ' |
utf-32-le | b'\xff\xfe\x00\x00' |
ff fe 00 00 |
'ÿþ\x00\x00' |
utf-32-be | b'\x00\x00\xfe\xff' |
00 00 fe ff |
'\x00\x00þÿ' |
utf-7 | b'+/v8-' |
2b 2f 76 38 2d |
'+/v8-' |
gb18030 | b'\x841\x953' |
84 31 95 33 |
'„1•3' |