String operations on byte strings
problem
You want the word in byte (Byte String) Perform normal text operations on strings ( For example, remove , Search and replace )
solution
Byte strings also support most of the same built-in operations as text strings . such as :
>>> data = b'Hello World'
>>> data[0:5]
b'Hello'
>>> data.startswith(b'Hello')
True
>>> data.split()
[b'Hello', b'World']
>>> data.replace(b'Hello', b'Hello Cruel')
b'Hello Cruel World'
>>>
These operations also apply to byte arrays . such as :
>>> data = bytearray(b'Hello World')
>>> data[0:5]
bytearray(b'Hello')
>>> data.startswith(b'Hello')
True
>>> data.split()
[bytearray(b'Hello'), bytearray(b'World')]
>>> data.replace(b'Hello', b'Hello Cruel')
bytearray(b'Hello Cruel World')
>>>
You can use regular expressions to match byte strings , But the regular expression itself must also be a string of bytes . such as :
>>> data = b'FOO:BAR,SPAM'
>>> import re
>>> re.split('[:,]',data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.3/re.py", line 191, in split
return _compile(pattern, flags).split(string, maxsplit)
TypeError: can't use a string pattern on a bytes-like object
>>> re.split(b'[:,]',data) # Notice: pattern as bytes
[b'FOO', b'BAR', b'SPAM']
>>>
Discuss
Most of the time , Operations on text strings can be used for byte strings . However , There are also some differences that need to be noted . First , The index operation of a byte string returns an integer instead of a single character . such as :
>>> a = 'Hello World' # Text string
>>> a[0]
'H'
>>> a[1]
'e'
>>> b = b'Hello World' # Byte string
>>> b[0]
72
>>> b[1]
101
>>>
This semantic difference has an impact on the processing of byte oriented character data .
Second point , Byte strings do not provide a nice string representation , It doesn't print out very well , Unless they are first decoded as a text string . such as :
>>> s = b'Hello World'
>>> print(s)
b'Hello World' # Observe b'...'
>>> print(s.decode('ascii'))
Hello World
>>>
Allied , There are no formatting operations for byte strings :
>>> b'%10s %10d %10.2f' % (b'ACME', 100, 490.1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for %: 'bytes' and 'tuple'
>>> b'{} {} {}'.format(b'ACME', 100, 490.1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'format'
>>>
If you want to format a byte string , You have to use standard text strings first , Then encode it as a byte string . such as :
>>> '{:10s} {:10d} {:10.2f}'.format('ACME', 100, 490.1).encode('ascii')
b'ACME 100 490.10'
>>>
The last thing to note is this , Using byte strings may change the semantics of some operations , Especially those operations related to the file system . such as , If you use a file name encoded in bytes , Instead of a normal text string , The encoding of the file name will be disabled / decode . such as :
>>> # Write a UTF-8 filename
>>> with open('jalape\xf1o.txt', 'w') as f:
... f.write('spicy')
...
>>> # Get a directory listing
>>> import os
>>> os.listdir('.') # Text string (names are decoded)
['jalapeño.txt']
>>> os.listdir(b'.') # Byte string (names left as bytes)
[b'jalapen\xcc\x83o.txt']
>>>
Notice how passing a byte string to the directory name in the last part of the example causes the file name in the result to be returned in undeciphered bytes . The file name in the directory contains the original UTF-8 code .
One last point , Some programmers tend to use byte strings instead of text strings in order to speed up program execution . Although it is true that manipulating byte strings is more efficient than text ( Because processing text is inherently Unicode Related expenses ). This usually leads to very messy code . You'll often find that byte strings don't match Python The rest of the work is good , And you have to manually handle all the coding / Decoding operation . To be honest , If you're dealing with text , Just use plain text strings instead of byte strings directly in the program .