Changing ByteStr REPR

  |   Source

A recent rebutal against Python 3 was recently written by the (in)famous Zed Shaw, with many responses to various arguments and counter arguments.

One particular topic which caught my eye was the bytearray vs unicodearray debate. I'll try explicitely avoid the term str/string/bytes/unicode naming as it is (IMHO) confusing, but that's a debate for another time. If one pay attention to above debates, you might see that there are about two camps:

  • bytearray and unicodearray are two different things, and we should never convert from one to the other. (that's rought the Pro-Python-3 camp)
  • bytearray and unicodearray are similar enough in most cases that we should do the magic for users.

I'm greatly exagerating here and the following is neither for one side or another, I have my personal preference of what I think is good, but that's irrelevant for now. Note that both sides argue that their preference is better for beginners.

You can often find posts trying to explain the misconception string/str/bytes, like this one which keep insisting on the fact that str in python 3 is far different from bytes.

The mistake in the REPR

I have one theory that the bytes/str issue is not in their behavior, but in their REPR. The REPR is in the end the main informatin communication channel between the object and the brain of the programmer, user. Also, Python "ducktyped", and you have to admit that bytes and str kinda look similar when printed, so assuming they should behave in similar way is not far fetched. I'm not saying that user will conciously assume bytes/str are the same. I'm saying that human brain inherently may do such association.

From the top of your head, what does requests.get(url).content returns ?

In [1]:
import requests_cache
import requests
In [2]:
b'{"name":"Luke Skywalker","height":"172","mass":"77","hair_color":"blond","skin_color":"fair","eye_color":"blue","birth_year":"19BBY","gender":"male","homeworld":"","films":["","","","",""],"species":[""],"vehicles":["",""],"starships":["",""],"created":"2014-12-09T13:50:51.644000Z","edited":"2014-12-20T21:17:56.891000Z","url":""}'

... bytes...

I'm pretty sure you glanced ahead in this post and probaly thought it was "Text", even probably in this case Json. It might be invalid Json, I'm pretty sure you cannot tell.

Why does it returns bytes ? Because it could fetch an image:

In [3]:
b"\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\xcc\x00\x00\x01\xcc\x08\x06\x00\x00\x00X\xdb\x98\x86\x00\x00 \x00IDATx\xda\xac\xbdy\x93\x1b\xb9\xb2\xf6\xf7K\x00\xb5\x90\xbdH\xa3\x99\xb9s7\xbf\xf1:\x1c\x0e/\xdf\xff\xdb8\xec\xb0}\xd79g4Rw\xb3IV\x15\x80\xf4\x1f@\xedUl\xea\\w\x84\xa65-6Y\x85\x02ry\xf2\xc9'\xa5\xfe\x9f\xfeGE\x04#\x821\x061\x16c\x0c\xc6XD\x0c\x02\xa0\x8a\x8a\x801\xa4\x1f\x08\x880\xfdRUD\x04\xd5\xfe\xff#6z\x8c*\xaa\x82\x88\xe0C \x84@\xf7~\xa6yy\xc5=>Q>~\xe6\xe1\xf3g~\xfd\xa7\x7f\xc28\x07\xb6\x00\x84h-\x88A1(\xe0U\xd2\xfb\xb8t\r1("

And if you decode the first request ?

In [4]:

Well that looks the same (except leading b...). Go explain a beginner that the 2 above are totally different things, while they already struggle with 0 base indexing, iterators, and the syntax of the language.

Changing the repr

Lets revert the repr of bytesarray to better represent what they are. IPython allows to change object repr easily:

In [5]:
text_formatter = get_ipython().display_formatter.formatters['text/plain']
In [6]:
def _print_bytestr(arg, p, cycle):
text_formatter.for_type(bytes, _print_bytestr)
<function IPython.lib.pretty._repr_pprint>
In [7]:

Make a usefull repr

<bytesbytesbytes> may not an usefull repr, so let's try to make a repr, that:

  • Convey bytes are, in genral not text.
  • Let us peak into the content to guess what it is
  • Push the user to .decode() if necessary.

Generally in Python objects have a repr which start with <, then have the class name, a quoted representation of the object, and memory location of the object, a closing >.

As the _quoted representation of the object may be really long, we can ellide it.

A common representation of bytes could be binary, but it's not really compact. Hex, compact but more difficult to read, and make peaking at the content hart when it could be ASCII. So let's go with ASCII reprentation where we escape non ASCII caracterd.

In [8]:
ellide = lambda s: s if (len(s) < 75) else  s[0:50]+'...'+s[-16:]
In [9]:
def _print_bytestr(arg, p, cycle):
    p.text('<bytes '+ellide(repr(arg))+' at {}>'.format(hex(id(arg))))       
text_formatter.for_type(bytes, _print_bytestr)
<function __main__._print_bytestr>
In [10]:
<bytes b'{"name":"Wilhuff Tarkin","height":"180","mass":"...pi/people/12/"}' at 0x107299228>
In [11]:
'{"name":"Wilhuff Tarkin","height":"180","mass":"unknown","hair_color":"auburn, grey","skin_color":"fair","eye_color":"blue","birth_year":"64BBY","gender":"male","homeworld":"","films":["",""],"species":[""],"vehicles":[],"starships":[],"created":"2014-12-10T16:26:56.138000Z","edited":"2014-12-20T21:17:50.330000Z","url":""}'

Advantage: It is not gobbledygook anymore when getting binary resources !

In [12]:
<bytes b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\...0IEND\xaeB`\x82' at 0x107e0c000>