Anatomy of python strings#
From the docs: "Strings are immutable sequences of Unicode code points". This requires a bit of unpacking ...
Terminology#
From the sea of technical lingo, I will mostly use three concepts (and often abuse terminology):
- Symbol
-
A symbol is an entity that conveys meaning in a given context. It can be seen as a "meme" in that it represents an idea or recognized concept. For example, it can be a single character or unit of text as perceived by a human reader (regardless of the underlying primitive blocks from which it is formed). The digit
1
is a symbol, so is that letterΓ©
, and so is the emoji π¨βπ©βπ§βπ¦. - Character
-
A primitive building block for symbols. It is common to refer to a visible (i.e., a user-perceived) character as a grapheme.
- Code point
-
Unicode code points are unsigned integers1 that map one to one with (primitive) characters. That is, to each character in the Unicode character set there is a corresponding integer code point as its index.
For example, the code point 97
corresponds to the grapheme e
. Every (primitive)
character can be seen as a symbol, but the opposite is not true because there are many
symbols that do not have an assigned code point. That is, some symbols are defined in
terms of a sequence of characters (and thus, of code points). Such symbols are commonly
referred to as grapheme clusters. An example of a grapheme cluster is π¨βπ©βπ§βπ¦ (as we
will see shortly, it consists of 7 characters 4 of which are
graphemes).
Redundancy of representation#
flowchart LR
subgraph G0 [symbol]
symbol{"Γ©"}
end
subgraph G1 [as one code point]
one_code_point["Γ© (U+00E9)"]
end
subgraph G2 [as two code points]
dispatch@{ shape: framed-circle }
dispatch --> two_code_point_1["e (U+0065)"]
dispatch --> two_code_point_2["Μ (U+0301)"]
end
symbol --> dispatch
symbol --> one_code_point
style symbol font-size:20px
style one_code_point font-size:18px
style two_code_point_1 font-size:18px
style two_code_point_2 font-size:18px
In Unicode, the symbol Γ© can be encoded in two ways (see Unicode equivalence). First, it has a dedicated code point (which defines it as a "primitive" grapheme). Second, it can be represented as a combination of e and an acute accent (which makes it a grapheme cluster as well).
s1 = "Γ©" # using one code point (U+00E9)
s2 = "eΜ" # using two code points (equivalent to s2 = "e\u0301")
assert s1 != s2
assert len(s1) == 1
assert len(s2) == 2
for char in s2:
code_point = ord(char)
print(f"{code_point} ({hex(code_point)})")
Output: (1)
-
M-x describe-char
inemacs
gives:position: 1 of 1 (0%), column: 0 character: Γ© (displayed as Γ©) (codepoint 233, #o351, #xe9) charset: iso-8859-1 (Latin-1 (ISO/IEC 8859-1)) code point in charset: 0xE9 script: latin syntax: w which means: word category: .:Base, L:Strong L2R, c:Chinese, j:Japanese, l:Latin, v:Viet to input: type "C-x 8 RET e9" or "C-x 8 RET LATIN SMALL LETTER E WITH ACUTE" buffer code: #xC3 #xA9 file code: #xC3 #xA9 (encoded by coding system utf-8-unix) display: terminal code #xC3 #xA9 Character code properties: customize what to show name: LATIN SMALL LETTER E WITH ACUTE old-name: LATIN SMALL LETTER E ACUTE general-category: Ll (Letter, Lowercase) decomposition: (101 769) ('e' ' ')
position: 1 of 2 (0%), column: 0 character: e (displayed as e) (codepoint 101, #o145, #x65) charset: ascii (ASCII (ISO646 IRV)) code point in charset: 0x65 script: latin syntax: w which means: word category: .:Base, L:Strong L2R, a:ASCII, l:Latin, r:Roman to input: type "C-x 8 RET 65" or "C-x 8 RET LATIN SMALL LETTER E" buffer code: #x65 file code: #x65 (encoded by coding system utf-8-unix) display: composed to form "e " (see below) Composed with the following character(s) " " by these characters: e (#x65) LATIN SMALL LETTER E (#x301) COMBINING ACUTE ACCENT Character code properties: customize what to show name: LATIN SMALL LETTER E general-category: Ll (Letter, Lowercase) decomposition: (101) ('e')
Example: a family#
flowchart TD
%%{init: {'themeVariables': {'title': 'My Flowchart Title'}}}%%
family1["π©βπ§"]
family2["π©βπ©βπ§"]
family3["π¨βπ©βπ§βπ¦"]
family4["πͺοΈ"]
family5["π¨βπ¦βπ¦"]
C@{ shape: framed-circle, label: "Stop" }
C --> cp1["π¨"]
C --> cp2["U+200d"]
C --> cp3["π©"]
C --> cp4["U+200d"]
C --> cp5["π§"]
C --> cp6["U+200d"]
C --> cp7["π¦"]
family3 --> C
cp1-.->cp1-hex["U+1f468"]
cp3-.->cp3-hex["U+1f469"]
cp5-.->cp5-hex["U+1f467"]
cp7-.->cp7-hex["U+1f466"]
style family1 font-size:50px
style family2 font-size:50px
style family3 font-size:50px
style family4 font-size:50px
style family5 font-size:50px
style cp1 font-size:30px
style cp2 font-size:30px
style cp3 font-size:30px
style cp4 font-size:30px
style cp5 font-size:30px
style cp6 font-size:30px
style cp7 font-size:30px
style cp1-hex font-size:30px
style cp3-hex font-size:30px
style cp5-hex font-size:30px
style cp7-hex font-size:30px
There are various emoji symbols that portray a family. They have different semantics,
which is reflected by the code points used to form them. In the representation of the
middle one (depicted on the lower levels), there are 4 primitive graphemes glued
together with the zero-width joiner
character U+200d
. We can use list("π¨βπ©βπ§βπ¦")
to get a list of characters
associated with the code points that form π¨βπ©βπ§βπ¦.
Indexing#
Consider the string sentense = "This π¨βπ©βπ§βπ¦ is my family!"
. As python
strings are (stored as) sequences of code points, sentense[:6]
would give
"This π¨"
because π¨ corresponds to the first (also called a base) code
point of π¨βπ©βπ§βπ¦. As can be expected sentense[:8]
returns "Thisπ¨βπ©"
, where the zero-width joiner is not visible2.
The situation can get tricky with symbols that may have different Unicode
representations. For example len("L'idée a été réévaluée.")
is 23, while
len("L'ideΜe a eΜteΜ reΜeΜvalueΜe.")
is 29 because all symbols eΜ in the latter
string are encoded using two code points. One can imagine strings with a mix of
representations for the same symbols which can be difficult to handle in an ad hoc
manner.
Grapheme clustering#
The Unicode standard defines rules for
identifying sequences of code points that are meant to form a particular symbol (i.e.,
grapheme cluster). Finding symbol
boundaries is a common
problem e.g., in text editors and terminal emulators. As an example, consider the
following functionality from the grapheme
3 package:
import grapheme
sentense = "This π¨βπ©βπ§βπ¦ is my family!"
assert len(sentense) == 26
assert grapheme.length(sentense) == 20
assert not grapheme.startswith(sentense, sentense[:6])
Normalization#
The unicodedata
package is a part of python's standard library and can be used to
normalize a string. That is, to detect symbols for which alternative Unicode encodings
exist and to convert them to a given canonical
form.
import unicodedata
s1 = "L'idée a été réévaluée."
assert len(s1) == 23
# each "Γ©" becomes "e\u0301"
s2 = unicodedata.normalize("NFD", s1) # canonical decomposition
assert len(s2) == 29 # (1)!
s3 = unicodedata.normalize("NFC", s2) # canonical composition
assert len(s3) == 23
assert s1 == s3
assert s1 != s2
- While the representation of symbols resulting from the
NDF
canonical decomposition may contain more code points, it allows for greater flexibility of text processing in many contexts, e.g., string pattern matching.
Memory footprint#
The above discussion is mostly abstract in that it makes no assumptions on how code
points (ranging from 0
to 1114111
) are to be stored in memory. Starting from PEP
393, python addresses the memory storage problem in
a pragmatic way by handling four cases which depend only on one parameter: the
largest code point occurring in the string.
import sys
import unicodedata
s1 = "L'idée a été réévaluée."
s2 = unicodedata.normalize("NFD", s1)
m1, m2 = max(s1), max(s2)
print(f"[s1]: {ord(m1)} ( {m1} ) #bytes = {sys.getsizeof(s1)}")
print(f"[s2]: {ord(m2)} ( {m2} ) #bytes = {sys.getsizeof(s2)}")
The largest code point for the s2
string corresponds to the combining acute accent,
while for the s1
string it corresponds to Γ©
.
The four cases are:
where denotes the largest code point in the string . The memory required to store is
where is the number of code points in and, the size of the
C-struct
that holds the data is given by4
The above logic is implemented in the string_bytes
function below5.
def string_bytes(s):
def string_bytes(s):
numb_code_points, max_code_points = len(s), ord(max(s))
# C-structs in cpython/Objects/unicodeobject.c
# ----------------------------------------------
# ASCII (use PyASCIIObject):
# 2 x ssize_t = 16
# 6 x unsigned int = 24
# otherwise (use PyCompactUnicodeObject):
# 1 x PyASCIIObject = 40
# 1 x ssize_t = 8
# 1 x char * = 8
# assuming a x86_64 architecture
struct_bytes = 56
if max_code_points < 2**7:
code_point_bytes = 1
struct_bytes = 40
elif max_code_points < 2**8:
code_point_bytes = 1
elif max_code_points < 2**16:
code_point_bytes = 2
else:
code_point_bytes = 4
# `+ 1` for zero termination
# the result is identical with sys.getsizeof(s)
return struct_bytes + (numb_code_points + 1) * code_point_bytes
For the above example, s1
is 56 + (23 + 1) * 1 = 80
bytes because it falls in the
second case as its largest code point is 233. The string s2
, on the other hand, falls
in the third case because the acute accent has a code point above 255 (so its size is
56 + (29 + 1) * 2 = 116
bytes).
Three clear advantages of the PEP 393 approach:
- an optimized ASCII implementation can be used for the most common (ASCII) case
- the constant number of bytes per code point6 results in constant-time indexing and facilitates other operations
- can handle natively strings containing non-BMP characters, i.e., code points greater than .
On the flip-side, concatenating a single emoji to an ASCII string increases the size x 4.
Code units#
The building block used to actually store a code point in memory is often called a code
unit. For example, consider the acute accent (U+0301
):
flowchart TD
%%{init: {'themeVariables': {'title': 'My Flowchart Title'}}}%%
s["U+0301"]
s --> utf8["UTF-8"]
s --> utf16["UTF-16"]
s --> utf32["UTF-32"]
C@{ shape: framed-circle, label: "Stop" }
C -.-> utf8-1["CC"]
C -.-> utf8-2["81"]
utf8 -.-> C
utf16 -.-> utf16-1["0103"]
utf32 -.-> utf16-2["01030000"]
style utf8 stroke-width:2px,stroke-dasharray: 5 5
style utf16 stroke-width:2px,stroke-dasharray: 5 5
style utf32 stroke-width:2px,stroke-dasharray: 5 5
- with a
utf-8
encoding there are two 8-bit code units (0xCC
and0x81
) - with a
utf-16
encoding there is one 16-bit code unit - with a
utf-32
encoding there is one 32-bit code unit .
Note that, in the above example, the code units for utf-16
and utf-32
are stored
using little-endian.
Four string encodings#
A different encoding is used in each of the four cases discussed above.
- case 1 : ASCII (which is equivalent to UTF-8 in this range)
- case 2 : UCS1 (i.e., LATIN-1)
- case 3 : UCS2 (i.e., UTF-16)
- case 4 : UCS4 (i.e., UTF-32).
For example, the string mess
in the snippet below has 8 code points and
, hence we are in case 3 in which UTF-16 encoding should be
used. At the end, the encoding computed manually is compared7
with the actual memory occupied by our string.
mess = "Iβ₯οΈζ₯ζ¬ΠΠΒ©"
assert len(mess) == 8
assert ord(max(mess)) == 65039 # case 3: 255 < 65039 < 65536
# utf-16-le stands for utf-16 with little-endian
encoding = b''.join([char.encode("utf-16-le") for char in mess]).hex()
assert string_bytes(mess) == 74 # 56 + (8 + 1) * 2
assert len(encoding) == 32 # i.e., 16 bytes as it is in hex
assert encoding == "490065260ffee5652c6713041e04a900"
# --------------------------------------------------------------------------
# compare to groundtruth (this is a hack!)
# --------------------------------------------------------------------------
import ctypes
import sys
def memory_dump(string):
address = id(string) # assuming CPython
buffer = (ctypes.c_char * sys.getsizeof(string)).from_address(address)
return bytes(buffer)
# [56:] removes what we called struct_bytes above (in CPython they come first)
# [:-2] removes the zero termination bytes
assert memory_dump(mess)[56:-2].hex() == encoding
# --------------------------------------------------------------------------
Bytes objects#
As we have seen, the code units used to store a python string in memory depend on the
string itself and are abstracted away from the user. While this is a good thing in many
cases, sometimes we need more fine-grained control. To this
end, python provides the "bytes" object
(an immutable sequences of single bytes). Actually we already used it in the previous
example as it is the return type of
str.encode
.
Let us consider the string a_man = "aπ¨"
. By now we know that it is stored using
4 bytes per code point. Using a_man.encode("utf-32")
we obtain:
"a"
:97, 0, 0, 0
"π¨"
:104, 244, 1, 0
.
If we relax the constraint of constant number of bytes per code point, we can dedicate
less space to our string. Using a_man.encode("utf-16")
we obtain:
"a"
:97, 0
"π¨"
:61, 216, 104, 220
or using a_man.encode("utf-8")
:
"a"
:97
"π¨"
:240, 159, 145, 168
.
All above representations have their applications. For example UTF-8 provides compatibility with ASCII and efficient data storage, while UTF-16 and UTF-32 allow for faster processing of a larger range of characters. Having the possibility to easily/efficiently change representations is convenient.
Bytes do not necessarily have to be associated with individual code points, as is the
case when using str.encode
. For example, suppose we want to express the
string "a1b1"
as a byte object, where each pair of characters represents a
byte in hex (i.e., 0xA1
followed by 0xB1
). In this case, using list("a1b1".encode())
is not appropriate, as it would return [97, 49, 98, 49]
, which
are the ASCII codes for the characters a
, 1
, b
, and 1
, respectively. Instead, we
should consider the additional structure and use list(bytes.fromhex("a1b1"))
,
which results in [161, 177]
.
Bytes objects can also be used in other contexts. For instance, (1).to_bytes(4, byteorder='little')
returns the byte representation of the integer 1
(in little-endian).
Immutability#
The design decision to have immutable string in python has far-reaching implication related to e.g., hashing, performance optimizations, garbage collection, thread safety etc. In addition to all this, having immutable strings was a prerequisite for the approach in PEP 393.
-
Often expressed as a hexadecimal number. ↩
-
The string might be rendered as
"This π¨\u200dπ©"
. ↩ -
pip install grapheme
↩ -
Assuming a
x86_64
architecture (see thestring_bytes
function for more details). ↩ -
Based on
PyObject * PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar)
inunicodeobject.c
. ↩ -
The smallest possible is always chosen. ↩
-
We used a
CPython
implementation ofpython 3.12
. ↩