Strings/Unicode
Both Python and C++ have core types to represent text and these are expected
to be freely interchangeable.
cppyy
makes it easy to do just that for the most common cases, while
allowing customization where necessary to cover the full range of diverse use
cases (such as different codecs).
In addition to these core types, there is a range of other character types,
from const char*
and std::wstring
to bytes
, that see much less
use, but are also fully supported.
std::string
The C++ core type std::string
is considered the equivalent of Python’s
str
, even as purely implementation-wise, it is more akin to bytes
:
as a practical matter, a C++ programmer would use std::string
where a
Python developer would use str
(and vice versa), not bytes
.
A Python str
is unicode, however, whereas an std::string
is character
based, thus conversions require encoding or decoding.
To allow for different encodings, cppyy
defers implicit conversions
between the two types until forced, at which point it will default to seeing
std::string
as ASCII based and str
to use the UTF-8 codec.
To support this, the bound std::string
has been pythonized to allow it to
be a drop-in for a range of uses as appropriate within the local context.
In particular, it is sometimes necessary (e.g. for function arguments that
take a non-const reference or a pointer to non-const std::string
variables), to use an actual std::string
instance to allow in-place
modifications.
The pythonizations then allow their use where str
is expected.
For example:
>>> cppyy.cppexec("std::string gs;") True >>> cppyy.gbl.gs = "hello" >>> type(cppyy.gbl.gs) # C++ std::string type <class cppyy.gbl.std.string at 0x7fbb02a89880> >>> d = {"hello": 42} # dict filled with str >>> d[cppyy.gbl.gs] # drop-in use of std::string -> str 42 >>>
To handle codecs other than UTF-8, the std::string
pythonization adds a
decode
method, with the same signature as the equivalent method of
bytes
.
If it is known that a specific C++ function always returns an std::string
representing unicode with a codec other than UTF-8, it can in turn be
explicitly pythonized to do the conversion with that codec.
std::string_view
It is possible to construct a (char-based) std::string_view
from a Python
str
, but it requires the unicode object to be encoded and by default,
UTF-8 is chosen.
This will give the expected result if all characters in the str
are from
the ASCII set, but otherwise it is recommend to encode on the Python side and
pass the resulting bytes
object instead.
std::wstring
C++’s “wide” string, std::wstring
, is based on wchar_t
, a character
type that is not particularly portable as it can be 2 or 4 bytes in size,
depending on the platform.
cppyy supports std::wstring
directly, using the wchar_t
array
conversions provided by Python’s C-API.
const char*
The C representation of text, const char*
, is problematic for two
reasons: it does not express ownership; and its length is implicit, namely up
to the first occurrence of '\0'
.
The first can, up to an extent, be ameliorated: there are a range of cases
where ownership can be inferred.
In particular, if the C string is set from a Python str
, it is the latter
that owns the memory and the bound proxy of the former that in turn owns the
(unconverted) str
instance.
However, if the const char*
’s memory is allocated in C/C++, memory
management is by necessity fully manual.
Length, on the other hand, can only be known in the case of a fixed array.
However even then, the more common case is to use the fixed array as a
buffer, with the actual string still only extending up to the '\0'
char,
so that is assumed.
(C++’s std::string
suffers from none of these issues and should always be
preferred when you have a choice.)
char*
The C representation of a character array, char*
, has all the problems of
const char*
, but in addition is often used as “data array of 8-bit int”.
character types
cppyy directly supports the following character types, both as single
variables and in array form: char
, signed char
, unsigned char
,
wchar_t
, char16_t
, and char32_t
.