對于前一篇爽锥,我們討論到字符串對象初始化過程ascii_decode函數(shù),我們說當ascii_decode函數(shù)如果對傳入?yún)?shù)C級別的字符指針(char*)并沒做任何操作哪工,那么unicode_decode_utf8函數(shù)將繼續(xù)調(diào)用_PyUnicodeWriter_InitWithBuffer函數(shù)档痪,unicode_decode_utf8函數(shù)的部分代碼片段如下所示
static PyObject *
unicode_decode_utf8(const char *s, Py_ssize_t size,
_Py_error_handler error_handler, const char *errors,
Py_ssize_t *consumed)
{
...
s += ascii_decode(s, end, PyUnicode_1BYTE_DATA(u));
if (s == end) {
return u;
}
// Use _PyUnicodeWriter after fast path is failed.
_PyUnicodeWriter writer;
_PyUnicodeWriter_InitWithBuffer(&writer, u);
writer.pos = s - starts;
Py_ssize_t startinpos, endinpos;
const char *errmsg = "";
PyObject *error_handler_obj = NULL;
PyObject *exc = NULL;
...
}
請回顧一下該函數(shù)調(diào)用的流程圖
當從ascii_decode函數(shù)返回后,unicode_decode_utf8函數(shù)得到的一個PyASCIIObject對象的內(nèi)存實體如下圖孽鸡,ascii_decode在PyUnicodeObject初始化過程中,總是先假定是一個ASCII字符串栏豺,因此我們看到state內(nèi)部類的kind字段為1.
該PyASCIIObject對象的指針(引用)會作為第二個參數(shù)傳遞給_PyUnicodeWriter_InitWithBuffer函數(shù)做進一步處理彬碱。
_PyUnicodeWriter接口
在unicode_decode_utf8調(diào)用_PyUnicodeWriter_InitWithBuffer函數(shù)前,它初始化一個_PyUnicodeWriter類型的變量并將該內(nèi)存地址傳遞給_PyUnicodeWriter_InitWithBuffer內(nèi)聯(lián)函數(shù)奥洼,那么究竟_PyUnicodeWriter在整個PyUnicode對象初始化過程中起到什么作用呢巷疼?,如果你感興趣的話,可以追溯一下_PyUnicodeWriter的老黃歷
在2010年灵奖,Python 3.3的PEP 393有了一個全新的Unicode實現(xiàn)嚼沿,即Python類型str,一直沿用至今。PEP 393的第一個實現(xiàn)使用了很多32位字符緩沖區(qū)(Py_UCS4)瓷患,這需要占用大量內(nèi)存骡尽,并且需要太多性能開銷用于轉(zhuǎn)換為8位(Py_UCS1,ASCII和Latin1)或16位(Py_UCS2擅编,BMP)字符攀细。目前流行的CPython3.x用于Unicode字符串的內(nèi)部結(jié)構(gòu)非常復(fù)雜,在構(gòu)建新字符串以避免存儲冗余的副本,因此辨图,字符串的內(nèi)存利用必須精打細算班套。而_PyUnicodeWriter類接口減少昂貴的內(nèi)存副本,甚至在最佳情況下完全避免內(nèi)存副本故河。
下面是_PyUnicodeWriter的結(jié)構(gòu)體的源代碼吱韭,這個類接口沒什么好說的。
/* --- _PyUnicodeWriter API ----------------------------------------------- */
typedef struct {
//由PyUnicode_New已分配的對象
PyObject *buffer;
void *data;
enum PyUnicode_Kind kind;
Py_UCS4 maxchar;
Py_ssize_t size;
Py_ssize_t pos;
/* 最小分配字符數(shù) (default: 0) */
Py_ssize_t min_length;
/* 最小的字符數(shù)(default: 127, ASCII) */
Py_UCS4 min_char;
/* 如果非零鱼的,則對緩沖區(qū)進行整體分配 (default: 0). */
unsigned char overallocate;
/* 如果readonly為1理盆,則緩沖區(qū)為共享字符串
(無法修改),并且大小設(shè)置為0凑阶。 */
unsigned char readonly;
} _PyUnicodeWriter ;
明確地說猿规,在Objects/unicodeobject.c源文件,大規(guī)模地使用了以 _PyUnicodeWriter_為前綴的函數(shù)族宙橱,而這里介紹的是_PyUnicodeWriter_InitWithBuffer是和字符串對象初始化有關(guān)的inline函數(shù)姨俩。而_PyUnicodeWriter_InitWithBuffer的實質(zhì)性代碼位于_PyUnicodeWriter_Update這個inline函數(shù),如果你C語言基礎(chǔ)扎實的話师郑,實際上這兩個函數(shù)并不存在C運行時函數(shù)棧pop/push的開銷环葵,因為它們的代碼在編譯后unicode_decode_utf8函數(shù)上下文的一部分。
static inline void
_PyUnicodeWriter_Update(_PyUnicodeWriter *writer)
{
writer->maxchar = PyUnicode_MAX_CHAR_VALUE(writer->buffer);
writer->data = PyUnicode_DATA(writer->buffer);
if (!writer->readonly) {
writer->kind = PyUnicode_KIND(writer->buffer);
writer->size = PyUnicode_GET_LENGTH(writer->buffer);
}
else {
/* use a value smaller than PyUnicode_1BYTE_KIND() so
_PyUnicodeWriter_PrepareKind() will copy the buffer. */
writer->kind = PyUnicode_WCHAR_KIND;
assert(writer->kind <= PyUnicode_1BYTE_KIND);
/* Copy-on-write mode: set buffer size to 0 so
* _PyUnicodeWriter_Prepare() will copy (and enlarge) the buffer on
* next write. */
writer->size = 0;
}
}
// Initialize _PyUnicodeWriter with initial buffer
static inline void
_PyUnicodeWriter_InitWithBuffer(_PyUnicodeWriter *writer, PyObject *buffer)
{
//初始化writer的所有字段為0
memset(writer, 0, sizeof(*writer));
writer->buffer = buffer;
_PyUnicodeWriter_Update(writer);
writer->min_length = writer->size;
}
延續(xù)前篇的例子宝冕,當執(zhí)行完_PyUnicodeWriter_InitWithBuffer函數(shù)张遭,_PyUnicodeWriter對象和PyASCIIObject對象的內(nèi)存狀態(tài)如下圖,讀者需要區(qū)分一個重要的概念,_PyUnicodeWrite對象是基于C級別的棧地梨,而不是堆菊卷。
我們CPython的默認編碼是utf-8宝剖,而utf-8的基本特征是針對不同語言文字或符號洁闰,采取不同位寬來存儲對應(yīng)文字或符號的編碼,例如一段采用utf-8編碼的字符序列诈闺,它混雜多個ASCII字符(位寬一個字節(jié))渴庆、歐洲諸國的字符(位寬2個字節(jié))、亞洲諸國(位寬3個字節(jié))雅镊。那CPython如何知道從哪里開始以一個位寬去讀取序列中的字符?下一次從哪里以2個字節(jié)或3個字節(jié)的位寬去讀取對應(yīng)的字符呢刃滓?那么CPython就引入了_PyUnicodeWriter對象仁烹,也就是說一個復(fù)雜的字符串可能包括多個_PyUnicodeWriter對象可以通過其內(nèi)部字段
- pos:告知CPython在處理任何unicode字節(jié)序列時,該字節(jié)序列是否為空。
- kind:以多少位寬的字節(jié)去解釋序列中的每一個字符咧虎。
咦卓缰?_PyUnicodeWriter對象的字段data剛好指向ascii_decode返回所指向的內(nèi)存地址,即整個PyASCIIOjbect的頭部尾端的最后一個字節(jié)。
static Py_ssize_t
ascii_decode(const char *start, const char *end, Py_UCS1 *dest)
{
const char *p = start;
const char *aligned_end = (const char *) _Py_ALIGN_DOWN(end, SIZEOF_LONG);
....
}
重點就在_Py_ALIGN_DOWN這個宏定義征唬,將傳入的指針p向下舍入到最接近以a對齊的地址邊界捌显,該地址邊界不大于p,你姑且先不要深挖CPython為什么要這么做总寒。
/* Round pointer "p" down to the closest "a"-aligned address <= "p". */
#define _Py_ALIGN_DOWN(p, a) ((void *)((uintptr_t)(p) & ~(uintptr_t)((a) - 1)))
扯遠了扶歪,回歸正題。整個_PyUnicodeWriter_函數(shù)族大規(guī)模地引用了以下宏函數(shù)摄闸,所以理解這些宏函數(shù)對于你大腦梳理字符串對象初始化過程以及字符串各種操作的細節(jié)至關(guān)重要善镰。
/* Fast check to determine whether an object is ready. Equivalent to
PyUnicode_IS_COMPACT(op) || ((PyUnicodeObject*)(op))->data.any) */
#define PyUnicode_IS_READY(op) (((PyASCIIObject*)op)->state.ready)
#define PyUnicode_Check(op) \
PyType_FastSubclass(Py_TYPE(op), Py_TPFLAGS_UNICODE_SUBCLASS)
/* Return true if the string is compact or 0 if not.
No type checks or Ready calls are performed. */
#define PyUnicode_IS_COMPACT(op) \
(((PyASCIIObject*)(op))->state.compact)
/* Return a void pointer to the raw unicode buffer. */
#define _PyUnicode_COMPACT_DATA(op) \
(PyUnicode_IS_ASCII(op) ? \
((void*)((PyASCIIObject*)(op) + 1)) : \
((void*)((PyCompactUnicodeObject*)(op) + 1)))
#define _PyUnicode_NONCOMPACT_DATA(op) \
(assert(((PyUnicodeObject*)(op))->data.any), \
((((PyUnicodeObject *)(op))->data.any)))
/* Return one of the PyUnicode_*_KIND values defined above. */
#define PyUnicode_KIND(op) \
(assert(PyUnicode_Check(op)), \
assert(PyUnicode_IS_READY(op)), \
((PyASCIIObject *)(op))->state.kind)
/* Returns the length of the unicode string. The caller has to make sure that
the string has it's canonical representation set before calling
this macro. Call PyUnicode_(FAST_)Ready to ensure that. */
#define PyUnicode_GET_LENGTH(op) \
(assert(PyUnicode_Check(op)), \
assert(PyUnicode_IS_READY(op)), \
((PyASCIIObject *)(op))->length)
#define PyUnicode_DATA(op) \
(assert(PyUnicode_Check(op)), \
PyUnicode_IS_COMPACT(op) ? _PyUnicode_COMPACT_DATA(op) : \
_PyUnicode_NONCOMPACT_DATA(op))
//位于Includes/object.h
static inline int
PyType_HasFeature(PyTypeObject *type, unsigned long feature) {
return ((PyType_GetFlags(type) & feature) != 0);
}
//位于Includes/object.h
#define PyType_FastSubclass(type, flag) PyType_HasFeature(type, flag)
/* Return a maximum character value which is suitable for creating another
string based on op. This is always an approximation but more efficient
than iterating over the string. */
#define PyUnicode_MAX_CHAR_VALUE(op) \
(assert(PyUnicode_IS_READY(op)), \
(PyUnicode_IS_ASCII(op) ? \
(0x7f) : \
(PyUnicode_KIND(op) == PyUnicode_1BYTE_KIND ? \
(0xffU) : \
(PyUnicode_KIND(op) == PyUnicode_2BYTE_KIND ? \
(0xffffU) : \
(0x10ffffU)))))
/* Returns the length of the unicode string. The caller has to make sure that
the string has it's canonical representation set before calling
this macro. Call PyUnicode_(FAST_)Ready to ensure that. */
#define PyUnicode_GET_LENGTH(op) \
(assert(PyUnicode_Check(op)), \
assert(PyUnicode_IS_READY(op)), \
((PyASCIIObject *)(op))->length)
當執(zhí)行完_PyUnicodeWriter_InitWithBuffer的相關(guān)代碼后,我們將注意力回到unicode_decode_utf8剩下剩下尚未展示的代碼片段年枕,源代碼見Objects/unicodeobject.c文件的第5020行-5034行炫欺,該代碼片段通過while循環(huán)遍歷參數(shù)s的C級別unicode字節(jié)序列。并通過unicode字節(jié)寬度來選擇調(diào)用對應(yīng)的XX_utf8_decode函數(shù)執(zhí)行解碼操作熏兄。
unicode_decode_utf8(const char *s, Py_ssize_t size,
_Py_error_handler error_handler, const char *errors,
Py_ssize_t *consumed)
...
writer.pos = s - starts;
Py_ssize_t startinpos, endinpos;
const char *errmsg = "";
PyObject *error_handler_obj = NULL;
PyObject *exc = NULL;
while (s < end) {
Py_UCS4 ch;
int kind = writer.kind;
if (kind == PyUnicode_1BYTE_KIND) {
if (PyUnicode_IS_ASCII(writer.buffer)){
ch = asciilib_utf8_decode(&s, end, writer.data, &writer.pos);
}
else{
ch = ucs1lib_utf8_decode(&s, end, writer.data, &writer.pos);
}
} else if (kind == PyUnicode_2BYTE_KIND) {
ch = ucs2lib_utf8_decode(&s, end, writer.data, &writer.pos);
} else {
assert(kind == PyUnicode_4BYTE_KIND);
ch = ucs4lib_utf8_decode(&s, end, writer.data, &writer.pos);
}
....
}
}
其中的asciilib_utf8_decode函數(shù)品洛、ucs1lib_utf8_decode函數(shù)、ucs2lib_utf8_decode函數(shù)和ucs4lib_utf8_decode函數(shù)摩桶,它們的函數(shù)實體由如下函數(shù)簽名定義桥状。由于該函數(shù)200多行,我沒打算在本篇羅列典格,其詳細代碼位于Objects/stringlib/codecs.h源文件的第22行-253行岛宦。該函數(shù)簽名帶有宏名稱的定義,也就是說會在編譯時耍缴,生成各自函數(shù)名對應(yīng)的函數(shù)副本砾肺。
Py_LOCAL_INLINE(Py_UCS4) STRINGLIB(utf8_decode)(
const char**,
const char*,
STRINGLIB_CHAR*,
Py_ssize_t*)
從上面的_PyUnicodeWriter對象的內(nèi)存圖可知,因為變量kind=1,它會調(diào)用asciilib_utf8_decode函數(shù)防嗡,在while中的ch一個char變量,初始值ch=230变汪,即指向C級別unicode字節(jié)序列的起始位置.
Py_LOCAL_INLINE(Py_UCS4) STRINGLIB(utf8_decode)(
const char** inptr,
const char* end,
STRINGLIB_CHAR* dest,
Py_ssize_t* outpos){
Py_UCS4 ch;
const char *s = *inptr;
const char *aligned_end = (const char *) _Py_ALIGN_DOWN(end, SIZEOF_LONG);
STRINGLIB_CHAR *p = dest + *outpos;
while(s<end){
ch=(unsigned char)*s; //230
if(ch<0x80){ //0x80=128
.....
}
if(ch<0xE0){ //0xe0=224
.....
}
if(ch<0xF0){//0x=240
/* \xE0\xA0\x80-\xEF\xBF\xBF -- 0800-FFFF */
Py_UCS4 ch2, ch3;
if (end - s < 3) {
/* unexpected end of data: the caller will decide whether
it's an error or not */
if (end - s < 2)
break;
ch2 = (unsigned char)s[1];
if (!IS_CONTINUATION_BYTE(ch2) ||
(ch2 < 0xA0 ? ch == 0xE0 : ch == 0xED))
/* for clarification see comments below */
goto InvalidContinuation1;
break;
}
ch2 = (unsigned char)s[1];
ch3 = (unsigned char)s[2];
if (!IS_CONTINUATION_BYTE(ch2)) {
/* invalid continuation byte */
goto InvalidContinuation1;
}
if (ch == 0xE0) {
if (ch2 < 0xA0)
/* invalid sequence
\xE0\x80\x80-\xE0\x9F\xBF -- fake 0000-0800 */
goto InvalidContinuation1;
} else if (ch == 0xED && ch2 >= 0xA0) {
/* Decoding UTF-8 sequences in range \xED\xA0\x80-\xED\xBF\xBF
will result in surrogates in range D800-DFFF. Surrogates are
not valid UTF-8 so they are rejected.
See https://www.unicode.org/versions/Unicode5.2.0/ch03.pdf
(table 3-7) and http://www.rfc-editor.org/rfc/rfc3629.txt */
goto InvalidContinuation1;
}
if (!IS_CONTINUATION_BYTE(ch3)) {
/* invalid continuation byte */
goto InvalidContinuation2;
}
ch = (ch << 12) + (ch2 << 6) + ch3 -
((0xE0 << 12) + (0x80 << 6) + 0x80);
assert ((ch > 0x07FF) && (ch <= 0xFFFF));
s += 3;
if (STRINGLIB_MAX_CHAR <= 0x07FF ||
(STRINGLIB_MAX_CHAR < 0xFFFF && ch > STRINGLIB_MAX_CHAR))
/* Out-of-range */
goto Return;
*p++ = ch;
continue;
}
if(ch<0xF5){
....
}
goto InvalidStart;
}
ch=0;
Return:
*inptr = s;
*outpos = p - dest;
return ch;
InvalidStart:
ch = 1;
goto Return;
InvalidContinuation1:
ch = 2;
goto Return;
InvalidContinuation2:
ch = 3;
goto Return;
InvalidContinuation3:
ch = 4;
goto Return;
}
僅當if(ch<0xF0)語句塊內(nèi)的代碼執(zhí)行完成后,此時變量ch=25105蚁趁,該ch變量由asciilib_utf8_decode函數(shù)會返回給unicode_decode_utf8函數(shù),s指針已經(jīng)偏移至s+=3的位置裙盾,即指向字節(jié)\xe6,如下內(nèi)存圖所示
接下來他嫡,ch=25105傳遞到switch/case語句塊番官,很明顯這里直接執(zhí)行default的分支中的_PyUnicodeWriter_WriteCharInline,這是一個宏函數(shù)钢属,其實質(zhì)函數(shù)主體為_PyUnicodeWriter_PrepareInternal徘熔。
unicode_decode_utf8(const char *s, Py_ssize_t size,
_Py_error_handler error_handler, const char *errors,
Py_ssize_t *consumed)
switch (ch) {
case 0:
.....
case 1:
.....
case 2:
.....
case 3:
case 4:
.....
default:
if (_PyUnicodeWriter_WriteCharInline(&writer, ch) < 0)
goto onError;
continue;
}
.....
}
static inline int
_PyUnicodeWriter_WriteCharInline(_PyUnicodeWriter *writer, Py_UCS4 ch)
{
assert(ch <= MAX_UNICODE);
if (_PyUnicodeWriter_Prepare(writer, 1, ch) < 0)
return -1;
PyUnicode_WRITE(writer->kind, writer->data, writer->pos, ch);
writer->pos++;
return 0;
}
/* Prepare the buffer to write 'length' characters
with the specified maximum character.
Return 0 on success, raise an exception and return -1 on error. */
#define _PyUnicodeWriter_Prepare(WRITER, LENGTH, MAXCHAR) \
(((MAXCHAR) <= (WRITER)->maxchar \
&& (LENGTH) <= (WRITER)->size - (WRITER)->pos) \
? 0 \
: (((LENGTH) == 0) \
? 0 \
: _PyUnicodeWriter_PrepareInternal((WRITER), (LENGTH), (MAXCHAR))))
int
_PyUnicodeWriter_PrepareInternal(_PyUnicodeWriter *writer,
Py_ssize_t length, Py_UCS4 maxchar)
{
Py_ssize_t newlen;
PyObject *newbuffer;
assert(maxchar <= MAX_UNICODE);
/* ensure that the _PyUnicodeWriter_Prepare macro was used */
assert((maxchar > writer->maxchar && length >= 0)
|| length > 0);
if (length > PY_SSIZE_T_MAX - writer->pos) {
PyErr_NoMemory();
return -1;
}
newlen = writer->pos + length;
maxchar = Py_MAX(maxchar, writer->min_char);
if (writer->buffer == NULL) {
.....
}
else if (newlen > writer->size) {
.....
}
else if (maxchar > writer->maxchar) {
assert(!writer->readonly);
newbuffer = PyUnicode_New(writer->size, maxchar);
if (newbuffer == NULL)
return -1;
_PyUnicode_FastCopyCharacters(newbuffer, 0,
writer->buffer, 0, writer->pos);
Py_SETREF(writer->buffer, newbuffer);
}
_PyUnicodeWriter_Update(writer);
return 0;
#undef OVERALLOCATE_FACTOR
}
在執(zhí)行_PyUnicodeWriter_PrepareInternal函數(shù)時很明顯執(zhí)行else if (maxchar > writer->maxchar)語句塊內(nèi)的代碼,在本示例再次執(zhí)行PyUnicode_New函數(shù)重新分配內(nèi)存并返回一個全新的PyASCIIObject對象淆党,它由臨時變量newbuffer托管酷师。在未執(zhí)行_PyUnicode_FastCopyCharacters函數(shù)之前讶凉,其內(nèi)存狀態(tài)如下
在執(zhí)行_PyUnicode_FastCopyCharacters函數(shù)時,由于參數(shù)PyUnicodeWriter對象的pos字段仍然為0,那么它調(diào)用的_copy_characaters函數(shù)的how_many參數(shù)就是writer->pos=0傳遞過去的副本山孔。這決定本示例中_copy_characters函數(shù)并不會執(zhí)行實際的拷貝操作懂讯,而是立即返回_PyUnicode_FastCopyCharacters函數(shù)。
void
_PyUnicode_FastCopyCharacters(
PyObject *to, Py_ssize_t to_start,
PyObject *from, Py_ssize_t from_start, Py_ssize_t how_many)
{
(void)_copy_characters(to, to_start, from, from_start, how_many, 0);
}
從_PyUnicodeWriter_WriteCharInline函數(shù)在返回前會調(diào)用PyUnicode_WRITE宏函數(shù)台颠,從上面的內(nèi)存狀態(tài)圖可知kind參數(shù)是2,data參數(shù)就是writer->data,index參數(shù)就是writer->pos=0,value參數(shù)就是ch=25105褐望。我們分析一下這里發(fā)生什么。
- 首先蓉媳,PyUnicode_WRITE中會執(zhí)行case PyUnicode_2BYTE_KIND的代碼分支譬挚。
- 然后,向PyCompactUnicode對象的data字段區(qū)域的首兩個字節(jié)寫入25105這個值酪呻。就是((Py_UCS2 *)(data))[(index)] = (Py_UCS2)(value)這條語句所做的事情减宣。完整的代碼如下所示
#define PyUnicode_WRITE(kind, data, index, value) \
do { \
switch ((kind)) { \
case PyUnicode_1BYTE_KIND: { \
assert((kind) == PyUnicode_1BYTE_KIND); \
((Py_UCS1 *)(data))[(index)] = (Py_UCS1)(value); \
break; \
} \
case PyUnicode_2BYTE_KIND: { \
assert((kind) == PyUnicode_2BYTE_KIND); \
((Py_UCS2 *)(data))[(index)] = (Py_UCS2)(value); \
break; \
} \
default: { \
assert((kind) == PyUnicode_4BYTE_KIND); \
((Py_UCS4 *)(data))[(index)] = (Py_UCS4)(value); \
} \
} \
} while (0)
那么執(zhí)行完P(guān)yUnicode_WRITE宏函數(shù)的相關(guān)代碼后的內(nèi)存狀態(tài),如下圖所示玩荠。
你可能會疑問ch=25105還不知道表示什么意思漆腌,這個我們用Python的chr函數(shù)和ord函數(shù),你就一目了然啦阶冈。顯然十進制的25105闷尿,十六進制0x6211,二進制的"01100010 00010001"和utf-8編碼形式的\x36\x88\x91都表示中文字"我"
當_PyUnicodeWriter_WriteCharInline執(zhí)行完成后女坑,并返回到unicode_decode_utf8函數(shù)的while循環(huán)的上下文填具。到這里C級別的char指針s已經(jīng)指向unicode字節(jié)序列中s[3]的位置,并且已經(jīng)將字符串中“我是一個自由開發(fā)者”的第一個漢字以2個字節(jié)的形式寫入PyCompactUnicodeObject對象的data指針所指向的60字節(jié)內(nèi)存區(qū)域的起始兩個字節(jié)的內(nèi)存空間匆骗。
從當前的內(nèi)存圖可知劳景,PyCompactUnicodeObject對象的kind=2,這次的unicode_decode_utf8函數(shù)會調(diào)用ucs2lib_utf8_decode函數(shù),CPU控制權(quán)落在ucs2lib_utf8_decode函數(shù)內(nèi)的while循環(huán)執(zhí)行主要執(zhí)行if(ch<0xF0)條件語句塊內(nèi)的代碼,并且在每次循環(huán)偏移char指針s,直到s到達end指針的位置碉就。
- 首先盟广,unsigned char類型的局部變量ch、ch2瓮钥、ch3筋量,分別在每次遍歷過程中托管s、s+1碉熄、s+2對應(yīng)位置的字節(jié)數(shù)據(jù)桨武,我們s、s+1锈津、s+3分別指向三個字節(jié)就表示一個utf-8編碼的漢字玻募,最后2字節(jié)表示的unicode編碼值。
- 然后 執(zhí)行下面這條語句,事實上完成utf-8編碼到unicode編碼這個轉(zhuǎn)換操作一姿。
ch = (ch << 12) + (ch2 << 6) + ch3 -((0xE0 << 12) + (0x80 << 6) + 0x80);
- 最后,將ch變量保存的2字節(jié)位寬的unicode編碼值拷貝到指針p指向PyCompactUnicode對象的data字段所指向的內(nèi)存區(qū)域七咧,拷貝完成后,p指針向data內(nèi)存區(qū)域的高地址偏移一次叮叹。
*p++=ch continue
我們用一個內(nèi)存狀態(tài)圖的gif動畫來展示一下ucs2lib_utf8_decode函數(shù)內(nèi)部while循環(huán)如何從C級別的字節(jié)序列將數(shù)據(jù)拷貝到PyCompactUnicode對象的data內(nèi)存區(qū)域艾栋。
當s指針偏移至end指針,跳出ucs2lib_utf8_decode內(nèi)部整個while循環(huán)蛉顽,并且會將循環(huán)后的ch=0返回給上一層unicode_decocde_utf8函數(shù)蝗砾,你應(yīng)該已經(jīng)想到ch=0其實就是作為整個字符串的結(jié)束字符NULL或'\0
'或拷貝到PyUnicodeCompact對象的data內(nèi)存區(qū)域,也就是data[s.length-1]的位置携冤。
我們本示例的中文字字符串在CPython內(nèi)部初始化后的內(nèi)存圖如下圖悼粮,之前說過_PyUnicodeWriter對象是基于棧的,當整個基于unicode_decode_utf8的函數(shù)棧銷毀前曾棕,_PyUnicodeWriter_Finish函數(shù)需要將_PyUnicodeWriter對象托管的PyASCIIObject或其子類返回給外部的Python代碼扣猫,隨之_PyUnicodeWriter對象就跟隨unicode_decode_utf8的函數(shù)棧一起銷毀。
我們分析發(fā)現(xiàn)整個PyASCIIObject/PyUnicodeObject的初始化整個過程翘地。一個簡單的字符串對象申尤,經(jīng)歷千辛萬苦的函數(shù)調(diào)用地獄。不禁讓我想到那些說CPython內(nèi)部字符串對象初始化性能如何高效Q酶昧穿!~壓根就瞎扯的!
整個PyASCIIObject/PyUnicodeObject對象初始化過程橙喘,實際上是對C級別char指針所指向的字節(jié)序列執(zhí)行解碼和拷貝序列中字節(jié)數(shù)據(jù)到其PyASCIIObject對象或及其子類對象的包裝過程时鸵。這個過程。涉及1到3次不等的malloc函數(shù)內(nèi)存分配厅瞎。為什么呢饰潜?
CPython會首先假定傳入的C級別字節(jié)序列是一個簡單ASCII字節(jié)序列。因此第一次按照PyUnicode_1BYTE_KIND的模式去計算字符串所需的堆內(nèi)存空間磁奖。而后續(xù)的字節(jié)編碼檢測算法發(fā)現(xiàn)該C級別的字節(jié)序列并非是一個ASCII字節(jié)序列囊拜,就會嘗試以PyUnicode_2BYTE_KIND的模式,再次從算內(nèi)存空間并調(diào)用malloc分配函數(shù)比搭。這意味著冠跷,很明顯吧,之前的malloc分配的內(nèi)存做了"無用功"身诺。請問這樣能字符串初始化的性能高效個錘子啊~!!.
小結(jié)
本文是目前分析整個CPython3.9內(nèi)部對PyASCIIObject以及其子類對象初始化最為詳細的一文
更新中.....