第2篇:CPython實現(xiàn)原理:整數(shù)對象(前篇)

在CPython中的整數(shù)對象的堆內(nèi)存分配并非在即時對某個需要使用的整數(shù)分配內(nèi)存的，因為這樣勢必對CPython的內(nèi)存利用率非常底下锣咒。而是有一套非常高效的內(nèi)存管理方案就是針對整數(shù)對象-緩沖池機制（高效嗎奇徒，得跟什么參照物對比？那是Python編程技術圈很官腔的褒贊而已）逛钻。我們知道在CPython的內(nèi)存管理模型中长搀，每個內(nèi)建對象都有自己獨有的對象池機制。而本篇我們恰好講解整數(shù)對象緩存池匿辩。

首先針對單個整數(shù)PyLongObject對象的表示法腰耙，CPython3.9有明確的定義

....
/* Long integer representation.
   The absolute value of a number is equal to
        SUM(for i=0 through abs(ob_size)-1) ob_digit[i] * 2**(SHIFT*i)
   Negative numbers are represented with ob_size < 0;
   zero is represented by ob_size == 0.
   In a normalized number, ob_digit[abs(ob_size)-1] (the most significant
   digit) is never zero.  Also, in all cases, for all valid i,
        0 <= ob_digit[i] <= MASK.
   The allocation function takes care of allocating extra memory
   so that ob_digit[0] ... ob_digit[abs(ob_size)-1] are actually available.

   CAUTION:  Generic code manipulating subtypes of PyVarObject has to
   aware that ints abuse  ob_size's sign bit.
*/
struct _longobject {
    PyObject_VAR_HEAD
    digit ob_digit[1];
};
....
typedef struct _longobject PyLongObject;

其最終形式，等價如下

....
struct _longobject {
    PyObject ob_base;
    Py_ssize_t ob_size; /* Number of items in variable part */
    digit ob_digit[1];
};
....
typedef struct _longobject PyLongObject;

我們從源代碼的注釋中得到一些關鍵的信息,整數(shù)的絕對值等于如下表達式

SUM（for i = 0 through abs（ob_size）-1）ob_digit [i] * 2 **（SHIFT * i）

如果上面的表達式铲球，我們知道ob_size的絕對值是控制ob_digit數(shù)組的長度沟优，而SHIFT有一個宏常量PyLong_SHIFT定義(下文會提到)。

負數(shù)用ob_size <0表示睬辐；
0由ob_size == 0表示
以標準化數(shù)字ob_digit [abs(ob_size)-1]（最高有效數(shù)字）永遠不會為0。而且宾肺，在所有情況下溯饵，對于所有有效的i，0 <= ob_digit [i] <=掩碼
內(nèi)存分配函數(shù)負責分配額外的內(nèi)存锨用，因此ob_digit [0]到ob_digit [abs(ob_size)-1]實際上可用的有效負載部分丰刊。

綜上所述，對于大整數(shù)在CPython3.x中的有效負載的存儲形式如下圖

而對于整數(shù)對象的存儲方式增拥，CPython3.x中就規(guī)定使用2**PyLong_SHIFT進制的字符串來表示大整數(shù)啄巧，而PyLong_SHIFT的定義在Include/longintrepr.h中找到寻歧，PyLong_SHIFT的可能的默認值是30或15,分別表示

把30位大小的數(shù)值存儲在uint32_t類型的ob_digit數(shù)組中
把15位大小的數(shù)值存儲在uint32_t類型的ob_digit數(shù)組中


#if PYLONG_BITS_IN_DIGIT == 30
typedef uint32_t digit;
typedef int32_t sdigit; /* signed variant of digit */
typedef uint64_t twodigits;
typedef int64_t stwodigits; /* signed variant of twodigits */
#define PyLong_SHIFT    30
#define _PyLong_DECIMAL_SHIFT   9 /* max(e such that 10**e fits in a digit) */
#define _PyLong_DECIMAL_BASE    ((digit)1000000000) /* 10 ** DECIMAL_SHIFT */
#elif PYLONG_BITS_IN_DIGIT == 15
typedef unsigned short digit;
typedef short sdigit; /* signed variant of digit */
typedef unsigned long twodigits;
typedef long stwodigits; /* signed variant of twodigits */
#define PyLong_SHIFT    15
#define _PyLong_DECIMAL_SHIFT   4 /* max(e such that 10**e fits in a digit) */
#define _PyLong_DECIMAL_BASE    ((digit)10000) /* 10 ** DECIMAL_SHIFT */
#else
#error "PYLONG_BITS_IN_DIGIT should be 15 or 30"
#endif
#define PyLong_BASE     ((digit)1 << PyLong_SHIFT)
#define PyLong_MASK     ((digit)(PyLong_BASE - 1))

#if PyLong_SHIFT % 5 != 0
#error "longobject.c requires that PyLong_SHIFT be divisible by 5"
#endif

那么CPython3.x究竟如何實現(xiàn)上面提到的存儲整數(shù)的算法呢？這個可以查看Objects/longobject.c源文件的PyLong_FromLong函數(shù)秩仆，從下面的代碼我們知道码泛，在PyLong_FromLong函數(shù)中，CPython還會調(diào)用IS_SMALL_INT這個宏函數(shù)對傳來C類型的長整形區(qū)分是小整數(shù)還是大整數(shù)澄耍，關于小整數(shù)的話題噪珊，后文再展開。

PyObject *
PyLong_FromLong(long ival)
{
    PyLongObject *v;
    unsigned long abs_ival;
    unsigned long t;  /* unsigned so >> doesn't propagate sign bit */
    int ndigits = 0;
    int sign;

    if (IS_SMALL_INT(ival)) {
        return get_small_int((sdigit)ival);
    }

    if (ival < 0) {
        /* negate: can't write this as abs_ival = -ival since that
           invokes undefined behaviour when ival is LONG_MIN */
        abs_ival = 0U-(unsigned long)ival;
        sign = -1;
    }
    else {
        abs_ival = (unsigned long)ival;
        sign = ival == 0 ? 0 : 1;
    }

    /* Fast path for single-digit ints */
    if (!(abs_ival >> PyLong_SHIFT)) {
        v = _PyLong_New(1);
        if (v) {
            Py_SET_SIZE(v, sign);
            v->ob_digit[0] = Py_SAFE_DOWNCAST(
                abs_ival, unsigned long, digit);
        }
        return (PyObject*)v;
    }

#if PyLong_SHIFT==15
    /* 2 digits */
    if (!(abs_ival >> 2*PyLong_SHIFT)) {
        v = _PyLong_New(2);
        if (v) {
            Py_SET_SIZE(v, 2 * sign);
            v->ob_digit[0] = Py_SAFE_DOWNCAST(
                abs_ival & PyLong_MASK, unsigned long, digit);
            v->ob_digit[1] = Py_SAFE_DOWNCAST(
                  abs_ival >> PyLong_SHIFT, unsigned long, digit);
        }
        return (PyObject*)v;
    }
#endif

    /* Larger numbers: loop to determine number of digits */
    t = abs_ival;
    while (t) {
        ++ndigits;
        t >>= PyLong_SHIFT;
    }
    v = _PyLong_New(ndigits);
    if (v != NULL) {
        digit *p = v->ob_digit;
        Py_SET_SIZE(v, ndigits * sign);
        t = abs_ival;
        while (t) {
            *p++ = Py_SAFE_DOWNCAST(
                t & PyLong_MASK, unsigned long, digit);
            t >>= PyLong_SHIFT;
        }
    }
    return (PyObject *)v;
}

整數(shù)對象的初始化

整數(shù)對象的創(chuàng)建齐莲，我們在前一篇已經(jīng)說過創(chuàng)建一個PyLongObject需要引用與該類型匹配的PyLong_Type實例痢站，我們查看一下PyLong_Type實例的初始化代碼,它位于Objects/longobject.c源文件，

PyTypeObject PyLong_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "int",                                      /* tp_name */
    offsetof(PyLongObject, ob_digit),           /* tp_basicsize */
    sizeof(digit),                              /* tp_itemsize */
    ....
    long_to_decimal_string,                     /* tp_repr */
    &long_as_number,                            /* tp_as_number */
    ....
    (hashfunc)long_hash,                        /* tp_hash */
    ....
    PyObject_GenericGetAttr,                    /* tp_getattro */
    ....
    Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE |
        Py_TPFLAGS_LONG_SUBCLASS,               /* tp_flags */
    long_doc,                                   /* tp_doc */
    ....
    long_richcompare,                           /* tp_richcompare */
    ....
    long_methods,                               /* tp_methods */
    0,                                          /* tp_members */
    long_getset,                                /* tp_getset */
    ....
    long_new,                                   /* tp_new */
    PyObject_Del,                               /* tp_free */
};

注意PyLong對象的實例化需要調(diào)用PyLong_Type實例的tp_new字段相關的參數(shù)选酗，它是一個long_new的函數(shù)指針阵难，那么long_new函數(shù)的具體定義

static PyObject *
long_new(PyTypeObject *type, PyObject *args, PyObject *kwargs)
{
    PyObject *return_value = NULL;
    static const char * const _keywords[] = {"", "base", NULL};
    static _PyArg_Parser _parser = {NULL, _keywords, "int", 0};
    PyObject *argsbuf[2];
    PyObject * const *fastargs;
    Py_ssize_t nargs = PyTuple_GET_SIZE(args);
    Py_ssize_t noptargs = nargs + (kwargs ? PyDict_GET_SIZE(kwargs) : 0) - 0;
    PyObject *x = NULL;
    PyObject *obase = NULL;

    fastargs = _PyArg_UnpackKeywords(_PyTuple_CAST(args)->ob_item, nargs, kwargs, NULL, &_parser, 0, 2, 0, argsbuf);
    if (!fastargs) {
        goto exit;
    }
    if (nargs < 1) {
        goto skip_optional_posonly;
    }
    noptargs--;
    x = fastargs[0];
skip_optional_posonly:
    if (!noptargs) {
        goto skip_optional_pos;
    }
    obase = fastargs[1];
skip_optional_pos:
    return_value = long_new_impl(type, x, obase);

exit:
    return return_value;
}

而long_new函數(shù)其實最終是調(diào)用long_new_impl函數(shù)，其具體定義在源文件的Objects/longobject.c源文件芒填，

static PyObject *
long_new_impl(PyTypeObject *type, PyObject *x, PyObject *obase)
/*[clinic end generated code: output=e47cfe777ab0f24c input=81c98f418af9eb6f]*/
{
    Py_ssize_t base;

    if (type != &PyLong_Type)
        return long_subtype_new(type, x, obase); /* Wimp out */
   //當x為NULL呜叫，底數(shù)非NULL返回,以0為參數(shù)調(diào)用PyLong_FromLong函數(shù)
    if (x == NULL) {
        if (obase != NULL) {
            PyErr_SetString(PyExc_TypeError,
                            "int() missing string argument");
            return NULL;
        }
        return PyLong_FromLong(0L);
    }
    //當x非NULL，obase為NULL,調(diào)用PyNumber_Long函數(shù)
    if (obase == NULL)
        return PyNumber_Long(x);

    base = PyNumber_AsSsize_t(obase, NULL);
    if (base == -1 && PyErr_Occurred())
        return NULL;
    //base只能在屬于0或區(qū)間[2,36]等整數(shù)
    if ((base != 0 && base < 2) || base > 36) {
        PyErr_SetString(PyExc_ValueError,
                        "int() base must be >= 2 and <= 36, or 0");
        return NULL;
    }

    if (PyUnicode_Check(x))
        return PyLong_FromUnicodeObject(x, (int)base);
    else if (PyByteArray_Check(x) || PyBytes_Check(x)) {
        const char *string;
        if (PyByteArray_Check(x))
            string = PyByteArray_AS_STRING(x);
        else
            string = PyBytes_AS_STRING(x);
        return _PyLong_FromBytes(string, Py_SIZE(x), (int)base);
    }
    else {
        PyErr_SetString(PyExc_TypeError,
                        "int() can't convert non-string with explicit base");
        return NULL;
    }
}

當使用Python層面的內(nèi)置類型class int(object)在實例化int(),事實上會調(diào)用到C底層的函數(shù)接口,會經(jīng)歷如下過程：
int(...) → PyObject → PyLong_Type → long_new → long_new_impl

C底層long_new_impl函數(shù)很大程度上反映了Python層面的類int在實例化時的行為氢烘。因為針對參數(shù)x的不同情況怀偷，long_new_impl根據(jù)相應的條件去調(diào)用PyLong_前綴的函數(shù)族中對應的函數(shù)來實例化PyLongObject

例如將數(shù)字或字符串轉(zhuǎn)換為整數(shù)，如果沒有參數(shù)播玖，則返回0椎工。如果x是數(shù)字，則返回x._int_()蜀踏。對于傳入的位置參數(shù)是浮點數(shù)字维蒙，這會截斷為零。

如果x不是數(shù)字或給出base果覆，則x必須是字符串颅痊，字節(jié)或字節(jié)數(shù)組，表示給base中的整數(shù)文字局待。文字可以以“ +”或“-”開頭斑响，并用空格包圍。 base默認為10钳榨。有效base為0和2-36舰罚。base為0表示將字符串的基數(shù)解釋為整數(shù)文字。

備注:我這里并沒打算羅列所有PyLong_函數(shù)族

小型整數(shù)

在Python中就引入了小型整數(shù)對象池(Small Integer Object Pool),那到底多小的整數(shù)為小型整數(shù)呢薛耻？营罢。還記得，上文我們提到PyLong_FromLong函數(shù)時饼齿，有提到IS_SMALL_INT的宏函數(shù)嗎饲漾？該函數(shù)是用于判斷當前傳入PyLong_FromLong函數(shù)的參數(shù)是否為小型整數(shù)蝙搔。

是什么原因?qū)е拢珻Python底層需要區(qū)分小型整數(shù)(Small Integer)和大型整數(shù)(Big Integer)呢考传？什么是小型整數(shù)吃型？顧名思義就是數(shù)值較小的整數(shù)。比如1,7,47,52等伙菊。我們Python編程中,和小型整數(shù)打交道的最多败玉。在CPython中一切對象都是堆中對應的內(nèi)存數(shù)據(jù)實體，試想一下我們不太可能為某段整數(shù)區(qū)間內(nèi)頻繁使用的整數(shù)分配N次堆內(nèi)存镜硕，然后再釋放堆內(nèi)存运翼，這樣勢必令到Python的內(nèi)存管理效率大大降低。并且會給系統(tǒng)內(nèi)核的虛擬內(nèi)存管理帶來嚴重的性能負擔兴枯。嚴重拖慢操作系統(tǒng)的性能血淌。

默認情況下，CPython3.9中關于小型整數(shù)的相關源代碼比較分散财剖，我們先查看一下IS_SMALL_INT宏函數(shù)的定義悠夯，如下代碼片段所示

//位于Objects/longobject.c文件
#define NSMALLPOSINTS           _PY_NSMALLPOSINTS
#define NSMALLNEGINTS           _PY_NSMALLNEGINTS
....

//位于Include/internal/pycore_interp.h文件

//小型整數(shù)的右開區(qū)間，最大值256
#define _PY_NSMALLPOSINTS           257
//小型整數(shù)的左閉區(qū)間躺坟，為-5
#define _PY_NSMALLNEGINTS           5

// The PyInterpreterState typedef is in Include/pystate.h.
struct _is {
  .....
#if _PY_NSMALLNEGINTS + _PY_NSMALLPOSINTS > 0
    /* Small integers are preallocated in this array so that they
       can be shared.
       The integers that are preallocated are those in the range
       -_PY_NSMALLNEGINTS (inclusive) to _PY_NSMALLPOSINTS (not inclusive).
    */
    PyLongObject* small_ints[_PY_NSMALLNEGINTS + _PY_NSMALLPOSINTS];
#endif
};

//位于Objects/longobject.c文件
#if NSMALLNEGINTS + NSMALLPOSINTS > 0

#define IS_SMALL_INT(ival) (-NSMALLNEGINTS <= (ival) && (ival) < NSMALLPOSINTS)
#define IS_SMALL_UINT(ival) ((ival) < NSMALLPOSINTS)

從上面的代碼片段我們的知道沦补，CPython預設的小型整數(shù)的區(qū)間為[-5，257)咪橙，該區(qū)間內(nèi)的所有整數(shù)為填充到一個有small_ints的數(shù)組當中,small_int數(shù)組聲明位于python核心C代碼位于Include/internal/pycore_interp.h文件中夕膀，它是結(jié)構(gòu)體_is的一個字段。

顯然在Python解釋器啟動時會調(diào)用到_PyLong_Init函數(shù)完成small_inits數(shù)組的初始化美侦，如下代碼所示产舞。

int
_PyLong_Init(PyThreadState *tstate)
{
#if NSMALLNEGINTS + NSMALLPOSINTS > 0
    for (Py_ssize_t i=0; i < NSMALLNEGINTS + NSMALLPOSINTS; i++) {
        sdigit ival = (sdigit)i - NSMALLNEGINTS;
        int size = (ival < 0) ? -1 : ((ival == 0) ? 0 : 1);

        PyLongObject *v = _PyLong_New(1);
        if (!v) {
            return -1;
        }

        Py_SET_SIZE(v, size);
        v->ob_digit[0] = (digit)abs(ival);

        tstate->interp->small_ints[i] = v;
    }
#endif

    if (_Py_IsMainInterpreter(tstate)) {
        _PyLong_Zero = PyLong_FromLong(0);
        if (_PyLong_Zero == NULL) {
            return 0;
        }

        _PyLong_One = PyLong_FromLong(1);
        if (_PyLong_One == NULL) {
            return 0;
        }

        /* initialize int_info */
        if (Int_InfoType.tp_name == NULL) {
            if (PyStructSequence_InitType2(&Int_InfoType, &int_info_desc) < 0) {
                return 0;
            }
        }
    }

    return 1;
}

請思考一個問題：小型數(shù)據(jù)真的有意義嗎？

你是否為認為小型整數(shù)的對象池對于實際的生產(chǎn)環(huán)境有實際意義菠剩？其實就我個人而言易猫，其實沒luan用！你試想一下稍微大型的計算用到整數(shù)具壮，它們的字面量不會超過256嗎准颓！只不過CPython源代碼是這么定義，那就照本宣課說一下棺妓。那有更高效的整數(shù)初始化方案嗎瞬场？答案是有的，那就是Cython,通過Cython在Python代碼中靜態(tài)指定需要初始化的整數(shù)變量涧郊，甚至是整數(shù)的數(shù)組或整數(shù)指針。Cython的整數(shù)對象之所有高性能眼五。

因為Cython語法聲明的變量妆艘，都是原始的C級別的數(shù)據(jù)類型彤灶，它們默認是基于C運行時系統(tǒng)的棧，而非CPython的數(shù)據(jù)椗或堆幌陕。
C運行時的棧(stack)提供了原始級別的數(shù)據(jù)訪問，因此存取速度會比基于堆汽煮、或CPython內(nèi)部stack指針構(gòu)建的數(shù)據(jù)棧要快不是一兩個數(shù)量級的問題搏熄，而是快十幾個數(shù)量級。
再者暇赤，Cython語法聲明的整數(shù)對象在編譯后的對象初始化的時間開銷并是恒定時間開銷O(1),Python實例化一個PyLongObject需要的時間開銷心例，最起碼也是O(n^2),因為實例化一個PyLongObject的Python語句等價于CPython內(nèi)部執(zhí)行5-6個指令碼，我們知道每執(zhí)行一個指令碼都要遍歷一次Python的解釋循環(huán)鞋囊，并執(zhí)行其中內(nèi)部C函數(shù)止后。還沒有算上CPython運行時棧和堆的開銷呢！溜腐！

我是基于事實分析問題译株，不像某些Python的極端分子極力吹捧⊥σ妫回歸正題吧歉糜，查看一下代碼吧，當我們碰巧在Python中碰到一個整數(shù)字面量剛好落入small_ints數(shù)據(jù)所指定的區(qū)間內(nèi)望众，那么初始化一個PyLongObject時匪补，PyLong_FromLong函數(shù)會以O(1)時間開銷返回，因為其調(diào)用了get_small_int函數(shù)黍檩。Python極端狂熱分子叉袍，還不趕快找個心靈安慰～～哈哈。

.....
static PyObject *
get_small_int(sdigit ival)
{
    assert(IS_SMALL_INT(ival));
    PyThreadState *tstate = _PyThreadState_GET();
    PyObject *v = (PyObject*)tstate->interp->small_ints[ival + NSMALLNEGINTS];
    Py_INCREF(v);
    return v;
}
....
static PyLongObject *
maybe_small_long(PyLongObject *v)
{
    if (v && Py_ABS(Py_SIZE(v)) <= 1) {
        sdigit ival = MEDIUM_VALUE(v);
        if (IS_SMALL_INT(ival)) {
            Py_DECREF(v);
            return (PyLongObject *)get_small_int(ival);
        }
    }
    return v;
}
#else
#define IS_SMALL_INT(ival) 0
#define IS_SMALL_UINT(ival) 0
#define get_small_int(ival) (Py_UNREACHABLE(), NULL)
#define maybe_small_long(val) (val)
#endif

結(jié)語

基于控制篇幅的原因刽酱，這是《CPython實現(xiàn)原理：整數(shù)對象》的前篇喳逛，關于大型整數(shù)的內(nèi)容，我會放到下篇再說棵里。