分享 | SortedList 源码浅析 (python)

草莓奶昔🍓

5277

2023.06.23

2023.06.25

发布于未知归属地

有序集合 Python3 技术交流

前言：S 代表着什么？新一代有序容器 SortedList
代码整体关系
出发点：短列表的快速插入删除
实现细节 1：分块+线段树
实现细节 2：动态扩容与收缩策略
源码中的技巧
结尾
参考
后记：issue摘录

前言：S 代表着什么？新一代有序容器 SortedList

与基于树的实现相比，使用列表具有一些优势：

大多数插入/删除不需要分配或释放内存。
指向元素的指针密集排列。这有利于硬件的内存架构，并更有效地利用缓存。

代码整体关系

SortedList 根据传不传 key 分为 SortedList、SortedKeyList，不传 key 时降低了函数调用的开销。这里 _new_ 为工厂函数。
SortedDict 和 SortedSet 基于 SortedList 封装实现。

出发点：短列表的快速插入删除

传统的基于树的设计具有更好的理论复杂度，但这忽略了当今软件和硬件的现实。
python 的 list 很快，有利于 memory management 和 random access.
bisect.insort 很快.(其实 js 的 Array.prototype.splice 也很快).
在不太长(1000 到 2000)的 list 上做插入删除操作，常数极小.
在硬件和软件方面，现代处理器已经花费了大量时间来优化类似 mem-cpy/mem-move 的操作。
因此，在这里把插入删除当作了 O(1).
但是长度达到阈值，例如超过 1e4 时，bisect.insort 插入会很慢。
因此需要分块。
插入或删除最常在短的 list 中执行。很少需要添加或删除新列表。

实现细节 1：分块+线段树

两种查询：
- getRankByValue -> 查询块内最大值 -> 在_maxes上二分
- getValueByRank -> 查询位置 -> _index+_offset. 维护这个更难一些,需要在线段树上二分.
  也可以不要求 $O (l o g n)$ ，而是将块的 size 设置大一点，查找 pos 时直接遍历，时间复杂度 $O (s q r t (n))$ ，空间复杂度 $O (n)$ .
三个关键：_maxes, _lists, _index+_offset
- _lists 是每个块内部的元素(有序)
- _maxes 是每个块的最大值(用于二分定位).其实是冗余信息，可以通过 _lists[i][-1] 得到.这样做的好处是避免二级访问，降低 access 的开销.
- _index 一颗线段树，叶子结点维护每个块的长度.
  这个信息对于插入和删除操作非常有用，因为它提供了一种快速定位某个位置在整个有序列表中的准确位置，加速定位操作。
  作者把块的索引统一命名为pos，把块内的索引统一命名为idx。
  _loc(pos,idx)->int: 树上二分，根据(块的索引，块内的索引)返回元素在整个列表中的索引
  _pos(idx)->Tuple[int,int]: 树上二分，返回 idx 对应的(块的索引，块内的索引)
  https://github.com/grantjenks/python-sortedcontainers/blob/92ef500158f87f7684023823d689cfd7bef892a1/src/sortedcontainers/sortedlist.py#L520C19-L520C19
  _offset 是叶子节点的起始索引
Q&A:

Q:为什么不用前缀和+二分来定位
A:因为是动态的，更新前缀和需要整个更新 O(n)，而更新线段树只需要 O(logn)
Q:为什么不用树状数组+二分来定位
~~A:类似的原因，更新复杂度太大，难以做到 O(logn) 维护~~
A:理论上可行，可能作者觉得线段树上二分比树状数组上二分容易写一些

实现细节 2：动态扩容与收缩策略

_expand函数与_delete函数
https://github.com/grantjenks/python-sortedcontainers/blob/92ef500158f87f7684023823d689cfd7bef892a1/src/sortedcontainers/sortedlist.py#L289
https://github.com/grantjenks/python-sortedcontainers/blob/92ef500158f87f7684023823d689cfd7bef892a1/src/sortedcontainers/sortedlist.py#L465
增加和删除元素时还需要根据负载调整_lists 的子列表
如果子列表的长度超过负载(DEFAULT_LOAD_FACTOR)的两倍，则将其一分为二
如果减少到负载的一半，则与邻居合并
系数为 1000 时，可以支持到 1e7 到 1e8 级别的数据.
作者建议设置立方根级别.

See :doc:implementation and :doc:performance-scale for more information.

Q：线段树是怎么处理插入删除的？块分裂时需要在序列中插入一个新的块，合并时则需要删除块，这两个操作用线段树似乎不好处理。
A：是一个(懒的) $O (s q r t (n))$ 的暴力重构

源码中的技巧

缓存类变量/类方法来减少属性查找的开销(大量使用)

_lists = self._lists
_maxes = self._maxes
_add = self.add

update 批量添加元素的处理

如果添加的元素太多if len(values) * 4 >= self._len:,
全部重构一遍，否则直接添加到_lists中.
这里的 4 可能是指线段树最坏时需要的 4 倍空间.

values = sorted(iterable)

if _maxes:
    if len(values) * 4 >= self._len:
        _lists.append(values)
        values = reduce(iadd, _lists, [])
        values.sort()
        self._clear()
    else:
        _add = self.add
        for val in values:
            _add(val)
        return

重写 list 的几个方法，抛出错误

    def append(self, value):
    """Raise not-implemented error.

    Implemented to override `MutableSequence.append` which provides an
    erroneous default implementation.

    :raises NotImplementedError: use ``sl.add(value)`` instead

    """
    raise NotImplementedError("use ``sl.add(value)`` instead")

__getitem__和pop 方法对删除开头和删除结尾做了特殊判断的优化,这两种场合是 O(1)的

对边界和负索引的处理

def index(self, value, start=None, stop=None):
   ...

   if start is None:
       start = 0
   if start < 0:
       start += _len
   if start < 0:
       start = 0

   if stop is None:
       stop = _len
   if stop < 0:
       stop += _len
   if stop > _len:
       stop = _len

   if stop <= start:
       raise ValueError("{0!r} is not in list".format(value))

View 利用 proxy 实现了延迟求值.


class SortedItemsView(ItemsView, Sequence):
   """Sorted items view is a dynamic view of the sorted dict's items.
   When the sorted dict's items change, the view reflects those changes.
   The items view implements the set and sequence abstract base classes.
   """
   __slots__ = ()

   @classmethod
   def _from_iterable(cls, it):
       return SortedSet(it)


   def __getitem__(self, index):
       _mapping = self._mapping
       _mapping_list = _mapping._list

       if isinstance(index, slice):
           keys = _mapping_list[index]
           return [(key, _mapping[key]) for key in keys]

       key = _mapping_list[index]
       return key, _mapping[key]


   __delitem__ = _view_delitem

结尾

看到这里，不由感叹 SortedList 的作者 Grant Jenks 大叔功力之深厚，创作出了这么优秀的模板。
最后，我想以日本著名游戏制作人加藤·惠（1998 ~) 的一句名言结束本文：

参考

https://www.zhihu.com/question/593450942/answer/2966795898
https://grantjenks.com/docs/sortedcontainers/implementation.html

后记：issue摘录

If SortedList is given a sequence, figure out load automatically based on length # 4
有人建议负载因子应该随长度变化动态调整
在作者的本地测试中，以列表长度的立方根作为加载因子似乎效果最好

import timeit

def benchmark(func):
    def wrapper(*args, **kwargs):
        times = [func(*args, **kwargs) for repeat in xrange(5)]
        return sum(times) / len(times)
    return wrapper

@benchmark
def insert(size, load, repeat=100):
    command = 'sl.add(random.randrange({0}))'.format(size)
    setup = ';'.join(
        ['import random',
         'from sortedcontainers import SortedList',
         'sl = SortedList(xrange({0}), load={1})'.format(size, load)])
    return timeit.timeit(command, setup=setup, number=repeat)

def test():
    for size in [int(1e5), int(1e6), int(5e6), int(1e7), int(2e7)]:
        times = [(insert(size, load), load)
                 for load in [50, 100, 150, 200, 1000]]
        times.sort()
        print 'Size:', size, 'Load:', times[0][1]

test()

答案可能是“视情况而定”，因此基准测试是最好的方法。

Implement key argument to all __init__ methods. #5
有人建议增加 key 参数
作者认为作为一组新的类型会更好。对于那些不需要该功能的用户，它会降低性能
Have .add(value) return the index for value #6
有人建议 add 返回插入的下标，这样可以避免再次查找
Make SortedDict and SortedSet inherit from dict/set for speed improvements #17
作者发现继承 dict 和 set 而不是组合，会有性能提升
Support thread-safety #105
有人希望支持线程安全
但作者表示不方便。reduce 就不是线程安全的(需要一个防止删除的锁)
The performance of add method could be improved #162
有人表明不用bisect.insort，而是切片插入的方式会更快
```
p = bisect_right(_lists[pos], value)
_lists[pos][p:p] = [value]
```
经过更多测试，这个技巧对小列表没有帮助。由于 SortedList 中的 DEFAULT_LOAD_FACTOR 是 1000，因此无法提高 add 方法的性能。
Can not distinguish two different items with the same value #161
SortedList 无法区分两个具有相同值的不同对象
作者在这里提到了这个问题
排序容器数据类型有三个要求：
1. The comparison value or key must have a total ordering.
  比较值或键必须具有总排序。
2. The comparison value or key must not change while the value is stored in the sorted container.
  当比较值或键存储在已排序的容器中时，该值不得更改。
3. If the key-function parameter is used, then equal values must have equal keys.
  如果使用键函数参数，则相等的值必须具有相等的键。
如果违反了这三个要求中的任何一个，则分类容器的保修无效，并且将无法正常运行。
"lower_bound()" function same as that of C++ STL's "map" library #172
SortedDict.irange类似于 C++ STL 的 map 的 lower_bound 函数
Feature Proposal: Introduce Higher Level APIs like ceil / floor #207
有人建议：引入更高级别的 API，如 ceil / floor
这之前，在How to do floor and ceiling lookups? #87中作者提到了性能更好的 floor 和 ceiling 方法的实现
作者认为irange可以替代这两个功能
但是，irange使用生成器，效率并不高。特别是对于仅选择一个值的场合，它将引入额外的计算开销