Skip to content

Overview

Source

Success

If not explicitly noted, a sequence refers to not only list, tuple, string, or array, but any iterable.

If not explicitly noted, a function is lazy, where each sequence is processed incrmentally only when necessary and returned incrmentally as a generator.

Info

Empty sequence is always a sub-sequence of any other sequence.

A sequence is always a sub-sequence of itself.

Warning

For tools related to specific tasks, please go to the respective documentation:

Sequence Basics

Tools for basic sequence usages.

productcmp

productcmp(x, y) compares two sequences x and y with equal length according to product order. Returns -1 if smaller, 0 if equal, 1 if greater, and None if not comparable.

  • Throw exception if x and y have different lengths.
productcmp(
    [1, 2, 3],
    [4, 5, 6]
)
# 1

productcmp(
    [1, 2, 3],
    [4, 3, 2]
)
# None

iter2seq

iter2seq(iterable, target=tuple) converts any iterable sequence iterable to an indexable and sizable sequence with type target=tuple if necessary, defaults to tuple.

Warning

This function reads the sequence at once.

Tip

Useful if you need to scan the sequence more than once.

Sequence Transformation

Tools for transforming sequence.

takeat

takeat(poss, seq) takes the elements of seq at the positions in poss.

Danger

poss must already be sorted.

Info

Let be the length of poss and be the length of seq. The complexity of this function is .

Tip

If seq supports random access, it is more efficient to use [seq[i] for i in poss], with the complexity of .

list(takeat(range(0, 10, 3), range(100, 110)))
# [100, 103, 106, 109]

mergeseqs

mergeseqs(seqs, default=None, key=None) merges the sequences of equal length in seqs into a single sequences. Returns None if there is conflict in any position.

  • A position has conflict if there are more than one valid values, where each valid value is not None in default or judged by a key function key.

  • default is a placeholder when there are no valid value in one position.

seqs = [
    (0   , 0   , None, 0   ),
    (None, 1   , 1   , None),
    (2   , None, None, None),
    (None, None, None, None)
]

list(mergeseqs(seqs[1:]))
# [2,
#  1,
#  1,
#  None]

list(mergeseqs(seqs))
# None

sortedbyrank

sortedbyrank(data, ranks, reverse=False) returns the sorted list of data, by the respective rank of each individual element in ranks.

sortedbyrank(
    ['a', 'b', 'c', 'd'],
    [0.2, 0.3, 0.1, 0.0]
)
# ['d', 'c', 'a', 'b']

sortedtorank

sortedtorank(data, ranks, reverse=False) returns the list of data rearranged w.r.t. the sorted positions, by the respective rank of each individual element in ranks.

sortedtorank(
    ['a', 'b', 'c', 'd'],
    [0.2, 0.3, 0.1, 0.0]
)
# ['c', 'd', 'b', 'a']

filterbyother

filterbyother(func, seq) drops the element in seq that fails the pairwise test func between the current element and any other element.

# Drop any number that is not the largest
list(filterbyother(lambda x, y: x >= y, [1, 2, 2, 3, 3, 3]))
# [3, 3, 3]

# Drop any set that is a superset of another set
list(filterbyother(
    lambda x, y: not x >= y,
    [{1,}, {1, 2}, {1, 2, 3}, {2, 3}, {2, 3, 4}]
))
# [{1}, {2, 3}]

Sequence Alignment and Join

Tools for aligning and joining sequences.

matchingfrequencies

matchingfrequencies(*seqs, key=None) returns each item and the respective number of sequences in seqs contains it.

  • Optional key function key can be specified.

Tip

If each sequence is sorted, then optimized sortedtools.matchingfrequencies with the same API should be used for better efficiency.

For the frequency of each item within a single sequence, use toolz.itertoolz.frequencies.

list(matchingfrequencies(
    [0, 1, 2, 3, 4],
    [1, 1, 1, 3, 4],
    [2, 1, 2, 2, 0],
    [1, 1, 1, 2, 2]
))
# [(0, 2), (1, 4), (2, 3), (3, 2), (4, 2)]

match

match(a, b, default=None) matches two sequences a and b in pairs, such that the total number of matching pairs is maximized.

  • If there are multiple alignments having the same number, the leftmost one is returned.

Warning

This function reads all sequences at once.

Tip

If both two sequences are sorted, respectively, then optimized sortedtools.sortedmatch with the same API should be used for better efficiency.

list(match(
    [0,    1, 1, 0, 1],
    [0, 0, 1, 1,    1]
))
# [(0, 0),
#  (None, 0),
#  (1, 1),
#  (1, 1),
#  (0, None),
#  (1, 1)]

align

align(a, b, cost=None, bound=inf, default=None) computes the alignment two sequences a and b, such that the total cost of the aligned sequences given the pair-wise cost function cost(x, y) is minimized.

  • Assume the sequences after alignment are a' and b'. The total cost is sum(cost(x, y) for x, y in zip(a', b')).

  • Both the minimum total cost and the respective aligned sequences are returned as a tuple.

  • In default, the cost function cost(x, y) returns 0 when x == y and 1 when not. This is equal to the edit distance.

  • To speedup the computation, a threshold of maximum cost bound=inf can be specified. When there is no satisfying result, None is returned.

  • If there are multiple alignments having the same cost, the leftmost one is returned.

Warning

This function reads all sequences at once.

align(
    [0,    1, 1, 0, 1],
    [0, 0, 1, 1,    1]
)
# (2, ([0, None, 1, 1,    0, 1],
#      [0,    0, 1, 1, None, 1]))

align(
    [0,    1, 1, 0, 1],
    [0, 0, 1, 1,    1],
    bound=1
)
# None

join

join(leftseq, rightseq, leftkey=None, rightkey=None, leftdefault=no_default, rightdefault=no_default) joins two sequences, optionally according to leftkey and rightkey, respectively. Outer join is also supported.

Warning

This function reads the first sequence leftseq at once. Thus, make leftseq the shorter one for best efficiency.

Tip

If both two sequences are sorted according to leftkey and rightkey, respectively, then optimized sortedtools.sortedjoin with the same API should be used for better efficiency.

Info

join is just a wrapper of toolz.itertools.join with the same more friendly API of sortedtools.sortedjoin.

list(join(
    [   -1, -1, -2,    -4, -5,    -6],
    [0,  1,  1,  2, 3,  4,  5, 5],
    leftkey=abs, leftdefault=None
))
# [(None, 0),
#  (-1, 1),
#  (-1, 1),
#  (-1, 1),
#  (-1, 1),
#  (-2, 2),
#  (None, 3),
#  (-4, 4),
#  (-5, 5),
#  (-5, 5)]

cmpjoin

cmpjoin(leftseq, rightseq, func=operator.eq, leftdefault=no_default, rightdefault=no_default) joins two sequences, optionally according to the condition function func, respectively. Outer join is also supported.

Warning

This function reads the first sequence leftseq at once. Thus, make leftseq the shorter one for best efficiency.

Tip

If both two sequences are sorted according to leftkey and rightkey, respectively, then optimized sortedtools.sortedjoin with the same API should be used for better efficiency.

This function is more flexable than join, and can be used in more scenarios.

list(cmpjoin(
    [   1, 1, 2,    4, 5,    -6],
    [0, 1, 1, 2, 3, 4, 5, 5]
))
# [(1, 1),
#  (1, 1),
#  (1, 1),
#  (1, 1),
#  (2, 2),
#  (4, 4),
#  (5, 5),
#  (5, 5)]

list(cmpjoin(
    [   1, 1, 2,    4, 5,    -6],
    [0, 1, 1, 2, 3, 4, 5, 5],
    lambda x, y: abs(x - y) == 1
))
# [(1, 0),
#  (1, 0),
#  (2, 1),
#  (2, 1),
#  (1, 2),
#  (1, 2),
#  (2, 3),
#  (4, 3),
#  (5, 4),
#  (4, 5),
#  (4, 5)]

templateseq

templateseq(seqs, default=None, simple=True) finds the common template of all the sequences seqs. default=None is used to denote any placeholder sub-sequence.

  • For better performance, option simple is enabled in default for only one scan of the sequences. However, it may work incorrectly for more complex template, which is:

    • Any part of the template appears more than once in the same sequence,

    • Any part of the template does not always appear before or after than another according to the same order part among sequences.

Warning

This function reads all the sequences at once.

This function reads all the sequences more than once when simple = False.

Unlike seqtools.commonsubseq(a, b), which finds the longest common sub-sequence, this function does not guarantee finding the longest template.

Tip

Essentially, each template is a closed sequential pattern covering all the sequences. Please check PrefixSpan-py for more advanced scenarios.

list(templateseq((
    s.split() for s in [
        "Alice likes tea and coffee !",
        "Bob likes sushi and ramen !",
        "Elisa or Anna likes icecream and cake and cookie !"
    ]
), default='*'))
# ['*', 'likes', '*', 'and', '*', '!']

list(templateseq((
    s.split() for s in [
        "Alice likes tea and coffee !",
        "Bob likes sushi and ramen !",
        "Elisa or Anna likes icecream and cake and cookie !"
    ]
), default='*', simple=False))
# ['*', 'likes', '*', 'and', '*', '!']

# For more complex data.
list(templateseq((
    s.split() for s in [
        "! Alice likes tea and coffee ! !",
        "Bob likes sushi and ramen ! !",
        "Elisa or Anna likes icecream and cake and cookie ! !"
    ]
), default='*'))
# Incorrect template.
# ['*', 'likes', '*', 'and', '*']

list(templateseq((
    s.split() for s in [
        "! Alice likes tea and coffee ! !",
        "Bob likes sushi and ramen ! !",
        "Elisa or Anna likes icecream and cake and cookie ! !"
    ]
), default='*', simple=False))
# ['*', 'likes', '*', 'and', '*', '!', '!']