

Overview

Success

If not explicitly noted, a sequence refers to not only list, tuple, string, or array, but any iterable.

If not explicitly noted, a function is lazy, where each sequence is processed incrmentally only when necessary and returned incrmentally as a generator.

Info

Empty sequence is always a sub-sequence of any other sequence.

A sequence is always a sub-sequence of itself.

Warning

For tools related to specific tasks, please go to the respective documentation:

Sequence Basics¶

Tools for basic sequence usages.

`productcmp`¶

productcmp(x, y) compares two sequences x and y with equal length according to product order. Returns -1 if smaller, 0 if equal, 1 if greater, and None if not comparable.

Throw exception if x and y have different lengths.

productcmp(
    [1, 2, 3],
    [4, 5, 6]
)
# 1

productcmp(
    [1, 2, 3],
    [4, 3, 2]
)
# None

`iter2seq`¶

iter2seq(iterable, target=tuple) converts any iterable sequence iterable to an indexable and sizable sequence with type target=tuple if necessary, defaults to tuple.

Warning

This function reads the sequence at once.

Tip

Useful if you need to scan the sequence more than once.

Sequence Transformation¶

Tools for transforming sequence.

`takeat`¶

takeat(poss, seq) takes the elements of seq at the positions in poss.

Danger

poss must already be sorted.

Info

Let $m$ be the length of poss and $n$ be the length of seq. The complexity of this function is $O(n)$ .

Tip

If seq supports random access, it is more efficient to use [seq[i] for i in poss], with the complexity of $O(m)$ .

list(takeat(range(0, 10, 3), range(100, 110)))
# [100, 103, 106, 109]

`mergeseqs`¶

mergeseqs(seqs, default=None, key=None) merges the sequences of equal length in seqs into a single sequences. Returns None if there is conflict in any position.

A position has conflict if there are more than one valid values, where each valid value is not None in default or judged by a key function key.
default is a placeholder when there are no valid value in one position.

seqs = [
    (0   , 0   , None, 0   ),
    (None, 1   , 1   , None),
    (2   , None, None, None),
    (None, None, None, None)
]

list(mergeseqs(seqs[1:]))
# [2,
#  1,
#  1,
#  None]

list(mergeseqs(seqs))
# None

`sortedbyrank`¶

sortedbyrank(data, ranks, reverse=False) returns the sorted list of data, by the respective rank of each individual element in ranks.

sortedbyrank(
    ['a', 'b', 'c', 'd'],
    [0.2, 0.3, 0.1, 0.0]
)
# ['d', 'c', 'a', 'b']

`sortedtorank`¶

sortedtorank(data, ranks, reverse=False) returns the list of data rearranged w.r.t. the sorted positions, by the respective rank of each individual element in ranks.

sortedtorank(
    ['a', 'b', 'c', 'd'],
    [0.2, 0.3, 0.1, 0.0]
)
# ['c', 'd', 'b', 'a']

`filterbyother`¶

filterbyother(func, seq) drops the element in seq that fails the pairwise test func between the current element and any other element.

# Drop any number that is not the largest
list(filterbyother(lambda x, y: x >= y, [1, 2, 2, 3, 3, 3]))
# [3, 3, 3]

# Drop any set that is a superset of another set
list(filterbyother(
    lambda x, y: not x >= y,
    [{1,}, {1, 2}, {1, 2, 3}, {2, 3}, {2, 3, 4}]
))
# [{1}, {2, 3}]

Sequence Alignment and Join¶

Tools for aligning and joining sequences.

`matchingfrequencies`¶

matchingfrequencies(*seqs, key=None) returns each item and the respective number of sequences in seqs contains it.

Optional key function key can be specified.

Tip

If each sequence is sorted, then optimized sortedtools.matchingfrequencies with the same API should be used for better efficiency.

For the frequency of each item within a single sequence, use toolz.itertoolz.frequencies.

list(matchingfrequencies(
    [0, 1, 2, 3, 4],
    [1, 1, 1, 3, 4],
    [2, 1, 2, 2, 0],
    [1, 1, 1, 2, 2]
))
# [(0, 2), (1, 4), (2, 3), (3, 2), (4, 2)]

`match`¶

match(a, b, default=None) matches two sequences a and b in pairs, such that the total number of matching pairs is maximized.

If there are multiple alignments having the same number, the leftmost one is returned.

Warning

This function reads all sequences at once.

Tip

If both two sequences are sorted, respectively, then optimized sortedtools.sortedmatch with the same API should be used for better efficiency.

list(match(
    [0,    1, 1, 0, 1],
    [0, 0, 1, 1,    1]
))
# [(0, 0),
#  (None, 0),
#  (1, 1),
#  (1, 1),
#  (0, None),
#  (1, 1)]

`align`¶

align(a, b, cost=None, bound=inf, default=None) computes the alignment two sequences a and b, such that the total cost of the aligned sequences given the pair-wise cost function cost(x, y) is minimized.

Assume the sequences after alignment are a' and b'. The total cost is sum(cost(x, y) for x, y in zip(a', b')).
Both the minimum total cost and the respective aligned sequences are returned as a tuple.
In default, the cost function cost(x, y) returns 0 when x == y and 1 when not. This is equal to the edit distance.
To speedup the computation, a threshold of maximum cost bound=inf can be specified. When there is no satisfying result, None is returned.
If there are multiple alignments having the same cost, the leftmost one is returned.

Warning

This function reads all sequences at once.

align(
    [0,    1, 1, 0, 1],
    [0, 0, 1, 1,    1]
)
# (2, ([0, None, 1, 1,    0, 1],
#      [0,    0, 1, 1, None, 1]))

align(
    [0,    1, 1, 0, 1],
    [0, 0, 1, 1,    1],
    bound=1
)
# None

`join`¶

join(leftseq, rightseq, leftkey=None, rightkey=None, leftdefault=no_default, rightdefault=no_default) joins two sequences, optionally according to leftkey and rightkey, respectively. Outer join is also supported.

Warning

This function reads the first sequence leftseq at once. Thus, make leftseq the shorter one for best efficiency.

Tip

If both two sequences are sorted according to leftkey and rightkey, respectively, then optimized sortedtools.sortedjoin with the same API should be used for better efficiency.

Info

join is just a wrapper of toolz.itertools.join with the same more friendly API of sortedtools.sortedjoin.

list(join(
    [   -1, -1, -2,    -4, -5,    -6],
    [0,  1,  1,  2, 3,  4,  5, 5],
    leftkey=abs, leftdefault=None
))
# [(None, 0),
#  (-1, 1),
#  (-1, 1),
#  (-1, 1),
#  (-1, 1),
#  (-2, 2),
#  (None, 3),
#  (-4, 4),
#  (-5, 5),
#  (-5, 5)]

`cmpjoin`¶

cmpjoin(leftseq, rightseq, func=operator.eq, leftdefault=no_default, rightdefault=no_default) joins two sequences, optionally according to the condition function func, respectively. Outer join is also supported.

Warning

This function reads the first sequence leftseq at once. Thus, make leftseq the shorter one for best efficiency.

Tip

If both two sequences are sorted according to leftkey and rightkey, respectively, then optimized sortedtools.sortedjoin with the same API should be used for better efficiency.

This function is more flexable than join, and can be used in more scenarios.

list(cmpjoin(
    [   1, 1, 2,    4, 5,    -6],
    [0, 1, 1, 2, 3, 4, 5, 5]
))
# [(1, 1),
#  (1, 1),
#  (1, 1),
#  (1, 1),
#  (2, 2),
#  (4, 4),
#  (5, 5),
#  (5, 5)]

list(cmpjoin(
    [   1, 1, 2,    4, 5,    -6],
    [0, 1, 1, 2, 3, 4, 5, 5],
    lambda x, y: abs(x - y) == 1
))
# [(1, 0),
#  (1, 0),
#  (2, 1),
#  (2, 1),
#  (1, 2),
#  (1, 2),
#  (2, 3),
#  (4, 3),
#  (5, 4),
#  (4, 5),
#  (4, 5)]

`templateseq`¶

templateseq(seqs, default=None, simple=True) finds the common template of all the sequences seqs. default=None is used to denote any placeholder sub-sequence.

For better performance, option simple is enabled in default for only one scan of the sequences. However, it may work incorrectly for more complex template, which is:
- Any part of the template appears more than once in the same sequence,
- Any part of the template does not always appear before or after than another according to the same order part among sequences.

Warning

This function reads all the sequences at once.

This function reads all the sequences more than once when simple = False.

Unlike seqtools.commonsubseq(a, b), which finds the longest common sub-sequence, this function does not guarantee finding the longest template.

Tip

Essentially, each template is a closed sequential pattern covering all the sequences. Please check PrefixSpan-py for more advanced scenarios.

list(templateseq((
    s.split() for s in [
        "Alice likes tea and coffee !",
        "Bob likes sushi and ramen !",
        "Elisa or Anna likes icecream and cake and cookie !"
    ]
), default='*'))
# ['*', 'likes', '*', 'and', '*', '!']

list(templateseq((
    s.split() for s in [
        "Alice likes tea and coffee !",
        "Bob likes sushi and ramen !",
        "Elisa or Anna likes icecream and cake and cookie !"
    ]
), default='*', simple=False))
# ['*', 'likes', '*', 'and', '*', '!']

# For more complex data.
list(templateseq((
    s.split() for s in [
        "! Alice likes tea and coffee ! !",
        "Bob likes sushi and ramen ! !",
        "Elisa or Anna likes icecream and cake and cookie ! !"
    ]
), default='*'))
# Incorrect template.
# ['*', 'likes', '*', 'and', '*']

list(templateseq((
    s.split() for s in [
        "! Alice likes tea and coffee ! !",
        "Bob likes sushi and ramen ! !",
        "Elisa or Anna likes icecream and cake and cookie ! !"
    ]
), default='*', simple=False))
# ['*', 'likes', '*', 'and', '*', '!', '!']

Overview

Sequence Basics¶

productcmp¶

iter2seq¶

Sequence Transformation¶

takeat¶

mergeseqs¶

sortedbyrank¶

sortedtorank¶

filterbyother¶

Sequence Alignment and Join¶

matchingfrequencies¶

match¶

align¶

join¶

cmpjoin¶

templateseq¶

`productcmp`¶

`iter2seq`¶

`takeat`¶

`mergeseqs`¶

`sortedbyrank`¶

`sortedtorank`¶

`filterbyother`¶

`matchingfrequencies`¶

`match`¶

`align`¶

`join`¶

`cmpjoin`¶

`templateseq`¶