Overview
Success
If not explicitly noted, a sequence refers to not only list, tuple, string, or array, but any iterable.
If not explicitly noted, a function is lazy, where each sequence is processed incrmentally only when necessary and returned incrmentally as a generator.
Info
Empty sequence is always a sub-sequence of any other sequence.
A sequence is always a sub-sequence of itself.
Warning
For tools related to specific tasks, please go to the respective documentation:
Sequence Basics¶
Tools for basic sequence usages.
productcmp
¶
productcmp(x, y)
compares two sequences x
and y
with equal length according to product order. Returns -1
if smaller, 0
if equal, 1
if greater, and None
if not comparable.
- Throw exception if
x
andy
have different lengths.
productcmp( [1, 2, 3], [4, 5, 6] ) # 1 productcmp( [1, 2, 3], [4, 3, 2] ) # None
iter2seq
¶
iter2seq(iterable, target=tuple)
converts any iterable sequence iterable
to an indexable and sizable sequence with type target=tuple
if necessary, defaults to tuple.
Warning
This function reads the sequence at once.
Tip
Useful if you need to scan the sequence more than once.
Sequence Transformation¶
Tools for transforming sequence.
takeat
¶
takeat(poss, seq)
takes the elements of seq
at the positions in poss
.
Danger
poss
must already be sorted.
Info
Let be the length of poss
and be the length of seq
. The complexity of this function is .
Tip
If seq
supports random access, it is more efficient to use [seq[i] for i in poss]
, with the complexity of .
list(takeat(range(0, 10, 3), range(100, 110))) # [100, 103, 106, 109]
mergeseqs
¶
mergeseqs(seqs, default=None, key=None)
merges the sequences of equal length in seqs
into a single sequences. Returns None
if there is conflict in any position.
-
A position has conflict if there are more than one valid values, where each valid value is not
None
in default or judged by a key functionkey
. -
default
is a placeholder when there are no valid value in one position.
seqs = [ (0 , 0 , None, 0 ), (None, 1 , 1 , None), (2 , None, None, None), (None, None, None, None) ] list(mergeseqs(seqs[1:])) # [2, # 1, # 1, # None] list(mergeseqs(seqs)) # None
sortedbyrank
¶
sortedbyrank(data, ranks, reverse=False)
returns the sorted list of data
, by the respective rank of each individual element in ranks
.
sortedbyrank( ['a', 'b', 'c', 'd'], [0.2, 0.3, 0.1, 0.0] ) # ['d', 'c', 'a', 'b']
sortedtorank
¶
sortedtorank(data, ranks, reverse=False)
returns the list of data
rearranged w.r.t. the sorted positions, by the respective rank of each individual element in ranks
.
sortedtorank( ['a', 'b', 'c', 'd'], [0.2, 0.3, 0.1, 0.0] ) # ['c', 'd', 'b', 'a']
filterbyother
¶
filterbyother(func, seq)
drops the element in seq
that fails the pairwise test func
between the current element and any other element.
# Drop any number that is not the largest list(filterbyother(lambda x, y: x >= y, [1, 2, 2, 3, 3, 3])) # [3, 3, 3] # Drop any set that is a superset of another set list(filterbyother( lambda x, y: not x >= y, [{1,}, {1, 2}, {1, 2, 3}, {2, 3}, {2, 3, 4}] )) # [{1}, {2, 3}]
Sequence Alignment and Join¶
Tools for aligning and joining sequences.
matchingfrequencies
¶
matchingfrequencies(*seqs, key=None)
returns each item and the respective number of sequences in seqs
contains it.
- Optional key function
key
can be specified.
Tip
If each sequence is sorted, then optimized sortedtools.matchingfrequencies
with the same API should be used for better efficiency.
For the frequency of each item within a single sequence, use toolz.itertoolz.frequencies
.
list(matchingfrequencies( [0, 1, 2, 3, 4], [1, 1, 1, 3, 4], [2, 1, 2, 2, 0], [1, 1, 1, 2, 2] )) # [(0, 2), (1, 4), (2, 3), (3, 2), (4, 2)]
match
¶
match(a, b, default=None)
matches two sequences a
and b
in pairs, such that the total number of matching pairs is maximized.
- If there are multiple alignments having the same number, the leftmost one is returned.
Warning
This function reads all sequences at once.
Tip
If both two sequences are sorted, respectively, then optimized sortedtools.sortedmatch
with the same API should be used for better efficiency.
list(match( [0, 1, 1, 0, 1], [0, 0, 1, 1, 1] )) # [(0, 0), # (None, 0), # (1, 1), # (1, 1), # (0, None), # (1, 1)]
align
¶
align(a, b, cost=None, bound=inf, default=None)
computes the alignment two sequences a
and b
, such that the total cost of the aligned sequences given the pair-wise cost function cost(x, y)
is minimized.
-
Assume the sequences after alignment are
a'
andb'
. The total cost issum(cost(x, y) for x, y in zip(a', b'))
. -
Both the minimum total cost and the respective aligned sequences are returned as a tuple.
-
In default, the cost function
cost(x, y)
returns0
whenx == y
and1
when not. This is equal to the edit distance. -
To speedup the computation, a threshold of maximum cost
bound=inf
can be specified. When there is no satisfying result,None
is returned. -
If there are multiple alignments having the same cost, the leftmost one is returned.
Warning
This function reads all sequences at once.
align( [0, 1, 1, 0, 1], [0, 0, 1, 1, 1] ) # (2, ([0, None, 1, 1, 0, 1], # [0, 0, 1, 1, None, 1])) align( [0, 1, 1, 0, 1], [0, 0, 1, 1, 1], bound=1 ) # None
join
¶
join(leftseq, rightseq, leftkey=None, rightkey=None, leftdefault=no_default, rightdefault=no_default)
joins two sequences, optionally according to leftkey
and rightkey
, respectively. Outer join is also supported.
Warning
This function reads the first sequence leftseq
at once. Thus, make leftseq
the shorter one for best efficiency.
Tip
If both two sequences are sorted according to leftkey
and rightkey
, respectively, then optimized sortedtools.sortedjoin
with the same API should be used for better efficiency.
Info
join
is just a wrapper of toolz.itertools.join
with the same more friendly API of sortedtools.sortedjoin
.
list(join( [ -1, -1, -2, -4, -5, -6], [0, 1, 1, 2, 3, 4, 5, 5], leftkey=abs, leftdefault=None )) # [(None, 0), # (-1, 1), # (-1, 1), # (-1, 1), # (-1, 1), # (-2, 2), # (None, 3), # (-4, 4), # (-5, 5), # (-5, 5)]
cmpjoin
¶
cmpjoin(leftseq, rightseq, func=operator.eq, leftdefault=no_default, rightdefault=no_default)
joins two sequences, optionally according to the condition function func
, respectively. Outer join is also supported.
Warning
This function reads the first sequence leftseq
at once. Thus, make leftseq
the shorter one for best efficiency.
Tip
If both two sequences are sorted according to leftkey
and rightkey
, respectively, then optimized sortedtools.sortedjoin
with the same API should be used for better efficiency.
This function is more flexable than join
, and can be used in more scenarios.
list(cmpjoin( [ 1, 1, 2, 4, 5, -6], [0, 1, 1, 2, 3, 4, 5, 5] )) # [(1, 1), # (1, 1), # (1, 1), # (1, 1), # (2, 2), # (4, 4), # (5, 5), # (5, 5)] list(cmpjoin( [ 1, 1, 2, 4, 5, -6], [0, 1, 1, 2, 3, 4, 5, 5], lambda x, y: abs(x - y) == 1 )) # [(1, 0), # (1, 0), # (2, 1), # (2, 1), # (1, 2), # (1, 2), # (2, 3), # (4, 3), # (5, 4), # (4, 5), # (4, 5)]
templateseq
¶
templateseq(seqs, default=None, simple=True)
finds the common template of all the sequences seqs
. default=None
is used to denote any placeholder sub-sequence.
-
For better performance, option
simple
is enabled in default for only one scan of the sequences. However, it may work incorrectly for more complex template, which is:-
Any part of the template appears more than once in the same sequence,
-
Any part of the template does not always appear before or after than another according to the same order part among sequences.
-
Warning
This function reads all the sequences at once.
This function reads all the sequences more than once when simple = False
.
Unlike seqtools.commonsubseq(a, b)
, which finds the longest common sub-sequence, this function does not guarantee finding the longest template.
Tip
Essentially, each template is a closed sequential pattern covering all the sequences. Please check PrefixSpan-py for more advanced scenarios.
list(templateseq(( s.split() for s in [ "Alice likes tea and coffee !", "Bob likes sushi and ramen !", "Elisa or Anna likes icecream and cake and cookie !" ] ), default='*')) # ['*', 'likes', '*', 'and', '*', '!'] list(templateseq(( s.split() for s in [ "Alice likes tea and coffee !", "Bob likes sushi and ramen !", "Elisa or Anna likes icecream and cake and cookie !" ] ), default='*', simple=False)) # ['*', 'likes', '*', 'and', '*', '!'] # For more complex data. list(templateseq(( s.split() for s in [ "! Alice likes tea and coffee ! !", "Bob likes sushi and ramen ! !", "Elisa or Anna likes icecream and cake and cookie ! !" ] ), default='*')) # Incorrect template. # ['*', 'likes', '*', 'and', '*'] list(templateseq(( s.split() for s in [ "! Alice likes tea and coffee ! !", "Bob likes sushi and ramen ! !", "Elisa or Anna likes icecream and cake and cookie ! !" ] ), default='*', simple=False)) # ['*', 'likes', '*', 'and', '*', '!', '!']