Schema
Warning
For all the functions except hasheader
, the header must be removed for best result.
Column Types¶
Tools for processing column types.
inferschema
¶
inferschema(data)
infers the schema of the table data
, as a tuple of column types. Currently available types are listed as follows.
Info
Utilizes the RegexOrder
library to infer the type of each column.
RegexOrder
is part of a research project. Thus, when using this function for research purpose, please cite both RegexOrder
and extratools
accordingly.
inferschema([ ['Los Angeles' , '34°03′' , '118°15′' ], ['New York City', '40°42′46″', '74°00′21″'], ['Paris' , '48°51′24″', '2°21′03″' ] ]) # ('title_words', 'formated_pos_ints', 'formated_pos_ints')
hasheader
¶
hasheader(data)
returns the confidence (between and ) of whether the first row of the table data
is header.
Info
It works by checking whether the type with vs. without the first row for each column, using the RegexOrder
library.
RegexOrder
is part of a research project. Thus, when using this function for research purpose, please cite both RegexOrder
and extratools
accordingly.
t = [ ['Los Angeles' , '34°03′' , '118°15′' ], ['New York City', '40°42′46″', '74°00′21″'], ['Paris' , '48°51′24″', '2°21′03″' ] ] hasheader(t) # 0.0 hasheader([ ['City', 'Latitude', 'Longitude'] ] + t) # 0.6666666666666666 hasheader([ ['C1', 'C2', 'C3'] ] + t) # 1.0
Primary/Foreign-Key of Table¶
Tools for processing primary/foreign-key of table.
candidatekeys
¶
candidatekeys(data, maxcols)
finds the candidate keys of a table data
.
-
In default, the maximum number of columns
maxcols
in each candidate key is limited to1
for efficiency. Specify larger number for more accurate results. -
Each candidate key is a tuple of column IDs.
Note
A proper primary key is further selected from the candidate keys.
t1 = [ ["a1", "b1", "c1", "d1"], ["a2", "b1", "c2", "d1"], ["a3", "b1", "c1", "d1"], ] list(candidatekeys(t1)) # [(0,)] list(candidatekeys(t1, maxcols=4)) # [(0,)] t2 = [ ["a1", "b1", "c1", "d1"], ["a1", "b1", "c2", "d1"], ["a2", "b1", "c1", "d1"], ] list(candidatekeys(t2)) # [] list(candidatekeys(t2, maxcols=4)) # [(0, 2)]
foreignkeys
¶
foreignkeys(primarydata, primarykey, foreigndata)
finds the foreign keys of the foreign table foreigndata
, according to the primary key primarykey
of the primary table primarydata
.
- Each foreign key is a tuple of column IDs.
pt = [ ["a1", "b1", "c1", "d1"], ["a1", "b1", "c2", "d1"], ["a2", "b1", "c1", "d1"], ] # Primary key of table tp pk = list(candidatekeys(pt, maxcols=4))[0] # (0, 2) ft = [ ["c1", "b1", "a2", "d1"], ["c2", "b1", "a1", "d1"], ] # Foreign keys of table ft list(foreignkeys(pt, pk, ft)) # [(2, 0)]