Linking
Overview
The Linking functions connect instances in a dataset to the respective URIs of a knowledge graph. They create one or more additional column(s) in the pandas DataFrame holding the URIS. These URIs can be later on used to generate additional features, as explained in Generators.
The input to the linking function is a pandas.DataFrame.column
with entities for which
additional information is desired. In the following example case additional information about
book authors.
import pandas
df = pd.DataFrame({
'author': ['Stephen King', 'Joanne K. Rowling', 'Dan Brown']
})
author |
---|
Stephen King |
Joanne K. Rowling |
Dan Brown |
The linking functions connects the entity, here the author, to its knowledge graph URI. The URIs appear in a newly created column in the DataFrame.
author |
new_link |
---|---|
Stephen King |
|
Joanne K. Rowling |
|
Dan Brown |
Pattern Linker
kgextension.linking.pattern_linker
The pattern linker takes the strings from the column that contains the entities to be linked and attaches them to a base link. For example, the base_url “www.examplegraph.com/resource” and the entity “New York City” are combined to the URI “www.examplegraph.com/resource/New_York_City”. The option url_encoding automatically applies URL encoding. The option DBpedia_link_format enables the conversion to the format that Dbpedia links are created in.
from kgextension.endpoints import WikiData
df_pattern_linked = pattern_linker(
df, column='author', new_attribute_name="new_link",
base_url="http://dbpedia.org/resource/", url_encoding=True,
DBpedia_link_format=True
)
author |
new_link |
---|---|
Stephen King |
|
Joanne K. Rowling |
|
Dan Brown |
- Advantage
fast
- Disadvantage
correctly spelled data required
Label Linker
kgextension.linking.label_linker
The label linker generates URIs by running a Sparql Query with the entity from the column and a label predicate, by default rdfs:label. This query is passed to the selected endpoint.
Default
df_label_linked = label_linker(
df, column='author', new_attribute_name="new_link",
endpoint=DBpedia, result_filter=None, language="en", max_hits=1,
label_property="rdfs:label"
)
author |
new_link_1 |
---|---|
Stephen King |
|
Joanne K. Rowling |
|
Dan Brown |
Language
The language argument is used to specify the language the labels are in. When querying, the label linker will add the specified language to the label as a language tag (e.g., “ExampleLabel”@en).
Imagine for example if we query DBpedia for the URI corresponding to the label “Sommer” (German for “Summer”). If we set the language argument to “de” (for German), we tell DBpedia to return the resource which, in German, is labeled “Sommer” (http://dbpedia.org/resource/Summer). If we set the language argument to “en” DBpedia will return the resource that, in English, is labeled “Sommer” (http://dbpedia.org/resource/Sommer).
Note
If the endpoint you query does not use language tags, the language argument has to be set to None.
Number of Links
By increasing the number of max_hits, several URI - columns are created whenever at least one of the entities has more than one label.
df_label_linked = label_linker(
df, column='identity', max_hits=2
)
identity |
new_link_1 |
new_link_2 |
---|---|---|
President |
||
Aruba |
||
Apple |
||
Paris |
Endpoint
By changing the endpoint, the resources are connected to another knowledge graph. Some endpoints are already predefined, such as DBpedia, WikiData and EUOpenData (EU Open Data Portal). For more information on endpoints, see Endpoints.
from kgextension.endpoints import WikiData
df_label_linked = label_linker(
df, column='identity', endpoint=WikiData
)
identity |
new_link_1 |
---|---|
President |
|
Apple |
|
Benjamin Franklin |
Label Property
Label Properties other than rdfs:label can lead to URI attributions. Consider for example the property foaf:name in the case of named entities.
df_label_linked = label_linker(
df, column='identity', label_property='foaf:name'
)
identity |
new_link_1 |
---|---|
Titanic |
|
Marie Curie |
|
Florence Nightingale |
DBpedia Lookup Linker
kgextension.linking.lookup_linker
This linker accesses the DBpedia Lookup web service that can be used to look up DBpedia URIs by related keywords. Related means that either the label of a resource matches, or an anchor text that was frequently used in Wikipedia to refer to a specific resource matches (for example the resource http://dbpedia.org/resource/United_States can be looked up by the string “USA”). The results are ranked by the number of inlinks pointing from other Wikipedia pages at a result page. See the DBpediaLookupAPI.
Default
df_lookup_linked = dbpedia_lookup_linker(
df, column="identity", new_attribute_name="new_link",
query_class="", max_hits=1, lookup_api="KeywordSearch"
)
identity |
new_link |
---|---|
Germany |
|
Italy |
|
United States of America |
Number of Links
Because the Lookup API also finds URIs of related concepts, many different URIs can be found per entity, as can be seen in the following example. While the first link has the strongest connection to the original string, each new link deviates more from the original meaning but is related to it.
df_lookup_linked = dbpedia_lookup_linker(
df, column="identity", max_hits=5
)
identity |
new_link_1 |
new_link_2 |
new_link_3 |
new_link_4 |
new_link_5 |
---|---|---|---|---|---|
Germany |
|||||
Italy |
|||||
United States of America |
Query Class
A DBpedia class from the DBpedia Ontology that the results should fall into (without prefix, e.g., dbo:place as place) can be specified.
df_lookup_linked = dbpedia_lookup_linker(
df, column="car", query_class='Automobile'
)
car |
new_link |
---|---|
Audi A8 |
|
Porsche Cayenne |
|
Tesla Model S |
|
Mercedes S Klasse |
Search Modus: Prefix Search
Additional to the default case of a Keyword Search, there is the option to conduct a prefix search that can be used to implement autocomplete input boxes.
df_lookup_linked = dbpedia_lookup_linker(
df, column="president", lookup_api="PrefixSearch"
)
president |
new_link |
---|---|
Bill C |
|
George B |
|
Barac |
|
Donal |
- Advantage
typo-insensitive
- Disadvantage
DBpedia-specific
DBpedia Spotlight Linker
kgextension.linking.dbpedia_spotlight_linker
This linker connects to the annotation tool DBpediaSpotlight. With the use of named entity recognition and related methods it identifies DBpedia resources from a text and allows to filter the results with confidence, support and similarity score measures.
Default
df_spotlight_linked = dbpedia_spotlight_linker(
df, column, new_attribute_name="new_link", max_hits=1,
language="en", selection="first", confidence=0.3, support=5,
min_similarity_score=0.5
)
animal |
new_link |
---|---|
Anaconda |
|
Bonobo |
|
Jellyfish |
|
Eagle |
Number of Links and Selection Method
When more than one entity can be identified from the column, the ordering of them is determined by the selection method. Three are available: the default is first, i.e. the URIs are ordered in accordance with their occurrence. support orders the results by descending support and similarityScore by descending similarity score.
The following example shows how the ordering of the URI columns can change with the chosen selection method.
selection=’first’
df_spotlight_linked = dbpedia_spotlight_linker(
df, 'sentence', max_hits=5, selection="first",
)
sentence |
new_link_1 |
new_link_2 |
new_link_3 |
new_link_4 |
new_link_5 |
---|---|---|---|---|---|
The Anaconda hides behind a cactus to catch the mouse. |
|||||
The Bonobo awaits the gorillas to visit the rain forest. |
NaN |
selection=’support’
df_spotlight_linked = dbpedia_spotlight_linker(
df, 'sentence', max_hits=5, selection="support",
)
sentence |
new_link_1 |
new_link_2 |
new_link_3 |
new_link_4 |
new_link_5 |
---|---|---|---|---|---|
The Anaconda hides behind a cactus to catch the mouse. |
|||||
The Bonobo awaits the gorillas to visit the rain forest. |
NaN |
selection=’similarityScore’
df_spotlight_linked = dbpedia_spotlight_linker(
df, 'sentence', max_hits=5, selection="similarityScore",
)
sentence |
new_link_1 |
new_link_2 |
new_link_3 |
new_link_4 |
new_link_5 |
---|---|---|---|---|---|
The Anaconda hides behind a cactus to catch the mouse. |
|||||
The Bonobo awaits the gorillas to visit the rain forest. |
NaN |
Filtering
There are three different thresholds to set to filter results: confidence, support and minimum similarity score. By increasing the thresholds, the selection of URIs has to fulfill to the stricter standards. The following two examples show the same inputs with different filter settings.
Laissez-faire Filtering
df_spotlight_linked = dbpedia_spotlight_linker(
df, 'sentence', confidence=0, support=0, min_similarity_score=0, max_hits=4
)
sentence |
new_link_1 |
new_link_2 |
new_link_3 |
new_link_4 |
---|---|---|---|---|
The eucalyptus tree grows in Australia. |
||||
Anna goes shopping in the mall of Western Chicago. |
In this case, the loose filtering rules allow non-sensical URIs, such as http://dbpedia.org/resource/Anna_Windass and http://dbpedia.org/resource/Population_growth to appear.
Strict Filtering
df_spotlight_linked = dbpedia_spotlight_linker(
df, 'sentence', confidence=0.9, support=7, min_similarity_score=0.9, max_hits=4
)
sentence |
new_link |
---|---|
The eucalyptus tree grows in Australia. |
|
Anna goes shopping in the mall of Western Chicago. |
In this case, the filter is very strict; while both http://dbpedia.org/resource/Australia and http://dbpedia.org/resource/Chicago are correct, other URIs that are correct as well such as http://dbpedia.org/resource/Eucalyptus are missing.
- Advantage
works on large textual data, typo-insensitive
- Disadvantage
DBpedia-specific, relatively slow
sameAs Linker
kgextension.linking.sameas_linker
The sameAs-Linker takes URIs from a column of a dataframe and queries a given SPARQL endpoint for resources, which are connected to these URIs via the owl:sameAs predicate. Found ressources are added as new columns to the dataframe and the dataframe is returned. It thus extracts URIs with the same meaning as the original one. This linker differs from the other four in that it needs at least one URI column to be present already to generate new URIs. It can thus be used on top of any of the other linkers.
The example shows the input dataframe to the sameas_linker: a dataframe containing entities and their already linked URIs, in this example case from WikiData.
word |
uri |
---|---|
they |
|
they |
|
she |
|
she |
|
he |
df_same_as_linked = sameas_linker(
df, column='uri', new_attribute_name="new_link", endpoint=WikiData,
result_filter=None, uri_data_model=False, bundled_mode=True
)
word |
uri |
new_link_1 |
---|---|---|
they |
||
they |
||
she |
||
she |
||
he |
Result Filtering
Since the same_as linker can in some cases generate a large amount of sameAs links, filtering can be applied. The result filter allows for a list of regexes to be passed that specify URI patterns that are allowed to be returned.
df_same_as_linked = sameas_linker(
df, "uri", endpoint=DBpedia, result_filter=["yago", "freebase", "wiki"]
)
word |
uri |
new_link_1 |
new_link_2 |
new_link_3 |
---|---|---|---|---|
University of Mannheim |
||||
University of Bremen |
The newly generated links from the sameAs-Linker follow the pattern defined by the filter. Other links that may potentially be found are not included in the extra URI columns but excluded because of the filter.
URI Data Model
If URI data model is chosen, the URI is directly queried instead of a SPARQL endpoint. While this option is slower, it is more independent of the endpoint itself.
Bundled Mode
In this default configuration, all the URIs to be queried are bundled into one query using the Sparql VALUES method. Since this requires a Sparql 1.1 implementation this can be turned off. However, this will lead to slower performance.