Dataset releases Archives - DBpedia Association

Recap 2023: A Year with DBpedia

Mon, 04 Dec 2023 11:50:24 +0000

Can you believe it..? … sixteen years ago the first DBpedia dataset was released. Sixteen years of development, improvements and growth. Now more than 4,100 GByte of data is hosted on the DBpedia Databus. We want to take this as an opportunity to send out a big “Thank you!” to all contributors, developers, members, hosters, funders, believers and DBpedia enthusiasts who made that possible. Thank you for your support!

In the upcoming blog series, we will take you on a retrospective tour through 2023. Furthermore, we will give you insights into a year with DBpedia. In the following we will also highlight our past events.

Snapshot Release

We are pleased to announce immediate availability of a new edition of the free and publicly accessible SPARQL Query Service Endpoint and Linked Data Pages, for interacting with the new Snapshot Dataset. In 2023, we released version 2022-12 release with all the features since version 2022-09. The current Snapshot Release contains more than 850 million facts (triples). Please check more details on our website.

Google Summer of Code (GSoC)

For the 12th year in a row, we have been able to support and guide young, ambitious developers who have joined us as an open source organization. We encouraged them to work on a programming project this summer. Each year we have been inspired by new project ideas, many amazing contributors, and mostly great project results that have shaped the future of DBpedia. If you want to have deeper insights in our GSoC contributors work you can find their blogs and repos on the DBpedia blog.

DBpedia @ Leipzig Semantic Web Day

On June 28, 2023, Sebastian Hellmann presented the DBpedia Databus 2.1. at Data Week Leipzig. Data Week is the networking and exchange event for highlighting scientific, economic, and social perspectives of data and its use, where industry, citizens, science, and public authorities can enter into dialogue. Data Week Leipzig took place June 26-30, 2023. Please find Sebastian’s slides here.

In the upcoming blog post after the holidays we will give you more insights in the past events and technical achievements. We are now looking forward to the year 2024. The DBpedia team plans to have a tutorial at the LREC-COLING 2024 conference and the DBpedia Day at SEMANTiCS 2024 conference in Amsterdam, Netherlands.

Above all, we wish you a merry Christmas and a happy new year. In the meantime, stay tuned and check our Twitter, Instagram or LinkedIn channels. You can subscribe to our Newsletter for the latest news and information around DBpedia.

Julia & Maria,

on behalf of the DBpedia Association

The post Recap 2023: A Year with DBpedia appeared first on DBpedia Association.

DBpedia Snapshot 2022-12 Release

Mon, 27 Mar 2023 09:36:32 +0000

News since DBpedia Snapshot 2022-09

New Abstract Extractor due to GSOC 2022 (credits to Celian Ringwald)

Work in progress: Smoothing the community issue reporting and fixing at Github

What is the “DBpedia Snapshot” Release?

Historically, this release has been associated with many names: “DBpedia Core”, “EN DBpedia”, and — most confusingly — just “DBpedia”. In fact, it is a combination of —

EN Wikipedia data — A small, but very useful, subset (~ 1 Billion triples or 14%) of the whole DBpedia extraction using the DBpedia Information Extraction Framework (DIEF), comprising structured information extracted from the English Wikipedia plus some enrichments from other Wikipedia language editions, notably multilingual abstracts in ar, ca, cs, de, el, eo, es, eu, fr, ga, id, it, ja, ko, nl, pl, pt, sv, uk, ru, zh.
Links — 62 million community-contributed cross-references and owl:sameAs links to other linked data sets on the Linked Open Data (LOD) Cloud that allow to effectively find and retrieve further information from the largest, decentral, change-sensitive knowledge graph on earth that has formed around DBpedia since 2007.
Community extensions — Community-contributed extensions such as additional ontologies and taxonomies.

Release Frequency & Schedule

Going forward, releases will be scheduled for the 1th of February, May, August, and November (with +/- 5 days tolerance), and are named using the same date convention as the Wikipedia Dumps that served as the basis for the release. An example of the release timeline is shown below:

*December 6–8*	*December 8–20*	*Dec 20–Jan 1*	*Jan 1–Feb 15*
Wikipedia dumps for June 1 become available on https://dumps.wikimedia.org/	Download and extraction with DIEF	Post-processing and quality-control period	Linked Data and SPARQL endpoint deployment

Data Freshness

Given the timeline above, the EN Wikipedia data of DBpedia Snapshot has a lag of 1-4 months. We recommend the following strategies to mitigate this:

DBpedia Snapshot as a kernel for Linked Data: Following the Linked Data paradigm, we recommend using the Linked Data links to other knowledge graphs to retrieve high-quality and recent information. DBpedia’s network consists of the best knowledge engineers in the world, working together, using linked data principles to build a high-quality, open, decentralized knowledge graph network around DBpedia. Freshness and change-sensitivity are two of the greatest data-related challenges of our time, and can only be overcome by linking data across data sources. The “Big Data” approach of copying data into a central warehouse is inevitably challenged by issues such as co-evolution and scalability.
DBpedia Live: Wikipedia is unmistakenly the richest, most recent body of human knowledge and source of news in the world. DBpedia Live is just minutes behind edits on Wikipedia, which means that as soon as any of the 120k Wikipedia editors press the “save” button, DBpedia Live will extract fresh data and update. DBpedia Live consists of the DBpedia Live Sync API (for syncing into any kind of on-site databases), Linked Data and SPARQL endpoint.
Latest-Core is a dynamically updating Databus Collection. Our automated extraction robot “MARVIN” publishes monthly dev versions of the full extraction, which are then refined and enriched to become Snapshot.

Data Quality & Richness

We would like to acknowledge the excellent work of Wikipedia editors (~46k active editors for EN Wikipedia), who are ultimately responsible for collecting information in Wikipedia’s infoboxes, which are refined by DBpedia’s extraction into our knowledge graphs. Wikipedia’s infoboxes are steadily growing each month and according to our measurements grow by 150% every three years. EN Wikipedia’s inboxes even doubled in this timeframe. This richness of knowledge drives the DBpedia Snapshot knowledge graph and is further potentiated by synergies with linked data cross-references. Statistics are given below

Data Access & Interaction Options

Linked Data

Linked Data is a principled approach to publishing RDF data on the Web that enables interlinking data between different data sources, courtesy of the built-in power of Hyperlinks as unique Entity Identifiers.

HTML pages comprising Hyperlinks that confirm to Linked Data Principles is one of the methods of interacting with data provided by the DBpedia Snapshot, be it manually via the web browser or programmatically using REST interaction patterns via https://dbpedia.org/resource/{entity-label} pattern. Naturally, we encourage Linked Data interactions, while also expecting user-agents to honor the cache-control HTTP response header for massive crawl operations. Instructions for accessing Linked Data, available in 10 formats.

SPARQL Endpoint

This service enables some astonishing queries against Knowledge Graphs derived from Wikipedia content. The Query Services Endpoint that makes this possible is identified by http://dbpedia.org/sparql, and it currently handles 7.2 million queries daily on average. See powerful queries and instructions (incl. rates and limitations).

An effective Usage Pattern is to filter a relevant subset of entity descriptions for your use case via SPARQL and then combine with the power of Linked Data by looking up (or de-referencing) data via owl:sameAs property links en route to retrieving specific and recent data from across other Knowledge Graphs across the massive Linked Open Data Cloud.

Additionally, DBpedia Snapshot dumps and additional data from the complete collection of datasets derived from Wikipedia are provided by the DBpedia Databus for use in your own SPARQL-accessible Knowledge Graphs.

DBpedia Ontology

This Snapshot Release was built with DBpedia Ontology (DBO) version: https://databus.dbpedia.org/ontologies/dbpedia.org/ontology–DEV/2021.11.08-124002 We thank all DBpedians for the contribution to the ontology and the mappings. See documentation and visualizations, class tree and properties, wiki.

DBpedia Snapshot Statistics

Overview. Overall the current Snapshot Release contains more than 850 million facts (triples).

At its core, the DBpedia ontology is the heart of DBpedia. Our community is continuously contributing to the DBpedia ontology schema and the DBpedia infobox-to-ontology mappings by actively using the DBpedia Mappings Wiki.

The current Snapshot Release utilizes a total of 55 thousand properties, whereas 1377 of these are defined by the DBpedia ontology.

Classes. Knowledge in Wikipedia is constantly growing at a rapid pace. We use the DBpedia Ontology Classes to measure the growth: Total number in this release (in brackets we give: a) growth to the previous release, which can be negative temporarily and b) growth compared to Snapshot 2016-10):

Persons: 1792308 (1.01%, 1.13%)
Places: 748372 (1.00%, 1820.86%), including but not limited to 590481 (1.00%, 5518.51%) populated places
Works 610589 (1.00%, 619.89%), including, but not limited to
- 157566 (1.00%, 1.38%) music albums
- 144415 (1.01%, 15.94%) films
- 24829 (1.01%, 12.53%) video games
Organizations: 345523 (1.01%, 109.31%), including but not limited to
- 87621 (1.01%, 2.25%) companies
- 64507 (1.00%, 64507.00%) educational institutions
Species: 1933436 (1.01%, 322239.33%)
Plants: 7718 (0.82%, 1.71%)
Diseases: 10591 (1.00%, 8.54%)

Detailed Growth of Classes: The image below shows the detailed growth for one class. Click on the links for other classes: Place, PopulatedPlace, Work, Album, Film, VideoGame, Organisation, Company, EducationalInstitution, Species, Plant, Disease. For further classes adapt the query by replacing the <http://dbpedia.org/ontology/CLASS> URI. Note, that 2018 was a development phase with some failed extractions. The stats were generated with the Databus VOID Mod.

Links. Linked Data cross-references between decentral datasets are the foundation and access point to the Linked Data Web. The latest Snapshot Release provides over 130.6 million links from 7.62 million entities to 179 external sources.

Top 11

###TOP11###

33,975305 http://www.wikidata.org

7,206,254 https://global.dbpedia.org

4,308,772 http://yago-knowledge.org

3,855,108 http://de.dbpedia.org

3,731,002 http://fr.dbpedia.org

2,991,921 http://viaf.org

2,929,808 http://it.dbpedia.org

2,925,530 http://es.dbpedia.org

2,788,703 http://fa.dbpedia.org

2,587,004 http://ru.dbpedia.org

2,580,398 http://sr.dbpedia.org

Top 10 without DBpedia namespaces

###TOP10###

33,975,305 http://www.wikidata.org

4,308,772 http://yago-knowledge.org

2,991,921 http://viaf.org

1,708,533 http://d-nb.info

612,227 http://sws.geonames.org

596,134 http://umbel.org

537,602 http://data.bibliotheken.nl

430,839 http://www.w3.org

422,989 http://musicbrainz.org

104,433 http://linkedgeodata.org

DBpedia Extraction Dumps on the Databus

All extracted files are reachable via the DBpedia account on the Databus. The Databus has two main structures:

In groups, artifacts, version, similar to software in Maven: https://databus.dbpedia.org/$user/$group/$artifact/$version/$filename
In Databus collections that are SPARQL queries over https://databus.dbpedia.org/yasui filtering and selecting files from these artifacts.

Snapshot Download. For downloading DBpedia Snapshot, we prepared this collection, which also includes detailed releases notes:

https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot-2022-03

The collection is roughly equivalent to http://downloads.dbpedia.org/2016-10/core/

Collections can be downloaded in many different ways, some download modalities such as bash script, SPARQL, and plain URL list are found in the tabs at the collection. Files are provided as bzip2 compressed n-triples files. In case you need a different format or compression, you can also use the “Download-As” function of the Databus Client (GitHub), e.g. -s $collection -c gzip would download the collection and convert it to GZIP during download.

Replicating DBpedia Snapshot on your server can be done via Docker, see https://hub.docker.com/r/dbpedia/virtuoso-sparql-endpoint-quickstart

git clone https://github.com/dbpedia/virtuoso-sparql-endpoint-quickstart.git

cd virtuoso-sparql-endpoint-quickstart

COLLECTION_URI=https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot-2022-09 VIRTUOSO_ADMIN_PASSWD=password docker-compose up

Download files from the whole DBpedia extraction. The whole extraction consists of approx. 20 Billion triples and 5000 files created from 140 languages of Wikipedia, Commons and Wikidata. They can be found in https://databus.dbpedia.org/dbpedia/(generic|mappings|text|wikidata)

You can copy-edit a collection and create your own customized (e.g.) collections via “Actions” -> “Copy Edit” , e.g. you can Copy Edit the snapshot collection above, remove some files that you do not need and add files from other languages. Please see the Rhizomer use case: Best way to download specific parts of DBpedia. Of course, this only refers to the archived dumps on the Databus for users who want to bulk download and deploy into their own infrastructure. Linked Data and SPARQL allow for filtering the content using a small data pattern.

Acknowledgments

First and foremost, we would like to thank our open community of knowledge engineers for finding & fixing bugs and for supporting us by writing data tests. We would also like to acknowledge the DBpedia Association members for constantly innovating the areas of knowledge graphs and linked data and pushing the DBpedia initiative with their know-how and advice. OpenLink Software supports DBpedia by hosting SPARQL and Linked Data; University Mannheim, the German National Library of Science and Technology (TIB) and the Computer Center of University Leipzig provide persistent backups and servers for extracting data. We thank Marvin Hofer and Mykola Medynskyi for technical preparation. This work was partially supported by grants from the Federal Ministry for Economics and Climate Action (BMWK) for the LOD-GEOSS Project (03EI1005E), PenFLaaS (100594042) as well as for the PLASS Project (01MD19003D).

The post DBpedia Snapshot 2022-12 Release appeared first on DBpedia Association.

A year with DBpedia – Retrospective Part 2/2022

Thu, 12 Jan 2023 09:59:46 +0000

This is the final part of our journey through 2022. In the previous blog post we already presented DBpedia highlights, events and tutorials. Now we want to take a look at the second half of 2022 and give an outlook for 2023.

DBpedia Day @ Semantics Conference in Vienna

Like last year, the DBpedia Day was part of the SEMANTiCS conference 2022 and was held on 13th of September at the ARCOTEL Wimberger Wien. Up to 100 DBpedians joined the DBpedia Day. Also this year, our CEO Sebastian Hellmann opened DBpedia Day by presenting the Linkmaster 3000 project. Afterwards, Olaf Harting from Linköping University gave his fantastic keynote presentation “Towards Querying Heterogeneous Federations of Interlinked Knowledge Graphs”. Furthermore, we organized a member presentation session, DBpedia Science: Linking and Consumption and a community session. In case you missed the event, all slides are also available on our event page or read our blogpost.

Linkmaster 3000

If it’s not linked — does it even exist? On September 13, 2022, Sebastian Hellmann introduced the Linkmaster 3000 at the SEMANTiCS Conference in Vienna, Austria. It’s an online tool in development to manage Linked Data spaces and help make them better integrated in the global database formed by Linked Data principles. Two milestones are pending: a large-scale evaluation and creation of a documentation for the launch.

New DBpedia Member

Since 2007, our DBpedia family has been growing steadily – and this year is no exception. We are once again pleased to welcome organizations from science and industry to the DBpedia family. Beginning in February, we were able to welcome Anhalt University of Applied Science into our midst. This was followed in June by the Leipzig University of Applied Sciences (HTWK Leipzig) and the Technical University of Madrid. And last but not least, DALICC joined the DBpedia family in November and has grown it to 34 members.

Right now we are excited to see who will become a new member this year and we are already looking forward to it.

DBpedia Tutorial @ the KGSWC 2022

On November 17, 2022, we organized a free DBpedia Knowledge Graph tutorial at the Knowledge Graph and Semantic Web Confernece (KGSWC) 2022. In this framework this year’s International Winter School was held under the theme “KnowledgeGraphs: Third Wave of AI” in Madrid, Spain. Around 33 participants joined the tutorial, which was organized by Milan Dojchinovski and Jan Forberg. If you want to know more about the tutorial you can find the slides here https://tinyurl.com/WinterSchool22.

DBpedia Snapshot 2022-09 Release

On October 28, 2022 we announced the immediate availability of a new edition of the free and publicly accessible SPARQL Query Service Endpoint and Linked Data Pages, for interacting with the new Snapshot Dataset. Since the last release we made a few changes. Due to GSoC 2022 (credits to Celian Ringwald) it includes the New Abstract Extractor. In Addition there is still work in progress in order of smoothing the community issue reporting and fixing at Github. The full release description including further statistics can be found on https://www.dbpedia.org/blog/dbpedia-snapshot-2022-09-release/.

We do hope we will meet you and some new faces during our events next year. The DBpedia Association wants to get to know you because DBpedia is a community effort and would not continue to develop, improve and grow without you. We plan to have meetings or tutorials at the Data Week in Leipzig, the LDK conference, and SEMANTiCS’23 conference. We wish you a happy New Year!

Stay safe and check Twitter, Instagram and LinkedIn or or subscribe to our Newsletter for the latest news and information.

Yours,

Emma & Julia

on behalf of the DBpedia Association

The post A year with DBpedia – Retrospective Part 2/2022 appeared first on DBpedia Association.

DBpedia Snapshot 2022-09 Release

Fri, 28 Oct 2022 12:28:10 +0000

News since DBpedia Snapshot 2022-03

New Abstract Extractor due to GSOC 2022 (credits to Celian Ringwald)
Work in progress: Smoothing the community issue reporting and fixing at Github

What is the “DBpedia Snapshot” Release?

Historically, this release has been associated with many names: “DBpedia Core”, “EN DBpedia”, and — most confusingly — just “DBpedia”. In fact, it is a combination of —

EN Wikipedia data — A small, but very useful, subset (~ 1 Billion triples or 14%) of the whole DBpedia extraction using the DBpedia Information Extraction Framework (DIEF), comprising structured information extracted from the English Wikipedia plus some enrichments from other Wikipedia language editions, notably multilingual abstracts in ar, ca, cs, de, el, eo, es, eu, fr, ga, id, it, ja, ko, nl, pl, pt, sv, uk, ru, zh.
Links — 62 million community-contributed cross-references and owl:sameAs links to other linked data sets on the Linked Open Data (LOD) Cloud that allow to effectively find and retrieve further information from the largest, decentral, change-sensitive knowledge graph on earth that has formed around DBpedia since 2007.
Community extensions — Community-contributed extensions such as additional ontologies and taxonomies.

Release Frequency & Schedule

September 6–8	September 8–20	September 20–November 1	November 1 –November 15
Wikipedia dumps for June 1 become available on https://dumps.wikimedia.org/	Download and extraction with DIEF	Post-processing and quality-control period	Linked Data and SPARQL endpoint deployment

Data Freshness

Given the timeline above, the EN Wikipedia data of DBpedia Snapshot has a lag of 1-4 months. We recommend the following strategies to mitigate this:

DBpedia Snapshot as a kernel for Linked Data: Following the Linked Data paradigm, we recommend using the Linked Data links to other knowledge graphs to retrieve high-quality and recent information. DBpedia’s network consists of the best knowledge engineers in the world, working together, using linked data principles to build a high-quality, open, decentralized knowledge graph network around DBpedia. Freshness and change-sensitivity are two of the greatest data-related challenges of our time, and can only be overcome by linking data across data sources. The “Big Data” approach of copying data into a central warehouse is inevitably challenged by issues such as co-evolution and scalability.
DBpedia Live: Wikipedia is unmistakenly the richest, most recent body of human knowledge and source of news in the world. DBpedia Live is just minutes behind edits on Wikipedia, which means that as soon as any of the 120k Wikipedia editors press the “save” button, DBpedia Live will extract fresh data and update. DBpedia Live is currently in tech preview status and we are working towards a high-available and reliable business API with support. DBpedia Live consists of the DBpedia Live Sync API (for syncing into any kind of on-site databases), Linked Data and SPARQL endpoint.
Latest-Core is a dynamically updating Databus Collection. Our automated extraction robot “MARVIN” publishes monthly dev versions of the full extraction, which are then refined and enriched to become Snapshot.

Data Quality & Richness

Data Access & Interaction Options

Linked Data

SPARQL Endpoint

DBpedia Ontology

DBpedia Snapshot Statistics

Overview. Overall the current Snapshot Release contains more than 850 million facts (triples).

The current Snapshot Release utilizes a total of 55 thousand properties, whereas 1377 of these are defined by the DBpedia ontology.

Persons: 1833500 (1.02%, 1.15%)
Places: 757436 (1.01%, 1842.91%), including but not limited to 597548 (1.01%, 5584.56%) populated places
Works 619280 (1.01%, 628.71%), including, but not limited to
- 158895 (1.01%, 1.40%) music albums
- 147192 (1.02%, 16.25%) films
- 25182 (1.01%, 12.71%) video games
Organizations: 350329 (1.01%, 110.83%), including but not limited to
- 88722 (1.01%, 2.28%) companies
- 64094 (0.99%, 64094.00%) educational institutions
Species: 1955775 (1.01%, 325962.50%)
Plants: 4977 (0.64%, 1.10%)
Diseases: 10837 (1.02%, 8.74%)

Top 11

###TOP11###

33,860,047 http://www.wikidata.org

7,147,970 https://global.dbpedia.org

4,308,772 http://yago-knowledge.org

3,832,100 http://de.dbpedia.org

3,704,534 http://fr.dbpedia.org

2,971,751 http://viaf.org

2,912,859 http://it.dbpedia.org

2,903,130 http://es.dbpedia.org

2,754,466 http://fa.dbpedia.org

2,571,787 http://sr.dbpedia.org

2,563,793 http://ru.dbpedia.org

Top 10 without DBpedia namespaces

###TOP10###

33,860,047 http://www.wikidata.org

4,308,772 http://yago-knowledge.org

2,971,751 http://viaf.org

1,687,386 http://d-nb.info

609,604 http://sws.geonames.org

596,134 http://umbel.org

533,320 http://data.bibliotheken.nl

430,839 http://www.w3.org

417,034 http://musicbrainz.org

104,433 http://linkedgeodata.org

DBpedia Extraction Dumps on the Databus

All extracted files are reachable via the DBpedia account on the Databus. The Databus has two main structures:

In groups, artifacts, version, similar to software in Maven: https://databus.dbpedia.org/$user/$group/$artifact/$version/$filename
In Databus collections that are SPARQL queries over https://databus.dbpedia.org/yasui filtering and selecting files from these artifacts.

Snapshot Download. For downloading DBpedia Snapshot, we prepared this collection, which also includes detailed releases notes: https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot-2022-03

The collection is roughly equivalent to http://downloads.dbpedia.org/2016-10/core/

Replicating DBpedia Snapshot on your server can be done via Docker, see https://hub.docker.com/r/dbpedia/virtuoso-sparql-endpoint-quickstart

git clone https://github.com/dbpedia/virtuoso-sparql-endpoint-quickstart.git

cd virtuoso-sparql-endpoint-quickstart

COLLECTION_URI=https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot-2022-09

VIRTUOSO_ADMIN_PASSWD=password docker-compose up

Acknowledgments

First and foremost, we would like to thank our open community of knowledge engineers for finding & fixing bugs and for supporting us by writing data tests. We would also like to acknowledge the DBpedia Association members for constantly innovating the areas of knowledge graphs and linked data and pushing the DBpedia initiative with their know-how and advice. OpenLink Software supports DBpedia by hosting SPARQL and Linked Data; University Mannheim, the German National Library of Science and Technology (TIB) and the Computer Center of University Leipzig provide persistent backups and servers for extracting data. We thank Marvin Hofer and Mykola Medynskyi for technical preparation. This work was partially supported by grants from the Federal Ministry for Economic Affairs and Energy of Germany (BMWi) for the LOD-GEOSS Project (03EI1005E), as well as for the PLASS Project (01MD19003D).

The post DBpedia Snapshot 2022-09 Release appeared first on DBpedia Association.

DBpedia Snapshot 2022-06 Release

Thu, 28 Jul 2022 10:57:09 +0000

The DBpedia team, heartbrokenly, has to inform you: that there will be NO new Snapshot Release of version 2022-06.

You can still use the last Snapshot Release of version 2022-03.

We want to address the current problem and future solutions in the following.

We encountered several new issues, but the major problem is that the current version of DBpedias Abstract Extractor is no longer working. Wikimedia/Wikipedia seems to have increased the requests per second restrictions on their old API.

Current API https://en.wikipedia.org/w/api.php

As a result, we could not extract any version of English abstracts for April, May, or June 2022 (not even thinking about the other 138 languages). We decided not to publish a mixed version with overlapping core data that is older than three months (e.g., abstracts and mapping-based data).

For a solution, a GSOC project was already promoted in early 2022 that specifies the task of improving abstract extraction. The project was accepted and is currently running. We tested several new promising strategies to implement a new Abstract Extractor.

New API https://en.wikipedia.org/api/rest_v1/
local deployment of Mediawiki for Wikitext-template parsing

We will give further status updates on this project in the future.

The announcement of the next Snapshot Release (2022-09) is scheduled for November the 1st.

The post DBpedia Snapshot 2022-06 Release appeared first on DBpedia Association.

DBpedia Snapshot 2022-03 Release

Mon, 02 May 2022 14:28:13 +0000

News since DBpedia Snapshot 2021-12

Release notes are now maintained in the Databus Collection (2022-03)
Image and Abstract Extractor was improved
Work in progress: Smoothing the community issue reporting and fixing at Github

What is the “DBpedia Snapshot” Release?

Historically, this release has been associated with many names: “DBpedia Core”, “EN DBpedia”, and — most confusingly — just “DBpedia”. In fact, it is a combination of —

EN Wikipedia data — A small, but very useful, subset (~ 1 Billion triples or 14%) of the whole DBpedia extraction using the DBpedia Information Extraction Framework (DIEF), comprising structured information extracted from the English Wikipedia plus some enrichments from other Wikipedia language editions, notably multilingual abstracts in ar, ca, cs, de, el, eo, es, eu, fr, ga, id, it, ja, ko, nl, pl, pt, sv, uk, ru, zh.
Links — 62 million community-contributed cross-references and owl:sameAs links to other linked data sets on the Linked Open Data (LOD) Cloud that allow to effectively find and retrieve further information from the largest, decentral, change-sensitive knowledge graph on earth that has formed around DBpedia since 2007.
Community extensions — Community-contributed extensions such as additional ontologies and taxonomies.

Release Frequency & Schedule

Going forward, releases will be scheduled for the 15th of February, May, July, and October (with +/- 5 days tolerance), and are named using the same date convention as the Wikipedia Dumps that served as the basis for the release. An example of the release timeline is shown below:

March 6–8	March 8–20	March 20–April 15	April 15–May 1
Wikipedia dumps for June 1 become available on https://dumps.wikimedia.org/	Download and extraction with DIEF	Post-processing and quality-control period	Linked Data and SPARQL endpoint deployment

Data Freshness

Given the timeline above, the EN Wikipedia data of DBpedia Snapshot has a lag of 1-4 months. We recommend the following strategies to mitigate this:

DBpedia Snapshot as a kernel for Linked Data: Following the Linked Data paradigm, we recommend using the Linked Data links to other knowledge graphs to retrieve high-quality and recent information. DBpedia’s network consists of the best knowledge engineers in the world, working together, using linked data principles to build a high-quality, open, decentralized knowledge graph network around DBpedia. Freshness and change-sensitivity are two of the greatest data-related challenges of our time, and can only be overcome by linking data across data sources. The “Big Data” approach of copying data into a central warehouse is inevitably challenged by issues such as co-evolution and scalability.
DBpedia Live: Wikipedia is unmistakenly the richest, most recent body of human knowledge and source of news in the world. DBpedia Live is just minutes behind edits on Wikipedia, which means that as soon as any of the 120k Wikipedia editors press the “save” button, DBpedia Live will extract fresh data and update. DBpedia Live is currently in tech preview status and we are working towards a high-available and reliable business API with support. DBpedia Live consists of the DBpedia Live Sync API (for syncing into any kind of on-site databases), Linked Data and SPARQL endpoint.
Latest-Core is a dynamically updating Databus Collection. Our automated extraction robot “MARVIN” publishes monthly dev versions of the full extraction, which are then refined and enriched to become Snapshot.

Data Quality & Richness

We would like to acknowledge the excellent work of Wikipedia editors (~46k active editors for EN Wikipedia), who are ultimately responsible for collecting information in Wikipedia’s infoboxes, which are refined by DBpedia’s extraction into our knowledge graphs. Wikipedia’s infoboxes are steadily growing each month and according to our measurements grow by 150% every three years. EN Wikipedia’s inboxes even doubled in this time frame. This richness of knowledge drives the DBpedia Snapshot knowledge graph and is further potentiated by synergies with linked data cross-references. Statistics are given below.

Data Access & Interaction Options

Linked Data

HTML pages comprising Hyperlinks that confirm to Linked Data Principles is one of the methods of interacting with data provided by the DBpedia Snapshot, be it manually via the web browser or programmatically using REST interaction patterns via https://dbpedia.org/resource/{{entity-label} pattern. Naturally, we encourage Linked Data interactions, while also expecting user-agents to honor the cache-control HTTP response header for massive crawl operations. Instructions for accessing Linked Data, available in 10 formats.

SPARQL Endpoint

DBpedia Ontology

DBpedia Snapshot Statistics

Overview. Overall the current Snapshot Release contains more than 850 million facts (triples).

The current Snapshot Release utilizes a total of 55 thousand properties, whereas 1377 of these are defined by the DBpedia ontology.

Persons: 1792308 (1.01%, 1.13%)
Places: 748372 (1.00%, 1820.86%), including but not limited to 590481 (1.00%, 5518.51%) populated places
Works 610589 (1.00%, 619.89%), including, but not limited to
- 157566 (1.00%, 1.38%) music albums
- 144415 (1.01%, 15.94%) films
- 24829 (1.01%, 12.53%) video games
Organizations: 345523 (1.01%, 109.31%), including but not limited to
- 87621 (1.01%, 2.25%) companies
- 64507 (1.00%, 64507.00%) educational institutions
Species: 1933436 (1.01%, 322239.33%)
Plants: 7718 (0.82%, 1.71%)
Diseases: 10591 (1.00%, 8.54%)

Top 11

###TOP11###

33,573,926 http://www.wikidata.org

7,005,750 https://global.dbpedia.org

4,308,772 http://yago-knowledge.org

3,768,764 http://de.dbpedia.org

3,642,704 http://fr.dbpedia.org

2,946,265 http://viaf.org

2,872,878,344 http://it.dbpedia.org

2,853,081 http://es.dbpedia.org

2,651,369 http://fa.dbpedia.org

2,552,761 http://sr.dbpedia.org

2,517,456 http://ru.dbpedia.org

Top 10 without DBpedia namespaces

###TOP10###

33,573,926 http://www.wikidata.org

4,308,772 http://yago-knowledge.org

2,946,265 http://viaf.org

1,633,862 http://d-nb.info

601,227 http://sws.geonames.org

596,134 http://umbel.org

528,123 http://data.bibliotheken.nl

430,839 http://www.w3.org

373,293 http://musicbrainz.org

104,433 http://linkedgeodata.org

DBpedia Extraction Dumps on the Databus

All extracted files are reachable via the DBpedia account on the Databus. The Databus has two main structures:

In groups, artifacts, version, similar to software in Maven: https://databus.dbpedia.org/$user/$group/$artifact/$version/$filename
In Databus collections that are SPARQL queries over https://databus.dbpedia.org/yasui filtering and selecting files from these artifacts.

Snapshot Download. For downloading DBpedia Snapshot, we prepared this collection, which also includes detailed releases notes: https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot-2022-03

The collection is roughly equivalent to http://downloads.dbpedia.org/2016-10/core/

Replicating DBpedia Snapshot on your server can be done via Docker, , see https://hub.docker.com/r/dbpedia/virtuoso-sparql-endpoint-quickstart

git clone https://github.com/dbpedia/virtuoso-sparql-endpoint-quickstart.git

cd virtuoso-sparql-endpoint-quickstart

COLLECTION_URI=https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot-2022-03VIRTUOSO_ADMIN_PASSWD=password docker-compose up

Acknowledgments

First and foremost, we would like to thank our open community of knowledge engineers for finding & fixing bugs and for supporting us by writing data tests. We would also like to acknowledge the DBpedia Association members for constantly innovating the areas of knowledge graphs and linked data and pushing the DBpedia initiative with their know-how and advice. OpenLink Software supports DBpedia by hosting SPARQL and Linked Data; University Mannheim, the German National Library of Science and Technology (TIB) and the Computer Center of University Leipzig provide persistent backups and servers for extracting data. We thank Marvin Hofer and Mykola Medynskyi for technical preparation. This work was partially supported by grants from the Federal Ministry for Economic Affairs and Energy of Germany (BMWi) for the LOD-GEOSS Project (03EI1005E), as well as for the PLASS Project (01MD19003D).

The post DBpedia Snapshot 2022-03 Release appeared first on DBpedia Association.

DBpedia Snapshot 2021-12 Release

Wed, 09 Feb 2022 17:17:44 +0000

News since DBpedia Snapshot 2021-09

+ Release notes are now maintained in the Databus Collection (2021-12)

+ Image and Abstract Extractor was improved

+ Work in progress: Smoothing the community issue reporting and fixing at Github

What is the “DBpedia Snapshot” Release?

Historically, this release has been associated with many names: “DBpedia Core”, “EN DBpedia”, and — most confusingly — just “DBpedia”. In fact, it is a combination of —

EN Wikipedia data — A small, but very useful, subset (~ 1 Billion triples or 14%) of the whole DBpedia extraction using the DBpedia Information Extraction Framework (DIEF), comprising structured information extracted from the English Wikipedia plus some enrichments from other Wikipedia language editions, notably multilingual abstracts in ar, ca, cs, de, el, eo, es, eu, fr, ga, id, it, ja, ko, nl, pl, pt, sv, uk, ru, zh.
Links — 62 million community-contributed cross-references and owl:sameAs links to other linked data sets on the Linked Open Data (LOD) Cloud that allow to effectively find and retrieve further information from the largest, decentral, change-sensitive knowledge graph on earth that has formed around DBpedia since 2007.
Community extensions — Community-contributed extensions such as additional ontologies and taxonomies.

Release Frequency & Schedule

Going forward, releases will be scheduled for the 15th of January, April, June and September (with +/- 5 days tolerance), and are named using the same date convention as the Wikipedia Dumps that served as the basis for the release. An example of the release timeline is shown below:

December 6–8	Dec 8–20	Dec 20–Jan 15	Jan 15–Feb 8
Wikipedia dumps for June 1 become available on https://dumps.wikimedia.org/	Download and extraction with DIEF	Post-processing and quality-control period	Linked Data and SPARQL endpoint deployment

Data Freshness

Given the timeline above, the EN Wikipedia data of DBpedia Snapshot has a lag of 1-4 months. We recommend the following strategies to mitigate this:

DBpedia Snapshot as a kernel for Linked Data: Following the Linked Data paradigm, we recommend using the Linked Data links to other knowledge graphs to retrieve high-quality and recent information. DBpedia’s network consists of the best knowledge engineers in the world, working together, using linked data principles to build a high-quality, open, decentralized knowledge graph network around DBpedia. Freshness and change-sensitivity are two of the greatest data-related challenges of our time, and can only be overcome by linking data across data sources. The “Big Data” approach of copying data into a central warehouse is inevitably challenged by issues such as co-evolution and scalability.
DBpedia Live: Wikipedia is unmistakenly the richest, most recent body of human knowledge and source of news in the world. DBpedia Live is just minutes behind edits on Wikipedia, which means that as soon as any of the 120k Wikipedia editors press the “save” button, DBpedia Live will extract fresh data and update. DBpedia Live is currently in tech preview status and we are working towards a high-available and reliable business API with support. DBpedia Live consists of the DBpedia Live Sync API (for syncing into any kind of on-site databases), Linked Data and SPARQL endpoint.
Latest-Core is a dynamically updating Databus Collection. Our automated extraction robot “MARVIN” publishes monthly dev versions of the full extraction, which are then refined and enriched to become Snapshot.

Data Quality & Richness

Data Access & Interaction Options

Linked Data

SPARQL Endpoint

DBpedia Ontology

DBpedia Snapshot Statistics

Overview. Overall the current Snapshot Release contains more than 850 million facts (triples).

The current Snapshot Release utilizes a total of 55 thousand properties, whereas 1377 of these are defined by the DBpedia ontology.

Persons: 1768613 (1.02%, 1.11%)
Places: 745010 (1.01%, 1812.68%), including but not limited to 588283 (1.01%, 5497.97%) populated places
Works 607852 (1.01%, 617.11%), including, but not limited to
- 157266 (1.00%, 1.38%) music albums
- 143193 (1.01%, 15.80%) films
- 24658 (1.01%, 12.44%) video games
Organizations: 343109 (1.01%, 108.54%), including but not limited to
- 86950 (1.01%, 2.23%) companies
- 64539 (1.00%, 64539.00%) educational institutions
Species: 1918094 (11.95%, 319682.33%)
Plants: 9369 (0.89%, 2.07%)
Diseases: 10553 (1.00%, 8.51%)

Detailed Growth of Classes: The image below shows the detailed growth for one class. Click on the links for other classes: Place, PopulatedPlace, Work, Album, Film, VideoGame, Organisation, Company, EducationalInstitution, Species, Plant, Disease. For further classes adapt the query by replacing the<http://dbpedia.org/ontology/CLASS> URI. Note, that 2018 was a development phase with some failed extractions. The stats were generated with the Databus VOID Mod.

Links. Linked Data cross-references between decentral datasets are the foundation and access point to the Linked Data Web. The latest Snapshot Release provides over 129.9 million links from 7.54 million entities to 179 external sources.

Top 11

###TOP11###

33,522,731 http://www.wikidata.org

6,924,866 https://global.dbpedia.org

4,308,772 http://yago-knowledge.org

3,742,122 http://de.dbpedia.org

3,617,112 http://fr.dbpedia.org

3,067,339 http://viaf.org

2,860,142 http://it.dbpedia.org

2,841,468 http://es.dbpedia.org

2,609,236 http://fa.dbpedia.org

2,549,967 http://sr.dbpedia.org

2,496,247 http://ru.dbpedia.org

Top 10 without DBpedia namespaces

###TOP10###

33,522,731 http://www.wikidata.org

4,308,772 http://yago-knowledge.org

3,067,339 http://viaf.org

1,831,098 http://d-nb.info

641,602 http://sws.geonames.org

596,134 http://umbel.org

524,267 http://data.bibliotheken.nl

430,839 http://www.w3.org

370,844 http://musicbrainz.org

106,498 http://linkedgeodata.org

DBpedia Extraction Dumps on the Databus

All extracted files are reachable via the DBpedia account on the Databus. The Databus has two main structures:

In groups, artifacts, version, similar to software in Maven: https://databus.dbpedia.org/$user/$group/$artifact/$version/$filename
In Databus collections that are SPARQL queries over https://databus.dbpedia.org/yasui filtering and selecting files from these artifacts.

Snapshot Download. For downloading DBpedia Snapshot, we prepared this collection, which also includes detailed releases notes: https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot-2021-12

The collection is roughly equivalent to http://downloads.dbpedia.org/2016-10/core/.

Collections can be downloaded in many different ways, some download modalities such as bash script, SPARQL, and plain URL list are found in the tabs at the collection. Files are provided as bzip2 compressed n-triples files. In case you need a different format or compression, you can also use the “Download-As” function of the Databus Client (GitHub), e.g. –s $collection -c gzip would download the collection and convert it to GZIP during download.

Replicating DBpedia Snapshot on your server can be done via Docker, see https://hub.docker.com/r/dbpedia/virtuoso-sparql-endpoint-quickstart.

git clone https://github.com/dbpedia/virtuoso-sparql-endpoint-quickstart.git

cd virtuoso-sparql-endpoint-quickstart

COLLECTION_URI=https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot-2021-12 VIRTUOSO_ADMIN_PASSWD=password docker-compose up

Acknowledgments

OpenLink Software supports DBpedia by hosting SPARQL and Linked Data; University Mannheim, the German National Library of Science and Technology (TIB) and the Computer Center of University Leipzig provide persistent backups and servers for extracting data. We thank Marvin Hofer and Mykola Medynskyi for technical preparation. This work was partially supported by grants from the Federal Ministry for Economic Affairs and Energy of Germany (BMWi) for the LOD-GEOSS Project (03EI1005E), as well as for the PLASS Project (01MD19003D).

The post DBpedia Snapshot 2021-12 Release appeared first on DBpedia Association.

DBpedia Snapshot 2021-09 Release

Fri, 22 Oct 2021 14:07:07 +0000

News since DBpedia Snapshot 2021-06

Release notes are now maintained in the Databus Collection (2021-09)
Image and Abstract Extractor was improved
Work in progress: Smoothing the community issue reporting and fixing at Github

What is the “DBpedia Snapshot” Release?

Historically, this release has been associated with many names: “DBpedia Core”, “EN DBpedia”, and — most confusingly — just “DBpedia”. In fact, it is a combination of —

EN Wikipedia data — A small, but very useful, subset (~ 1 Billion triples or 14%) of the whole DBpedia extraction using the DBpedia Information Extraction Framework (DIEF), comprising structured information extracted from the English Wikipedia plus some enrichments from other Wikipedia language editions, notably multilingual abstracts in ar, ca, cs, de, el, eo, es, eu, fr, ga, id, it, ja, ko, nl, pl, pt, sv, uk, ru, zh.
Links — 62 million community-contributed cross-references and owl:sameAs links to other linked data sets on the Linked Open Data (LOD) Cloud that allow to effectively find and retrieve further information from the largest, decentral, change-sensitive knowledge graph on earth that has formed around DBpedia since 2007.
Community extensions — Community-contributed extensions such as additional ontologies and taxonomies.

Release Frequency & Schedule

September 6–8	Sep 8–20	Sep 20–Oct 10	Oct 10–20
Wikipedia dumps for June 1 become available on https://dumps.wikimedia.org/	Download and extraction with DIEF	Post-processing and quality-control period	Linked Data and SPARQL endpoint deployment

Data Freshness

Given the timeline above, the EN Wikipedia data of DBpedia Snapshot has a lag of 1-4 months. We recommend the following strategies to mitigate this:

DBpedia Snapshot as a kernel for Linked Data: Following the Linked Data paradigm, we recommend using the Linked Data links to other knowledge graphs to retrieve high-quality and recent information. DBpedia’s network consists of the best knowledge engineers in the world, working together, using linked data principles to build a high-quality, open, decentralized knowledge graph network around DBpedia. Freshness and change-sensitivity are two of the greatest data-related challenges of our time, and can only be overcome by linking data across data sources. The “Big Data” approach of copying data into a central warehouse is inevitably challenged by issues such as co-evolution and scalability.
DBpedia Live: Wikipedia is unmistakenly the richest, most recent body of human knowledge and source of news in the world. DBpedia Live is just minutes behind edits on Wikipedia, which means that as soon as any of the 120k Wikipedia editors press the “save” button, DBpedia Live will extract fresh data and update. DBpedia Live is currently in tech preview status and we are working towards a high-available and reliable business API with support. DBpedia Live consists of the DBpedia Live Sync API (for syncing into any kind of on-site databases), Linked Data and SPARQL endpoint.
Latest-Core is a dynamically updating Databus Collection. Our automated extraction robot “MARVIN” publishes monthly dev versions of the full extraction, which are then refined and enriched to become Snapshot.

Data Quality & Richness

Data Access & Interaction Options

Linked Data

SPARQL Endpoint

DBpedia Ontology

This Snapshot Release was built with DBpedia Ontology (DBO) version: https://databus.dbpedia.org/ontologies/dbpedia.org/ontology–DEV/2021.07.09-070001 We thank all DBpedians for the contribution to the ontology and the mappings. See documentation and visualizations, class tree and properties, wiki.

DBpedia Snapshot Statistics

Overview. Overall the current Snapshot Release contains more than 850 million facts (triples).

The current Snapshot Release utilizes a total of 55 thousand properties, whereas 1377 of these are defined by the DBpedia ontology.

Persons: 1,730,033 (2.28%, 8.85%)
Places: 737,512 (-25.64%, -11.42%),
including but not limited to 582,191 (-0.14%, 13.35%) populated
places
Works 603,110 (1.34%, 21.58%), including,
but not limited to
- 157,137 (1.94%, 38.02%) music albums
- 142,135 (0.75%, 1466.74%) films
- 24,452 (0.85%, 1133.70%) video
  games
Organizations:
339,927 (-0.13%, -42.35%), including but not limited to
- 85,726 (1.20%, 59.79%) companies
- 64,474 (1.01%, 17.68%)
  educational institutions
Species: 160,535 (-3.71%, -46.47%)
Plants: 10,509 (-9.41%, 8.10%)
Diseases: 10,512 (-9.39%, 747.74%)

Links. Linked Data cross-references between decentral datasets are the foundation and access point to the Linked Data Web. The latest Snapshot Release provides over 127.8 million links from 7.47 million entities to 179 external sources.

Top 11

33,403,279 www.wikidata.org

6,847,067 global.dbpedia.org

4,308,772 yago-knowledge.org

3,712,468 de.dbpedia.org

3,589,032 fr.dbpedia.org

2,917,799 viaf.org

2,841,527 it.dbpedia.org

2,816,382 es.dbpedia.org

2,567,507 fa.dbpedia.org

2,542,619 sr.dbpedia.org

Top 10 without DBpedia namespaces

33,403,279 www.wikidata.org

4,308,772 yago-knowledge.org

2,917,799 viaf.org

1,614,381 d-nb.info

596,134 umbel.org

581,558 sws.geonames.org

521,985 data.bibliotheken.nl

430,839 www.w3.org

369,309 musicbrainz.org

104,433 linkedgeodata.org

DBpedia Extraction Dumps on the Databus

All extracted files are reachable via the DBpedia account on the Databus. The Databus has two main structures:

In groups, artifacts, version, similar to software in Maven: https://databus.dbpedia.org/$user/$group/$artifact/$version/$filename
In Databus collections that are SPARQL queries over https://databus.dbpedia.org/yasui filtering and selecting files from these artifacts.

Snapshot Download. For downloading DBpedia Snapshot, we prepared this collection, which also includes detailed releases notes: https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot-2021-06 The collection is roughly equivalent to http://downloads.dbpedia.org/2016-10/core/

Collections can be downloaded in many different ways, some download modalities such as bash script, SPARQL, and plain URL list are found in the tabs at the collection. Files are provided as bzip2 compressed n-triples files. In case you need a different format or compression, you can also use the “Download-As” function of the Databus Client (GitHub), e.g. –s $collection -c gzip would download the collection and convert it to GZIP during download.

Replicating DBpedia Snapshot on your server can be done via Docker, see https://hub.docker.com/r/dbpedia/virtuoso-sparql-endpoint-quickstart

git clone https://github.com/dbpedia/virtuoso-sparql-endpoint-quickstart.git

cd virtuoso-sparql-endpoint-quickstart

COLLECTION_URI=https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot-2021-09 VIRTUOSO_ADMIN_PASSWD=password docker-compose up

Acknowledgments

First and foremost, we would like to thank our open community of knowledge engineers for finding & fixing bugs and for supporting us by writing data tests. We would also like to acknowledge the DBpedia Association members for constantly innovating the areas of knowledge graphs and linked data and pushing the DBpedia initiative with their know-how and advice. OpenLink Software supports DBpedia by hosting SPARQL and Linked Data; University Mannheim, the German National Library of Science and Technology (TIB) and the Computer Center of University Leipzig provide persistent backups and servers for extracting data. We thank Marvin Hofer and Mykola Medynskyi for technical preparation. This work was partially supported by grants from the Federal Ministry for Economic Affairs and Energy of Germany (BMWi) for the LOD-GEOSS Project (03EI1005E), as well as for the PLASS Project (01MD19003D).

The post DBpedia Snapshot 2021-09 Release appeared first on DBpedia Association.

New DBpedia Usage Report for July 2017 through January 2021

Fri, 13 Aug 2021 08:08:35 +0000

Our partner OpenLink Software recently published a new DBpedia usage report on the SPARQL endpoint and associated Linked Data deployment.

Introduction

This document shows some of the statistics from the DBpedia 2016-10 dataset collected between July 2017 and January 2021; spanning more than three and a half year of logs from the DBpedia web service operated by OpenLink Software at http://dbpedia.org/sparql/ .

The log files used to prepare this document include data from the following DBpedia release:

DBpedia 2016–10 dataset

Infrastructure

The DBpedia service consists of:

two or more Virtuoso Universal Server Instances — facilitating Linked Data Deployment including providing a SPARQL endpoint delivering RDF data in a variety of document formats subject to content-negotiation.
a Reverse Proxy Server — which redirects client requests to an available Virtuoso instance and caches the results in case another client repeats the same request within a specified timeframe
a physical computer — hosted in OpenLink Software’s datacenter

Currently the DBpedia service is hosted on two virtual machines running CentOS 6, each using 8 Intel Xeon E5–2630 2.30 GHz cores with 200 GB SSD and 64GB memory, hosting Virtuoso 7.2 Enterprise Edition with the Column Store Module.

Rate and Connection limits

To maintain equitable access to the DBpedia service for everyone, OpenLink Software limits connections by rate and concurrent connection, limiting disruption by faulty or misbehaving applications.

Current limit rates are:

Connection limit of 50 parallel connections per IP address . This number is fairly high to permit multiple clients in networks using Network Address Translation (NAT) to appear as one network IP. Without the use of tracking cookies, it is impossible to distinguish between machines inside a NAT network, and for privacy and legal reasons, OpenLink Software has decided not to use such cookies at this point in time.
Rate limit of 100 requests per second per IP address, with an initial burst of 120 requests.

As part of monitoring the DBpedia service, OpenLink Software performs frequent traffic analysis to make sure the service is running smoothly.

Ideally, applications should be written to check the HTTP status code of each request, and in case of a 503 (Service Unavailable) or 429 (Too Many Requests) code, perform a 1–2 second sleep before retrying the request.

OpenLink Software may alter these parameters at any time to make sure the service remains reachable to the general public.

In case of misuse, OpenLink Software may temporarily block an offender’s IP address from accessing the DBpedia service. This temporary ban will be automatically lifted once such a blocked IP address refrains from making any request to the DBpedia service for at least 5 minutes.

Configured Virtuoso limits on the DBpedia endpoint

The Virtuoso configuration for the DBpedia endpoint includes:

Query Execution Timeout of 120 seconds. This is the query solution preparation threshold. If the timeout stops execution before the solution is complete — i.e., if the solution is partial — this is indicated to the query client via HTTP response headers.
Maximum SPARQL query solution (aka result set) size of 10,000 rows. This is the maximum number of solution rows (for SELECT queries) or triple/quad statements (for CONSTRUCT or DESCRIBE queries) returned per query-solution-retrieval round-trip.

Virtuoso “Anytime Query” Functionality

The “Anytime Query” is a core feature of Virtuoso that enables it to handle the challenges inherent in providing a publicly accessible interface for ad-hoc querying at Web scale. This feature allows an application compliant with the SPARQL- and HTTP-protocol to issue long-running and/or large-solution queries, for which finding the complete solution would exceed configured query timeout and/or result set limits, and rather than being rebuffed with no solution, to receive partial solutions conforming to those thresholds. Further, this feature enables the use of LIMIT and OFFSET (typically combined with ORDER BY and/or GROUP BY) to create windows (also known as sliding windows or cursors ) to iterate through the complete query solution without being adversely affected by inserts or deletions.

Note: Even while paging through a partial query solution, Virtuoso continues to work towards a complete solution in the background.

Custom HTTP headers

As the W3C SPARQL standard currently does not specify an authoritative status code or header response to report a partial result set, OpenLink Software has opted to have Virtuoso return a status code of 200 to denote a successful request and add a custom header to the result to indicate that the result was limited to what could be returned within the settings enforced by the server.

If full execution of the query would return more than the configured maximum number of rows, the X-SPARQL-MaxRows line is added, as shown below:

HTTP/1.1 200 OK
Date: Tue, 1 Jan 2018 12:00:00 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 1427536
Connection: keep-alive
Vary: Accept-Encoding
Server: Virtuoso/07.20.3224 (Linux) i686-generic-linux-glibc212-64 VDB
X-SPARQL-default-graph: http://dbpedia.org
X-SPARQL-MaxRows: 10000
Expires: Tue, 07 Jan 2018 12:00:00 GMT
Cache-Control: max-age=604800
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Access-Control-Allow-Methods: HEAD, GET, POST, OPTIONS
Access-Control-Allow-Headers: DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Accept-Encoding
Accept-Ranges: bytes

If the AnyTime Query timeout is reached, several headers are added:

HTTP/1.1 200 OK
Date: Tue, 01 Jan 2018 12:00:00 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 80
Connection: keep-alive
Server: Virtuoso/07.20.3224 (Linux) i686-generic-linux-glibc212-64 VDB
X-SPARQL-default-graph: http://dbpedia.org
X-SQL-State: S1TAT
X-SQL-Message: RC...: Returning incomplete results, query interrupted by result timeout. Activity: 7 rnd 64.87M seq 0 same seg 1 same pg 0 same par 0 disk 0 spec disk 0B / 0 mess
X-Exec-Milliseconds: 30000
X-Exec-DB-Activity: 7 rnd 64.87M seq 0 same seg 1 same pg 0 same par 0 disk 0 spec disk 0B / 0 messages 0 fork
Expires: Tue, 07 Jan 2018 12:00:00 GMT
Cache-Control: max-age=604800
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Access-Control-Allow-Methods: HEAD, GET, POST, OPTIONS
Access-Control-Allow-Headers: DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Accept-Encoding
Accept-Ranges: bytes

Hosting Independent DBpedia Instances

The restrictions described above may impair some complex analytical queries. Users who frequently encounter these limits are advised to use one of the following methods:

set up a private DBpedia instance in their own datacenter or in the cloud
use a Docker image
use an instance of OpenLink Software’s Pay-As-You-Go DBpedia AMIs (either the DBpedia 2016–10 Snapshot or the DBpedia-Live Mirror [starts from 2016–04 Snapshot]) on Amazon EC2.

HTTP logs

The HTTP server log files used in this report exclude traffic generated by:

IP addresses that were temporarily rate-limited after their burst period
IP addresses that were banned after misuse
applications, spiders, and other crawlers that were blocked after frequently hitting the rate-limiter or which generally claimed too many resources

The system uses a combination of firewall rules and Access Control Lists (ACLs) to quickly drop such connections, so legitimate users of the DBpedia service can continue to connect and execute queries.

To save time, these dropped connections are not recorded in the log files.

The data for this document was extracted from reports generated by Webalizer v2.21.

HTTP Usage Historical Overview

The first table shows the average numbers of Visits and Hits per day during the time each DBpedia dataset was was live on the http://dbpedia.org/sparql endpoint.

DBpedia	From	Until	Days	Visits per day	Hits per day	Total Hits
3.3	2009-06-30	2009-11-05	128	9,602	733,811	94,661,592
3.4	2009-11-06	2010-04-07	152	11,100	1,212,549	185,519,930
3.5	2010-04-08	2011-01-17	284	16,381	1,122,612	282,898,279
3.6	2011-01-18	2011-06-30	163	19,288	1,328,355	219,178,587
3.7	2011-07-01	2012-06-19	354	23,408	2,052,660	594,338,675
3.8	2012-06-20	2013-09-19	456	16,614	2,925,335	570,440,410
3.9	2013-09-20	2014-09-02	347	22,026	3,035,428	1,062,399,840
2014	2014-09-03	2015-07-05	305	27,927	3,423,490	1,051,011,401
2015-04	2015-07-06	2016-03-31	269	24,689	3,516,936	953,089,788
2015-10	2016-04-01	2016-10-13	195	110,745	6,581,217	1,263,593,686
2016-04	2016-10-14	2017-07-03	262	231,735	7,646,447	2,003,369,014
2016-10	2017-07-04	2021-01-07	1283	257.994	7,542,623	9.501.427.081

For detailed information on the specific usage numbers, please visit the original report by OpenLink Software published here. Also, older reports are available through their site. Read the previous usage report 2020 on the DBpedia blog.

Further Links

For the latest news, subscribe to the DBpedia Newsletter, check our DBpedia Website and follow us on Twitter or LinkedIn .

Thanks for reading and keep using DBpedia!

Yours DBpedia Associaton

The post New DBpedia Usage Report for July 2017 through January 2021 appeared first on DBpedia Association.

Announcement: DBpedia Snapshot 2021-06 Release

Fri, 23 Jul 2021 08:50:06 +0000

DBpedia Snapshot 2021-06 Release

What is the “DBpedia Snapshot” Release?

Historically, this release has been associated with many names: “DBpedia Core”, “EN DBpedia”, and — most confusingly — just “DBpedia”. In fact, it is a combination of —

EN Wikipedia data — A small, but very useful, subset (~ 1 Billion triples or 14%) of the whole DBpedia extraction using the DBpedia Information Extraction Framework (DIEF), comprising structured information extracted from the English Wikipedia plus some enrichments from other Wikipedia language editions, notably multilingual abstracts in ar, ca, cs, de, el, eo, es, eu, fr, ga, id, it, ja, ko, nl, pl, pt, sv, uk, ru, zh .
Links — 62 million community-contributed cross-references and owl:sameAs links to other linked data sets on the Linked Open Data (LOD) Cloud that allow to effectively find and retrieve further information from the largest, decentral, change-sensitive knowledge graph on earth that has formed around DBpedia since 2007.
Community extensions — Community-contributed extensions such as additional ontologies and taxonomies.

Release Frequency & Schedule

June 6–8	June 8–20	June 20–July 10	July 10–20
Wikipedia dumps for June 1 become available on https://dumps.wikimedia.org/	Download and extraction with DIEF	Post-processing and quality-control period	Linked Data and SPARQL endpoint deployment

Data Freshness

Given the timeline above, the EN Wikipedia data of DBpedia Snapshot has a lag of 1-4 months. We recommend the following strategies to mitigate this:

DBpedia Snapshot as a kernel for Linked Data: Following the Linked Data paradigm, we recommend using the Linked Data links to other knowledge graphs to retrieve high-quality and recent information. DBpedia’s network consists of the best knowledge engineers in the world, working together, using linked data principles to build a high-quality, open, decentralized knowledge graph network around DBpedia. Freshness and change-sensitivity are two of the greatest data related challenges of our time, and can only be overcome by linking data across data sources. The “Big Data” approach of copying data into a central warehouse is inevitably challenged by issues such as co-evolution and scalability.
DBpedia Live: Wikipedia is unmistakingly the richest, most recent body of human knowledge andsource of news in the world. DBpedia Live is just behind edits on Wikipedia, which means that as soon as any of the 120k Wikipedia editors presses the “save” button, DBpedia Live will extract fresh data and update. DBpedia Live is currently in tech preview status and we are working towards a high-available and reliable business API with support. DBpedia Live consists of the DBpedia Live Sync API (for syncing into any kind of on-site databases), Linked Data and SPARQL endpoint.
Latest-Core is a dynamically updating Databus Collection. Our automated extraction robot “MARVIN” publishes monthly dev versions of the full extraction, which are then refined and enriched to become Snapshot.

Data Quality & Richness

Data Access & Interaction Options

Linked Data

SPARQL Endpoint

An effective Usage Pattern is to filter a relevant subset of entity descriptions for your use case via SPARQL and then combine with the power of Linked Data by looking up (or de-referencing) data via owl:sameAs property links en route to retrieving specific and recent data from across other Knowledge Graphs across the massive Linked Open Data Cloud.

Additionally, DBpedia Snapshot dumps and additional data from the complete collection of datasets derived form Wikipedia are provided by the DBpedia Databus for use in your own SPARQL-accessible Knowledge Graphs.

DBpedia Ontology

DBpedia Snapshot Statistics

Overview. Overall the current Snapshot Release contains more than 850 million facts (triples).

The current Snapshot Release utilizes a total of 55 thousand properties, whereas 1372 of these are defined by the DBpedia ontology.

Classes. Knowledge in Wikipedia is constantly growing at a rapid pace. We use the DBpedia Ontology Classes to measure the growth: Total number in this release (in brackets we give: a) growth to previous release, which can be negative temporarily and b) growth compared to Snapshot 2016-10):

Persons: 1682299 (1.69%, 5.84%)
Places: 992381 (0.44%, 2413%), including but not limited to 584938 (-0.0%, 5465%) populated places
Works 593689 (0.54%, 6017%), including, but not limited to
- 153984 (-0.7%, 35.2%) music albums
- 140766 (0.79%, 1453%) films
- 24198 (1.17%, 1120%) video games
Organizations: 341609 (0.69%, 1070%), including but not limited to
- 84464 (0.61%, 116.%) companies
- 63809 (0.17%, 6380%) educational institutions
Species: 169874 (-5.9%, 2831%)
Plants: 11952 (-15.%, 164.%)
Diseases: 8637 (0.16%, 596.%)

Detailed Growth of Classes: The image below shows the detailed growth for one class. Click on the links for other classes: Place, PopulatedPlace, Work, Album, Film, VideoGame, Organisation, Company, EducationalInstitution, Species, Plant, Disease. For further classes adapt the query by replacing the <http://dbpedia.org/ontology/CLASS> URI. Note, that 2018 was a development phase with some failed extractions. The stats were generated with the Databus VOID Mod.

Links. Linked Data cross-references between decentral datasets are the foundation and access point to the Linked Data Web. The latest Snapshot Release provides over 61.7 million links from 6.67 million entities to 178 external sources.

Top 11

6,672,052 global.dbpedia.org

5,380,836 www.wikidata.org

4,308,772 yago-knowledge.org

2,561,963 viaf.org

1,989,632 fr.dbpedia.org

1,851,182 de.dbpedia.org

1,563,230 it.dbpedia.org

1,495,866 es.dbpedia.org

1,283,672 pl.dbpedia.org

1,274,002 ru.dbpedia.org

1,203,247 d-nb.info

Top 10 without DBpedia namespaces

5,380,836 www.wikidata.org

4,308,772 yago-knowledge.org

2,561,963 viaf.org

1,203,247 d-nb.info

596,134 umbel.org

559,821 sws.geonames.org

431,830 data.bibliotheken.nl

430,839 www.w3.org (wn20)

301,483 musicbrainz.org

104,433 linkedgeodata.org

DBpedia Extraction Dumps on the Databus

All extracted files are reachable via the DBpedia account on the Databus. The Databus has two main structures:

In groups, artifacts, version, similar to software in Maven: https://databus.dbpedia.org/$user/$group/$artifact/$version/$filename
In Databus collections that are SPARQL queries over https://databus.dbpedia.org/yasui filtering and selecting files from these artifacts.

Snapshot Download. For downloading DBpedia Snapshot, we prepared this collection, which also includes detailed releases notes:

https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot-2021-06

The collection is roughly equivalent to http://downloads.dbpedia.org/2016-10/core/

Collections can be downloaded many different ways, some download modalities such as bash script, SPARQL and plain URL list are found in the tabs at the collection. Files are provided as bzip2 compressed n-triples files. In case you need a different format or compression, you can
also use the “Download-As” function of the Databus Client (GitHub), e.g. -s$collection -c gzip would download the collection and convert it to GZIP during download.

Replicating DBpedia Snapshot on your server can be done via Docker, see https://hub.docker.com/r/dbpedia/virtuoso-sparql-endpoint-quickstart

git clone https://github.com/dbpedia/virtuoso-sparql-endpoint-quickstart.git

cdvirtuoso-sparql-endpoint-quickstart

COLLECTION_URI= https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot2021-06
VIRTUOSO_ADMIN_PASSWD=password docker-compose up

Download files from the whole DBpedia extraction. The whole extraction consists of approx. 20 Billion triples and 5000 files created from 140 languages of Wikipedia, Commons and Wikidata. They can be found in https://databus.dbpedia.org/dbpedia/(generic|mappings|text|wikidata)

Acknowledgements

First and foremost, we would like to thank our open community of knowledge engineers for finding & fixing bugs and for supporting us by writing data tests. We would also like to acknowledge the DBpedia Association members for constantly innovating the areas of knowledge graphs and linked data and pushing the DBpedia initiative with their know-how and advice. OpenLink Software supports DBpedia by hosting SPARQL and Linked Data; University Mannheim, the German National Library of Science and Technology (TIB) and the Computer Center of University Leipzig provide persistent backups and servers for extracting data. We thank Marvin Hofer and Mykola Medynskyi for the technical preparation, testing and execution of this release. This work was partially supported by grants from the Federal Ministry for Economic Affairs and Energy of Germany (BMWi) for the LOD-GEOSS Project (03EI1005E), as well as for the PLASS Project (01MD19003D).

The post Announcement: DBpedia Snapshot 2021-06 Release appeared first on DBpedia Association.