DBpedia Service Archives - DBpedia Association

New DBpedia Usage Report for July 2017 through January 2021

Fri, 13 Aug 2021 08:08:35 +0000

Our partner OpenLink Software recently published a new DBpedia usage report on the SPARQL endpoint and associated Linked Data deployment.

Introduction

This document shows some of the statistics from the DBpedia 2016-10 dataset collected between July 2017 and January 2021; spanning more than three and a half year of logs from the DBpedia web service operated by OpenLink Software at http://dbpedia.org/sparql/ .

The log files used to prepare this document include data from the following DBpedia release:

DBpedia 2016–10 dataset

Infrastructure

The DBpedia service consists of:

two or more Virtuoso Universal Server Instances — facilitating Linked Data Deployment including providing a SPARQL endpoint delivering RDF data in a variety of document formats subject to content-negotiation.
a Reverse Proxy Server — which redirects client requests to an available Virtuoso instance and caches the results in case another client repeats the same request within a specified timeframe
a physical computer — hosted in OpenLink Software’s datacenter

Currently the DBpedia service is hosted on two virtual machines running CentOS 6, each using 8 Intel Xeon E5–2630 2.30 GHz cores with 200 GB SSD and 64GB memory, hosting Virtuoso 7.2 Enterprise Edition with the Column Store Module.

Rate and Connection limits

To maintain equitable access to the DBpedia service for everyone, OpenLink Software limits connections by rate and concurrent connection, limiting disruption by faulty or misbehaving applications.

Current limit rates are:

Connection limit of 50 parallel connections per IP address . This number is fairly high to permit multiple clients in networks using Network Address Translation (NAT) to appear as one network IP. Without the use of tracking cookies, it is impossible to distinguish between machines inside a NAT network, and for privacy and legal reasons, OpenLink Software has decided not to use such cookies at this point in time.
Rate limit of 100 requests per second per IP address, with an initial burst of 120 requests.

As part of monitoring the DBpedia service, OpenLink Software performs frequent traffic analysis to make sure the service is running smoothly.

Ideally, applications should be written to check the HTTP status code of each request, and in case of a 503 (Service Unavailable) or 429 (Too Many Requests) code, perform a 1–2 second sleep before retrying the request.

OpenLink Software may alter these parameters at any time to make sure the service remains reachable to the general public.

In case of misuse, OpenLink Software may temporarily block an offender’s IP address from accessing the DBpedia service. This temporary ban will be automatically lifted once such a blocked IP address refrains from making any request to the DBpedia service for at least 5 minutes.

Configured Virtuoso limits on the DBpedia endpoint

The Virtuoso configuration for the DBpedia endpoint includes:

Query Execution Timeout of 120 seconds. This is the query solution preparation threshold. If the timeout stops execution before the solution is complete — i.e., if the solution is partial — this is indicated to the query client via HTTP response headers.
Maximum SPARQL query solution (aka result set) size of 10,000 rows. This is the maximum number of solution rows (for SELECT queries) or triple/quad statements (for CONSTRUCT or DESCRIBE queries) returned per query-solution-retrieval round-trip.

Virtuoso “Anytime Query” Functionality

The “Anytime Query” is a core feature of Virtuoso that enables it to handle the challenges inherent in providing a publicly accessible interface for ad-hoc querying at Web scale. This feature allows an application compliant with the SPARQL- and HTTP-protocol to issue long-running and/or large-solution queries, for which finding the complete solution would exceed configured query timeout and/or result set limits, and rather than being rebuffed with no solution, to receive partial solutions conforming to those thresholds. Further, this feature enables the use of LIMIT and OFFSET (typically combined with ORDER BY and/or GROUP BY) to create windows (also known as sliding windows or cursors ) to iterate through the complete query solution without being adversely affected by inserts or deletions.

Note: Even while paging through a partial query solution, Virtuoso continues to work towards a complete solution in the background.

Custom HTTP headers

As the W3C SPARQL standard currently does not specify an authoritative status code or header response to report a partial result set, OpenLink Software has opted to have Virtuoso return a status code of 200 to denote a successful request and add a custom header to the result to indicate that the result was limited to what could be returned within the settings enforced by the server.

If full execution of the query would return more than the configured maximum number of rows, the X-SPARQL-MaxRows line is added, as shown below:

HTTP/1.1 200 OK
Date: Tue, 1 Jan 2018 12:00:00 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 1427536
Connection: keep-alive
Vary: Accept-Encoding
Server: Virtuoso/07.20.3224 (Linux) i686-generic-linux-glibc212-64 VDB
X-SPARQL-default-graph: http://dbpedia.org
X-SPARQL-MaxRows: 10000
Expires: Tue, 07 Jan 2018 12:00:00 GMT
Cache-Control: max-age=604800
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Access-Control-Allow-Methods: HEAD, GET, POST, OPTIONS
Access-Control-Allow-Headers: DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Accept-Encoding
Accept-Ranges: bytes

If the AnyTime Query timeout is reached, several headers are added:

HTTP/1.1 200 OK
Date: Tue, 01 Jan 2018 12:00:00 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 80
Connection: keep-alive
Server: Virtuoso/07.20.3224 (Linux) i686-generic-linux-glibc212-64 VDB
X-SPARQL-default-graph: http://dbpedia.org
X-SQL-State: S1TAT
X-SQL-Message: RC...: Returning incomplete results, query interrupted by result timeout. Activity: 7 rnd 64.87M seq 0 same seg 1 same pg 0 same par 0 disk 0 spec disk 0B / 0 mess
X-Exec-Milliseconds: 30000
X-Exec-DB-Activity: 7 rnd 64.87M seq 0 same seg 1 same pg 0 same par 0 disk 0 spec disk 0B / 0 messages 0 fork
Expires: Tue, 07 Jan 2018 12:00:00 GMT
Cache-Control: max-age=604800
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Access-Control-Allow-Methods: HEAD, GET, POST, OPTIONS
Access-Control-Allow-Headers: DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Accept-Encoding
Accept-Ranges: bytes

Hosting Independent DBpedia Instances

The restrictions described above may impair some complex analytical queries. Users who frequently encounter these limits are advised to use one of the following methods:

set up a private DBpedia instance in their own datacenter or in the cloud
use a Docker image
use an instance of OpenLink Software’s Pay-As-You-Go DBpedia AMIs (either the DBpedia 2016–10 Snapshot or the DBpedia-Live Mirror [starts from 2016–04 Snapshot]) on Amazon EC2.

HTTP logs

The HTTP server log files used in this report exclude traffic generated by:

IP addresses that were temporarily rate-limited after their burst period
IP addresses that were banned after misuse
applications, spiders, and other crawlers that were blocked after frequently hitting the rate-limiter or which generally claimed too many resources

The system uses a combination of firewall rules and Access Control Lists (ACLs) to quickly drop such connections, so legitimate users of the DBpedia service can continue to connect and execute queries.

To save time, these dropped connections are not recorded in the log files.

The data for this document was extracted from reports generated by Webalizer v2.21.

HTTP Usage Historical Overview

The first table shows the average numbers of Visits and Hits per day during the time each DBpedia dataset was was live on the http://dbpedia.org/sparql endpoint.

DBpedia	From	Until	Days	Visits per day	Hits per day	Total Hits
3.3	2009-06-30	2009-11-05	128	9,602	733,811	94,661,592
3.4	2009-11-06	2010-04-07	152	11,100	1,212,549	185,519,930
3.5	2010-04-08	2011-01-17	284	16,381	1,122,612	282,898,279
3.6	2011-01-18	2011-06-30	163	19,288	1,328,355	219,178,587
3.7	2011-07-01	2012-06-19	354	23,408	2,052,660	594,338,675
3.8	2012-06-20	2013-09-19	456	16,614	2,925,335	570,440,410
3.9	2013-09-20	2014-09-02	347	22,026	3,035,428	1,062,399,840
2014	2014-09-03	2015-07-05	305	27,927	3,423,490	1,051,011,401
2015-04	2015-07-06	2016-03-31	269	24,689	3,516,936	953,089,788
2015-10	2016-04-01	2016-10-13	195	110,745	6,581,217	1,263,593,686
2016-04	2016-10-14	2017-07-03	262	231,735	7,646,447	2,003,369,014
2016-10	2017-07-04	2021-01-07	1283	257.994	7,542,623	9.501.427.081

For detailed information on the specific usage numbers, please visit the original report by OpenLink Software published here. Also, older reports are available through their site. Read the previous usage report 2020 on the DBpedia blog.

Further Links

For the latest news, subscribe to the DBpedia Newsletter, check our DBpedia Website and follow us on Twitter or LinkedIn .

Thanks for reading and keep using DBpedia!

Yours DBpedia Associaton

The post New DBpedia Usage Report for July 2017 through January 2021 appeared first on DBpedia Association.

New DBpedia Usage Report

Fri, 30 Oct 2020 11:34:08 +0000

Our partner Openlink recently published a new DBpedia usage report on the SPARQL endpoint and associated Linked Data deployment

Introduction

Just recently, DBpedia Association member and hosting specialist, OpenLink released the DBpedia Usage report, a periodic report on the DBpedia SPARQL endpoint and associated Linked Data deployment.

The report not only gives some historical insight into DBpedia’s usage, number of visits and hits per day but especially shows statistics collected between July 2017 and September 2020, spanning more than 3 years of logs from the DBpedia web service operated by our partner OpenLink Software at http://dbpedia.org/sparql/.

Before we want to highlight a few aspects of DBpedia’s usage we would like to thank OpenLink for the continuous hosting of the DBpedia Endpoint and the creation of this report.

DBpedia Usage Report: Historical Overview

The first table shows the average numbers of Visits and Hits per day during the time each DBpedia dataset was live on the http://dbpedia.org/sparql endpoint. Similarly to the hits, we also see a huge increase in visits coinciding with the DBpedia 2015–10 release on April 1st, 2016.

Historical overview of visits and hits per day in the cause of the last 10 years.

This boost was attributed to an intensive promotion of DBpedia via community meetings, and exchange with various partners in the Linked Data community. In addition, our Social Media activity in the community increased backlinks. Since then, not only the numbers of hits rose but DBpedia also provided for better data quality. We are constantly working on improving accessibility, data quality and stability of the SPARQL endpoint.

Kudos to Open Link for maintaining the technical baseline for DBpedia.

The next graph shows the percentage of the total number of hits in a given time period that can be attributed to the /sparql endpoint. If we look at the historical data from 2014–09 onward, we can see the requests to /sparql were about 60.16% of the total number of hits.

DBpedia Usage Report: Current Statistics

If we focus on the last 12 months, we can see a slightly lower average of 48.10%, as shown in the graph below. This means that around 50% of traffic uses Linked Data constructions to view the information available through DBpedia. To put this into perspective, that means that of the average of 7.2 million hits to the endpoint on a given day, 3.6 million hits are Linked Data Deployment hits.

The following table shows the information on visits, sited and hits for
each month between September 2019 and 2020.

Overview of the last 12 months.

For detailed information on the specific usage numbers, please visit the original report by Openlink published here. Also, older reporst are available through their site.

Further Links

For the latest news, subscribe to the DBpedia Newsletter, check our DBpedia Website and follow us on Twitter, Facebook, and LinkedIn .

Thanks for reading and keep using DBpedia!

Yours DBpedia Associaton

The post New DBpedia Usage Report appeared first on DBpedia Association.

New Prototype: Databus Collection Feature

Thu, 14 Nov 2019 11:39:45 +0000

We are thrilled to announce that our Databus Collection Feature for the DBpedia Databus has been developed and is now available as a prototype. It simplifies the way to bundle your data and use it in your application.

A new Databus Collection Feature? How come, and how does it work? Read below and find out how using the DBpedia Databus becomes easier by the day and with each new tool.

Motivation

With more and more data being uploaded to the databus we started to develop test applications using that data. The SPARQL endpoint offers a central hub to access all metadata for datasets uploaded to the databus provided you know how to write SPARQL queries. The metadata includes the download links of the data files – it was, therefore, possible to pass a SPARQL query to an application, download the actual data and then use for whatever purpose the app had.

The Databus Collection Editor

The DBpedia Databus now provides an editor for collections. A collection is basically a labelled SPARQL query that is retrievable via URI. Hence, with the collection editor you can group Databus groups and artifacts into a bundle and publish your selection using your Databus account. It is now a breeze to select the data you need, share the exact selection with others and/or use it in existing or self-made applications.

If you are not familiar with SPARQL and data queries, you can think of the feature as a shopping cart for data: You create a new cart, put data in it and tell your friends or applications where to find it. Quite neat, right?

In the following section, we will cover the user interface of the collection editor.

The Editor UI

Firstly, you can find the collection editor by going to the DBpedia Databus and following the Collections link at the top or you can get there directly by clicking here.

What you will see is the following:

General Collection Info

Secondly, since you do not have any collections yet, the editor has already created an empty collection named “Unnamed” for you. At the right side next to the label and description you will find a pen icon. By clicking the icon or the label itself you can edit its content. The collection is not published yet, so the Collection URI is blank.

Whenever you are not logged in or the collection has not been published yet, the editor will also notify you that your changes are only saved in your local browser cache and NOT remotely on our server. Keep that in mind when clearing your cache. Publishing the collection however is easy: Simply log into (or create) your Databus account and hit the publish button in the action bar. This will open up a modal where you can pick your unique collection id and hit publish again. That’s it!

The Collection Info section will now show the collection URI. Following the link will take you to the HTML representation of your collection that will be visible to others. Hitting the Edit button in the action bar will bring you back to the editor.

Collection Hierarchy

Let’s have a look at the core piece of the collection editor: the hierarchy view. A collection can be a bundle of different Databus groups and artifacts but is not limited to that. If you know how to write a SPARQL query, you can easily extend your collection with more powerful selections. Therefore, the hierarchy is split into two nodes:

Generated Queries: Contains all queries that are generated from your selection in the UI
Custom Queries: Contains all custom written SPARQL queries

Both, hierarchy nodes have a “+” icon. Clicking on this button will let you add generated or custom queries respectively.

Custom Queries

If you hit the “+” icon on the Custom Queries node, a new node called “Custom Query” will appear in the hierarchy. You can remove a custom query by clicking on the trashcan icon in the hierarchy. If you click the node it will take you to a SPARQL input field where you can edit the query.

To make your collection more understandable for others, you can even document the query by adding a label and description.

Writing Your Own Custom Queries

A collection query is a SPARQL query of the form:

SELECT DISTINCT ?file WHERE {
    {
        [SUBQUERY]
    }
    UNION
    {
        [SUBQUERY]
    }
    UNION
    ...
    UNION
    {
        [SUBQUERY]
    }
}

All selections made by generated and custom queries will be joined into a single result set with a single column called “file“. Thus it is important that your custom query binds data to a variable called “file” as well.

Generated Queries

Clicking the “+” icon on the Generated Queries node will take you to a search field. Make use of the indexed search on the Databus to find and add the groups and artifacts you need. If you want to refine your search, don’t worry: you can do that in the next step!

Once the artifact or group has been added to your collection, the Add to Collection button will turn green. Once you are done you can go back to the Editor with Back to Hierarchy button.

Your hierarchy will now contain several new nodes.

Group Facets, Artifact Facets and Overrides

Group and artifacts that have been added to the collection will show up as nodes in the hierarchy. Clicking a node will open a filter where you can refine your dataset selection. Setting a filter to a group node will apply it to all artifact nodes unless you override that setting in any artifact node manually. The filter set in the group node is shown in the artifact facets in dark grey. Any overrides in the artifact facets will be highlighted in green:

Group Nodes

A group node will provide a list of filters that will be applied to all artifacts of that group:

Artifact Nodes

Artifact nodes will then actually select data files which will be visible in the faceted view. The facets are generated dynamically from the available variants declared in the metadata.

Example: Here we selected the latest version of the databus dump as n-triple. This collection is already in use: The collection URI is passed to the new generic lookup application, which then creates the search function for the databus website. If you are interested in how to configure the lookup application, you can go here: https://github.com/dbpedia/lookup-application. Additionally, there will also be another blog post about the lookup within the next few weeks

Use Cases

The DBpedia Databus Collections are useful in many ways.

You can share a specific dataset with your community or colleagues.
You can re-use dataset others created
You can plug collections into databus-ready applications and avoid spending time on the download and setup process
You can point to a specific piece of data (e.g. for testing) with a single URI in your publications
You can help others to create data queries more easily

We hope you enjoy the Databus Collection Feature and we would love to hear your feedback! You can leave your thoughts and suggestions in the new DBpedia Forum. Feedback of any kinds is highly appreciated since we want to improve the prototype as fast and user-driven as possible! Cheers!

A big thanks goes to DBpedia developer Jan Forberg who finalized the Databus Collection Feature and compiled this text.

Yours

DBpedia Association

The post New Prototype: Databus Collection Feature appeared first on DBpedia Association.

More than 50 DBpedia enthusiasts joined the Community Meeting in Karlsruhe.

Thu, 19 Sep 2019 13:07:07 +0000

SEMANTiCS is THE leading European conference in the field of semantic technologies and the platform for professionals who make semantic computing work, and understand its benefits and know its limitations.

Following, we will give you a brief retrospective about the presentations.

Opening Session

Katja Hose – “Querying the web of data”

….on the search for the killer App.

The concept of Linked Open Data and the promise of the Web of Data have been around for over a decade now. Yet, the great potential of free access to a broad range of data that these technologies offer has not yet been fully exploited. This talk will, therefore review the current state of the art, highlight the main challenges from a query processing perspective, and sketch potential ways on how to solve them. Slides are available here.

Dan Weitzner – “timbr-DBpedia – Exploration and Query of DBpedia in SQL”

The timbr SQL Semantic Knowledge Platform enables the creation of virtual knowledge graphs in SQL. The DBpedia version of timbr supports query of DBpedia in SQL and seamless integration of DBpedia data into data warehouses and data lakes. We already published a detailed blogpost about timbr where you can find all relevant information about this amazing new DBpedia Service.

Showcase Session

Maribel Acosta – “A closer look at the changing dynamics of DBpedia mappings”

Her presentation looked at the mappings wiki and how different language chapters use and edit it. Slides are available here.

Mariano Rico – “Polishing a diamond: techniques and results to enhance the quality of DBpedia data”

DBpedia is more than a source for creating papers. It is also being used by companies as a remarkable data source. This talk is focused on how we can detect errors and how to improve the data, from the perspective of academic researchers and but also on private companies. We show the case for the Spanish DBpedia (the second DBpedia in size after the English chapter) through a set of techniques, paying attention to results and further work. Slides are available here.

Guillermo Vega-Gorgojo – “Clover Quiz: exploiting DBpedia to create a mobile trivia game”

Clover Quiz is a turn-based multiplayer trivia game for Android devices with more than 200K multiple choice questions (in English and Spanish) about different domains generated out of DBpedia. Questions are created off-line through a data extraction pipeline and a versatile template-based mechanism. A back-end server manages the question set and the associated images, while a mobile app has been developed and released in Google Play. The game is available free of charge and has been downloaded by +10K users, answering more than 1M questions. Therefore, Clover Quiz demonstrates the advantages of semantic technologies for collecting data and automating the generation of multiple-choice questions in a scalable way. Slides are available here.

Fabian Hoppe and Tabea Tiez – “The Return of German DBpedia”

Fabian and Tabea will present the latest news on the German DBpedia chapter as it returns to the language chapter family after an extended offline period. They will talk about the data set, discuss a few challenges along the way and give insights into future perspectives of the German chapter. Slides are available here.

Wlodzimierz Lewoniewski and Krzysztof Węcel – “References extraction from Wikipedia infoboxes”

In Wikipedia’s infoboxes, some facts have references, which can be useful for checking the reliability of the provided data. We present challenges and methods connected with the metadata extraction of Wikipedia’s sources. We used DBpedia Extraction Framework along with own extensions in Python to provide statistics about citations in 10 language versions. Provided methods can be used to verify and synchronize facts depending on the quality assessment of sources. Slides are available here.

Wlodzimierz Lewoniewski – “References extraction from Wikipedia infoboxes” … He gave insight into the process of extracting references for Wikipedia infoboxes, which we will use in our GFS project.

Afternoon Session

Sebastian Hellmann, Johannes Frey, Marvin Hofer – “The DBpedia Databus – How to build a DBpedia for each of your Use Cases”

The DBpedia Databus is a platform that is intended for data consumers. It will enable users to build an automated DBpedia-style Knowledge Graph for any data they need. The big benefit is that users not only have access to data, but are also encouraged to apply improvements and, therefore, will enhance the data source and benefit other consumers. We want to use this session to officially introduce the Databus, which is currently in beta and demonstrate its power as a central platform that captures decentrally created client-side value by consumers.

We will give insight on how the new monthly DBpedia releases are built and validated to copy and adapt for your use cases. Slides are available here.

Interactive session, moderator: Sebastian Hellmann – “DBpedia Connect & DBpedia Commerce – Discussing the new Strategy of DBpedia”

In order to keep growing and improving, DBpedia has been undergoing a growth hack for the last couple of months. As part of this process, we developed two new subdivisions of DBpedia: DBpedia Connect and DBpedia Commerce. The former is a low-code platform to interconnect your public or private databus data with the unified, global DBpedia graph and export the interconnected and enriched knowledge graph into your infrastructure. DBpedia Commerce is an access and payment platform to transform Linked Data into a networked data economy. It will allow DBpedia to offer any data, mod, application or service on the market. During this session, we will provide more insight into these as well as an overview of how DBpedia users can best utilize them. Slides are available here.

In case you missed the event, all slides and presentations are also available on our Website. Further insights, feedback and photos about the event are available on Twitter via #DBpediaDay

We are now looking forward to more DBpedia meetings next year. So, stay tuned and check Twitter, Facebook and the Website or subscribe to our Newsletter for the latest news and information.

If you want to organize a DBpedia Community meeting yourself, just get in touch with us via dbpedia@infai.org regarding program and organization.

Yours

DBpedia Association

The post More than 50 DBpedia enthusiasts joined the Community Meeting in Karlsruhe. appeared first on DBpedia Association.

DBpedia Live Restart – Getting Things Done

Thu, 01 Aug 2019 10:41:03 +0000

Part VI of the DBpedia Growth Hack series (View all)

DBpedia Live is a long term core project of DBpedia that immediately extracts fresh triples from all changed Wikipedia articles. After a long hiatus, fresh and live updated data is available once again, thanks to our former co-worker Lena Schindler whose work we feature in this blog post. Before we dive into Lena’s report, let’s have a look at some general info about DBpedia Live:

Live Enterprise Version

OpenLink Software provides a scalable, dedicated, live Virtuoso instance, built on Lena’s remastering. Kingsley Idehen announced the dedicated business service in our new DBpedia forum. .
On the Databus, we collect publicly shared and business-ready dedicated services in the same place where you can download the data. Databus allows you to download the data, build a service, and offer that service, all in one place. Data up-loaders can also see who builds something with their data

Remastering the DBpedia Live Module

Contribution by Lena Schindler

After developing the DBpedia REST API as part of a student project in 2018, I worked as a student Research Assistant for DBpedia. My task was to analyze and patch severe issues in the DBpedia Live instance. I will shortly describe the purpose of DBpedia Live, the reasons it went out of service, what I did to fix these, and finally, the changes needed to support multi-language abstract extraction.

Overview

The DBpedia Extraction Framework is Scala-based software with numerous features that have evolved around extracting knowledge (as RDF) from Wikis. One part is the DBpedia Live module in the “live-deployed” branch, which is intended to provide a continuously updated version of DBpedia by processing Wikipedia pages on demand, immediately after they have been modified by a user. The backbone of this module is a queue that is filled with recently edited Wikipedia pages, combined with a relational database, called Live Cache, that handles the diff between two consecutive versions of a page. The module that fills the queue, called Feeder, needs some kind of connection to a Wiki instance that reports changes to a Wiki Page. The processing then takes place in four steps:

A wiki page is taken out of the queue.
Triples are extracted from the page, with a given set of extractors.
The new triples from the page are compared to the old triples from the Live Cache.
The triple sets that have been deleted and added are published as text files, and the Cache is updated.

Background

DBpedia Live has been out of service since May 2018, due to the termination of the Wikimedia RCStream Service, upon which the old DBpedia Live Feeder module relied. This socket-based service provided information about changes to an existing Wikimedia instance and was replaced by the EventStreams service, which runs over a single HTTP connection using chunked transfer encoding, and is following the Server-Sent Event (SSE) protocol. It provides a stream of events, each of which contains information about title, id, language, author, and time of every page edit of all Wikimedia instances.

Fix

Starting in September 2018, my first task was to implement a new Feeder for DBpedia Live that is based on this new Wikimedia EventStreams Service. For the Java world, the Akka framework provides an implementation of a SSE client. Akka is a toolkit developed by Lightbend. It simplifies the construction of concurrent and distributed JVM applications, enabling both Java and Scala access. The Akka SSE client and the Akka Streams module are used in the new EventStreamsFeeder (Akka Helper) to extract and process the data stream. I decided to use Scala instead of Java, because it is a more natural fit to Akka.

After I was able to process events, I had the problem that frequent interruptions in the upstream connection were causing the processing stream to fail. Luckily, Akka provides a fallback mechanism with back-off, similar to the Binary Exponential Backoff of the Ethernet protocol which I could use to restart the stream (called “Graph” in Akka terminology).

Another problem was that in many cases, there were many changes to a page within a short time interval, and if events were processed quickly enough, each change would be processed separately, stressing the Live Instance with unnecessary load. A simple “thread sleep” reduced the number of change-sets being published every hour from thousands to a few hundred.

Multi-language abstracts

The next task was to prepare the Live module for the extraction of abstracts (typically the first paragraph of a page, or the text before the table of contents). The extractors used for this task were re-implemented in 2017. It turned out to be a configuration issue first, and second a candidate for long debugging sessions, fixing issues in the dependencies between the “live” and “core” modules. Then, in order to allow the extraction of abstracts in multiple languages, the “live” module needed many small changes, at places spread across the code-base, and care had to be taken not to slow down the extraction in the single language case, compared to the performance before the change. Deployment was delayed by an issue with the remote management unit of the production server, but was accomplished by May 2019.

Summary

I also collected my knowledge of the Live module in detailed documentation, addressed to developers who want to contribute to the code. This includes an explanation of the architecture as well as installation instructions. After 400 hours of work, DBpedia Live is alive and kicking, and now supports multi-language abstract extraction. Being responsible for many aspects of Software Engineering, like development, documentation, and deployment, I was able to learn a lot about DBpedia and the Semantic Web, hone new skills in database development and administration, and expand my programming experience using Scala and Akka.

“Thanks a lot to the whole DBpedia Team who always provided a warm and supportive environment!”

Thank you Lena, it is people like you who help DBpedia improve and develop further, and help to make data networks a reality.

Follow DBpedia on LinkedIn, Twitter or Facebook and stop by the DBpedia Forum to check out the latest discussions.

Yours DBpedia Association

The post DBpedia Live Restart – Getting Things Done appeared first on DBpedia Association.