Project Archives - DBpedia Association https://www.dbpedia.org/project/ Global and Unified Access to Knowledge Graphs Mon, 06 Sep 2021 12:17:04 +0000 en-GB hourly 1 https://wordpress.org/?v=6.4.3 https://www.dbpedia.org/wp-content/uploads/2020/09/cropped-dbpedia-webicon-32x32.png Project Archives - DBpedia Association https://www.dbpedia.org/project/ 32 32 10 years ago DBpedia decided to join the Google Summer of Code program https://www.dbpedia.org/blog/10-years-ago-dbpedia-decided-to-join-the-google-summer-of-code-program/ https://www.dbpedia.org/blog/10-years-ago-dbpedia-decided-to-join-the-google-summer-of-code-program/#respond Wed, 23 Jun 2021 09:17:47 +0000 https://www.dbpedia.org/?p=4646 After almost 10 years of being part of the Google Summer of code (GSoC) program, it’s time for a recap! We started with only a couple of projects and mentors, and over the years we grew to a total of 65 successful projects, 65 students and 59 mentors. In 2021 alone, 10 projects have already […]

The post 10 years ago DBpedia decided to join the Google Summer of Code program appeared first on DBpedia Association.

]]>
After almost 10 years of being part of the Google Summer of code (GSoC) program, it’s time for a recap! We started with only a couple of projects and mentors, and over the years we grew to a total of 65 successful projects, 65 students and 59 mentors. In 2021 alone, 10 projects have already been accepted by students from all over the world, which in turn are supervised by more than 20 mentors.

DBpedia has to thank Google Summer of Code a lot for its continued growth. We are happy to be part of the open source community and to help students hone their skills and make an impact with their talent. None of this would have been possible without the generosity and hard work of our mentors. We would like to take this opportunity to thank all mentors and students and recognize them in this post. Here’s a list of DBpedia projects and people at the GSoC in recent years. Have fun reading!

2021

Even though this season of Google Summer of Code has just begun, we want to mention the current projects, students and mentors here.

Projects and StudentsMentors
Towards a neural extraction framework by Ziwei XUTommaso Soru (now at Scalpel / Liber AI), Thiago Castro Ferreira (now at Universidade Federal de Minas Gerais), Zheyuan BAI (now at Huawei)
Neural QA Model for DBpedia by Siddhant JainTommaso Soru (now at Scalpel / Liber AI), Anand Panchbhai (now at Logy.AI)
Lifecycle Management of DBpedia Neural QA Models by Sahan DilshanEdgard Marx (now at eccenca & HTWK), Lahiru Hinguruduwa (now at CERN), Nausheen Fatma (now at Tealbook)
DBpedia Spotlight Dashboard: an integrated statistical information tool from the Wikipedia dumps and the DBpedia Extraction Framework artifacts by José Manual Díaz UrracoSaid Polanco-Martagón (now at Universidad Politécnica de Victoria), Maribel Angelica Marin Castro (now at Universidad Politécnica de Victoria), Beyza Yaman (now at ADAPT Centre), Julio Hernandez (now at ADAPT Centre), Jan Forberg (now at InfAI)
Update DBpedia Sparql for newly updated wiki resources and specifically related to pandemic, healthcare, and health AI fields by Guang ZhangMarvin Hofer (now at ScaDS.AI), Sebastian Hellmann (now at InfAI/DBpedia), Alexander Winter (now at InfAI)
Social Knowledge Graph: Employing SNA measures to Knowledge Graph by Zhipeng ZhaoLuca Virgili (now at Polytechnic University of Marche)
Modular DBpedia Chatbot by Jayesh DesaiAndreas Both (now at DATEV eG), Aleksandr Perevalov (Anhalt University of Applied Science), Ram G Athreya (now at Samsung Research America), Ricardo Usbeck (now at Hamburg University)
DBpedia Live Neural Question Answering Chatbot by Ashutosh KumarEdgard Marx (now at eccenca GmbH), Diego Moussallem (now at Globo), Thiago Castro Ferreira (now at Universidade Federal de Minas Gerais), Nausheen Fatma (now at Tealbook)
User Centric Knowledge Engineering and Data Visualization by Karan KharechaJan Forberg (now at InfAI), Luca Virgili (now at Polytechnic University of Marche), Krishanu Konar (now at Media.net)
Web app to generate RDF from DBpedia abstracts by Fernando Casabán BlascoMariano Rico (now at UPM and Dylan Q)

2020

Oh, what a year! This year a new record was marked. DBpedia got 45 proposals for the Google Summer of Code. For the 9th year in a row, we were part of this incredible journey of young ambitious developers who joined us as an open source organization to work on a GSoC coding project all summer. Read an interesting post about DBpedia’s participation on the DBpedia blog

Projects and StudentsMentors
A multilingual Neural RDF Verbalizer by Marco Sobrevilla (now at University of Sao Paulo)Stuart Chan, Diego Moussallem (now at Globo), Thiago Castro Ferreira (now at Universidade Federal de Minas Gerais)
A Neural QA model for DBpedia: Compositionality by Zheyuan BAI (now at Huawei)Tommaso Soru (now at Scalpel / Liber AI), Anand Panchbhai (now at Logy.AI)
Combine DBpedia/Databus with IPFS by Kirill Yankov (now at InfAI)Amandeep Srivastava (now at Goldman Sachs), Sebastian Hellmann (InfAI/DBpedia)
Dashboard for Language/National Knowledge Graphs by Karan KharechaLuca Virgili (now at Polytechnic University of Marche), Jan Forberg (now at InfAI), Sebastian Hellmann (now at InfAI/DBpedia)
DBpedia Neural Multilingual QA by Lahiru Hinguruduwa (now at CERN/DBpedia)Edgard Marx (now at eccenca), Renato Fabbri, Akshay Jagatap (now at Amazon)
Extending Extraction Framework with Citations, Commons and Lexeme Extractors by Mykola Medynskyi (now at InfAI)Beyza Yaman (now at ADAPT Centre), Julio Hernandez (now at ADAPT Centre), Sebastian Hellmann (now at InfAI/DBpedia)
RDF-to-Text using Generative Adversial Networks by Niloy Purkait (now at MindLabs)Diego Moussallem (now at Globo), Thiago Castro Ferreira (now at Universidade Federal de Minas Gerais), Mariana Dias da Silva

2019

And for the 8th time, DBpedians were coding in order of the GSoC program. For more information about 2019 have a look at the 6 projects below or check more details on the DBpedia blog.

Projects and StudentsMentors
A neural QA Model for DBpedia by Anand Panchbhai (now at Logy.AI)Tommaso Soru (now at Scalpel / Liber AI), Aashay Singhal (now at Indeed.com), Aman Mehta (now at Scaler Academy)
Multilingual Neutral RDF Verbalizer for DBpedia by Dwaraknath Gnaneshwar (now at aixplain, inc.)Diego Moussallem (now at Globo), Thiago Castro Ferreira (now at Universidade Federal de Minas Gerais)
Predicate Detection using Word Embeddings for Question Answering over Linked Data by Yajing BianRam G Athreya (now at Samsung Research America), Rricha Jalota (now at Paderborn University), Ricardo Usbeck (now at Hamburg University)
Tool to generate RDF triples from DBpedia abstract by Jayakrishna Sahit (now at Stealth Mode)Mariano Rico (now at UPM and Dylan Q), Amandeep Srivastava (now at Goldman Sachs)
Transformer of Attention Mechanism for Long-text QA by Stuart ChanBharat (now at Brandvise: Business and Finance), Nausheen Fatma (now at Tealbook), Rricha Jalota (now at Paderborn University)
Workflow for linking External datasets by Jaydeep Chakraborty (now at Arizona State University)Krishanu Konar (now at Media.net), Beyza Yaman (now at ADAPT Centre), Luca Virgili (now at Polytechnic University of Marche)

2018

In 2018 we started out with six students that committed to GSoC projects. However, in the course of the summer, some dropped out or did not pass the midterm evaluation. In the end, we had three finalists that made it through the program. Read Sandra’s recap on the DBpedia blog.

Projects and StudentsMentors
 A neural QA Model for DBpedia by Aman Mehta (now at Tavisca)Tommaso Soru (now at Scalpel), Ricardo Usbeck (now at Hamburg University)
Complex Embeddings for OOV Entities by Bharat Suri (now at Salesforce)Tommaso Soru (now at Scalpel / Liber AI), Thiago Galery (now at Optimizely), Peng Xu (now at BorealisAI)
Web application to detect incorrect mappings across DBpedias in different languages by Víctor FernándezMariano Rico (now at UPM and Dylan Q), Dimitris Kontokostats (now at Diffbot), Nandana (now at Logy.AI)

2017

“GSoC is the perfect opportunity to learn from experts, get to know new communities, design principles and workflows.” (Ram G Athreya) On the DBpedia blog you can read more details about this year’s 7 GSoC project ideas.  

Projects and StudentsMentors
DBpedia Mapping Front-End Administration by Ismael Rodriguez (now at Hedyla)Anastasia Dimou (now at imec), Dimitris Kontokostas (now at Diffbot), Wouter Maroy (now at Deloitte Belgium)
First Chatbot for DBpedia by Ram Ganesan Athreya (now at Samsung Research America)Ricardo Usbeck (now at Hamburg University)
Knowledge base embeddings for DBpedia by Nausheen Fatma (now at Tealbook)Sandro Athaide Coelho (now at Translucent Computing Inc), Tommaso Soru (now at Scalpel / Liber AI)
Knowledge Base Embeddings for DBpedia by Akshay Jagatap (now at Amazon)Pablo Mendes (now at Apple Inc), Sandro Athaide Coelho (now at Translucent Computing Inc), Tommaso Soru (now at Scalpel / Liber AI)
The table extractor by Luca Virgili (now at Polytechnic University of Marche)Emanuele Storti (now at Polytechnic University of Marche), Domenico Potena (now at Polytechnic University of Marche)
Unsupervised Learning of DBpedia Taxonomy by Shashank Motepalli (University of Toronto)Marco Fossati (now at Wikimedia Foundation), Dimitris Kontokostas (now at Diffbot)
Wikipedia List-Extractor by Krishanu Konar (now at Media.net)Marco Fossati (now at Wikimedia Foundation)

2016

DBpedia participated for a fourth time in the Google Summer of Code program. This was a quite competitive year (like every year) where more than forty students applied for a DBpedia project. In the end, 8 great students from all around the world were selected and worked on their projects during the summer of 2016.  

Projects and StudentsMentors
A Hybrid Classifier/Rule-based Event Extractor for DBpedia Proposal by Vincent Bohlen (Fraunhofer FOKUS)Marco Fossati (now at Wikimedia Foundation)
Automatic mappings extraction by Aditya Nambiar (now at Facebook)Markus Freudenberg (now at Datadrivers GmbH)
Combining DBpedia and Topic Modelling by wojtuchAlexandru Todor (now at FU Berlin)
DBpedia Lookup Improvements by Kunal.JhaAxel Ngonga (now at University of Paderborn)
Inferring infobox template class mappings from Wikipedia + Wikidata by Peng_Xu (now at BorealisAI)Nilesh Chakraborty (now at Fraunhofer IAIS)
Integrating RML in the Dbpedia extraction framework by Wouter Maroy (now at Deloitte Belgium)Dimitris Kontokostas (now at Diffbot), Anastasia Dimou (now at imec)
The List Extractor by FedBaiMarco Fossati (now at Wikimedia Foundation)
The Table Extractor by s.papaliniMarco Fossati (now at Wikimedia Foundation)

2015

This year, for the first time, a former student is now one of our mentors! Dimitris Kontokostas, Marco Fossati, Thiago Galery, Joachim Daiber and Ruben Verborgh, members of the DBpedia community, mentored the following great students from around the world. Check more details on the DBpedia blog!

Projects and StudentsMentors
Fact Extraction from Wikipedia Text by Emilio Dorigatti (now at LMU Munich)Marco Fossati (now at Wikimedia Foundation)
Better context vectors for disambiguation by Philipp Dowling (now at Bain & Company)Thiago Galery (now at Optimizely), Joachim Daiber (now at Apple Inc), David Przybilla (now at Optimizely)
Wikipedia Stats Extractor by Naveen Madhire (now at Amazon Web Services (AWS))Thiago Galery (now at Optimizely), Joachim Daiber (now at Apple Inc), David Przybilla (now at Optimizely)
DBpedia Live scaling and new interface by Andre PereiraDimitris Kontokostas (now at Diffbot), Magnus Knuth (now at eccenca GmbH)
Adding live-ness to the Triple Pattern Fragments server by Pablo Estrada (now at Google)Ruben Verborgh (now at Ghent University – imec), Dimitris Kontokostas (now at Diffbot)
DBpedia Schema Enrichment on Web Protege by laparmakerli Alexandru Todor (now at FU Berlin)
Keyword Search on DBpedia by Kartik Singhal Axel Ngonga (now at University of Paderborn)
Parallel processing in DBpedia extraction Framework by Ritesh Kumar SinghNilesh Chakraborty (now at Fraunhofer IAIS)
Remodeling pignlproc for generating Named entity recognition models by Naveen Madhire (now at Amazon Web Services (AWS))Joachim Daiber (now at Apple)

2014

For a third time, DBpedia was part of the Google Summer of Code because as you know, all good things come in threes.

Projects and StudentsMentors
Abbreviation Base – A multilingual knowledge base for abbreviations by Divyum Rastogi (now at Amazon)Martin Brümmer (now at GitLab Inc.), Sebastian Hellmann (now at InfAI/DBpedia)
Distributed extraction of Wikipedia data dumps for DBpedia by Nilesh Chakraborty (now at Fraunhofer IAIS)Sang Venkatraman (now at Accenture), Nicolas Torzec (now at Verizon MediaM), Dimitris Kontokostas (now at Diffbot), Andrea Di Menna (now at Immobiliare.it)
Fine grained massive extraction of wikipedia content by Roberto Bampi (now at RISE SICS)Michel Dumontier (now at University Maastricht), Marco Fossati (now at Wikimedia Foundation)
Media Extractor for DBpedia by Leandro Doctors (now at University of Mons)Alexandru Todor (now at FU Berlin), Magnus Knuth (now at eccenca GmbH), Marco Fossati (now at Wikimedia Foundation)
Natural language question answering engine with DBpedia by WencanAxel Ngonga (now at University of Paderborn), Marco Fossati (now at Wikimedia Foundation), Michel Dumontier (now at University Maastricht)
New DBpedia Interfaces: Resource Widgets by Jorge Cruz (now at Hipmunk)Magnus Knuth (now at eccenca GmbH), Patrick Westphal (now at Leipzig University), Dimitris Kontokostas (now at Diffbot)
Wikimedia Commons extraction by Gaurav Vaidya (now at RENCI, an institute of UNC at Chapel Hill)Dimitris Kontokostas (now at Diffbot), Andrea Di Menna (now at Immobiliare.it), Jimmy O’Regan (now at Trinity College Dublin)

2013

After the first year only DBpedia Spotlight was part of the GSoC, in 2013 the whole DBpedia family was participating, including DBpedia, DBpedia Spotlight and DBpedia Wiktionary. Check more details on the DBpedia blog.

Projects and StudentsMentors
DBpedia: Design a better / interactive display page (+ Search) by El DenisClaus Stadler (now at Smart Data Analytics (SDA) Research), Dimitris Kontokostas (now at Diffbot)
HadyElsahar – WikiData+DBpedia Idea proposal by Hady Elsahar (now at NAVER LABS Europe)Dimitris Kontokostas (now at Diffbot)
Input Formats Generalization and Graph-Based Disambiguation Integration and Improvements by Zhiwei CaiMax Jakob (now at Ecosia), Chris Hokamp (now at AYLIEN)
Interface / Power tool for DBpedia testing metadata by Lazaros Ioannidis (now at POWERPHARM)Dimitris Kontokostas (now at Diffbot)
Type inference to extend coverage by Kasun PereraMarco Fossati (now at Wikimedia Foundation)

2012

2012 was the first year ever being part of the Google Summer of Code. All projects focused on DBpedia Spotlight and were co-mentored by Pablo N. Mendes (now at Apple Inc) and Max Jakob (now at Ecosia). Get more insights on the DBpedia blog.

The post 10 years ago DBpedia decided to join the Google Summer of Code program appeared first on DBpedia Association.

]]>
https://www.dbpedia.org/blog/10-years-ago-dbpedia-decided-to-join-the-google-summer-of-code-program/feed/ 0
ImageSnippets and DBpedia https://www.dbpedia.org/blog/imagesnippets-and-dbpedia/ Wed, 18 Dec 2019 12:15:33 +0000 https://blog.dbpedia.org/?p=1291  by Margaret Warren  The following post introduces to you ImageSnippets and how this tool profits from the use of DBpedia. ImageSnippets – A Tool for Image Curation For over two decades, ImageSnippets has been evolving as an ontology and data-driven framework for image annotation research. Representing the informal knowledge people have about the context and provenance […]

The post ImageSnippets and DBpedia appeared first on DBpedia Association.

]]>
 by Margaret Warren 

The following post introduces to you ImageSnippets and how this tool profits from the use of DBpedia.

ImageSnippets – A Tool for Image Curation

For over two decades, ImageSnippets has been evolving as an ontology and data-driven framework for image annotation research. Representing the informal knowledge people have about the context and provenance of images as RDF/linked data is challenging, but it has also been an enlightening and engaging journey in not only applying formal semantic web theory to building image graphs but also to weave together our interests with what others have been doing in the field of semantic annotation and knowledge graph building over these many years. 

DBpedia provides the entities for our RDF descriptions

Since the beginning, we have always made use of DBpedia and other publicly available datasets to provide the entities for use in our RDF descriptions.  Though ImageSnippets can be used to build special vocabularies around niche domains, our primary research is around relation ontology building and we prefer to avoid the creation of new entities unless we absolutely can not find them through any other service.

When we first went live with our basic system in 2013, we began hand-building tens of thousands of triples using terms primarily from DBpedia (the core of the linked data cloud.) While there would often be an overlap of terms with other datasets – almost a case of too many choices – we formed a best practice of preferentially using DBpedia terms as often as possible, because they gave us the most utility for reasoning using the SKOS concepts built into the DBpedia service. We have also made extensive use of DBpedia Spotlight for named-entity extraction.

How to combine DBpedia & Wikidata and make it useful for ImageSnippets

But the addition of the Wikidata Query Service over the past 18 months or so has now given us an even more unique challenge: how to work with both! Since DBpedia and Wikidata both have class relationships that we can reason from, we found ourselves in a position to be able to examine both DBpedia and Wikidata in concert with each other through the use of mapping techniques between the two datasets.

How it works: ImageSnippets & DBpedia

When an image is saved, we build inference graphs over results from both DBpedia and Wikidata. These graphs can be revealed with simple SPARQL queries at our endpoint and queries from subclasses, taxons and SKOS concepts can find image results in our custom search tool.  We have also just recently added a pathfinder utility – highly useful for semantic explainability as it will return the precise path of connections from an originating source entity to the target entity that was used in our custom image search.

Sometimes a query will produce very unintuitive results, and the pathfinder tool enables us to quickly locate semantic errors which lead to clearly erroneous misclassifications (for example, a search for the Wikidata subclass of ‘communication medium’ reveals images of restaurants and hotels because of misclassifications in Wikidata.) In this way we can quickly troubleshoot the results of queries, using the images as visual cues to explore the accuracy of the semantic modelling in both datasets.


We are very excited with the new directions that we feel can come of our knitting together of the two knowledge graphs through the use of our visual interface and believe there is a great potential for ImageSnippets to serve a more complex role in cleaning and aligning the two datasets, using the images as our guides.

A big thank you to Margaret Warren for providing some insights into her work at ImageSnippets.

Yours,

DBpedia Association

The post ImageSnippets and DBpedia appeared first on DBpedia Association.

]]>
GlobalFactSync and WikiDataCon2019 https://www.dbpedia.org/blog/globalfactsync-and-wikidatacon2019/ Thu, 24 Oct 2019 11:41:10 +0000 https://blog.dbpedia.org/?p=1266 We will be spending the next three days in Berlin at WikidataCon 2019, the conference for open data enthusiasts. From October 24th till 26th we will be presenting the latest developments and first results of our work in the GlobalFactSyncRE-Project.  Short Project Intro Funded by the Wikimedia Foundation, the project started in June 2019 and […]

The post GlobalFactSync and WikiDataCon2019 appeared first on DBpedia Association.

]]>
We will be spending the next three days in Berlin at WikidataCon 2019, the conference for open data enthusiasts. From October 24th till 26th we will be presenting the latest developments and first results of our work in the GlobalFactSyncRE-Project. 

Short Project Intro

Funded by the Wikimedia Foundation, the project started in June 2019 and has two goals:

  • Answer the following questions:
    • How is data edited in Wikipedia and Wikidata?
    • Where does it come from?
    • How can we synchronize it globally?
  • Build an information system to synchronize facts between all Wikipedia language-editions, Wikidata, DBpedia and eventually multiple external sources, while also providing respective references. 

In order to help Wikipedians to maintain their infoboxes, check for factual correctness, and also improve data in Wikidata, we use data from Wikipedia infoboxes of different languages, Wikidata, and DBpedia and fuse them into our PreFusion dataset (in JSON-LD). More information on the fusion process, which is the engine behind GFS, can be found in the FlexiFusion paper.

Can’t join the conference or want to find out more about GlobalFactSync?

No problem, the poster we are presenting at the conference is currently available here and will soon be available here. Additionally, why not go through our project timeline, follow up on our progress so far and find out what’s coming up next.

In case you have specific questions regarding GlobalfactSync or even some helpful feedback just ping us via dbpedia@infai.org. We also have our new DBpedia Forum, home to the DBpedia Comunity, which just waits for you to initialize a discussion around GlobalFactSync. Why not start it now?

For general DBpedia news and updates follow us on Twitter.

…And if you are in Berlin at WikiDataCon2019 stop by our poster and talk to our developers. They are looking forward to vital exchanges with you.

All the best

yours,


DBpedia Association


The post GlobalFactSync and WikiDataCon2019 appeared first on DBpedia Association.

]]>
Global Fact Sync – Synchronizing Wikidata & Wikipedia’s infoboxes https://www.dbpedia.org/blog/global-fact-sync-synchronizing-wikidatas-wikipedias-infoboxes/ Thu, 25 Jul 2019 11:33:36 +0000 https://blog.dbpedia.org/?p=1186 How is data edited in Wikipedia/Wikidata? Where does it come from? And how can we synchronize it globally?   The GlobalFactSync (GFS) Project — funded by the Wikimedia Foundation — started in June 2019 and has two goals: Answer the above-mentioned three questions. Build an information system to synchronize facts between all Wikipedia language-editions and Wikidata.  […]

The post Global Fact Sync – Synchronizing Wikidata & Wikipedia’s infoboxes appeared first on DBpedia Association.

]]>

How is data edited in Wikipedia/Wikidata? Where does it come from? And how can we synchronize it globally?  

The GlobalFactSync (GFS) Project — funded by the Wikimedia Foundation — started in June 2019 and has two goals:

  • Answer the above-mentioned three questions.
  • Build an information system to synchronize facts between all Wikipedia language-editions and Wikidata. 

Now we are seven weeks into the project (10+ more months to go) and we are releasing our first prototypes to gather feedback. 

How – Synchronization vs Consensus

We follow an absolute “Human(s)-in-the-loop” approach when we talk about synchronization. The final decision whether to synchronize a value or not should rest with a human editor who understands consensus and the implications. There will be no automatic imports. Our focus is to drastically reduce the time to research all references for individual facts.

A trivial example to illustrate our reasoning is the release date of the single “Boys Don’t Cry” (March 16th, 1989) in the English, Japanese, and French Wikipedia, Wikidata and finally in the external open database MusicBrainz.  A human editor might need 15-30 minutes finding and opening all different sources, while our current prototype can spot differences and display them in 5 seconds.

We already had our first successful edit where a Wikipedia editor fixed the discrepancy with our prototype: “I’ve updated Wikidata so that all five sources are in agreement.” We are now working on the following tasks:

  • Scaling the system to all infoboxes, Wikidata and selected external databases (see below on the difficulties there)
  • Making the system:
    •  “live” without stale information
    • “reliable” with less technical errors when extracting and indexing data
    • “better referenced” by not only synchronizing facts but also references 

Contributions and Feedback

To ensure that GlobalFactSync will serve and help the Wikiverse we encourage everyone to try our data and micro-services and leave us some feedback, either on our Meta-Wiki page or via email. In the following 10+ months, we intend to improve and build upon these initial results. At the same time, these microservices are available to every developer to exploit it and hack useful applications. The most promising contributions will be rewarded and receive the book “Engineering Agile Big-Data Systems”. Please post feedback or any tool or GUI here. In case you need changes to be made to the API, please let us know, too.
For the ambitious future developers among you, we have some budget left that we will dedicate to an internship. In order to apply, just mention it in your feedback post. 

Finally, to talk to us and other GlobalfactSync-Users you may want to visit WikidataCon and Wikimania, where we will present the latest developments and the progress of our project. 

Data, APIs & Microservices (Technical prototypes) 

Data Processing and Infobox Extraction

For GlobalFactSync we use data from Wikipedia infoboxes of different languages, as well as Wikidata, and DBpedia and fuse them to receive one big, consolidated dataset – a PreFusion dataset (in JSON-LD). More information on the fusion process, which is the engine behind GFS, can be found in the FlexiFusion paper. One of our next steps is to integrate MusicBrainz into this process as an external dataset. We hope to implement even more such external datasets to increase the amount of available information and references. 

First microservices 

We deployed a set of microservices to show the current state of our toolchain.

  • [Initial User Interface] The GlobalFactSync UI prototype (available at http://global.dbpedia.org) shows all extracted information available for one entity for different sources. It can be used to analyze the factual consensus between different Wikipedia articles for the same thing. Example: Look at the variety of population counts for Grimma.
  • [Reference Data Download] We ran the Reference Extraction Service over 10 Wikipedia languages. Download dumps here.
  • [ID service] Last but not least, we offer the Global ID Resolution Service. It ties together all available identifiers for one thing (i.e. at the moment all DBpedia/Wikipedia and Wikidata identifiers – MusicBrainz coming soon…) and shows their stable DBpedia Global ID. 

Finding sync targets

In order to test out our algorithms, we started by looking at various groups of subjects, our so-called sync targets. Based on the different subjects a set of problems were identified with varying layers of complexity:

  • identity check/check for ambiguity — Are we talking about the same entity? 
  • fixed vs. varying property — Some properties vary depending on nationality (e.g., release dates), or point in time (e.g., population count).
  • reference — Depending on the entity’s identity check and the property’s fixed or varying state the reference might vary. Also, for some targets, no query-able online reference might be available.
  • normalization/conversion of values — Depending on language/nationality of the article properties can have varying units (e.g., currency, metric vs imperial system).

The check for ambiguity is the most crucial step to ensure that the infoboxes that are being compared do refer to the same entity. We found, instances where the Wikipedia page and the infobox shown on that page were presenting information about different subjects (e.g., see here).

Examples

As a good sync target to start with the group ‘NBA players’ was identified. There are no ambiguity issues, it is a clearly defined group of persons, and the amount of varying properties is very limited. Information seems to be derived from mainly two web sites (nba.com and basketball-reference.com) and normalization is only a minor issue. ‘Video games’ also proved to be an easy sync target, with the main problem being varying properties such as different release dates for different platforms (Microsoft Windows, Linux, MacOS X, XBox) and different regions (NA vs EU).

More difficult topics, such as ‘cars’, ’music albums’, and ‘music singles’ showed more potential for ambiguity as well as property variability. A major concern we found was Wikipedia pages that contain multiple infoboxes (often seen for pages referring to a certain type of car, such as this one). Reference and fact extraction can be done for each infobox, but currently, we run into trouble once we fuse this data. 

Further information about sync targets and their challenges can be found on our Meta-Wiki discussion page, where Wikipedians that deal with infoboxes on a regular basis can also share their insights on the matter. Some issues were also found regarding the mapping of properties. In order to make GlobalFactSync as applicable as possible, we rely on the DBpedia community to help us improve the mappings. If you are interested in participating, we will connect with you at http://mappings.dbpedia.org and in the DBpedia forum.  

BottomlineWe value your feedback

Your DBpedia Association

The post Global Fact Sync – Synchronizing Wikidata & Wikipedia’s infoboxes appeared first on DBpedia Association.

]]>