Wikidata:Lexicographical data/Documentation
Wikidata:Lexicographical data
| This documentation page is currently being reworked. Some important changes may occur. |
This is the main documentation page for lexicographical data on Wikidata. Since the new data system is not deployed yet, this documentation is incomplete and mostly based on the test system.
See also the technical documentation on extension WikibaseLexeme.
Contents
Introduction[edit]
Data Model[edit]
The data model of WikibaseLexeme describes the structure of the data that is handled as "Lexemes" in Wikibase. The text below is a summary; for more detailed information, see Extension:WikibaseLexeme/Data Model.
A Lexeme is a lexical element of a language, such as a word, a phrase, or a prefix (see Lexeme on Wikipedia). Lexemes are Entities in the sense of the Wikibase data model. A Lexeme is described using the following information:
- An ID. Lexemes have IDs starting with an "L" followed by a natural number in decimal notation, e.g.
L3746552. These IDs are unique within the repository that manages the Lexeme. The ID can be combined with a repository's concept base URI to form a unique URI for the Lexeme. - A Lemma for use as a human readable representation of the lexeme, e.g. "run".
- The Language to which the lexeme belongs. This is a reference to a concrete Item, e.g. English (Q1860).
- The Lexical category to which the lexeme belongs. This is given as a reference to a concrete Item, e.g. adjective (Q34698).
- A list of Lexeme Statements to describe properties of the lexeme that are not specific to a Form or Sense (e.g. derived from or grammatical gender or syntactic function)
- A list of Forms, typically one for each relevant combination of grammatical features, such as 2nd person / singular / past tense. A Form is described using the following information:
- An ID. Forms have IDs starting with the ID of the Lexeme they belong to, followed by a hyphen ("-") and an "F", followed by a natural number in decimal notation: e.g.
L3746552-F7 - A representation, spelling out the Form as a string.
- A list of grammatical features that define for which syntactic role the given form applies. These are given as references to a concrete Items, e.g. participle (Q814722) for participle.
- A list of Form Statements further describing the Form or its relations to other Forms or Items (e.g. IPA transcription (P898), pronunciation audio, rhymes with, used until, used in region)
- An ID. Forms have IDs starting with the ID of the Lexeme they belong to, followed by a hyphen ("-") and an "F", followed by a natural number in decimal notation: e.g.
- A list of Senses, describing the different meanings of the lexeme (e.g. "financial institution" and "edge of a body of water" for the English noun bank). A sense is described using the following information:
- An ID. Senses have IDs starting with the ID of the Lexeme they belong to, followed by a hyphen ("-") and an "S", followed by a natural number in decimal notation: e.g.
L3746552-S4. These IDs are unique within the repository that manages the Lexeme. The ID can be combined with a repository's concept base URI to form a unique URI for the Sense. - A Gloss, defining the meaning of the Sense using natural language.
- A list of Sense Statements further describing the Sense and its relations to Senses and Items (e.g. translation, synonym, antonym, connotation, register, denotes, evokes).
- An ID. Senses have IDs starting with the ID of the Lexeme they belong to, followed by a hyphen ("-") and an "S", followed by a natural number in decimal notation: e.g.
This data model is further extended by the set of properties typically used for Lexeme statements, Form statements, and Sense statements. See Wikidata:Lexicographical data/Properties for an overview of these properties and Wikidata:Property proposal/Lexemes for current proposals of additional properties.
| verb | noun | pronoun | adjective | adverb | preposition | postposition | conjunction | interjection | numeral | determiner | grammatical particle | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Arabic | ذهب | كتاب | انا | جميل | عادةً | في | لكن (بس) | يعني | واحد | هذا | ||
| English | go | book | I | beautiful | usually | in | but | oh | one | this | ||
| German | wissen | Zukunft | ich | ausgezeichnet | querbeet | in | aber | ach | eins | dieser | ||
| Gangbuk-gu | 먹다 | 사람 | 나 | 괴롭다 | 함께 | 가만 | 극 | 고전적 | ||||
| French | aller | livre | je | beau | toujours | dans | mais | merci | un | ce | ||
| Pashto | تلل | کتاب | زه | ښکلی | په | خو | یو | |||||
| Persian | رفتن | کتاب | من | زیبا | در | را | اما | آخ | یک | این | ||
| Russian | быть | вода | я | хороший | хорошо | в | - | и | всё | три | - | не |
In some cases or languages, there may be multiple entities for related words, in others just one. The below table provides an overview how they may be linked:
| difference in | 1 lexeme | 2+ lexemes | |||
|---|---|---|---|---|---|
| sense | add several senses | add applicable sense to lexeme | link other(s) with homograph lexeme | duplicate forms on each | |
| etym. | add etym. to each sense | add etym. to lexeme base | link other(s) with homograph lexeme | duplicate forms on each | |
| gender | add gender to each sense | add gender to lexeme base | link other(s) with homograph lexeme | duplicate forms on each | |
| common/proper | add several senses | use lexical category "noun" | add applicable sense to lexeme | link other(s) with homograph lexeme | duplicate forms on each |
| caps/lowercase | add several forms | qualify forms to applicable senses | add applicable sense to lexeme | link other(s) with homograph lexeme | add only applicable forms |
| singular/plural | add several forms | qualify forms to applicable senses | add applicable sense | if possible link other(s) with homograph lexeme | add only applicable forms |
| pronunciation | add the same form twice | qualify forms to applicable senses, add prononciation | add applicable sense | if possible link other(s) with homograph lexeme | add form and applicable pronunciation |
| forms/spelling | add several forms or alternate forms | qualify forms to applicable senses | add applicable sense | if possible link other(s) with homograph lexeme | add only applicable forms |
For a given language and criterion (first column), just one of the two might apply
Interface[edit]
Lexeme[edit]
- Create a new Lexeme
- Go to Special:NewLexeme
- Enter a lemma (dictionary form of a word)
- Enter the language of the lexeme by typing the name of the language or Q-ID
- In the field that appears above, enter the language code of the lemma
- Enter the lexical category by typing its name or the Q-ID (example: verb, noun, adjective...)
- Click on "Create"
- The Lexeme is now created with this basic information, you can continue editing it
- Edit a Lexeme
- Click on the edit button, next to the lemma
- Edit the content of the different fields
- Lemma
- Language code of the lemma
- Language of the Lexeme
- Lexical category
- Click on "publish"
- Add, edit or delete statements of a Lexeme
- To add a statement of a Lexeme, click on "add statement"
- Enter a property: start typing its name in the property field (example: derived-from) and select it in the suggester
- Enter a value
- Just like on Items, you can add qualifiers and references
- Save by clicking "publish"
- To edit a statement, click on "edit"
- To delete a statement, click on "edit", then "remove"
- Delete a Lexeme
- Go to WD:RFD
- Search for a Lexeme
Here's how you can look for Lexemes, Lemmas, Forms or Senses, via Special:Search or the search box on any page:
- look for a lexeme by its L-number
- by typing "Lexeme:L123"
- by typing "L123" and selecting the Lexeme namespace
- look for a Lexeme by the name of its lemma
- by typing "Lexeme:sandbox"
- by typing "sandbox" and selecting the Lexeme namespace
- use the L shortcut: "L:L123" or "L:sandbox"
- look for a Form: (eg "Lexeme:mangeant") with any of the methods described above
Note that the selector (drop-down menu popping up to suggest results) is not working yet. But if you press Enter or search after typing your keyword, you'll access the results.
Form[edit]
- Create a new Form
- In the Forms section, click on "add Form"
- Fill the representation (mandatory)
- Fill the language code of the representation (mandatory)
- Enter one or several grammatical features, by typing their name and selecting them in the list of items
- Edit a Form
- Click on the edit button next to the representation
- Modify the content in the fields
- Click on "publish"
- Delete a Form
- Click on the edit button next to the representation
- Click on Remove
Sense[edit]
- Create a new Sense
- In the Senses section of a Lexeme, click on "add Sense"
- Enter a language code (for example: en, fr, zh)
- Enter a gloss (very short phrase defining the meaning)
- You can add new glosses by clicking on "add"
- Click on "Publish"
- Now the Sense is created, you can add statements
- Edit a Sense
- Click on the edit button, next to the Sense ID
- Edit the content of the different fields
- Click on "publish"
- Remove a Sense
- Click on the edit button, next to the Sense ID
- Click on "remove"
Features[edit]
See also: Wikidata:Lexicographical data/Development
What is included in the first version[edit]
- New datatypes: Lexeme, Form
- Add, edit, delete Lexemes
- Add, edit, delete Forms
- Add, edit, delete statements
- Add, edit, delete qualifiers
- Add, edit, delete references
- Linking to an Item from a Lexeme or a Form
- Linking to another Lexeme from a Lexeme, a Form or an Item
- Search and suggestions when entering a value
- Basic internal APIs (used for UI, you should not use them)
What will be added in the future[edit]
Ordered from near to long-term plans
- Search for content with Special:Search
Done - Display the lemma in the history pages, recent changes and watchlist
Done - Add, edit, delete Senses
Done - RDF support and ability to query the data on query.wikidata.org
Done - Better API support
- Automatic generation of Forms
- Data access on clients (other Wikimedia projects)
- Editing data directly from Wiktionary

