Wiktionary:Scribunto

Wiktionary supports server-side scripting to generate content for pages, using the Scribunto extension. It is used as a complement to templates, in particular parser functions like, and so on. Scripts are divided into modules located in the Module: namespace, and are written in the Lua programming language.

Getting started
Here are some helpful links to get you started with Lua and Scribunto.
 * Learning Lua – If you're not yet familiar with the language, or with programming in general, this is a good place to start. This doesn't cover any of the parts that are specific to using Lua within a wiki, it's only "generic" Lua.
 * Scribunto/Lua tutorial – A short tutorial to explain how to use Scribunto/Lua within the wiki.
 * Scribunto/Lua reference manual – A reference manual for Lua as it applies to the Scribunto extension. This also lists Wiki-specific things that don't exist in normal Lua.
 * Official reference manual for Lua 5.1 – A quick reference to the language, for more experienced programmers. Again, this is generic Lua, and does not cover specific details about using it on Wiktionary, but it has information that the Scribunto-specific manual lacks.
 * lua-users wiki – A user-written wiki with many articles on various aspects of Lua.
 * Programming in Lua by Roberto Ierusalimschy, one of the creators of Lua – in-depth discussion of the basics of Lua 5.0.

Information about using Scribunto on Wiktionary specifically:
 * /Converting templates - Some tips and tricks for "translating" common practices from templates to Lua.
 * Coding conventions

How Scribunto interacts with the wiki
A Scribunto module in itself is really a large function: it runs from top to bottom, and is expected to return a value. Normally, the return value of a module is a list of functions and their names, which can then be called from another module or from "wikispace". However, a module could, in theory, return something other than a list of functions. It could return a table of strings, a table containing other tables, or even a single value. However, only a module that returns a list of functions and their names can be invoked from wikitext; anything else can only be imported and used from within Scribunto itself.

A function in a Scribunto module is called from wikitext as if it were a parser function, using this notation: The function then returns wikitext as its output. This wikitext can contain HTML-style wikitext (such as  and   and so on) and wiki-specific markup (such as   for bolding and  ), but cannot invoke templates, parser functions, magic words or parser extension tags (for example, something like  will be interpreted as meaning the literal string , rather than be expanded out to  ). If a module needs to invoke a template or a parser-function, it has to use special functions for this purpose in the lua namespace. But it is hoped that these functions will not be needed too often.

Since, in order to be useful, the function needs information about the context in which it was invoked, the Scribunto extension will pass it a single argument, customarily named lua. This argument can be used to obtain various key bits of information; in particular, if the parameter  was passed to the template that is invoking the Scribunto function, then inside the function, lua will be the string lua. (Any numbered/unnamed parameters can be accessed by number; for example, the template's  becomes the module's lua.) Even in the absence of the   parameter, Scribunto code can use lua to obtain its value; therefore, functions don't actually need to pass   around to each other. A module can call  in Module:parameters to clean up the parameters, and do some checking on their validity.

It is also possible (though not usually necessary) for a template to pass further arguments as part of the invocation, in which case these will be available via lua. For example, if the module was invoked using, then lua will be the string lua. (Numbered/unnamed parameters work analogously.)

Debugging and error reporting
When Lua encounters an error in a script, it aborts the script and shows "Module error" in large red, clickable text on the page. Click on this text in order to see what caused the error.

A module error also adds the page to Category:Pages with module errors. When writing modules or converting templates, it is a good idea to check this category to see whether any pages that use it are triggering errors. It is also possible to trigger errors yourself, using the following: This can be used to check whether a module and its accompanying template(s) are being used correctly, and to show an error to the user otherwise. It is highly recommended that you use this whenever possible, to make your modules more robust and to make it easier to find mistakes.

While you are working on a script, it may occasionally be useful to generate debug messages so that you can see what is going on at particular points in your script. You can do this with the  function:

This function will output its argument to the Scribunto debug console if you run the module in the debug console, e.g. by typing  if the function you want to run is named   and takes no arguments. It automatically adds a newline to the end of the message.

The function  can be used for simple benchmarking of a given function. It can be used like this:

An error also occurs when the time allocated for running scripts expires before all scripts on a page can be run. If you are making a complex and potentially time-consuming edit to a module, you can use the "Preview page with this template" to preview a very large, module-heavy page like [[a]] to check if your script slows it down too much.

English Wiktionary also has its own purpose-made debugging module, aptly named Module:debug. The function  can be used to track entries that fulfill a particular condition without interfering with the operation of a function or template. It is similar in purpose to Category:Template tracking.

"Frame" and "parent frame"
There are actually two ways that values can be passed to a Scribunto module. The first is the one shown above, in which the values are passed as parameters directly to the module invocation. So for example, if there is a Lua function  that generates (say)   for , a template  will invoke and pass on the arguments, and other pages will access this function by writing (e.g.). With this approach, the Lua function needs to access the arguments that were passed to ; to that end, it might be written like this:

However, there is another way, which is recommended because it is fastersource. Every module also has access to its so-called "parent frame", which contains the collection of arguments passed not to the module, but to the template that called it. So rather than invoking the module and pass the values on explicitly, the module is invoked with no parameters. The module itself can access the parameters that were passed to the template, using the parent frame. The example above would then be written like this:

As you can see, the only real difference is the use of lua to get the arguments of the "parent frame" (i.e., the template-call), rather than the arguments of the module invocation itself.

It's possible to write a function that supports both approaches (by using lua). This may occasionally be useful for simple functions that can be called more or less intuitively from a template as well as another module. But for more complicated functions, it's better to write the "main" code in one function, and write another function that can be invoked from a template, which then gathers the parameters and calls the main function.

Note that an empty parameter passed on from a template "counts"; i.e. the template call  will lead to the related condition

to be satisfied, as an empty string is interpreted as. The code

on the other hand, will only respond to a non-empty first argument.

Efficiency
The efficiency of Lua can be checked through the template preview feature: After pressing "preview", right click on the preview of the page and request the page source. In the page source, search for "NewPP" in order to see how much time the execution of the Lua module took (example: Lua time usage: 0.004s). Search for "served" in order to see how long time it took to render the entire page (example: Served by mw1035 in 0.498 secs). This latter time can be used to compare with how long it takes to render the same page with or without Lua modules.

Various techniques can be used to increase efficiency. The following come from a chapter in Lua Programming Gems: Avoid creating functions and tables inside loops. Use local variables. Memoize expensive functions. Avoid a large number of separate string concatenation operations by inserting strings into a table with lua and creating the final string with lua.

Each individual concatenation operation (whether it involves two strings, lua, or several, lua) generates a single new string (blog post), which is stored in Lua memory. Many concatenation operations (for instance, in loops) can use a lot of memory because many intermediate strings are created.

To memoize a function that has one argument and one return value, you may be able to use the lua function from Module:fun. Beware that it does not work as the third argument to lua, though, because it returns a table.

For Scribunto specifically, it increases efficiency to use basic lua functions instead of lua ones when performing a large number of string operations. The Ustring functions are implemented mostly in PHP (see UstringLibrary.php on Phabricator), and they must parse the string (sequence of bytes) into codepoints, and the pattern matching functions must convert the Lua pattern into PHP regex, before they are able to do their job. The basic string functions operate on bytes, so they eliminate this intermediate step. Read more on this below.

The lua function uses the lua functions (source code). In many cases, the function lua can be used instead, and will be much faster. For example, to iterate over lines of wikitext, use lua instead of lua. The function lua can be replicated by using this same loop to insert the items into a table.

Methods of creating an array (a table with consecutive integer keys: lua, for instance), vary in speed. The following three methods are ranked from fastest to slowest. The first method is used in Module:table, which is used frequently enough that it needs to be as efficient as possible. The first method is fastest because all that must be done in each iteration is an addition operation and creation of an index. In the second, the length of the table must be newly calculated for each iteration. The function lua operates in a similar way to the second method, but it must first determine whether it has been supplied a third argument or not.

Lua tables contain an array and a hash part. The array part takes less memory per field than the hash part, so using arrays rather than hashes is a good idea when memory is an issue. According to World of Warcraft wiki, array fields use 16 bytes, while hash fields use 40 bytes.

Both the array and the hash part have a size that is a power of two. For the array part, the size is the smallest power of two that is greater than or equal to the greatest index in the array part; for the hash part, the size is the smallest power of two that is greater than or equal to the number of elements in the hash part. The array part can only contain fields that are indexed by a positive integer (lua), while the hash part contains fields with any type of index.

Array fields are added when an element in a table literal does not have an explicit index (lua creates a table with an array part with a size of four), or, under certain conditions, when an element is added with the indexing operator (lua. When a table literal contains explicit numerical indices, hash fields are added: lua creates a table whose hash part contains four fields. (These fields may be shifted to the array part of the table if more fields are added to the table.)

In vanilla Lua and in Scribunto, there is no way to check how many fields in a table are in the array part and the hash part. The length operator doesn't check if positive integer–indexed fields are in the array part or the hash part.

Unicode
Lua itself does not understand Unicode; whereas there are more than a million possible Unicode characters, a "string" in Lua is just a sequence of bytes in the range 0–255. (Unfortunately, the Lua documentation refers to these bytes as "characters", but don't be deceived.)

To address this lack, the Scribunto extension does (at least) four things for us:


 * whenever any text is passed into a Lua module (e.g., as a template parameter), the original character-string is transformed into a byte-string using . UTF-8 is a variable-width encoding: ASCII characters are transformed into just a single byte, while other Unicode characters are transformed into two, three, or four bytes.
 * the text returned by a Lua module is interpreted as UTF-8, and transformed back into a Unicode-character string. This means, for example, that if a module receives a bit of text and returns it unmodified, then all will be well.
 * Technical notes:
 * In the event that the string passed back from Lua is not valid UTF-8, invalid sequences will be replaced by the replacement character U+FFFD (&#xFFFD;). The same is also done for some valid UTF-8 characters, such as many of the control characters in the range U+0000 to U+0020.
 * In addition to being UTF-8-decoded, the characters in the string will be modified so that they conform with the Normalization Form Canonical Composition (NFC). For further explanation, see below.
 * the source-code of a Scribunto module is encoded using UTF-8, so we can use Unicode characters inside Lua string literals.
 * the Scribunto extension includes a lua ("Unicode string") module, which is always available. This module provides UTF-8-aware analogues of Lua's built-in string functions. In essence, the functions in this module allow you to operate on a UTF-8-encoded byte-string as though it were still the original Unicode character-string.

Even so, when using the lua library, there are some caveats that you need to pay attention to. Although the library is capable of interpreting a sequence of several bytes as a single Unicode character, there may still be more than one Unicode character in a single logical character. For example, although я́ appears to us as a single logical character, it is really encoded as two distinct Unicode characters: the Cyrillic letter я (U+044F) followed by a combining acute accent (U+0301). Therefore, the code lua will actually return 2, not 1. More subtly, the following will also return a valid result: lua. This happens because the character class in the pattern lua actually contains two characters (the Cyrillic letter and the accent mark); the function searches for each character individually, and finds the first one (the Cyrillic letter).

MediaWiki converts Unicode characters to the canonical composition normalization form (NFC) when they are entered into a textbox or displayed on a page (see the MediaWiki page on Unicode normalization considerations). Among other things, this means that a bare letter character plus a combining character changes to the composed form, if possible, and some individual characters are changed to a character with a similar appearance. For example, the two-character sequence a (U+0061) + ◌́ (combining acute, U+0301) becomes á (U+00E1), and a CJK Compatibility Ideograph changes into the corresponding character from one of the CJK Unified Ideograph blocks (&#xF900; (U+F900) &rarr; &#x8C48; (U+8C48)). To display characters that would otherwise be transformed, use s such as html.

Beware of normalization forms when testing the output of module functions that return decomposed forms (NFD) using a module such as Module:UnitTests. Even if the "actual" and "expected" fields are identical when displayed on the page, they may be different in the module, in which case the tests will fail. (For instance, the "actual" field may have letter–combining character sequences while the "expected" field has the corresponding letter plus diacritic characters.) Convert them to the same normalization form (NFC or NFD) using lua or lua to make sure that the comparison is done correctly.

Generating Unicode characters
There are several ways to type a Unicode character in a Lua module: add the character itself to a Lua string, add a decimal escape sequence representing the bytes in the UTF-8 encoding to a Lua string, or place the codepoint (in hexadecimal or decimal base) into lua. For example, the letter á (Latin small letter a with acute, codepoint U+00E1) can be entered as:


 * 1) lua
 * 2) lua
 * 3) lua
 * 4) lua

The Scribunto extension currently uses Lua version 5.1 (with a few features from 5.2), so hexadecimal escape sequences and Unicode escape sequences, added in Lua versions 5.2 and 5.3 or thereabouts, are not supported. In Lua 5.3, the escape sequences lua all yield the character á, while in Scribunto they yield.

Byte sequences (method 2) should be avoided, because they are hard to read and write and susceptible to errors. They are different from codepoints: for instance, the byte sequence for the combining acute accent (displayed over a dotted circle: ◌́) is lua, or lua in hexadecimal base, while the codepoint is U+0301 (lua in decimal). There is no correspondence unless one looks at the individual bits. The byte sequence can be converted to the codepoint and vice-versa, but that is difficult to do without a program.

Although codepoints can be entered into lua using decimal base (method 4), hexadecimal base (method 3) is more recognizable, because that is the way codepoints are usually represented. For instance, U+00E1 stands for the letter á, and corresponds to the Lua code lua.

Combining characters are best not entered on their own. For example, a combining acute accent added directly inside quotes (lua or lua is impossible to read, as it displays directly on top of one of the quotes.

Strings are fed through Unicode composition normalization before being given to the invoked function as arguments, and also when returned as output. Consequently, strings may be modified on the way in and on the way out. For example, the two codepoints U+0061 and U+0301 (Latin lowercase a followed by combining acute accent) are automatically converted to the single codepoint U+00E1 (Latin lowercase a with acute accent, a single character). To analyse strings on a character by character basis, you need to do it within Lua using the lua function, you cannot rely on the on-page output containing exactly the characters you returned.

String functions
Scribunto contains the basic Lua string functions and the lua functions. Some of the Ustring functions are copies of the basic string functions, others are equivalent functions that are modified to work with strings containing Unicode characters beyond the basic ASCII character set, and there are some new functions.

The modified functions include lua, lua, lua, lua, lua, lua, lua, lua.

The basic Lua string functions look at bytes, while the Ustring functions look at codepoints encoded in UTF-8.

For the basic Lua functions, length means the number of bytes. Anything beyond basic ASCII will have a length greater than the number of displayed characters.

Patterns
Note: The following section only holds true for the UTF-8 encoding, which is used on Wiktionary (as well as other MediaWiki projects). Other encodings follow different rules.

In the discussion below, ASCII refers to Unicode characters in the codepoint range U+0000 to U+0080. They are encoded as one byte each: bytes lua to lua ( in binary). Non-ASCII refers to Unicode characters in the codepoint range U+0080 to U+10FFFF. They are encoded using two, three, or four bytes. The first byte (the leading byte) is in the range lua to lua and the following one to three bytes (continuation bytes) are in the range lua to lua. Hence ASCII is synonymous with single-byte, and non-ASCII with multi-byte.

Note also that different bytes are used in ASCII and non-ASCII, so it is easy to determine whether an arbitrary byte belongs to one or the other.

Basic string patterns
A pattern will behave identically in both the basic string and the Ustring functions if it fulfills certain conditions: it must only contain ASCII or simple sequences of non-ASCII characters. Thus the patterns lua, lua or lua, lua will all work correctly whether they are used in the basic string function or the Ustring function.

But quantifiers or sets containing non-ASCII characters will fail. They act on individual bytes, not characters. A set containing a non-ASCII character will match any one of the bytes in the encoding of the character. A quantifier will act on the last byte immediately before it.

For instance, in the basic Lua string functions, the quantified item lua does not match a sequence of one or more of the character lua (lua). The character á is a two-byte sequence, equivalent to the byte escape sequence lua, so the pattern lua is really lua, and it matches the byte lua plus one or more of the byte lua: lua. (Only the first of these options is valid UTF-8. The rest would display as,   if they were published on a Wiktionary page, and they are unlikely ever to occur in a module.)

Similarly, the set lua does not match "the character á or the character é". Rather, it matches just one of the bytes used to encode the codepoints á or é in UTF-8 lua, and if it is applied to lua (= lua), it will match only the first byte, lua.

See Module:User:Erutuon/patterns for a function that determines whether a pattern will behave in the same way in the basic string functions as in the Ustring functions.

Ustring patterns
The ustring functions fix these problems. They deal with codepoints rather than bytes. So any multi-byte sequence that encodes a Unicode character is considered as a unit.

The Ustring functions must be used if the pattern contains quantifiers acting on non-ASCII characters, character classes that are meant to find Unicode characters, or sets with non-ASCII characters. Using the basic string function is likely to return incorrect results. Examples of these

Here the pattern is equivalent to lua, if duplicates are removed and bytes are sorted. lua is equivalent to lua; the only "new" byte is lua, the second byte in the encoding of lua.

To match a single UTF-8 character in the basic string functions, you can use the pattern lua. For instance, the two expressions below give the same result. lua will be faster, because it has no processing to do before it compares the string lua to the pattern, while lua has to parse both the string and the pattern into codepoints before matching them.

Organizing Lua modules
Document Lua modules on a /documentation subpage. The documentation will appear at the top of the module page.

Categories cannot be entered into modules directly. Put a category on the documentation page, separated from the documentation by tags on the top and bottom:

(documentation)