For integration with duckling please visit Builtin Duckling. Duckling support more languages, but you will need to have an instance of duckling up and running.
Those are the languages supported using Duckling or not using it:
Language | Locale | Without Duckling | With Duckling |
---|---|---|---|
Arabic | ar | X | |
Armenian | hy | ||
Basque | eu | ||
Bengali | bn | X | |
Catala | ca | ||
Chinese | zh | X | X |
Czech | cs | X | |
Danish | da | X | |
Dutch | nl | X | |
English | en | X | X |
Farsi | fa | ||
Finnish | fi | X | |
French | fr | X | X |
Galician | gl | ||
German | de | X | |
Greek | el | X | |
Hindi | hi | X | |
Hungarian | hu | X | |
Indonesian | id | X | |
Italian | it | X | |
Irish | ga | X | |
Japanese | ja | X | X |
Korean | ko | X | |
Lithuanian | lt | ||
Malay | ms | X | |
Nepali | ne | X | |
Norwegian | no | X | |
Polish | pl | X | |
Portuguese | pt | X | X |
Romanian | ro | X | |
Russian | ru | X | |
Serbian | sr | ||
Spanish | es | X | X |
Swedish | sv | X | |
Tamil | ta | X | |
Thai | th | *1 | |
Tagalog | tl | ||
Turkish | tr | X | |
Ukrainian | uk | X |
*1: Thai is not supported by duckling, but there exists a repo in github with an implementation of the thai rules of duckling: https://github.com/pantuwong/thai_duckling
The NER Manager includes by default a builtin entity extraction with different bundles available for different languages. The entity extraction is done even if the utterance is not matched to an intent.
Builtin | English | French | Spanish | Portuguese | Chinese | Japanese | Other |
---|---|---|---|---|---|---|---|
X | X | X | X | X | X | X | |
Ip | X | X | X | X | X | X | X |
Hashtag | X | X | X | X | X | X | X |
Phone Number | X | X | X | X | X | X | X |
URL | X | X | X | X | X | X | X |
Number | X | X | X | X | X | X | see 1 |
Ordinal | X | X | X | X | X | X | |
Percentage | X | X | X | X | X | X | see 2 |
Dimension | X | X | X | X | X | X | see 3 |
Age | X | X | X | X | X | X | |
Currency | X | X | X | X | X | X | |
Date | X | X | X | X | see 4 | see 4 | see 4 |
Duration | X |
- 1: Only for non text numbers
- 2: Only for % symbol non text numbers
- 3: Only for dimension acronyms (km, s, km/h...) non text numbers
- 4: Only dd/MM/yyyy formats or similars, non text
- Email Extraction
- IP Extraction
- Hashtag Extraction
- Phone Number Extraction
- URL Extraction
- Number Extraction
- Ordinal Extraction
- Percentage Extraction
- Age Extraction
- Currency Extraction
- Date Extraction
- Duration Extraction
It can identify and extract valid emails accounts, this works for any language.
"utterance": "My email is [email protected] please write me",
"entities": [
{
"start": 12,
"end": 33,
"len": 22,
"accuracy": 0.95,
"sourceText": "[email protected]",
"utteranceText": "[email protected]",
"entity": "email",
"resolution": {
"value": "[email protected]"
}
}
]
It can identify and extract valid IPv4 and IPv6 addresses, this works for any language.
"utterance": "My ip is 8.8.8.8",
"entities": [
{
"start": 9,
"end": 15,
"len": 7,
"accuracy": 0.95,
"sourceText": "8.8.8.8",
"utteranceText": "8.8.8.8",
"entity": "ip",
"resolution": {
"value": "8.8.8.8",
"type": "ipv4"
}
}
]
"utterance": "My ip is ABEF:452::FE10",
"entities": [
{
"start": 9,
"end": 22,
"len": 14,
"accuracy": 0.95,
"sourceText": "ABEF:452::FE10",
"utteranceText": "ABEF:452::FE10",
"entity": "ip",
"resolution": {
"value": "ABEF:452::FE10",
"type": "ipv6"
}
}
]
It can identify and extract hashtags from the utterances, this works for any language.
"utterance": "Open source is great! #proudtobeaxa",
"entities": [
{
"start": 22,
"end": 34,
"len": 13,
"accuracy": 0.95,
"sourceText": "#proudtobeaxa",
"utteranceText": "#proudtobeaxa",
"entity": "hashtag",
"resolution": {
"value": "#proudtobeaxa"
}
}
]
It can identify and extract phone numbers from the utterances, this works for any language.
"utterance": "So here is my number +1 541-754-3010 callme maybe",
"entities": [
{
"start": 21,
"end": 35,
"len": 15,
"accuracy": 0.95,
"sourceText": "+1 541-754-3010",
"utteranceText": "+1 541-754-3010",
"entity": "phonenumber",
"resolution": {
"value": "+1 541-754-3010"
}
}
]
It can identify and extract URLs from the utterances, this works for any language.
"utterance": "The url is https://something.com",
"entities": [
{
"start": 11,
"end": 31,
"len": 21,
"accuracy": 0.95,
"sourceText": "https://something.com",
"utteranceText": "https://something.com",
"entity": "url",
"resolution": {
"value": "https://something.com"
}
}
]
It can identify and extract numbers. This works for any language, and the numbers can be integer or floats.
"utterance": "This is 12",
"entities": [
{
"start": 8,
"end": 9,
"len": 2,
"accuracy": 0.95,
"sourceText": "12",
"utteranceText": "12",
"entity": "number",
"resolution": {
"strValue": "12",
"value": 12,
"subtype": "integer"
}
}
]
The numbers can be also be text written, but this only works for: English, French, Spanish and Portuguese.
"utterance": "This is twelve",
"entities": [
{
"start": 8,
"end": 13,
"len": 6,
"accuracy": 0.95,
"sourceText": "twelve",
"utteranceText": "twelve",
"entity": "number",
"resolution": {
"strValue": "12",
"value": 12,
"subtype": "integer"
}
}
]
The text feature also works for fractions.
"utterance": "one over 3",
"entities": [
{
"start": 0,
"end": 9,
"len": 10,
"accuracy": 0.95,
"sourceText": "one over 3",
"utteranceText": "one over 3",
"entity": "number",
"resolution": {
"strValue": "0.333333333333333",
"value": 0.333333333333333,
"subtype": "float"
}
}
]
It can identify and extract ordinal numbers. This works only for English, Spanish, French and Portuguese.
"utterance": "He was 2nd",
"entities": [
{
"start": 7,
"end": 9,
"len": 3,
"accuracy": 0.95,
"sourceText": "2nd",
"utteranceText": "2nd",
"entity": "ordinal",
"resolution": {
"strValue": "2",
"value": 2,
"subtype": "integer"
}
}
]
The numbers can be written by text.
"utterance": "one hundred twenty fifth",
"entities": [
{
"start": 0,
"end": 23,
"len": 24,
"accuracy": 0.95,
"sourceText": "one hundred twenty fifth",
"utteranceText": "one hundred twenty fifth",
"entity": "ordinal",
"resolution": {
"strValue": "125",
"value": 125,
"subtype": "integer"
}
}
]
It can identify and extract percentages. If the percentage is indicated with the symbol % it works for any language.
"utterance": "68.2%",
"entities": [
{
"start": 0,
"end": 4,
"len": 5,
"accuracy": 0.95,
"sourceText": "68.2%",
"utteranceText": "68.2%",
"entity": "percentage",
"resolution": {
"strValue": "68.2%",
"value": 68.2,
"subtype": "float"
}
}
]
The percentage can be indicated by text, but it only works for English, French, Spanish and Portuguese.
"utterance": "68.2 percent",
"entities": [
{
"start": 0,
"end": 11,
"len": 12,
"accuracy": 0.95,
"sourceText": "68.2 percent",
"utteranceText": "68.2 percent",
"entity": "percentage",
"resolution": {
"strValue": "68.2%",
"value": 68.2,
"subtype": "float"
}
}
]
It can understand text numbers but only works for English, French, Spanish and Portuguese.
"utterance": "thirty five percentage",
"entities": [
{
"start": 0,
"end": 21,
"len": 22,
"accuracy": 0.95,
"sourceText": "thirty five percentage",
"utteranceText": "thirty five percentage",
"entity": "percentage",
"resolution": {
"strValue": "35%",
"value": 35,
"subtype": "integer"
}
}
]
It can identify and extract different dimensions, like length, distance, speed, volume, area,... If the international acronym of the dimension is used then it works in any language.
"utterance": "120km",
"entities": [
{
"start": 0,
"end": 4,
"len": 5,
"accuracy": 0.95,
"sourceText": "120km",
"utteranceText": "120km",
"entity": "dimension",
"resolution": {
"strValue": "120",
"value": 120,
"unit": "Kilometer",
"localeUnit": "Kilometer"
}
}
]
In instead of the acronym, the text of the dimension is used in a language, then it works in English, French, Spanish and Portuguese.
"utterance": "Está a 325 kilómetros de Bucarest",
"entities": [
{
"start": 7,
"end": 20,
"len": 14,
"accuracy": 0.95,
"sourceText": "325 kilómetros",
"utteranceText": "325 kilómetros",
"entity": "dimension",
"resolution": {
"strValue": "325",
"value": 325,
"unit": "Kilometer",
"localeUnit": "Kilómetro"
}
}
]
It can identify and extract ages. It works in English, French, Spanish and Portuguese. Take into account that several ways to say an age can be also confused with a duraction ("It will be 10 years" can be an age or a duration), so two overlaped entities, one age and one duration, can be returned.
"utterance": "This saga is ten years old",
"entities": [
{
"start": 13,
"end": 25,
"len": 13,
"accuracy": 0.95,
"sourceText": "ten years old",
"utteranceText": "ten years old",
"entity": "age",
"resolution": {
"strValue": "10",
"value": 10,
"unit": "Year",
"localeUnit": "Year"
}
},
{
"start": 13,
"end": 21,
"len": 9,
"accuracy": 0.95,
"sourceText": "ten years",
"utteranceText": "ten years",
"entity": "duration",
"resolution": {
"values": [
{
"timex": "P10Y",
"type": "duration",
"value": "315360000"
}
]
}
}
]
It can identify and extract currency values. It works in English, French, Spanish and Portuguese.
"utterance": "420 million finnish markka",
"entities": [
{
"start": 0,
"end": 25,
"len": 26,
"accuracy": 0.95,
"sourceText": "420 million finnish markka",
"utteranceText": "420 million finnish markka",
"entity": "currency",
"resolution": {
"strValue": "420000000",
"value": 420000000,
"unit": "Finnish markka",
"localeUnit": "Finnish markka"
}
}
]
It the used language is not english, the localeUnit contains the locale name of the currency.
"utterance": "420 millones de marcos finlandeses",
"entities": [
{
"start": 0,
"end": 33,
"len": 34,
"accuracy": 0.95,
"sourceText": "420 millones de marcos finlandeses",
"utteranceText": "420 millones de marcos finlandeses",
"entity": "currency",
"resolution": {
"strValue": "420000000",
"value": 420000000,
"unit": "Finnish markka",
"localeUnit": "Marco finlandés"
}
}
]
It can identify and extract dates, if provided in numeric format can work in any language, but take into account that the localization also affect to the date format.
"utterance": "28/10/2018",
"entities": [
{
"start": 0,
"end": 9,
"len": 10,
"accuracy": 0.95,
"sourceText": "28/10/2018",
"utteranceText": "28/10/2018",
"entity": "date",
"resolution": {
"type": "date",
"timex": "2018-10-28",
"strValue": "2018-10-28",
"date": "2018-10-28T00:00:00.000Z"
}
}
]
It can understand written date formats in English, French, Spanish and Portuguese.
"utterance": "Volveré el 12 de enero del 2019",
"entities": [
{
"start": 11,
"end": 30,
"len": 20,
"accuracy": 0.95,
"sourceText": "12 de enero del 2019",
"utteranceText": "12 de enero del 2019",
"entity": "date",
"resolution": {
"type": "date",
"timex": "2019-01-12",
"strValue": "2019-01-12",
"date": "2019-01-12T00:00:00.000Z"
}
}
]
It can understand partial dates. Then the timex contains the resolution, example, if I provide the day but not the month neither the year, then both year and month will be filled with X. Also, in this case, two possible dates will be returned: the past and the future. Also take into account that in cases like that, the resolution can also include a number, like in this example:
"utterance": "I'll go back on 15",
"entities": [
{
"start": 16,
"end": 17,
"len": 2,
"accuracy": 0.95,
"sourceText": "15",
"utteranceText": "15",
"entity": "number",
"resolution": {
"strValue": "15",
"value": 15,
"subtype": "integer"
}
},
{
"start": 16,
"end": 17,
"len": 2,
"accuracy": 0.95,
"sourceText": "15",
"utteranceText": "15",
"entity": "date",
"resolution": {
"type": "interval",
"timex": "XXXX-XX-15",
"strPastValue": "2018-08-15",
"pastDate": "2018-08-15T00:00:00.000Z",
"strFutureValue": "2018-09-15",
"futureDate": "2018-09-15T00:00:00.000Z"
}
}
]
When the grain resolution is not a day, it can be resolved not only with a past and future date, but also each date is an interval. Example: if we are resolving a date that is a month, like January, it will return the past and future januaries, but also each january is an interval from the day 1 of January until the day 1 of February, like in this example:
"utterance": "I'll be out in Jan",
"entities": [
{
"start": 15,
"end": 17,
"len": 3,
"accuracy": 0.95,
"sourceText": "Jan",
"utteranceText": "Jan",
"entity": "daterange",
"resolution": {
"type": "interval",
"timex": "XXXX-01",
"strPastStartValue": "2018-01-01",
"pastStartDate": "2018-01-01T00:00:00.000Z",
"strPastEndValue": "2018-02-01",
"pastEndDate": "2018-02-01T00:00:00.000Z",
"strFutureStartValue": "2019-01-01",
"futureStartDate": "2019-01-01T00:00:00.000Z",
"strFutureEndValue": "2019-02-01",
"futureEndDate": "2019-02-01T00:00:00.000Z"
}
}
]
It also identifies expecial dates, like Christmas:
"utterance": "I will return in Christmas",
"entities": [
{
"start": 17,
"end": 25,
"len": 9,
"accuracy": 0.95,
"sourceText": "Christmas",
"utteranceText": "Christmas",
"entity": "date",
"resolution": {
"type": "interval",
"timex": "XXXX-12-25",
"strPastValue": "2017-12-25",
"pastDate": "2017-12-25T00:00:00.000Z",
"strFutureValue": "2018-12-25",
"futureDate": "2018-12-25T00:00:00.000Z"
}
}
]
It can identify and extract duration intervals. It works currently in English only. The resolution is done in seconds, with a timex indicator. Example: "It will take me 5 minutes" the timex is "PT5M" meaning "Present Time 5 Minutes".
"utterance": "It will take me 5 minutes",
"entities": [
{
"start": 13,
"end": 21,
"len": 9,
"accuracy": 0.95,
"sourceText": "5 minutes",
"utteranceText": "5 minutes",
"entity": "duration",
"resolution": {
"values": [
{
"timex": "PT5M",
"type": "duration",
"value": "300"
}
]
}
}
]