-
Notifications
You must be signed in to change notification settings - Fork 6
Home
Welcome to the String-Transformers wiki!
String Transformers is a collection of Java classes which implement a transform
method, taking a character string and changing it into another. They are useful as part of a process for matching character strings against each other and deciding whether the things they represent are the same. For example, in a messy database we probably “Royal Botanic Gardens, Kew” to match against ”royal botanic gardens kew”, “ROYAL BOTANIC GARDENS KEW” and maybe even “R.B.G. Kew”.
Some of the transformers are generic: CapitalLettersExtractor
removes non-capital letters from a string. Others are geared towards handling scientific names, like StripBasionymAuthorTransformer
.
To transform
- “Royal Botanic Gardens, Kew”
- ”royal botanic gardens kew”
- “ROYAL BOTANIC GARDENS KEW”
so they match
A LowerCaseTransformer
followed by a StripNonAlphanumericCharacters
would turn all three strings into “royal botanic gardens kew”.
If we also want to match “R.B.G. Kew” then we could use StripNonAlphanumericCharacters
, then TitleCaseTransformer
, then CapitalLettersExtractor
to end with “R B G K” in each case — although this will also match against “Rather big grey koala”, so we must be careful!
To transform these scientific names
- “Coffea sapinii”
- “Coffea sapini”
into the same string we might use an A2BTransformer
with search pattern (A) (\\w)\\1
, which means any letter followed by the same letter, and the replace pattern (B) $1
which means the first letter (since it was in brackets). This replaces double (or more) letters with only a single letter: “Cofea sapini”.
To transform
- “(De Wild.) A.P.Davis”
- “A.Davis”
we first use a RemoveBracketedTextTransformer
to remove the often-absent basionym author. We could add an A2BTransformer
to remove the initials, but there are specific transformers geared towards botany: we want a SurnameExtractor
— both author strings should be converted to “Davis”.