opensearch-project · leanneeliatra · Sep 10, 2024 · Sep 17, 2024 · Sep 17, 2024 · Sep 18, 2024
@@ -0,0 +1,105 @@
+---
+layout: default
+title: HTML Character Filter
+parent: Character Filters
+nav_order: 100
+---
+
+# HTML strip character filter
+The `html_strip` character filter removes HTML elements from the input text, and generating the visible text with the tags rendered.
+
+The `html_strip` character filter identifies and removes all HTML tags, such as `<div>`, `<p>`, and `<a>`, from the input text. The filter can also be configured to preserve certain tags or decode specific HTML entities like `&nbsp;` into spaces.
+
+## Example of the HTML analyzer
+```
+GET /_analyze
+{
+  "tokenizer": "keyword",
+  "char_filter": [
+    "html_strip"
+  ],
+  "text": "<p>Commonly used calculus symbols include &alpha;, &beta; and &theta; </p>"
+}
+```
+Using the HTML analyzer, we can convert the HTML character entity references into their corresponding symbols. The returned processed text would read:
+
+```
+Commonly used calculus symbols include α, β and θ 
+```
+
+## Example of a custom analyzer 
+
+Let's create a custom analyzer that strips HTML tags and then converts the remaining text to lowercase using the `html_strip` analyszer and `lowercase` filter.
+```
+PUT /html_strip_and_lowercase_analyzer
+{
+  "settings": {
+    "analysis": {
+      "char_filter": {
+        "html_filter": {
+          "type": "html_strip"
+        }
+      },
+      "analyzer": {
+        "html_strip_analyzer": {
+          "type": "custom",
+          "char_filter": ["html_filter"],
+          "tokenizer": "standard",
+          "filter": ["lowercase"]
+        }
+      }
+    }
+  }
+}
+```
+### Testing our `html_strip_and_lowercase_analyzer`
+```
+GET /html_strip_and_lowercase_analyzer/_analyze
+{
+  "analyzer": "html_strip_analyzer",
+  "text": "<h1>Welcome to <strong>OpenSearch</strong>!</h1>"
+}
+```
+Gives the result
+```
+welcome to opensearch!
+```
+The HTML tags have been removed and the output is in lowercase.
+
+## Example of a custom analyzer preserving HTML tags
+Let's create our custom analyzer
+```
+PUT /html_strip_preserve_analyzer
+{
+  "settings": {
+    "analysis": {
+      "char_filter": {
+        "html_filter": {
+          "type": "html_strip",
+          "escaped_tags": ["b", "i"]
+        }
+      },
+      "analyzer": {
+        "html_strip_analyzer": {
+          "type": "custom",
+          "char_filter": ["html_filter"],
+          "tokenizer": "keyword"
+        }
+      }
+    }
+  }
+}
+```
+### Testing the `html_strip_preserve_analyzer`  
+```
+GET /html_strip_preserve_analyzer/_analyze
+{
+  "analyzer": "html_strip_analyzer",
+  "text": "<p>This is a <b>bold</b> and <i>italic</i> text.</p>"
+}
+
+```
+We get the results as seen. The italic and bold tags have been retained as we specified this in our custom analyzer.
+```
+This is a <b>bold</b> and <i>italic</i> text.
+```
@@ -0,0 +1,23 @@
+---
+layout: default
+title: Character Filters
+nav_order: 90
+has_children: true
+has_toc: false
+---
+
+# Character filters
+
+Character filters process the text before tokenization, modifying or cleaning the input to prepare it for further analysis. 
+
+Unlike token filters, which operate on tokens (words or terms), character filters work on the raw input text before tokenization. They are especially useful for cleaning or transforming structured text with unwanted characters, like HTML tags or special symbols. Character filters help strip or replace these elements, ensuring the text is properly formatted for analysis.
+
+Use cases for character filters include:
+## HTML stripping
+Removing HTML tags from content, ensuring only the visible text is indexed. See [HTML stripping]({{site.url}}{{site.baseurl}}/analyzers/html-character-filter) for more information.
+
+## Pattern replacement
+Replacing or removing unwanted characters or patterns in text (e.g., converting hyphens to spaces
+## Custom mappings
+Substituting specific characters or sequences with other values, such as converting currency symbols into their textual equivalents.
+