Skip to content

Commit

Permalink
Allow parsers to define parameters for URL normalization
Browse files Browse the repository at this point in the history
Allow parsers to define URL parameters for normalization. The provided
parameters will be stripped from the URL.
  • Loading branch information
jayjay-w committed Jun 11, 2024
1 parent 09bcc8c commit 6b5e9e5
Show file tree
Hide file tree
Showing 4 changed files with 54 additions and 0 deletions.
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -454,6 +454,16 @@ To enable sampling for Honeycomb, set the following configuration (either in `co

**Note**: If sampling behavior is changed in Pender, we will also need to update the behavior to match in any other application reporting to Honeycomb. More [here](https://docs.honeycomb.io/getting-data-in/opentelemetry/ruby/#sampling)

### URL Parameters Normalization

Some service providers include URL parameters for tracking purposes that can be safely removed. Pender parsers can define a list of such parameters to be removed during the URL normalization process.

To define URL parameters to be removed, a parser class should implement the `urls_parameters_to_remove` method, which returns an array of strings representing the parameters to be stripped. For example:

```ruby
def urls_parameters_to_remove
['ighs']
end

#### Environment overrides

Expand Down
32 changes: 32 additions & 0 deletions app/models/media.rb
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ def initialize(attributes = {})
self.follow_redirections
self.url = RequestHelper.normalize_url(self.url) unless self.get_canonical_url
self.try_https
self.remove_parser_specific_parameters
self.parser = nil
end

Expand Down Expand Up @@ -275,6 +276,37 @@ def try_https
end
end

def remove_parser_specific_parameters
parser_class = self.class.find_parser_class(self.url)
return unless parser_class&.respond_to?(:urls_parameters_to_remove)

params_to_remove = parser_class.urls_parameters_to_remove
return unless params_to_remove.any? { |param| self.url.include?(param) }

uri = URI.parse(self.url)
query_params = URI.decode_www_form(uri.query || '').to_h

params_to_remove.each do |param|
query_params.keys.each do |key|
query_params.delete(key) if key.include?(param)
end
end

new_query = query_params.empty? ? nil : URI.encode_www_form(query_params)
uri.query = new_query

result_url = uri.to_s
result_url += '/' if url.end_with?('/') && !result_url.end_with?('/')
self.url = result_url
end

def self.find_parser_class(url)
PARSERS.each do |parser|
return parser if parser.patterns.any? { |pattern| pattern.match?(url) }
end
nil
end

def get_html(header_options = {}, force_proxy = false)
RequestHelper.get_html(self.url, self.method(:set_error), header_options, force_proxy)
end
Expand Down
4 changes: 4 additions & 0 deletions app/models/parser/instagram_item.rb
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ def type
def patterns
[INSTAGRAM_ITEM_URL, INSTAGRAM_HOME_URL]
end

def urls_parameters_to_remove
['igsh']
end
end

private
Expand Down
8 changes: 8 additions & 0 deletions test/models/media_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -618,4 +618,12 @@ def teardown
assert_equal "201", response.code
assert_equal 'fake response body', response.body
end

test 'should remove parser specific URL parameters' do
url = 'https://www.instagram.com/p/xyz/?igshid=1'
WebMock.stub_request(:any, url).to_return(status: 200, body: 'fake response body')

media = Media.new(url: url)
assert_not_includes media.url, 'igshid=1'
end
end

0 comments on commit 6b5e9e5

Please sign in to comment.