Releases: crawler-commons/url-frontier
URLFrontier 2.4
What's Changed
This release fixes some bugs (see detailed list below) and adds the following notables changes:
- Ignite backend has been removed
- Two new API methods have been added:
- GetURLStatus to retrieve information about a particular URL in the frontier
- ListURLs to retrieve all URLs known in the frontier
Detailed list of changes
- PutURLs calls failing with java.lang.IllegalStateException #77 by @michaeldinzinger in #78
- Modifying the delete functionality for ignite.purge (Issue #74) by @michaeldinzinger in #81
- Modifying functionality for rocksdb.purge to use abstract method by @michaeldinzinger in #82
- Preventing IgniteHeartbeat race condition on the start of IgniteService (Issue #72) by @michaeldinzinger in #79
- Fix mistake for config forwarding in constructor DistributedFrontierS… by @michaeldinzinger in #83
- Making the metrics variables for GetURLs protected by @michaeldinzinger in #84
- Bump grpc-protobuf from 1.50.2 to 1.53.0 in /API by @dependabot in #85
- Bump ch.qos.logback:logback-classic from 1.4.4 to 1.4.12 in /service by @dependabot in #89
- Ability to set URL limit for specific domain by @zaibacu in #91
- Add method to get URL Status (returns an URLItem) by @klockla in #92
- adds dependabot by @jnioche in #97
- Fix #87, similar behaviour re-blocking whether specitying a key or not by @jnioche in #99
- Add method ListURLs to list all URLs known in the frontier with their next fetch date by @klockla in #93
- Remove Ignite, implements #96 by @jnioche in #100
- Added test case for discussion #94 by @klockla in #101
- Better organisation of Maven plugins and update their version by @jnioche in #102
- Reformat some files which were not passing mvn verify by @klockla in #104
- Improved handling of dependencies and updated their versions by @jnioche in #105
- Reverted back to GRPC 1.66.0 and updated protoc to 3.25.5 by @klockla in #106
- Prerelease 2.4 by @jnioche in #107
New Contributors
- @michaeldinzinger made their first contribution in #78
- @zaibacu made their first contribution in #91
- @klockla made their first contribution in #92
Full Changelog: 2.3.1...2.4
URLFrontier 2.3.1
What's new in URLFrontier 2.3
What's Changed
- Use multiple threads for putting URLs by @jnioche in #65
- Multithread reading from queues by @jnioche in #66
- Bugfix Exception when trying to delete a non existent crawl #70
- Dependency updates #68
- Bugfix ShardedRocksDBService does not return ack when identical URLs are sent in short succession #67
- Batch write operations #64
Multithreading the writes gives an gain of up to 54% while querying the queues in parallel improves the performance by up to 8x.
Full Changelog: 2.2...2.3
What's new in URLFrontier 2.2
What's Changed
- ShardedRocksDBService by @jnioche in #56
- RocksDB backend - faster restarts, implements #54 by @jnioche in #60
- DeleteCrawl - remove default value and need explicit crawlID
- Bugfix ID not returned by AbstractFrontierService
- PutURLs uses crawlid and URL as id for the URLInfo
- Getting a stack dump when closing RocksDB at the end of the service #17
Full Changelog: 2.1...2.2
What's new in URLFrontier 2.1
This release fixes some bugs (see below) and adds the following changes.
- Java 11 #51
- Faster recovery for RocksDB service implementation, fix #52
- Add messageID to URLItem #53
Dependency upgrades
- Prometheus 0.15.0
- Ignite 2.13.0
- RocksDB 7.2.2
- Lucene 9.1.0
Bugfixes
- Exception caught when deleting queue #55
- add missing removal listener to DistributedFrontierService
- ListNodes throwing NPE
Full Changelog: urlfrontier-2.0...2.1
What's new in URLFrontier 2.0
This is the first release named 2.x and a major step towards URL Frontier 2, which is being funded through the NGI0 Discovery Fund.
The main goal of this release was to introduce the concept of a distributed frontier in the API and have an implementation of the service which could work in a distributed fashion. For the latter, we implemented a service based on Apache Ignite. Ignite handles the detection of nodes, replication, failure management as well as key value storage. In addition, we used Apache Lucene for ordering and accessing the URLs within a queue.
The main changed to the API are the addition of the listNodes endpoint to return the list of nodes in the Frontier cluster, as well as the addition of a local field in most of the messages used by the API. This is in order to determine whether the corresponding action (e.g. GetStats) should be applied to the cluster as a whole (by default) or only to the targeted node.
The other two implementations of the service (Memory and RocksDB) work as previously.
The next releases will be focusing on robustness and resilience.
What's new in URLFrontier 1.2
This is the 2nd step work towards URL Frontier 2, which is being funded through the NGI0 Discovery Fund.
This release fixes a bug introduced in version 1.1 and adds the following functionality.
The service implementation takes a parameter -s, the value of which is used as port number to expose metrics for Prometheus.
See README for instructions on how to set up monitoring for URLFrontier with Grafana, Loki and Prometheus.
The API and client code remains unchanged from the previous version. Only the service implementation is affected.
What's new in URLFrontier 1.1
This is the initial work towards URL Frontier 2, which is being funded through the NGI0 Discovery Fund.
Please note that the service implementation is now available from Maven, making it easier to write standalone service implementations to extend it.
Logging configuration
The logging is done with Logback. A default configuration is loaded and will dump logs on the console at INFO level and above but the configuration of the logging can be overridden by specifying a configuration file when launching a frontier service, e.g.
java -Dlogback.configurationFile=log-conf.xml ...
The API also has a new endpoint SetLogLevel, which allows changing the level of the logs generated by a running frontier service dynamically. The changes are not persisted between runs of the service.
This is typically done using the CLI
Usage: Client SetLogLevel [-l=STRING] -p=STRING
Change the log level of a package in the Frontier service
-l, --level=STRING Log level [TRACE, DEBUG, INFO, WARN, ERROR]
-p, --package=STRING package name
for instance
java -jar ~/urlfrontier-client-*.jar SetLogLevel -p crawlercommons.urlfrontier.service -l DEBUG
will ask the Frontier to generate logs at level DEBUG for any class within the crawlercommons.urlfrontier.service package.
Multi-tenancy with crawlIDs
A Frontier instance can now support multi-tenancy in URLFrontier by introducing a concept of crawlID,
therefore handling logical crawls separately e.g. generic crawl vs specific ones. This affects pretty much every endpoint in the API as well as the service implementation.
Please note that these changes are not backward compatible and as a result, an existing frontier generated with a version < 1.1 can be loaded with URLFrontier 1.1 and above.
Two new endpoints have been added to the API in order to deal with crawls as a whole: