[pytx][ncmec] NCMEC fetch implementation unable to make progress #1679
Labels
ncmec
Pertaining to the NCMEC Hash API or cybertips
python-threatexchange
Items related to the threatexchange python tool / library
There exists some data in some NCMEC environments that have a lot of records all the same second. The current logic will keep fetching until it gets enough data to advance the checkpoint, of which the current smallest granularity is one second. Unfortunately, the amount of data that needs to be fetched in some cases is quite large, and frequently busts storage solutions (especially on HMA).
The fix is to store the "next" URL when the fetch granularity is one second. It's unclear what the behavior of the NCMEC database will be in this circumstance (if it's based on an offset, this may cause records to be skipped in some cases).
If we want to be double defensive, we can invalidate the next URL if it was stored more than say ~1 day ago.
The text was updated successfully, but these errors were encountered: