-
Notifications
You must be signed in to change notification settings - Fork 59
Core_API
JSON-RPC API
Results can take two forms :
- success : {code: 'success', result: json_object}
- error : {code: 'fail', message: error_string}
-
ping : answers pong when core API is alive
-
get_status : returns statistics and information on the core's general status and loops
-
reinitialize : reinitializes the whole databases, crawl jobs and memory structure
-
listjobs : returns the list of crawling jobs past, running and pending
-
refreshjobs : updates and returns the list of crawling jobs past, running and pending
-
lookup_httpstatus : checks a webpage's existence and returns its http code status
- url : string url of the looked up webpage
- timeout : integer number of seconds to allow for lookup (default : 2)
-
lookup : checks a webpage's existence and returns a boolean True when the webpage exists or is a redirection (httpstatus = 200 or (> 300 and < 400), False otherwise
- url : string url of the looked up webpage
- timeout : integer number of seconds to allow for lookup (default : 2)
-
declare_pages : add pages in the memory structure and create webentities based on the default creation rule from these pages if necessary. Returns the webentities
- list_urls : array of strings of the webpages urls
-
declare_page : add page in the memory structure and create webentity based on the default creation rule from this pages if necessary. Returns the webentity
- url : string url of the declared webpage
-
crawl_webentity : programs the future crawl of a webentity
- webentity_id : string id of the webentity to crawl from Memory Structure
- maxdepth : integer maximum depth crawling value (default : None, will apply main config's mongo-scrapy default maxdepth value)
- all_pages_as_startpoints : boolean, True to use all existing pages from the webentity in the memory structure as crawl's starting points instead of the webentity's startpages (default : False)
-
crawl.reinitialize : cleans the list of jobs, empties the crawled results in the mongo database and cancels all pending crawls
-
crawl.start : programs a crawl for a webentity
- webentity_id : string id of the webentity to crawl from memory structure
- starts : array of strings of the crawl's starting points urls
- follow_prefixes : array of strings of LRU prefixes to follow within the crawled links until maxdepth is reached (usually set to the webentity's LRU prefixes)
- nofollow_prefixes : array of strings of LRU prefixes to not follow within the crawled links (usually set to the list of LRU prefixes of a webentity's subwebentities)
- discover_prefixes : array of strings of LRU prefixes to follow for redirections (default : config['discoverPrefixes'])
- maxdepth : maximum depth of links to follow from the starting points (default : config['mongo-scrapy']['maxdepth'])
- download_delay : integer number of seconds to wait between two consecutive requests on the same domain name (default : config['mongo-scrapy']['download_delay'])
-
crawl.cancel : cancel a running or pending crawl job
- job_id : string id of the job given by listjobs
-
crawl.cancel_all : cancels all running or pending crawl jobs
-
crawl.list : returns the list of past, running and pending crawls
-
crawl.get_job_logs : returns a time ordered list of logs relative to a specific crawl job
- job_id : string id of the job given by listjobs
-
crawl.get_webentity_logs : returns a time ordered list of logs relative to all crawls relative to a specific webentity
- webentity_id : string id of the webentity to crawl from memory structure
# Memory Structure functions
-
store.reinitialize : empties the memory structure and redefines the default webentity creation rule
-
store.declare_webentity_by_lru : creates if necessary a webentity for a specific LRUprefix and returns it
- lru_prefix : string lru_prefix
-
store.get_webentity_by_url : tries to find the webentity corresponding to a url and returns it if it exists or returns None
- url : string url to look for
-
store.get_webentities : returns the list of all webentities in the memory structure or of those whose IDs are given as input
- list_ids : optionnal list of string ids of the webentities looked for (default : None)
-
store.get_webentity_pages : returns the list of all webpages stored in the memory structure corresponding to a specific webentity
- webentity_id : string id of the webentity
-
store.get_webentity_subwebentities : returns a list of webentities having LRU prefixes starting with one of the webentity's prefixes
- webentity_id : string id of the webentity
-
store.get_webentity_parentwebentities : returns a list of webentities having LRU prefixes starting like one of the webentity's prefixes but shorter
- webentity_id : string id of the webentity
-
store.get_precision_exceptions : returns the list of string LRU prefixes defined as precision exceptions
-
store.remove_precision_exceptions : removes a list of string LRU prefixes from precision exceptions if existing
- list_exceptions : array of string lru prefixes
-
store.merge_webentity_into_another : Merges a webentity into another by adding all of its lru prefixes to the other one before removing it
- old_webentity_id : string id of the webentity to merge into the other
- good_webentity_id : string id of the webentity to host the merged one
- include_tags : boolean True to add all tags from old_webentity to other one (default : False)
- include_home_and_startpages_as_startpages : boolean True to add all startpages and the homepage from old_webentity as startpages of the other one (default : False)
-
store.delete_webentity : Removes a webentity from the memory structure (all its webpages will be associated with the default OUTSIDE WEB webentity for LRU prefix "s:http" or "s:https")
- webentity_id : string id of the webentity
-
store.rename_webentity : Defines a webentity's name field
- webentity_id : string id of the webentity
- new_name : string name
-
store.set_webentity_status : Defines a webentity's status
- webentity_id : string id of the webentity
- status : string status (UNDECIDED, IN, OUT or DISCOVERED)
-
store.set_webentity_homepage : Defines the homepage to display for a specific webentity
- webentity_id : string id of the webentity
- homepage : string url of the webentity's homepage
-
store.add_webentity_lruprefix : Adds a LRU prefix to a specific webentity. Eventually removes it from another webentity if it was already defined
- webentity_id : string id of the webentity
- lru_prefix : string lru prefix to add to the webentity
-
store.rm_webentity_lruprefix : Removes a LRU prefix to a specific webentity. Eventually removes the webentity if it has no LRU prefix left
- webentity_id : string id of the webentity
- lru_prefix : string lru prefix to remove to the webentity
-
store.add_webentity_startpage : Adds a starting point URL to a specific webentity
- webentity_id : string id of the webentity
- startpage_url : string url to add to the webentity's startpoints
-
store.rm_webentity_startpage : Removes a starting point URL to a specific webentity if existing
- webentity_id : string id of the webentity
- startpage_url : string url to remove to the webentity's startpoints
-
store.add_webentity_tag_value : Adds a namespace:key=value tag to a specific webentity
- webentity_id : string id of the webentity
- tag_namespace : string namespace (should not contain any ":")
- tag_key : string key (should not contain any "=")
- tag_value : string value
-
store.rm_webentity_tag_key : Removes all namespace:key tag values to a specific webentity if existing.
- webentity_id : string id of the webentity
- tag_namespace : string namespace (should not contain any ":")
- tag_key : string key (should not contain any "=")
-
store.rm_webentity_tag_value : Removes a namespace:key=value tag to a specific webentity if existing.
- webentity_id : string id of the webentity
- tag_namespace : string namespace (should not contain any ":")
- tag_key : string key (should not contain any "=")
- tag_value : string value
-
store.set_webentity_tag_values : Removes all values for a specific namespace:key and replace them with a specific list of tag_values.
- webentity_id : string id of the webentity
- tag_namespace : string namespace (should not contain any ":")
- tag_key : string key (should not contain any "=")
- tag_values : list of string values
-
store.get_webentities_network_json : Returns a json representation of the whole network between linked webentities
-
store.generate_webentities_network_gexf : Generates a GEXF local file representing the whole network between linked webentities
-
store.get_webentity_nodelinks_network_json : Returns a json representation of the network between linked nodes for a specific webentityif set of for the whole memory structure otherwise
- webentity_id : string id of the webentity whose nodes to represent (default : None)
- include_frontier : boolean True to include foreign links to nodes from other webentities or False to get only links within the webentity (default : False)