Crawler and Searchengine

Installation prerequisites#

Those two services works together : prismeai-crawler and prismeai-searchengine, if you wish to use one of them, you have to install the second one.

They need access to:

An ElasticSearch, it can be the same as the one used for the core deployment
A Redis, it can also be the same as the one used for the core deployment

Environment variables#

Name	Description	Default value	Affected services
REDIS_URL	Allow communications between both services	redis://localhost:6379	BOTH
ELASTIC_SEARCH_URL	Saves documents content within an Elastic database	localhost	BOTH
MAX_CONTENT_LEN	Maximum length in characters of documents crawled. The service will drop documents exceeding this limit. This limit exists to prevent crawling huge webpages that doesn't contain any real information.	150000	prismeai-crawler

ELASTIC_SEARCH_URL might be set to an empty string '', in which case no webpage content would be saved, thus deactivating searches.

Microservice testing#

Once you configured and started the microservice (following the generic guide) you can verify everything is in order.

Create a searchengine :

curl --location 'http://localhost:8000/monitor/searchengine/test/test' \
--header 'Content-Type: application/json' \
--data '{
    "websites": [
        "https://docs.eda.prisme.ai/en/workspaces/"
    ]
}'

If successful, a complete searchengine object including an id field should be received.

After a few seconds, look at the crawl history:

curl --location --request GET 'http://localhost:8000/monitor/searchengine/test/test/stats' \
--header 'Content-Type: application/json' \
--data '{
    "urls":  ["http://quotes.toscrape.com"]
}'

The fields metrics.indexed_pages and metrics.pending_requests should be greater than 0, and pages already indexed should appear in crawl_history.

Try a search:

curl --location 'http://localhost:8000/search/test/test' \
--header 'Content-Type: application/json' \
--data '{
    "query": "workspace"
}'

In the answer, a results table should indicate one or more pages of the https://docs.eda.prisme.ai documentation dealing with workspaces.

Congratulations, you service is up and running!