Skip to content

Crawler and Searchengine

Installation prerequisites#

Those two services works together : prismeai-crawler and prismeai-searchengine, if you wish to use one of them, you have to install the second one.

They need access to:

  • An ElasticSearch, it can be the same as the one used for the core deployment
  • A Redis, it can also be the same as the one used for the core deployment

Environment variables#

Name Description Default value Affected services
REDIS_URL Allow communications between both services redis://localhost:6379 BOTH
ELASTIC_SEARCH_URL Saves documents content within an Elastic database localhost BOTH
MAX_CONTENT_LEN Maximum length in characters of documents crawled. The service will drop documents exceeding this limit. This limit exists to prevent crawling huge webpages that doesn't contain any real information. 150000 prismeai-crawler

ELASTIC_SEARCH_URL might be set to an empty string '', in which case no webpage content would be saved, thus deactivating searches.

Microservice testing#

Once you configured and started the microservice (following the generic guide) you can verify everything is in order.

  1. Create a searchengine :

    curl --location 'http://localhost:8000/monitor/searchengine/test/test' \
    --header 'Content-Type: application/json' \
    --data '{
        "websites": [
            "https://docs.eda.prisme.ai/en/workspaces/"
        ]
    }'
    

    If successful, a complete searchengine object including an id field should be received.

  2. After a few seconds, look at the crawl history:

    curl --location --request GET 'http://localhost:8000/monitor/searchengine/test/test/stats' \
    --header 'Content-Type: application/json' \
    --data '{
        "urls":  ["http://quotes.toscrape.com"]
    }'
    

    The fields metrics.indexed_pages and metrics.pending_requests should be greater than 0, and pages already indexed should appear in crawl_history.

  3. Try a search:

    curl --location 'http://localhost:8000/search/test/test' \
    --header 'Content-Type: application/json' \
    --data '{
        "query": "workspace"
    }'
    
    In the answer, a results table should indicate one or more pages of the https://docs.eda.prisme.ai documentation dealing with workspaces.

Congratulations, you service is up and running!