This page is still a beta!

1.4. Tutorial

Configuration of crawler

When the crawler extension is installed the backend administration can be accessed through the Web > Info module:

This shows you the entries that can be submitted for crawling. Currently there are none. You need to configure how the page tree is crawled. This is done using Page TSconfig. So for the page “Testsite” you enter this configuration in the TSconfig field:

tx_crawler.crawlerCfg.paramSets {
  language = &L=[|_TABLE:pages_language_overlay;_FIELD:sys_language_uid]
  language.procInstrFilter = tx_indexedsearch_reindex, tx_cachemgm_recache
  language.baseUrl = http://localhost:8888/typo3/dummy_4.0/
  mininews = &L=[|_TABLE:pages_language_overlay;_FIELD:sys_language_uid]&tx_mininews_pi1[showUid]=[_TABLE:tx_mininews_news]
  mininews.procInstrFilter = tx_indexedsearch_reindex
  mininews.cHash = 1
  mininews.baseUrl = http://localhost:8888/typo3/dummy_4.0/
}

This code contains two “sets”, namely “language” and “mininews”. The result is displayed in the “Start Crawling” screen:

Each set describes variations of the URL for each page. The “language” set will look if there are translations for a page and if so, ask to visit the page both with and without the L-get variable.

Its the same with the “mininews” set; It looks up mininews items on the page and if found will generate a number of URLs to be crawled corresponding to the number of mininews items that exists. This is even combined with the L-parameter so each news display is visited one time for each language!

In addition you can set the “baseUrl” for the request and whether a cHash value should be calculated (for “mininews” this is necessary to have it indexed or cached).

Summary: The configuration is supposed to describe which parameter variations you want to visit pages with during the crawling! The configuration defines the URLs to visit.

Submit URLs to queue

When the configuration is finished you can submit the URLs to the queue by pressing “Crawl URLs”:

You can use the “Scheduled” setting to define when the URLs are crawled. After submitting them, you can view the queue by changing the view from “Start Crawling” to “Crawler log”

This view will show you the queue - which is the same as “the log”, only different by whether a queue item has been processed or not.

It will look like this:

This clearly shows which URLs are ready to be processed.

Explanation of processing instructions

Before we look at how the queue is processed, we need to take a look at processing instructions. When the crawler requests one of these URLs from the TYPO3 frontend it can add a TYPO3 specific request header which asks the frontend to do a special thing; For instance this header can ask to re-index the page, re-cache the page, to process the request with some frontend usergroups initialized etc.

If you look at the configuration code you can see how each set is assigned processing instructions. When you submit URLs you must select which processing instructions to send in the request:

The available processing instructions are defined by third-party extensions using an API in the crawler extension. In this case “indexed_search” and the extension “cachemgm” is installed and provides processing instructions. If you select “Re-indexing” it means that all configuration sets with this processing instruction is used to generate URLs which will pass this processing instruction to the frontend. In the frontend there are hooks which will take care of processing according to the processing instruction. In the case of “tx_indexedsearch_reindex” it will ask to have pages re-indexed!

The same is the case with “Re-cache pages”; This will re-generate the cached version of a page.

Run via backend

To process the queue you must either set up a cron-job on your server or use the backend to execute the queue:

This view, “CLI status” shows you the status of the processing. If the status is “Start” it means that a process is already running and therefore cannot be run again. If the status is “end” is you see here you can press “Run now” and that will active the processing from the backend:

During processing you can see in the “Crawler log” view how the queue is processed:

This shows at what time the queue entry was processed and how the exit status was. If you see “..” in the status column it means this item is being processed currently.

In the “CLI status” view you will see the status changed:

Run via cron

The best option is to have a cron-script running the processing. This is done by adding the command shown in the “CLI status” view to the crontab:

So, in this case the cronjob will look like:

This will run the script every minute. Before you add it as a cron-job please check it from the command line that it executes correctly! You will have to add a user called “_cli_crawler” before it does! And you must have PHP installed as a CGI script as well in /usr/bin/

After adding the cronjob you can look at the “Last seen” time:

This is just after the cronjob was added - it has not been activated yet. But waiting for a minute this is what I see:

So this tells me that it is processed correctly.