Login / Status
developer.Resource
Home . Documentation . Document Library . Extension Manuals
Sponsors
hosted by punkt.deTYPO3 and Open Source Magazine

1.4. Indexing configurations

Setting up the “crawler” extension

Before you can work with “Indexing configurations” you must make sure you have set up the “crawler” extension and have a cron-job running that will process the crawler queue as we fill it! For this, please refer to the documentation of the “crawler” extension!

Generally about indexing configurations

Indexing configuration sets up indexing jobs that are performed by a cron-script independently of frontend requests. The “crawler” extension is used as a service to perform the execution of queue entries that controls the indexing.

The Indexing configuration contains two parts

  1. Definition of execution time and periodicality.

  2. Definition of indexing type and settings.

Below you see what all Indexing Configurations have in common:

These settings are described in the context sensitive help so please refer to that for more information.

The “Session ID” requires a show introduction: When an indexing job is started it will set this value to a unique number which is used as ID for that process and all indexed entries are tagged with it. When the processing of an indexing configuration is done it will be reset to zero again.

Periodic indexing of the website (“Page tree”)

You can have the whole page tree indexed overnight using this indexing configuration of type “Page tree”:

This defines that the page tree is to be crawled to a depth of 3 levels from the root point “Testsite”. For each page a combination of parameters is calculated based on the “crawler” configurations for the “Re-index” processing instruction (See “crawler” extension for more information) and those URLs are committed to the crawler log plus entries for all subpages to the processed page (so that each of those pages are indexed as well.)

This is what the crawler log may look like after processing:

Here you can notice that the visited URLs have additional parameters added - those are combined based on the “crawler” extensions configuration in Page TSconfig.

Also notice the special crawler log entries found in the “Storage folder”. These are the “meta-entries” which calls an indexed search hook which in turn generates the URL entries and pushed them to the queue.

On the far right in this view you can see that noted as well, including the “set_id”:

Finally, in the Web>Info, “Indexed search” you will see that these visited URLs were re-indexed:

Location: Indexing configurations for  indexing of the page tree should be placed in a SysFolder since their location in the page tree is not relevant to their function.

Periodic indexing of records (“Database Records”)

You can also use the Indexing Configuration to index single records.

Location: You must place the indexing configuration on the page where you want the search results to be displayed - typically on the page where a plugin exists that can process the parameters pointing to the record. In the case below the Indexing Configuration is placed on the same page as the frontend plugin (“Morbi diam enim...”) that can display the search results:

 

The configuration record looks like this:

If the records you want to index is not located on the page where the indexing configuration and fronend plugin is, then you can point to the location. Notice how the field with “GET parameters” is used to define how the search results are shown - this must correspond with what the plugin takes of parameters.

A fancy option is the “Index Records immediately when saved” - which will index records as they are saved through “TCEmain”!

In the crawler log you will see the entries for record indexing like this:

After processing the Web>Info, “Indexed search” view will show this view:

Notice how the GET parameters are nicely added and how the “CfgUid” column contains the UID of the indexing configuration / the “set_id” of the processing.

In fact, if a record is removed its indexing entry will also be removed upon next indexing - simply because the “set_id” is used to finally clear out old entries after a re-index!

Indexing External websites (“External URL”)

You can index external websites using Indexing Configurations. They can actually crawl an external URL! Configuration looks like this:

It pretty much explains itself how it works. The Context Sensitive Help will provide enough information to complete configuration.

Location: You should place the Indexing Configuration on a “Not-in-menu” page in the root of the site for instance. The page must be “searchable” since the external URL results are bound to a page in the page tree, namely the page where the configuration is found.

 

This is how the crawler log looks immediately after the crawling has begun:

The initial entry is “http://typo3.org/” which is already processed. When this process was executed it added entries for all found subpages to the queue as well. When their execution time comes the crawler will request those URLs as well and if subpages are found on them, entries for those subpages are added until the configured depth is reached.

After a few minutes you see more entries processed like this:

In Web>Info, “Indexed search” the indexed entries looks like this:

Indexing directories of files (“Filepath on server”)

You can also have directories of files on your server indexed periodically, using the type “Filepath on server”.

Again, the options are either easy to understand or your can read more about them in the Context Sensitive Help.

Location: The Indexed Search configuration should be located on a not-in-menu page, just like the “External URL” type required. Same reasons; results are bound to a page in the page tree.

The process of indexing a directory of files is the same as for the external URL: For each directory a) all files are indexed and b) all sub-directories added to the crawler queue for later processing. This is shown in the crawler log:

When processing is done the result is shown in the Web>Info, “Indexed search”:

Showing the search results

By default the search results are shown with no distinction between those from local TYPO3 pages, records indexed, the file path and external URLs. Only division follows that of the page on which the result is found:

However, you can configure to have a division of the search results into categories following the indexing configurations:

To obtain this categorization you must set TypoScript configuration in the Setup field like this:

plugin.tx_indexedsearch.search.defaultFreeIndexUidList = 0,6,7,8
plugin.tx_indexedsearch.blind.freeIndexUid = 0

The “defaultFreeIndexUidList” is uid numbers of indexing configurations to show in the categorization! The order determines which are shown in top. Changing it could bring results from TYPO3.org and TYPO3.com in top:

The categorization happens when the “Category” selector in the “Advanced” search form is set like this:

(Notice, you can preset this value from TypoScript as well!)

Searching a specific category from URL

If you want search forms on the site to make look up directly in results belonging to one or more indexing configurations you can use a set or GET variables like these, here using UID values 7 and 8 since they look up in TYPO3.org and TYPO3.com results:

index.php?id=78&tx_indexedsearch[sword]=level&tx_indexedsearch[_freeIndexUid]=7,8

Grouping more indexing configurations in one search category

You might find that you want to group the results from multiple indexing configurations in the same category. For instance, I have an indexing configuration for both “TYPO3.org” and “TYPO3.com” but I want all search results to appear under the category “External URLs”. This can be done by creating a special type of indexing configuration which only points to other indexing configurations:

This indexing configuration is not used during indexing but during searching. So a reconfiguration of the TypoScript to use uid 9 instead of 7,8 will yield this result:

TypoScript:

plugin.tx_indexedsearch.search.defaultFreeIndexUidList = 9,6,0

Disable frontend initiated indexing

If you choose to index your site using Indexing Configurations you can disable indexing through the user requests in the frontend. This is easily done via the configuration of the Indexed Search extension in the Extension Manager:

Indexing files on pages separately

If enabled, links to local files found on pages will initiate indexing of those external files. However, this often has the unpleasant effect that too many files are indexed during the same page request. Using the crawler extension you can configure the indexer to add a queue entry instead of immediate indexing of external files. Thus the indexing will happen outside the frontend user request, using the cronscript!

This behaviour is configured in the extension managers configuration for “Indexed search”: