Configure indexed search and Crawler easily in typo3

By Siju Raju on November 9, 2016

The system extension Indexed Search is the engine which actually indexes content and provides a frontend plugin to let you search for content and show the results. The index search engine provides two major elements to TYPO3:

1. Indexing: An indexing engine which indexes TYPO3 pages on-the-fly as they are rendered by TYPO3’s frontend. Indexing a page means that all words from the page (or specifically defined areas on the page) are registered, counted, weighted and finally inserted into a database table of words. Then another table will be filled with relation records between the word table and the page.
2. Searching: A plugin you can insert on your website which allows website users to search for information on your website. By searching the plugin first looks in the word-table if the word exist and if it does all pages which has a relation to that word will be considered for the search result display. The search results are ordered based on factors like where on the page the word was found or the frequency of the word on the page.

This article will give you step by step instruction on how to install and configure those extensions to help efficiently index your typo3 content.

Configuring Server

If you want to index external documents referenced on your Web pages in addition to standard text elements, you will have to make sure you have properly installed a few third party binaries:

  • catdoc for Microsoft Word documents (Will not support docx files)
  • xlhtml for Microsoft Excel spreadsheets (Will not support xlsx files)
  • ppthtml for Microsoft Powerpoint presentations (Will not support pptx files)
  • pdftotext and pdfinfo for PDF files
  • unzip for OpenOffice documents
  • unrtf for RTF

Configuring indexed search and Crawler
Login to typo3 backend and then Admin Tools > Extension Manager and find the extension Indexed Search Engine. Go to extensions configuration section.
Make sure Paths to PDF parsers, unzip, WORD parser, EXCEL parser, POWERPOINT parser and RTF parser all contain/usr/bin/

Indexed search configuration typo3

Make sure indexing of content is not performed automatically when showing a page in frontend and let use crawler to index external files.

Indexed search

Indexed Search

Crawler requires a backend user _cli_crawler. Go to SYSTEM > Backend Users and create this backend user with a random password. This user must not be an administrator and should not be part of any backend user group.

Backend user

Typoscript Setup

Open your typoscript template and add following lines.

config.index_enable = 1
config.index_externals = 1

How Crawler works ?

The crawler performs mainly two jobs,
1. Generate URLs of pages to be processed (with any GET parameter required, e.g., “L” for language or “tx_ttnews[tt_news]” to show the details of a tt_news record) and enqueue them for processing by the other job;
2. Process the queue of URLs and take the appropriate action (in our case invoke Indexed Search to index the page or the document).
When generating URLs, the crawler will automatically be able to crawl your website and enqueue the different pages (with /index.php?uid=…). But if your site is multilingual, you will have to tell it to generate variations for each and every page (with /index.php?uid=…&L=0 and/index.php?uid=…&L=1 for instance).
When a link to a document is encountered while indexing the content of a page, Indexing Search will not index it right away but instead will add it to the queue of pages and documents to be indexed (because option “Use crawler extension to index external files” was ticked in Indexed Search configuration).

Crawler Configuration.

We can check a basic crawler configuration which allows the whole page tree to be indexed.
Step 1: Goto Web > List
Step 2: Select Root Page of your site
Step 3: Create a new record of type “Crawler Configuration” Which is under the section “Site Crawler”

Crawler Configuration

Crawler Configuration

Now we can use this configuration to index our website. Configure the scheduler to run different crawler tasks.

Typo3 scheduler

Adding Search Plugin To a Page

Select the page in which you want to integrate the search option. Create new content elemnt, under the tab ‘Plugin’ select ‘indexed serach’.

Indexing News Articles

Suppose we have latest news list section which contains news teaser and link to detail page. The details page contains att_news plugin whose output mode is SINGLE. As such, this plugin expects a GET parameter in the URL:
&tx_ttnews[tt_news]= (id)
Our test configuration:

  • Sysfolder [uid #19] is our tt_news storage folder
  • Page [uid #35] contains a tt_news plugin for SINGLE view

We want crawler to dynamically generate a list of URLs with the additional tx_ttnews[tt_news] parameter when it crawls page #35.

Crawler Configuration

We are creating this configuration for the subtree of page #35.
Step1: Go To Typo3 backend Web > List
Step2: Click on page #35
Step 3: Create a new record of type “Crawler Configuration” Which is under the section “Site Crawler”

Crawler configuration

The _TABLE field in configuration defines the look up table (tt_new here). And _PID defines the news storage folder id (#19 here).
While creating crawler configuration tick “Append cHash” otherwise you will end up having N times the first news being indexed due to TYPO3 caching mechanism.

Crawler Configuration

Leave a Reply