Tomcat, Solr And Special Characters

Friday, 11 January 2013

If Solr does not return any results when looking for words with special characters, this post could explain why.
Solr's example schema.xml comes with the charFilter element to map special chars to their ASCII equivalents. Look for the following in the "index" and "query" section of the solr.TextField class:

<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>

You can open the mapping-ISOLatin1Accent.txt file to see what gets mapped to what.

What often gets overlooked is that Tomcat, which takes the search request URI, also has to properly encode those special characters or they'll end up like gibberish when it reaches Solr. This is simple to do. The URIEncoding="UTF-8" attribute needs to be added to the Connector element in Tomcat's conf/server.xml

It is not part of a standard Tomcat installation, which is what most people use when setting up Solr.

This is what the Connector should look like:

<Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" URIEncoding="UTF-8" />

Now, when looking for "bacteriën" (bacteria) or "financiën" (finance) Tomcat won't mess up the 'ë', Solr will properly map them and look for "bacterien" and "financien".

At the time of this writing, a patch for this issue just got applied to Solr 4.1 which will take care of the encoding for us. It will ask the HTTPRequest for its character encoding and convert it correctly.