Published Tuesday 4th March 2014
Sphinx is a search engine that you can use instead of directly accessing your databases. The concept is quite simple, you tell Sphinx how to create indexes for your database, and then you perform searches on those indexes to return arrays of IDs ready to loop through your real database with, instead of asking your real database to do a slower search itself. When dealing with millions of records, this approach can give a huge performance boost.
Setting Sphinx up however, is a little more complicated. I recently did a fresh installation on one of our CentOS 6.5 servers so I thought I'd note down the steps for others to follow:
CentOS uses the YUM package manager. On some distributions you may find Sphinx is already in the repositories, in which case the below would bring in everything you need.
yum install sphinxsearch
On CentOS 6.5 however, I didn't find it in the standard repositories. If the above finds nothing for you either, download the relevant RPM from http://sphinxsearch.com/downloads for your distribution and architecture. We use AMD64 and the current Sphinx build is 2.1.6 so the below worked for me.
wget http://sphinxsearch.com/files/sphinx-2.1.6-1.rhel6.x86_64.rpm
yum localinstall sphinx-2.1.6-1.rhel6.x86_64.rpm
After installation, YUM should tell you where the config file is located. For most people this will be /etc/sphinx/sphinx.conf and should already contain some basic setup lines. Open this file and we'll go through the important blocks. If, like me, you're planning to use Sphinx for multiple services, you should create individual .conf files for each service, for example, /etc/sphinx/website1/sphinx.conf and /etc/sphinx/website2/sphinx.conf
nano /etc/sphinx/website1/sphinx.conf
The first block should look something like this:
source jobs
{
type = mysql
sql_host = localhost
sql_user =
sql_pass =
sql_db =
sql_port = 3306
sql_query = \
SELECT `fldID`, `fldTitle` \
FROM `tblJobs`
}
This tells Sphinx where to source the data from. You can have as many of these blocks as you like, one for each index you'd like to create. I've called mine 'jobs' because that's the name of the database table I'll be working on. Obviously, you'll want to populate the connection details.
The query part tells Sphinx what to index. Essentially, you're giving it a SELECT query to run, and it will create it's own table based on those results. The created table is your index, and that's what you'll be using to perform searches on, so make sure this query contains all of the fields you need to search on. The first field defined must always be an unsigned unique integer as this is the field Sphinx will return to you as a search result, and you'll be using it to look up the real database rows.
The next block tells Sphinx where to store the results of the above.
index jobs
{
source = jobs
path = /var/lib/sphinx/jobs
docinfo = extern
charset_type = sbcs
}
Again, if you plan to use Sphinx for multiple services, you should set the path of this block to something more manageable, like /var/lib/sphinx/website1/jobs. I'm calling my index 'jobs' again, because that's what my database table is called.
In the sample sphinx.conf you'll likely then have an 'index testrt' block. This is a relatively new type of index that allows on-the-fly updates. It's safe to remove these blocks, or if you think they might be useful, check out the Sphinx documentation on real-time indexes.
The next block contains your indexer settings. Most of the defaults are good to go so I'm just going to up the memory limit as the default is only 32MB.
indexer
{
mem_limit = 256M
}
The final block contains the searchd settings. Searchd is the daemon that runs in the background, waiting for and processing your searches.
searchd
{
listen = 9312
listen = 9306:mysql41
log = /var/log/sphinx/searchd.log
query_log = /var/log/sphinx/query.log
read_timeout = 5
max_children = 30
pid_file = /var/run/sphinx/searchd.pid
max_matches = 1000000
seamless_rotate = 1
preopen_indexes = 1
unlink_old = 1
workers = threads # for RT to work
binlog_path = /var/lib/sphinx/
}
Again most of the defaults are good to go, but if your end application needs to receive a lot of results you'll want to increase the default max_matches from 1000 otherwise Sphinx will stop after finding this many results. I find 1000000 still performs reasonably.
That initial listen setting is the port number the daemon will be running on. Your end application connects to this port and you can only have one daemon running on each, so to use Sphinx for multiple services you'll want each of your configuration files to define a different port number here. That way website1 can use 9312 and website2 can use 9313, for example.
With Sphinx configured, we then need to tell it to build these indexes. When you update your real database these indexes will need rebuilding so you should decide whether to do this every time the database updates, or at specific times per day, and build a cron script to action it. If you're dealing with large databases, your updates are probably scheduled already so the best idea is to schedule a new cron that runs shortly after those updates. Again, if running multiple services you'll want to do this for each of the config files you've created. Note that if any of your index paths in the above configuration point to directories that don't yet exist, you'll first want to mkdir and chmod 777 them so that Sphinx has access, otherwise the indexer will fail.
mkdir /var/lib/sphinx/website1
chmod 777 /var/lib/sphinx/website1
sudo -usphinx indexer --config /etc/sphinx/website1/sphinx.conf --all --rotate
Finally, start a searchd daemon for each of the configuration files you create.
searchd --config /etc/sphinx/website1/sphinx.conf
You'll want this to start on reboot too. If you didn't move the sphinx.conf and only need one daemon to run, you can tell your init system to manage this for you. On CentOS this is done with chkconfig:
chkconfig searchd on
However, I want to run multiple daemons and have moved my .conf files to non-standard locations, so I edited /etc/rc.local instead.
searchd --config /etc/sphinx/website1/sphinx.conf
searchd --config /etc/sphinx/website2/sphinx.conf
searchd --config /etc/sphinx/website3/sphinx.conf
Sphinx is now all set up, for multiple websites or services to use, and if you created a cron script it will keep it's own indexes up to date. All that's left is to use it! For my own needs I'll be using this PHP Sphinx client class, and it's quite intuitive.
include('sphinxapi.php');
$s = new SphinxClient;
$s->setServer("localhost", 9312); // Remember to change this to the right port for this services searchd daemon.
$s->setMaxQueryTime(3);
$s->setMatchMode(SPH_MATCH_BOOLEAN);
$results = $s->query(rawurldecode('+lorem +ipsum', 'jobs');
print_r($results); // This now contains, among other things, $results['matches'] full of record ID's.
Blog posts are written by individuals and do not necessarily depict the opinions or beliefs of QWeb Ltd or its current employees. Any information provided here might be biased or subjective, and might become out of date.
Udit, Wednesday 9th July 2014 19:40
Hello Ric,
Thanks for the step by step Guidance.
Could you please let us know the location of the searchd.log file.
I am facing a issue, where I am not able to start the searchd daemon , because it gioves me this strange error of :
*************
using config file ‘/etc/sphinx/sphinx.conf’…
WARNING: compat_sphinxql_magics=1 is deprecated; please update your application and config
FATAL: failed to open log file ‘searchd.log’: Permission denied
*************
I am not able to locate the file, hence
1. I created a data/searchd.log file
2. gave access of 777 sphinx:sphinx to this directory and file
But alas it did not solve my problem.
I went to the /var/lib/sphinx and did the same, but same issue.
I am stuck on this issue past 2 days and its eating my head.
Help would be appreciated.
Awaiting repsonse.
Ric, Wednesday 9th July 2014 21:11
Hi Udit,
Your /etc/sphinx/sphinx.conf file should contain a searchd {} block, which in turn should contain a path to this log file. For example: log = /var/log/sphinx/searchd.log
If you don’t have this line, just add one yourself and restart Sphinx. You may need to create this file manually if Sphinx doesn’t do so itself.
I hope this helps!
JP, Sunday 24th May 2015 12:28
Hello Ric,
I have installed the rpm package & I can see the conf file in /etc/sphinx . I want to use Sphinx in MySQL. So when I go to MySql & run “show engines;” I cant see Sphinx in it. Do I need to configure it? How to solve it? Kindly reply soon if possible.
Ric, Wednesday 24th June 2015 17:04
Hi JP,
I’ve not used Sphinx as a storage engine before, only as an indexer, but I had a quick look into this for you.
This may be of use: http://sphinxsearch.com/docs/current.html#sphinxse-overview
It seems the storage engine (SphinxSE) requires at least MySQL 5.0.22 or 5.1.12, and doesn’t work at all in MySQL 4. It also looks like you need to compile SphinxSE into MySQL instead of using the vanilla MySQL RPM. Assuming you’re running CentOS / Redhat, this may be of use here: http://sphinxsearch.com/wiki/doku.php?id=sphinx_sphinxse_on_rhel
Other RPM based distributions may provide alternative MySQL packages that already contain SphinxSE. I’d advise checking before compiling this in yourself.
Will, Monday 19th December 2016 12:10
Thanks for the informative post. When setting up for multiple websites, do you also have to separate out paths for ‘pid_file’ and ‘binlog_path’ or can they share the same path?
Ric, Monday 19th December 2016 18:18
Hi Will,
I believe all instances can share a single log but administration could then be difficult and I’m sure you’d regret that configuration in the future!
Your email address is used to notify you of new comments to this thread, and also to pull your Gravatar image. Your name, email address, and message are stored as encrypted text. You won't be added to any mailing list, and your details won't be shared with any third party.