Friday, February 5, 2010

Assuring Sophisticated Search with CouchDB Lucene and Cucumber

‹prev | My Chain | next›

With a good handle on some of the new couchdb-lucene features, it is time to get them working in my application. Fortunately, I have very good Cucumber coverage over my search features. So best to dive right in...
cstrom@whitefall:~/repos/eee-code$ cucumber ./features/recipe_search.feature:7
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes

Scenario: Matching a word in the ingredient list in full recipe search # ./features/recipe_search.feature:7
Given a "pancake" recipe with "chocolate chips" in it # features/step_definitions/recipe_search.rb:1
And a "french toast" recipe with "eggs" in it # features/step_definitions/recipe_search.rb:25
And a 0.5 second wait to allow the search index to be updated # features/step_definitions/recipe_search.rb:212
When I search for "chocolate" # features/step_definitions/recipe_search.rb:216
Then I should see the "pancake" recipe in the search results # features/step_definitions/recipe_search.rb:250
expected following output to contain a <a href='/recipes/2009/04/12/pancake'>pancake</a> tag:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>EEE Cooks: Recipes</title>
<link href="/stylesheets/style.css" rel="stylesheet" type="text/css">
<link href="/main.rss" rel="alternate" title="EEE Cooks RSS" type="application/rss+xml">
<link href="/recipes.rss" rel="alternate" title="EEE Cooks Recipe RSS" type="application/rss+xml">
</head>
<html><body>
<div id="header">
<div id="eee-header-logo">
<a href="/">
<img alt="Home" src="/images/eee_corner.png"></a>
</div>
</div>
<ul id="eee-categories">
<li><a href="/recipes/search?q=category:italian">Italian</a></li>
<li><a href="/recipes/search?q=category:asian">Asian</a></li>
<li><a href="/recipes/search?q=category:latin">Latin</a></li>
...
</ul>
<div id="refine-search">
<form action="/recipes/search" id="search-form" method="get">
<input maxlength="2048" name="q" size="31" type="text" value=""><input name="s" type="submit" value="Search">
</form>
</div>
<p class="no-results">
No results matched your search. Please refine your search
</p>

<div id="footer">
...
</div>
</body></html>
</html>
(Spec::Expectations::ExpectationNotMetError)
./features/step_definitions/recipe_search.rb:251:in `/^I should see the "(.+)" recipe in the search results$/'
./features/recipe_search.feature:13:in `Then I should see the "pancake" recipe in the search results'
And I should not see the "french toast" recipe in the search results # features/step_definitions/recipe_search.rb:256
And I should see the search field for refining my search # features/step_definitions/recipe_search.rb:345

Failing Scenarios:
cucumber ./features/recipe_search.feature:7 # Scenario: Matching a word in the ingredient list in full recipe search

1 scenario (1 failed)
7 steps (1 failed, 2 skipped, 4 passed)
0m4.677s
I am somewhat surprised that this did not fail outright. The couchdb-lucene resource in the CouchDB database has moved, so I expected an outright failure. Looking into the CouchDB log during the request, I find:
[Fri, 05 Feb 2010 23:11:23 GMT] [debug] [<0.23316.0>] OAuth Params: [{"limit","20"},{"q","chocolate"},{"skip","0"}]

[Fri, 05 Feb 2010 23:11:23 GMT] [info] [<0.23319.0>] EXTERNAL: Starting process for: fti

[Fri, 05 Feb 2010 23:11:23 GMT] [info] [<0.23319.0>] COMMAND: /usr/bin/python /home/cstrom/local/couchdb-lucene/tools/couchdb-external-hook.py

[Fri, 05 Feb 2010 23:11:25 GMT] [info] [<0.23316.0>] 127.0.0.1 - - 'GET' /eee-test/_fti?limit=20&q=chocolate&skip=0 405
Hmm... A 405 (resource not allowed) error. That makes sense given that the couchdb-lucene resource is different, but how does that translate into a nice empty set / not a crash? The answer:
#...
begin
data = RestClient.get couchdb_url
@results = JSON.parse(data)
rescue Exception
#puts Rack::Utils.escape(@query)
@query = ""
@results = { 'total_rows' => 0, 'rows' => [] }
end
#...
Ah! Any exception presents no results back to the user. Not sure I care too much for that (perhaps I ought to at least log it), but that explains it. To work with the new version of couchdb-lucene, I cannot access the _fti resource, but must access the collection/type under the full-text-index resource. The base URL then becomes:
#...
couchdb_url = "#{@@db}/_fti/recipes/all?limit=20" +
"&q=#{Rack::Utils.escape(@query)}" +
"&skip=#{skip}"
#...
Now I need to define the "all" index on the "recipes" collection, which is accomplished (like almost everything else in CouchDB) via a design document:
{
"_id": "_design/recipes",
"_rev": "1-c4eab7fa459c9aa4d265e57e5d9c3eb8",
"fulltext": {
"all": {
"index": "function(rec) {//...}"
}
}
}
The index function, which is called on each document in the DB, needs to include only published recipe documents. I had this working last year (in production now), so I can copy this from last year's indexing function:
function(rec) {
if (rec.type == 'Recipe' && rec.published) {
var doc = new Document();
// still need to add record fields to the lucene document
return doc;
}
}
Indexing ingredient names, also copied from last year's implementation, looks like:
function(rec) {
if (rec.type == 'Recipe' && rec.published) {
var doc = new Document();

var ingredients = [];
for (var i=0; i<rec.preparations.length; i++) {
ingredients.push(rec.preparations[i]['ingredient']['name']);
}
doc.add(ingredients.join(', '));

return doc;
}
}
That is still very similar to last year's implementation. The big difference, as I found last night, is how attributes are passed when adding fields to the index document. Here I need to add the recipe date, title, preparation time, and the list of ingredients and store these values in the index. Storing them in the index will return them along with the results. If 20 results are returned for a query, these values will be returned as well. The index function that accomplishes this is:
function(rec) {
if (rec.type == 'Recipe' && rec.published) {
var doc = new Document();


doc.add(rec.title, {"store":"yes","field":"title"});
doc.add(rec.date, {"store":"yes","field":"date"});
doc.add(rec.prep_time, {"store":"yes","field":"prep_time"});

var ingredients = [];
for (var i=0; i<rec.preparations.length; i++) {
ingredients.push(rec.preparations[i]['ingredient']['name']);
}
doc.add(ingredients.join(', '));
doc.add(ingredients.join(', '), {"store":"yes","field":"ingredient"});
return doc;
}
}
With that, I have my cucumber scenario passing:
cstrom@whitefall:~/repos/eee-code$ cucumber ./features/recipe_search.feature:7
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes

Scenario: Matching a word in the ingredient list in full recipe search # ./features/recipe_search.feature:7
Given a "pancake" recipe with "chocolate chips" in it # features/step_definitions/recipe_search.rb:1
And a "french toast" recipe with "eggs" in it # features/step_definitions/recipe_search.rb:25
And a 0.5 second wait to allow the search index to be updated # features/step_definitions/recipe_search.rb:212
When I search for "chocolate" # features/step_definitions/recipe_search.rb:216
Then I should see the "pancake" recipe in the search results # features/step_definitions/recipe_search.rb:250
And I should not see the "french toast" recipe in the search results # features/step_definitions/recipe_search.rb:256
And I should see the search field for refining my search # features/step_definitions/recipe_search.rb:345

1 scenario (1 passed)
7 steps (7 passed)
0m1.786s
I get another scenario or two passing relatively easily until I run into a problem with stemming (matching the search term "whisk" when "whisking" is in the document):
cstrom@whitefall:~/repos/eee-code$ cucumber ./features/recipe_search.feature:26
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes

Scenario: Matching a word stem in the recipe instructions # ./features/recipe_search.feature:26
Given a "pancake" recipe with instructions "mixing together dry ingredients" # features/step_definitions/recipe_search.rb:67
And a "french toast" recipe with instructions "whisking the eggs" # features/step_definitions/recipe_search.rb:67
And a 0.5 second wait to allow the search index to be updated # features/step_definitions/recipe_search.rb:212
When I search for "whisk" # features/step_definitions/recipe_search.rb:216
Then I should not see the "pancake" recipe in the search results # features/step_definitions/recipe_search.rb:256
And I should see the "french toast" recipe in the search results # features/step_definitions/recipe_search.rb:250
expected following output to contain a <a href='/recipes/2009/04/12/french_toast'>french toast</a> tag:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
...
<div id="refine-search">
<form action="/recipes/search" id="search-form" method="get">
<input maxlength="2048" name="q" size="31" type="text" value="whisk"><input name="s" type="submit" value="Search">
</form>
</div>
<p class="no-results">
No results matched your search. Please refine your search
</p>
...
</div>
</body></html>
</html>
(Spec::Expectations::ExpectationNotMetError)
./features/step_definitions/recipe_search.rb:251:in `/^I should see the "(.+)" recipe in the search results$/'
./features/recipe_search.feature:33:in `And I should see the "french toast" recipe in the search results'

Failing Scenarios:
cucumber ./features/recipe_search.feature:26 # Scenario: Matching a word stem in the recipe instructions

1 scenario (1 failed)
6 steps (1 failed, 5 passed)
0m2.071s
To get stemming to work last year, I had to fork my own couchdb-lucene. Happily, the latest version of couchdb-lucene has this feature built-in as the "porter" stem analyzer. I only need add the "analyzer" option to the design document:
{
"_id": "_design/recipes",
"_rev": "1-c4eab7fa459c9aa4d265e57e5d9c3eb8",
"fulltext": {
"analyzer": "perfield:{default:\"porter\"}",
"all": {
"index": "function(rec) {//...}"
}
}
}
With that, I have my Cucumber scenario passing:
cstrom@whitefall:~/repos/eee-code$ cucumber ./features/recipe_search.feature:26
Feature: Search for recipes

So that I can find one recipe among many
As a web user
I want to be able search recipes

Scenario: Matching a word stem in the recipe instructions # ./features/recipe_search.feature:26
Given a "pancake" recipe with instructions "mixing together dry ingredients" # features/step_definitions/recipe_search.rb:67
And a "french toast" recipe with instructions "whisking the eggs" # features/step_definitions/recipe_search.rb:67
And a 0.5 second wait to allow the search index to be updated # features/step_definitions/recipe_search.rb:212
When I search for "whisk" # features/step_definitions/recipe_search.rb:216
Then I should not see the "pancake" recipe in the search results # features/step_definitions/recipe_search.rb:256
And I should see the "french toast" recipe in the search results # features/step_definitions/recipe_search.rb:250

1 scenario (1 passed)
6 steps (6 passed)
0m1.667s
That is a good stopping point for tonight. I likely still have some other searching scenarios that are not yet passing. I will pick up there tomorrow.

Day #5

No comments:

Post a Comment