Some WordPress Search Problems are Due to Stopwords

I’m liking WordPress and may switch my main blog over here.  But the search engine has a big problem.

screenshot of no-matches page for the the

If you try to search for an 80s band named “The The”, you will get 0 hits.

That’s because “the” is so common that WordPress simply ignores it — a traditional stopword.

screenshot of the the page

But it’s the wrong approach: there really are posts about the band (1234), but no one can find them.

And other searchengines are smarter, or at least more complete: doing a search for “the the” on Bing finds 78 million matches.  Granted that most of them are typos, where someone retyped it, but at least they do match.

Stopwords were invented in search and information retrieval research that started with document-processing computer systems.  In the late 20th century, computers had little memory and small storage areas, so they were diligent in pruning the data.  But things have changed now.

Professor Marti Hearst, in her definitive book, Search User Interfaces, points out that ignoring stopwords is opaque to users, who expect if they type “a”, “an”, or “the”, that the search engine will find them.

From the book: In a famous example in the early days of Web search, a searcher who typed “to be or not to be” in a search engine would be shocked to be served empty results. In 1996, a review of eight major search engines found that only AltaVista could handle the Hamlet quote; all others ignored stopwords (Peterson, 1997, Sherman, 2001).

Since then, web search engines have coped with the demands on indexing processing and storage, because they’ve discovered that even these seemingly meaningless words have value.

This lesson needs to spread to everyone with smaller search engines: improvements in relevance algorithms are less important than knowing our users and giving them what they think they want. Hear me, WordPress?   I’m available if you want to talk about it.



