Search Engine Scorecard •
Tuesday, January 01, 2013Copyright © 2002-2013 Simpatico.blogspot.com. All rights reserved.
The State of Internet Search
By the early 2000s, Internet search engines had seemingly matured. It turns out they are more like the stock market: their performance could be quite volatile. Since late 2003, all three top search engines have had periods of downtime—when they returned unacceptable results. The low point came in late 2007 when none of the Big Three earned an A rating for three consecutive months.
Google was the first engine to deteriorate; it finally recovered by mid-2007—after more than three years. Microsoft’s engine was useless for the first half of 2005 and produced disappointing results again between early 2007 and Bing’s 2009 debut. Even Yahoo, the most consistent of the top engines, had a couple of under-performing months in 2007. (As for Cuil, the new search engine that came on-line in July 2008, don’t bother.)
For the first time in at least four years, all three search engines earned an A rating in June 2009. Enjoy it while you can because if history is any indication, you’ll soon find yourself in….
Search Engine Hell
Here are the longest periods of downtime when a search engine has consistently performed poorly in our testing since 2003.
28 months: Google (January 2005-April 2007)
27 months: Microsoft (February 2007-April 2009)
6 months: Microsoft (January 2005-June 2005)
Search User’s Bill of Rights
We expect a search engine to do two things:
The cardinal sin of a search engine is the failure to include a Web page in its database. We obviously do not expect a search engine to read a new page the second it’s published. But any Web page that has been in existence for a period of time (two weeks?) should be in a search engine’s index. Unless a Web page’s host is constantly out of service, there’s no reason why a page should disappear from a search engine’s index. Its ranking may vary—not fluctuate wildly—but a search engine should never remove it indiscriminately. The ability to scan the entire Internet is only limited by a company’s time and storage capacity. There are some Web pages that we wish were not indexed (more on that later).
New Search Order
Almost as critical as what is in a search engine’s index is the order of returned results of a query. Different companies use different algorithms to determine the relevance of each match found. This Web page wouldn’t be necessary if search engines were interchangeable. Click on Return to Simpatico Stock Watch for more background information and analysis.
This Test Bed’s for You
If you work in the QA department of a search engine company, we feel your pain. How can you tell if a Web page is missing from the search index if you’ve never seen it before? As an independent publisher, we want to make sure Internet users can see our pages when they turn to their favorite search engines. Since 2003, we have devised a series of queries to test the reliability and accuracy of search engines.
Our queries range from the general to the specific. They return anywhere from fewer than 50 hits to over two million. We use no more than three words in a query (we count anything within quotations as one word) because studies have shown people normally don’t enter long queries. Some of our targeted Web pages are updated weekly and some less frequently. In a perfect world, we’d like to see one of these specific pages returned among the first 10 results of a search. But we’ll settle for another one of our own pages. Yes, we have a biased perspective. Our visitors can judge for themselves whether our Web pages deserve to be placed near the top.
While our pages are Web crawler-friendly (as far as we know), their relative anonymity really tests a search engine’s coverage. For one thing, we’re not some well-known corporate entity. We don’t even have our own domain name, which means our “home” page may not be apparent to less discerning software. But since all Web pages are equal on the Internet—for now anyway—search engines should be able to find and index our pages with no trouble at all (or so we thought).
The Big Three
Search engines have been ranking Web pages for years; it’s time for us to rate their performance. Once a month we perform the above searches using each search engine’s default settings. To keep things simple, we award a ballpark grade to each search engine. That is, a C may be as high as a C+ or as low as a C-. Here’s how the three most popular search engines stack up against each other. Note that we now report the “real” number of hits for each query. The next time your search engine returns hundreds of thousands of results, just keep clicking on the next highest page and soon you’ll reach the “actual” end minus all the “omitted results.” Both Yahoo and Microsoft have a maximum of 1,000 hits, and Google’s ceiling seems to be at least 700.
* Highest ranking Web page from our family of sites, excluding this one and any Blogger archive.
? – Ranking below No. 200.
M – Target page missing from search index and none of our other related pages in top 200.
** Complete data available upon request.
It’s Not Brain Surgery
We concede our method of scoring search engines based solely on the rankings of our Web pages does not provide a complete picture of their performance. If we had the time (we imagine search companies must already be doing this), we would click on each of the first 200 returned results and assign a value as follows.
+3 – extremely relevant
+1 – somewhat relevant
0 – irrelevant but at least related to query topic
-1 – irrelevant and totally unrelated to query topic
-3 – spam site/mirror site (see below)
Then we’d adjust the total score by examining the distribution of the most relevant results. Are they found among the first 50 hits or after that? But this scoring system still wouldn’t address those relevant Web pages that are not among the first 200 results, assuming they are indexed at all. That’s why our evaluation method is telling—however limited it may be. If you expect to see something, then you can tell if it’s missing.
More Transparency, Please
Search engine companies like to tout the billions of Web pages in their databases. Here are two questions they don’t like to address.
Ideally, a search engine should read every Web page at least once a week. And it should take no more than a week to build an index. With any other kind of software, you can easily find out which version you’re running. Search engine companies don’t release this information except when they’re promoting a major upgrade (as was the case with the new MSN Search in 2005).
We would also like to see search companies be a little more open about their ranking methodologies (we understand there are legitimate reasons for being guarded). For example, it seems frequently updated pages are ranked much higher than static ones. Just because a page doesn’t get updated all the time doesn’t mean the information is no good. It all depends on the query.
More Money, More Problems
In addition to the aforementioned issues, here are a few more common problems that plague most of the search engines.
Throwing the Baby out…
Sometimes we really question the rationale behind some design changes. One search company decided a few years ago to downplay the importance of a Web page’s title. We agree the title alone is not that significant. But if the title, content, and URL all refer to the same topic, then it is noteworthy. Books use the Dewey decimal code for classification. The title and URL are an Internet publisher’s way of “tagging” a Web page (this is especially true for bloggers who rely on third-party hosts such as blogspot.com).
A spam site may use a document name that matches its plagiarized content. The main part of the URL, however, seldom relates to the content. The following is a sample of spam sites that excerpted or linked to our pages at one time.
Note that in most cases, the spam site’s URL is a dead giveaway. With a little bit of analysis, any search engine should be able to disregard these bogus sites. Ignoring the title of a Web page is not the way to go.
Do you suppose search companies tolerate or even support spam sites because they carry so many ads? Spam sites clog up a search index—not to mention the Internet—and could wreck the rankings of legitimate Web pages. They’re as insidious as junk e-mail. The fact that we found more than 800 spam pages using one of the Big Three engines (for one of our target pages above) should give anyone pause. Perhaps search engines should provide a spam filter option. Filtering spam is more essential than filtering adult content (we already have software that takes care of the latter).
Here’s another idea: Search engines should allow users to specify how much weight to give to titles and URLs on the preference page. Letting search users have a say in how results should be ordered is probably inevitable. Instead of viewing it as a sign of weakness or defeat, search companies should embrace this modification because it offers users the option to choose among competing factors.
The Worst Is Yet to Come
Just when we’re getting tired of spam sites that plagiarize snippets of other people’s Web pages, we’ve also seen Web sites that brazenly co-opt entire pages—text and template. We know of two sites, started in 2006 or earlier, that act like real-time mirror sites for blogspot.com. All links to blogspot.com are converted into links that point to a target domain, including news feed and blogger.com references. For instance, a page with the URL abc.blogspot.com would become accessible as mirrorsite.com/abc, and every blogspot.com link on this mirror page would appear as mirrorsite.com/whatever. It’s a thin line between an official proxy server and information superhighway robbery.
The creators of these sites claim they did it because blogspot.com is blocked in certain countries, but fighting alleged censorship seems to be a pretext (it would be easy for any government to block their sites). At best these dubious mirror sites violate copyrights, confuse search engines and Web surfers, and conveniently divert traffic to them. (Some search engines penalize similarly worded pages. We’ve seen one company place mirrorsite.com/abc in its search index but not abc.blogspot.com! You can imagine the impact these sites have on page ranking.) At worst they could sabotage contextual advertising from Google, Yahoo, and others. If companies are worried about click fraud now, get ready for click theft.
The Internet is just too irresistible to con artists. No, we’re not blaming the Net—that would be like blaming the postal service and the telephone for the proliferation of junk mail and telemarketers. Web publishers must remain vigilant.
Instead of meaningless features such as index size, we’re more interested in important ones that are either shrouded in mystery or long overdue.
Page handling: The most important thing about a Web crawler is how often it revisits a typical page. We also need to know how much of a Web page is analyzed each time. Is it 50 percent? Is it the beginning, the middle, and the end? Is it the first 100K bytes? What happens if a page has two photos at the top that take up 100K bytes? Will anything else beyond that get indexed? We’re not convinced search engines can read tables reliably because we’ve never seen a highlighted table cell in a page cache. Some search engines disregard meta tags since they don’t always match the content of a page. We recommend that they be ignored only when they seem totally unrelated to the content.
Index building: How long does it take a search engine to build its database? Or how often does a company release a new index? We would find the answer to that question if search companies provided version tracking (see below). For some unknown reason, search engines continue to include archived pages from Blogger but not the current pages themselves. Whether they are legitimate or questionable, all mirror sites should be ignored by search engines.
User preferences: If search engines are filtering spam sites, they’re not doing a very good job of it. The last thing we want to see is censorship on the Internet, so this filter should be another option under search preferences. Not only should search companies filter spam pages per user request, but they also need to make sure these sites don’t adversely affect the rankings of other Web pages. As we stated earlier, search users should have more control over the ordering of results. Some of the competing factors include matching titles, matching URLs, established versus newly published pages, frequently updated versus static documents, domestic versus international sites, business/commercial versus non-shopping sites, forums/newsgroups versus regular pages, and official versus unofficial sites. Microsoft and Ask have added the user’s city on their preference pages.
Version tracking: Powerful search engines have been around since the mid-1990s. It’s odd that search companies still do not provide a version number for the engine and a version number (or date) for the index. What are they trying to hide?
E-Mail to Nowhere
The sorting of search results is not an exact science; it’s not surprising the companies involved don’t always get it right. What is inexcusable is the lack of interest in customer input—we can’t decide if it’s arrogance, apathy, or incompetence. One company’s modus operandi is to always deny there’s any problem. You should at least be able to send feedback with just one or two clicks.
* Sent e-mail to Yahoo’s search blog admin since we couldn’t find any contact info in main search area.
** Sent e-mail to Google’s AdSense group.
Keeping an eye on search engines is only a by-product of our Web publishing activities. Here are some sites that focus on the business of Internet search.
Cool Web Surfers Don't Cut and Paste
Would you like to share this Web page with friends? Don't cut and paste. Provide a Web link to this page or refer to its Web address. We invite all content providers to join our "Don't Cut and Paste" campaign.
Copyright © 2002-2013 Calba Media LLC. All rights reserved.