Search Engine Scorecard •

Wednesday, January 01, 2014
Copyright © 2002-2014 Simpatico.blogspot.com. All rights reserved.
 
The State of Internet Search

By the early 2000s, Internet search engines had seemingly matured. It turns out they are more like the stock market: their performance could be quite volatile. Since late 2003, all three top search engines have had periods of downtime—when they returned unacceptable results. The low point came in late 2007 when none of the Big Three earned an A rating for three consecutive months.

Google was the first engine to deteriorate; it finally recovered by mid-2007—after more than three years. Microsoft’s engine was useless for the first half of 2005 and produced disappointing results again between early 2007 and Bing’s 2009 debut. Even Yahoo, the most consistent of the top engines, had a couple of under-performing months in 2007. (As for Cuil, the new search engine that came on-line in July 2008, don’t bother.)

For the first time in at least four years, all three search engines earned an A rating in June 2009. Enjoy it while you can because if history is any indication, you’ll soon find yourself in….

Search Engine Hell

Here are the longest periods of downtime when a search engine has consistently performed poorly in our testing since 2003.

28 months: Google (January 2005-April 2007)
27 months: Microsoft (February 2007-April 2009)
6 months: Microsoft (January 2005-June 2005)

Search User’s Bill of Rights

We expect a search engine to do two things:

  1. Include every accessible Web page in its index.
  2. Return the most relevant results first.

The cardinal sin of a search engine is the failure to include a Web page in its database. We obviously do not expect a search engine to read a new page the second it’s published. But any Web page that has been in existence for a period of time (two weeks?) should be in a search engine’s index. Unless a Web page’s host is constantly out of service, there’s no reason why a page should disappear from a search engine’s index. Its ranking may vary—not fluctuate wildly—but a search engine should never remove it indiscriminately. The ability to scan the entire Internet is only limited by a company’s time and storage capacity. There are some Web pages that we wish were not indexed (more on that later).

New Search Order

Almost as critical as what is in a search engine’s index is the order of returned results of a query. Different companies use different algorithms to determine the relevance of each match found. This Web page wouldn’t be necessary if search engines were interchangeable. Click on Return to Simpatico Stock Watch for more background information and analysis.

This Test Bed’s for You

If you work in the QA department of a search engine company, we feel your pain. How can you tell if a Web page is missing from the search index if you’ve never seen it before? As an independent publisher, we want to make sure Internet users can see our pages when they turn to their favorite search engines. Since 2003, we have devised a series of queries to test the reliability and accuracy of search engines.

Our queries range from the general to the specific. They return anywhere from fewer than 50 hits to over two million. We use no more than three words in a query (we count anything within quotations as one word) because studies have shown people normally don’t enter long queries. Some of our targeted Web pages are updated weekly and some less frequently. In a perfect world, we’d like to see one of these specific pages returned among the first 10 results of a search. But we’ll settle for another one of our own pages. Yes, we have a biased perspective. Our visitors can judge for themselves whether our Web pages deserve to be placed near the top.

Sample Queries
No.QueryTarget Page
Q192.7 kngysimkpti.blogspot.com
Q292.7 energy “bay area”simkpti.blogspot.com
Q392.7 kpti playlistsimkpti.blogspot.com
Q492.7 kptisimkpti.blogspot.com
Q5“fantasy playlist”hadiscs.blogspot.com
Q6“death of modern rock”simcomments.blogspot.com
Q7“dance radio” directory stationsdanceradio.blogspot.com or simdanceradio.blogspot.com
Q8“dance radio” playlistsimdancemix.blogspot.com
Q9“dance radio”danceradio, simdanceradio, or simdancemix
Q10“bay area radio” directory historybayradio.blogspot.com
Q11“bay area radio” directorysimbayradio.blogspot.com
Q12“bay area radio”bayradio or simbayradio

While our pages are Web crawler-friendly (as far as we know), their relative anonymity really tests a search engine’s coverage. For one thing, we’re not some well-known corporate entity. We don’t even have our own domain name, which means our “home” page may not be apparent to less discerning software. But since all Web pages are equal on the Internet—for now anyway—search engines should be able to find and index our pages with no trouble at all (or so we thought).

The Big Three

Search engines have been ranking Web pages for years; it’s time for us to rate their performance. Once a month we perform the above searches using each search engine’s default settings. To keep things simple, we award a ballpark grade to each search engine. That is, a C may be as high as a C+ or as low as a C-. Here’s how the three most popular search engines stack up against each other. Note that we now report the “real” number of hits for each query. The next time your search engine returns hundreds of thousands of results, just keep clicking on the next highest page and soon you’ll reach the “actual” end minus all the “omitted results.” Both Yahoo and Microsoft have a maximum of 1,000 hits, and Google’s ceiling seems to be at least 700.

2009 Simpatico Rankings* by Google
(position/no. of hits [for latest month only])
QueryJanFebMarAprMayJun
Q14497911/537
Q2?196192?115?/687
Q3111111/64
Q4111111/181
Q5144154141124128184/206
Q6313131303336/40
Q7194119100224645/488
Q810371654/641
Q9??????/626
Q10544224/237
Q11111111/453
Q1279981011/708
GradeAAAAAA

2009 Simpatico Rankings* by Yahoo
(position/no. of hits [for latest month only])
QueryJanFebMarAprMayJun
Q1744455/997
Q2??????/1,000
Q3111111/32
Q4111111/119
Q539982423/293
Q66106456/54
Q7221720232813/1,000
Q81341422/1,000
Q91823565215765/1,000
Q10111111/695
Q11533333/1,000
Q121066766/1,000
GradeAAAAAA

2009 Simpatico Rankings* by Microsoft
(position/no. of hits [for latest month only])
QueryJanFebMarAprMayJun
Q1MMMM514/675
Q2MMMM559/1,000
Q3MMMM41/43
Q4MMMM1311/164
Q5MMMM9354/238
Q6MMMMMM/44
Q7MMMM446/1,000
Q8MMMM?7/1,000
Q9MMMM??/1,000
Q10MMMM11/604
Q11MMMM11/962
Q12MMMM55/999
GradeDDDDBA

* Highest ranking Web page from our family of sites, excluding this one and any Blogger archive.
? – Ranking below No. 200.
M – Target page missing from search index and none of our other related pages in top 200.

2008 Report Card**
EngineJanFebMarAprMayJunJulAugSepOctNovDec
GoogleABAAAAAAAAAA
YahooBBBAAAABAABB
MicrosoftDDDDDDDDDDDD

2007 Report Card**
EngineJanFebMarAprMayJunJulAugSepOctNovDec
GoogleCCCCBBBABBBA
YahooAAAAAAADBDBB
MicrosoftACDDDDDDDDDD

2006 Report Card**
EngineJanFebMarAprMayJunJulAugSepOctNovDec
GoogleCCCCCCCDCCCC
YahooAAAAAADAAAAA
MicrosoftAAAACAAAAAAA

2005 Report Card**
EngineJanFebMarAprMayJunJulAugSepOctNovDec
GoogleDDCDDCDCDCCC
YahooAAABAAAAAABA
MicrosoftDDDDDDAAAAAA

** Complete data available upon request.

It’s Not Brain Surgery

We concede our method of scoring search engines based solely on the rankings of our Web pages does not provide a complete picture of their performance. If we had the time (we imagine search companies must already be doing this), we would click on each of the first 200 returned results and assign a value as follows.

+3 – extremely relevant
+1 – somewhat relevant
0 – irrelevant but at least related to query topic
-1 – irrelevant and totally unrelated to query topic
-3 – spam site/mirror site (see below)

Then we’d adjust the total score by examining the distribution of the most relevant results. Are they found among the first 50 hits or after that? But this scoring system still wouldn’t address those relevant Web pages that are not among the first 200 results, assuming they are indexed at all. That’s why our evaluation method is telling—however limited it may be. If you expect to see something, then you can tell if it’s missing.

More Transparency, Please

Search engine companies like to tout the billions of Web pages in their databases. Here are two questions they don’t like to address.

  1. How often does an engine revisit a typical Web page?
  2. How long does it take an engine to build its index?

Ideally, a search engine should read every Web page at least once a week. And it should take no more than a week to build an index. With any other kind of software, you can easily find out which version you’re running. Search engine companies don’t release this information except when they’re promoting a major upgrade (as was the case with the new MSN Search in 2005).

We would also like to see search companies be a little more open about their ranking methodologies (we understand there are legitimate reasons for being guarded). For example, it seems frequently updated pages are ranked much higher than static ones. Just because a page doesn’t get updated all the time doesn’t mean the information is no good. It all depends on the query.

More Money, More Problems

In addition to the aforementioned issues, here are a few more common problems that plague most of the search engines.

  • The ability to read tables remains inconsistent.
  • Archived Web pages (courtesy of Blogger and others) are indexed instead of the current pages themselves.
  • Spam sites, the ones that include excerpts from various Web pages or masquerade as directories, continue to get indexed and are sometimes ranked much higher than the pages they raid.
  • Not all links to a specific page are counted (Technorati has the same problem) because some people use redirection.
  • Only the first few paragraphs of a Web page are read.
  • The word “directory” seems to have special meaning to at least one search engine.
  • Mirror sites are indexed instead of the master sites.

Throwing the Baby out…

Sometimes we really question the rationale behind some design changes. One search company decided a few years ago to downplay the importance of a Web page’s title. We agree the title alone is not that significant. But if the title, content, and URL all refer to the same topic, then it is noteworthy. Books use the Dewey decimal code for classification. The title and URL are an Internet publisher’s way of “tagging” a Web page (this is especially true for bloggers who rely on third-party hosts such as blogspot.com).

A spam site may use a document name that matches its plagiarized content. The main part of the URL, however, seldom relates to the content. The following is a sample of spam sites that excerpted or linked to our pages at one time.

Search Engine Traps
Web PageReferenced on Spam Sites
simkptibeat-road.info, yourlimosine.info, limosineservice.info, awlnk.com, ketchup.craiga92.info, agoner.com, mpmusik2.info
danceradio19278.info, fncsnet.info, foradance.com, dancesteps.info, romanian.bivgaaci.info, baselib.info
simdancemixwebvideos.it, gwnet.com, mariah-carey.iwantmo.info, highspeedinternetnews.com, backstreet-boys.iwantmo.info, 1woodfloors27.info, 1marketingmix81.info, foradance.com, dancesteps.info, en3.wm99.com, jasonnevins.formjason.com, romanian.bivgaaci.info, childsafety.lostmoney.be, evanescencemylastbreath.grandmothers.be, table-furniture-guide.com, mp3musicdownloads.bestmp3musicdownloadsfast.info, gorillaz.amsea.info, seanpaulwebeburning.dallasmorningnews.be, dailysoftware.net, …only-direct-source.info, …extreme-links.info, …leap-links-resource.info, …full-steam-links-resource.info, …links-connect.info, …only-the-best-links-51.info, …ultimate-resource-links-5.info, …top-results-links-13.info, eatas.com, …public-resource-link-repository-89.info, …online-links-resource.info, radio-blogs.biz, …enterprise-links-resource.info, …immense-links-resource.info, …greatest-links-resource-13.info, …ultimate-resource-links-23.info, …world-reknown-resource-links-37.info, …ultimate-resource-links-99.info, …right-minded-free-links.info, …only-the-best-links-63.info, the-best-links-resource-43.info, …incredible-free-links.info, 3086.ighlu5o.org, …momentous-links-resource.info, …only-the-best-links-13.info, …summer.digipointhost.com, asbestospoisoning.be, …the-best-links-resource-17.info, …distinguished-free-links.info, …only-the-best-links-93.info, …online-links-resource.info, …the-best-links-resource-83.info, wedding.kixen.com, …online-links-resource.info
bayradiobigstanradio.com, everything-radio.info, chicagoradiostations.justmysize.be, chicagoradiostations.ceilingfan.be, radio.deephqah.info, onlineradiostations.anorexiaandbulimia.be, onlineradiostations.digitalrecorder.be, pxn.com, mp-musik4.info, sandiegoradiostations.dallasmorningnews.be, radionstations.dallasmorningnews.be
simbayradio1videokilledtheradiostar32.info, everything-radio.info, fishtank.dk, theedge.mildedge.com, sanfranciscoradio.queryfrancisco.com, radio.haukvaka.info, bradwurst.dk, en10.rongsou.com.cn, extreme-forum.dk, onlinradiostations.digitalrecorder.be, radio-stations.celebrity-fan.be, radiostations.motorcyclebatteries.be, teachingchristianwomentoliveinendtimes.justmysize.be

Note that in most cases, the spam site’s URL is a dead giveaway. With a little bit of analysis, any search engine should be able to disregard these bogus sites. Ignoring the title of a Web page is not the way to go.

Do you suppose search companies tolerate or even support spam sites because they carry so many ads? Spam sites clog up a search index—not to mention the Internet—and could wreck the rankings of legitimate Web pages. They’re as insidious as junk e-mail. The fact that we found more than 800 spam pages using one of the Big Three engines (for one of our target pages above) should give anyone pause. Perhaps search engines should provide a spam filter option. Filtering spam is more essential than filtering adult content (we already have software that takes care of the latter).

Here’s another idea: Search engines should allow users to specify how much weight to give to titles and URLs on the preference page. Letting search users have a say in how results should be ordered is probably inevitable. Instead of viewing it as a sign of weakness or defeat, search companies should embrace this modification because it offers users the option to choose among competing factors.

The Worst Is Yet to Come

Just when we’re getting tired of spam sites that plagiarize snippets of other people’s Web pages, we’ve also seen Web sites that brazenly co-opt entire pages—text and template. We know of two sites, started in 2006 or earlier, that act like real-time mirror sites for blogspot.com. All links to blogspot.com are converted into links that point to a target domain, including news feed and blogger.com references. For instance, a page with the URL abc.blogspot.com would become accessible as mirrorsite.com/abc, and every blogspot.com link on this mirror page would appear as mirrorsite.com/whatever. It’s a thin line between an official proxy server and information superhighway robbery.

The creators of these sites claim they did it because blogspot.com is blocked in certain countries, but fighting alleged censorship seems to be a pretext (it would be easy for any government to block their sites). At best these dubious mirror sites violate copyrights, confuse search engines and Web surfers, and conveniently divert traffic to them. (Some search engines penalize similarly worded pages. We’ve seen one company place mirrorsite.com/abc in its search index but not abc.blogspot.com! You can imagine the impact these sites have on page ranking.) At worst they could sabotage contextual advertising from Google, Yahoo, and others. If companies are worried about click fraud now, get ready for click theft.

The Internet is just too irresistible to con artists. No, we’re not blaming the Net—that would be like blaming the postal service and the telephone for the proliferation of junk mail and telemarketers. Web publishers must remain vigilant.

Competitive Matrix

Instead of meaningless features such as index size, we’re more interested in important ones that are either shrouded in mystery or long overdue.

Search Engine Features
FeaturesGoogleYahooMicrosoftAsk

Page Handling
Page crawl frequency????
Page analysis/coverage????
Parsing tables????
Meta tags used?Description, keyword??

Index Building
Index cycle/release????
Scrubbing archived pagesNNN?
Scrubbing mirror sitesNNN?

User Preferences
Spam filterNNNN
User-defined keysNNNN
Location (city)NNYY

Version Tracking
Engine versionNNNN
Index versionNNNN

Page handling: The most important thing about a Web crawler is how often it revisits a typical page. We also need to know how much of a Web page is analyzed each time. Is it 50 percent? Is it the beginning, the middle, and the end? Is it the first 100K bytes? What happens if a page has two photos at the top that take up 100K bytes? Will anything else beyond that get indexed? We’re not convinced search engines can read tables reliably because we’ve never seen a highlighted table cell in a page cache. Some search engines disregard meta tags since they don’t always match the content of a page. We recommend that they be ignored only when they seem totally unrelated to the content.

Index building: How long does it take a search engine to build its database? Or how often does a company release a new index? We would find the answer to that question if search companies provided version tracking (see below). For some unknown reason, search engines continue to include archived pages from Blogger but not the current pages themselves. Whether they are legitimate or questionable, all mirror sites should be ignored by search engines.

User preferences: If search engines are filtering spam sites, they’re not doing a very good job of it. The last thing we want to see is censorship on the Internet, so this filter should be another option under search preferences. Not only should search companies filter spam pages per user request, but they also need to make sure these sites don’t adversely affect the rankings of other Web pages. As we stated earlier, search users should have more control over the ordering of results. Some of the competing factors include matching titles, matching URLs, established versus newly published pages, frequently updated versus static documents, domestic versus international sites, business/commercial versus non-shopping sites, forums/newsgroups versus regular pages, and official versus unofficial sites. Microsoft and Ask have added the user’s city on their preference pages.

Version tracking: Powerful search engines have been around since the mid-1990s. It’s odd that search companies still do not provide a version number for the engine and a version number (or date) for the index. What are they trying to hide?

E-Mail to Nowhere

The sorting of search results is not an exact science; it’s not surprising the companies involved don’t always get it right. What is inexcusable is the lack of interest in customer input—we can’t decide if it’s arrogance, apathy, or incompetence. One company’s modus operandi is to always deny there’s any problem. You should at least be able to send feedback with just one or two clicks.

Contact Log
Company20032004200520062007
GoogleNov--AugJan**
Yahoo--MayAug*Jan*
Microsoft-Nov/DecJan/FebMay, AugJan, Jul, Sep

* Sent e-mail to Yahoo’s search blog admin since we couldn’t find any contact info in main search area.
** Sent e-mail to Google’s AdSense group.

Resources

Keeping an eye on search engines is only a by-product of our Web publishing activities. Here are some sites that focus on the business of Internet search.

  • Search Engine Land (searchengineland.com)
  • Search Engine Watch (searchenginewatch.com)
  • Search Engine Showdown (searchengineshowdown.com)
  • Battelle Media (battellemedia.com)

Cool Web Surfers Don't Cut and Paste

Would you like to share this Web page with friends? Don't cut and paste. Provide a Web link to this page or refer to its Web address. We invite all content providers to join our "Don't Cut and Paste" campaign.

Copyright © 2002-2014 Calba Media LLC. All rights reserved.

Labels: , , , , ,