<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>remylabs &#187; Blogosphere</title>
	<atom:link href="http://remylabs.com/blog/category/blogosphere/feed/" rel="self" type="application/rss+xml" />
	<link>http://remylabs.com/blog</link>
	<description>the remylabs blog</description>
	<lastBuildDate>Wed, 10 Feb 2010 01:49:04 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Umbria &#8211; Market Intelligence from Blogs</title>
		<link>http://remylabs.com/blog/2005/12/fortune-on-umbria/</link>
		<comments>http://remylabs.com/blog/2005/12/fortune-on-umbria/#comments</comments>
		<pubDate>Thu, 08 Dec 2005 16:10:52 +0000</pubDate>
		<dc:creator>martin</dc:creator>
				<category><![CDATA[Blogosphere]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://www.remylabs.com/blog/?p=41</guid>
		<description><![CDATA[FORTUNE has an article (&#8221;Blogging for Dollars&#8221;) that covers Umbria, a company based here in Colorado that tracks what bloggers are saying about its clients (aka mining blogs for market intelligence).
Economically, this market is finally starting to take shape &#8212; the ideas and attempts have been out there for a few years, but consumer companies [...]]]></description>
			<content:encoded><![CDATA[<p><!--nosphereit-->FORTUNE has an <a href="http://www.fortune.com/fortune/smallbusiness/articles/0,15114,1134129,00.html">article (&#8221;Blogging for Dollars&#8221;)</a> that covers <a href="http://www.umbrialistens.com/home">Umbria</a>, a company based here in Colorado that tracks what bloggers are saying about its clients (aka mining blogs for market intelligence).</p>
<p>Economically, this market is finally starting to take shape &#8212; the ideas and attempts have been out there for a few years, but consumer companies have been on the fence about whether the blogosphere is worth listening in on.  Until recently, that is.  Umbria claims they&#8217;ll have $2M revenue this year and will be profitable next year, but the overall market for this kind of service is still only $20M according to the article (<a href="http://www.intelliseek.com/">Intelliseek</a> has about 1/3rd of that market).</p>
<p>Technologically, Umbria also sounds pretty interesting.  They claim to have a competitive edge in automating most of the process:</p>
<blockquote><p>Umbria&#8217;s solution is entirely software-based. [Umbria's] competitors also meet with clients to interpret the data and suggest strategic responses. &#8220;Ultimately we rely on both technology and humans for analysis,&#8221; says Max Kalehoff, marketing director for <a href="http://www.buzzmetrics.com">BuzzMetrics</a> [another Umbria competitor]. &#8220;Umbria takes an extremely automated approach.&#8221;</p></blockquote>
<p>Umbria&#8217;s technology sounds like a pipeline of parsers that generates features that in turn drive product and sentiment classifiers (and those drive reporting):</p>
<blockquote><p>Every few hours Umbria sends an application called a spider out over the web to scour the blogosphere for postings about the firm&#8217;s clients, most of which are big consumer companies, such as Electronic Arts, SAP, and Sprint. By analyzing keywords in blogs, Umbria can classify each citation thematically. In the case of Sprint, for example, Umbria&#8217;s software can tell whether a blogger is talking about customer service, the company&#8217;s advertisements, or a particular calling plan.</p>
<p>Another big challenge is to decipher what&#8217;s on a blogger&#8217;s mind. To figure out whether an opinion is strong or tepid, for example, it helps to know that &#8220;awesome&#8221; is a stronger endorsement than &#8220;pretty cool,&#8221; and that &#8220;shoddy&#8221; is less damning than &#8220;abominable.&#8221; Umbria has several employees with Ph.D.s in linguistics and artificial intelligence who are forever tweaking the software to make it better at categorizing opinions.</p></blockquote>
<p>I can&#8217;t help thinking that more manual tweaking goes into each client&#8217;s setup than this description lets on, but still, I&#8217;m glad they&#8217;re seeing success, and I bet those linguists are having fun with the blogosphere, even if they have to do a bit of slumming to come up with their rules:</p>
<blockquote><p>The software can also estimate the author&#8217;s age and gender. Elongated spellings (&#8221;soooooooo&#8221;), multiple exclamation marks (!!!), and acronyms such as POS (&#8221;parent over shoulder&#8221;) suggest a teenage female member of Generation Y (born after 1979). The blogger is probably a teenage boy if a posting is rife with hip-hop terminology such as &#8220;aight&#8221; (translation: &#8220;all right&#8221;) and &#8220;true dat&#8221; (&#8221;I agree!&#8221;).</p></blockquote>
<p>There you have it, you don&#8217;t even have to know the language to have your voice heard by the people who want to sell you more stuff.  Now that&#8217;s power.  On one side of that function, at least.</p>
]]></content:encoded>
			<wfw:commentRss>http://remylabs.com/blog/2005/12/fortune-on-umbria/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tagyu</title>
		<link>http://remylabs.com/blog/2005/10/tagyu/</link>
		<comments>http://remylabs.com/blog/2005/10/tagyu/#comments</comments>
		<pubDate>Wed, 12 Oct 2005 18:36:25 +0000</pubDate>
		<dc:creator>martin</dc:creator>
				<category><![CDATA[Blogosphere]]></category>

		<guid isPermaLink="false">http://www.remylabs.com/blog/?p=35</guid>
		<description><![CDATA[Now this is cool.  Tagyu is an &#8220;auto-tagging&#8221; service of sorts, created by Adam Kalsey.  You paste in some text (or submit via their REST API) and it suggests tags, using some kind of a similarity metric between your text and already tagged texts in Tagyu&#8217;s index (gathered from del.icio.us etc.).  
So [...]]]></description>
			<content:encoded><![CDATA[<p>Now <a href="http://tagyu.com">this</a> is cool.  Tagyu is an &#8220;auto-tagging&#8221; service of sorts, created by Adam Kalsey.  You paste in some text (or submit via <a href="http://tagyu.com/tools/rest">their REST API</a>) and it suggests tags, using some kind of a similarity metric between your text and already tagged texts in Tagyu&#8217;s index (gathered from del.icio.us etc.).  </p>
<p>So far, I&#8217;ve tried a few different texts, and about half the time the returned tags are great.  This is impressive, because this is not an easy problem to solve, but 50% precision is not quite enough for prime time.  If someone (sploggers?) unleashes Tagyu to auto-tag a large volume of posts that feed back into the del.icio.us and Tagyu system, that would be detrimental to improving precision of the system, unless you could assign some kind of a score to the quality of tags (yes, that&#8217;s a chicken/egg thing).</p>
<p>Maybe we need some kind of a large-scale tag-quality feedback system.  Some clever piece of javascript that lets you click &#8220;this tag is right on&#8221; or &#8220;this tag is a cruel joke&#8221; when reading someone&#8217;s blog or feed.  Of course, if you&#8217;re an idiot at tagging, you&#8217;re not going to install that piece of javascript.  An aggregator might be the best place to do that, where attention.xml lives (eventually).</p>
<p>This is the first service of this kind that I&#8217;m aware of, and there are lots of applications of this kind of thing in blog search.   There could be an ad-matching app in there, too.  And, an intermediate step in Tagyu is matching content to other content (and then to tags).  I hope Adam Kalsey keeps up the R&#038;D effort on this.  Tagyu has a super-clean looking site.  Very nice. </p>
<p>btw, for this post&#8217;s text, Tagyu returned the following tags: tagging del.icio.us tools.  Looks good to me.</p>
<p>&#8212;&#8211;<br />
(Via <a href="http://www.buzzmachine.com/index.php/2005/10/10/the-best-thing-i-saw-at-web-20">BuzzMachine</a>)</p>
]]></content:encoded>
			<wfw:commentRss>http://remylabs.com/blog/2005/10/tagyu/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Yahoo! briefly launches &#8230; Feedsterati?</title>
		<link>http://remylabs.com/blog/2005/07/yahoo-briefly-launches-feedsterati/</link>
		<comments>http://remylabs.com/blog/2005/07/yahoo-briefly-launches-feedsterati/#comments</comments>
		<pubDate>Fri, 08 Jul 2005 22:43:25 +0000</pubDate>
		<dc:creator>martin</dc:creator>
				<category><![CDATA[Blogosphere]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://www.remylabs.com/blog/?p=29</guid>
		<description><![CDATA[Steve Rubel and Niall Kennedy are reporting on a Yahoo RSS search service which was briefly public this morning.  Seems to combine feed search (not just blogs, apparently, but other feed content, too, like Feedster) and several ranking options (date, relevance, and popularity).  I&#8217;m curious about the popularity ranking, but I&#8217;d guess the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.micropersuasion.com/2005/07/yahoo_unveils_b.html">Steve Rubel</a> and <a href="http://www.niallkennedy.com/blog/archives/2005/07/yahoo_rss_searc.html">Niall Kennedy</a> are reporting on a Yahoo RSS search service which was briefly public this morning.  Seems to combine feed search (not just blogs, apparently, but other feed content, too, like Feedster) and several ranking options (date, relevance, and <i>popularity</i>).  I&#8217;m curious about the popularity ranking, but I&#8217;d guess the initial version will resemble a Technorati-like tally of incoming-links. </p>
<p>Greg Linden <a href="http://glinden.blogspot.com/2005/07/yahoo-and-being-underfoot.html">wonders</a> whether the small blog/feed search engines will survive the entry of the giants into the field:</p>
<blockquote><p>
&#8230; it is good for a startup to see the entry of a big company into its area since it attracts attention and legitimizes the field &#8230; but competing directly against these giants is scary if you have no differentiator.
</p></blockquote>
<p>While the small players have driven innovation and broad acceptance of concepts like link popularity and tagging, they continue to <a href="http://www.micropersuasion.com/2005/07/technorati_and_.html">struggle</a> with scalability.  Also, the most compelling products to come out of the blog search startups, while they&#8217;ve been exciting and even revolutionary from a user&#8217;s point of view, have not been technologically deep in the sense of difficult to duplicate by the search giants.  There have been exceptions, of course, but no really deep technology is in evidence among those services that have made the biggest splashes (technorati, bloglines, flickr, del.icio.us).</p>
<p>So, when a search giant comes in with equal-or-better features, scalability, and a huge engineering team that can relatively quickly merge ideas emerging from the programming part of the blogosphere into the vast search toolkit that the giants already have, that might just cast a bit of a cloud over the little guys.  </p>
<p>Having said that, I believe there will continue to be a place for the little guys in the blog search ecosystem.  They&#8217;re the real innovators and they have their ears to the ground.  And even at the break-neck speed at which Yahoo and Google have been rolling out features lately, an army of little guys can still cover a lot more ground than the two giants in the search for the next cool thing that will make users&#8217; lives (even) better.</p>
]]></content:encoded>
			<wfw:commentRss>http://remylabs.com/blog/2005/07/yahoo-briefly-launches-feedsterati/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Whither Tagging?</title>
		<link>http://remylabs.com/blog/2005/07/whither-tagging/</link>
		<comments>http://remylabs.com/blog/2005/07/whither-tagging/#comments</comments>
		<pubDate>Wed, 06 Jul 2005 04:56:28 +0000</pubDate>
		<dc:creator>martin</dc:creator>
				<category><![CDATA[Blogosphere]]></category>

		<guid isPermaLink="false">http://www.remylabs.com/blog/?p=28</guid>
		<description><![CDATA[Susan Mernit asks some interesting questions about tagging&#8217;s scalability in this post:

&#8230;
1. How well will tagging work as an organizing and information retrieval method when there are millions of tags?&#8211;That&#8217;s where having additional filters, such as identity, trust or cohort group becomes relevant&#8211;becomes needed.
2. How can developers move tagging into a wider market? I describe [...]]]></description>
			<content:encoded><![CDATA[<p>Susan Mernit asks some interesting questions about tagging&#8217;s scalability in <a href="http://susanmernit.blogspot.com/2005/07/tagging-whats-next.html">this post</a>:</p>
<blockquote><p>
&#8230;<br />
1. How well will tagging work as an organizing and information retrieval method when there are millions of tags?&#8211;That&#8217;s where having additional filters, such as identity, trust or cohort group becomes relevant&#8211;becomes needed.</p>
<p>2. How can developers move tagging into a wider market? I describe tagging to non-geek friends and they are interested, but these folks aren&#8217;t blogging, don&#8217;t use tag-friendly photo services and are a world away, still&#8211;how can the tools bring them closer?<br />
&#8230;
</p></blockquote>
<p>I don&#8217;t think that tagging will turn out to be the emperor&#8217;s new clothes (which isn&#8217;t at all what Susan is suggesting, either).  But there&#8217;s a sense here that the honeymoon is over and it&#8217;s time for tagging to get serious about earning its keep for readers and searchers and to make stuff not just more broadcastable in flickr and Technorati, but also to make the good stuff more findable.</p>
]]></content:encoded>
			<wfw:commentRss>http://remylabs.com/blog/2005/07/whither-tagging/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Yahoo: linked blogs != searched blogs</title>
		<link>http://remylabs.com/blog/2005/07/yahoo-linked-blogs-searched-blogs/</link>
		<comments>http://remylabs.com/blog/2005/07/yahoo-linked-blogs-searched-blogs/#comments</comments>
		<pubDate>Wed, 06 Jul 2005 04:43:42 +0000</pubDate>
		<dc:creator>martin</dc:creator>
				<category><![CDATA[Blogosphere]]></category>

		<guid isPermaLink="false">http://www.remylabs.com/blog/?p=27</guid>
		<description><![CDATA[Interesting informal experiment in the Yahoo! Buzz Log:

Do you find blogs via links or through search? We wanted to know, so we lined up the top 20 blogs in the blogosphere according to Technorati &#8230; we did peek at the number of searches each received over the last week.
&#8230;
Turns out the lead blog on Technorati [...]]]></description>
			<content:encoded><![CDATA[<p>Interesting <a href="http://buzz.yahoo.com/buzz_log/entry/2005/07/05/1800/">informal experiment</a> in the Yahoo! Buzz Log:</p>
<blockquote><p>
Do you find blogs via links or through search? We wanted to know, so we lined up the top 20 blogs in the blogosphere according to Technorati &#8230; we did peek at the number of searches each received over the last week.<br />
&#8230;<br />
Turns out the lead blog on Technorati runs in the middle of the search pack. Fark ranked #5 on Technorati, but in terms of searches &#8212; it&#8217;s the top dog, er, blog.
</p></blockquote>
<p>Note that Yahoo! restricted its sample set to the top 20 blogs according to Technorati, so if all blogs were lined up according to search popularity on Yahoo, the top 20 might or might not include any of the Technorati top 20.</p>
<p>So, what&#8217;s this mean? It means that searches map to a different popularity ranking than links do.  They&#8217;re two different measures.  Technorati takes a sort of <i>intra-blogosphere</i> measurement of popularity, while Yahoo! is taking an <i>external</i> measurement of blog popularity, looking into the blogosphere from the web at large (or at least from <a href="http://search.yahoo.com/">search.yahoo.com</a>).  If you&#8217;re a marketer, you probably like Yahoo&#8217;s measurement better.</p>
<p>(via <a href="http://www.micropersuasion.com/2005/07/which_blog_is_t.html">Steve Rubel</a>)</p>
]]></content:encoded>
			<wfw:commentRss>http://remylabs.com/blog/2005/07/yahoo-linked-blogs-searched-blogs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>See blogs near you on Google Earth with Blogdigger Local</title>
		<link>http://remylabs.com/blog/2005/06/blogdigger-local-on-google-earth/</link>
		<comments>http://remylabs.com/blog/2005/06/blogdigger-local-on-google-earth/#comments</comments>
		<pubDate>Wed, 29 Jun 2005 17:56:02 +0000</pubDate>
		<dc:creator>martin</dc:creator>
				<category><![CDATA[Blogosphere]]></category>
		<category><![CDATA[Cool Tech]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://www.remylabs.com/blog/?p=26</guid>
		<description><![CDATA[Greg Gershman has built a cool application of Google Earth.  You can jump from Blogdigger Local search results to Google Earth and see markers for all of the blogs in your geo neighborhood.  The result looks something like this:

(That&#8217;s Greg&#8217;s image.  Don&#8217;t have Google Earth running here, waiting for the OS X [...]]]></description>
			<content:encoded><![CDATA[<p>Greg Gershman has <a href="http://www.blogdigger.com/blog/2005/06/29/1120058467000.html">built a cool application</a> of Google Earth.  You can jump from Blogdigger Local search results to Google Earth and see markers for all of the blogs in your geo neighborhood.  The result looks something like this:</p>
<p><img src='/blog/uploads/NY_BDLocal_GEarth_sm.JPG' alt='' /></p>
<p>(That&#8217;s Greg&#8217;s image.  Don&#8217;t have Google Earth running here, waiting for the OS X version.  Impatiently.)  Blogdigger seems to have found its niche with Blogdigger Local, and it&#8217;s a good one.</p>
]]></content:encoded>
			<wfw:commentRss>http://remylabs.com/blog/2005/06/blogdigger-local-on-google-earth/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The newspaper of the future</title>
		<link>http://remylabs.com/blog/2005/06/the-newspaper-of-the-future/</link>
		<comments>http://remylabs.com/blog/2005/06/the-newspaper-of-the-future/#comments</comments>
		<pubDate>Mon, 27 Jun 2005 20:24:37 +0000</pubDate>
		<dc:creator>martin</dc:creator>
				<category><![CDATA[Blogosphere]]></category>

		<guid isPermaLink="false">http://www.remylabs.com/blog/?p=25</guid>
		<description><![CDATA[Interesting article in Sunday&#8217;s NYT about a Lawrence, Kansas paper including user-generated content in a big way:

&#8220;I don&#8217;t think of us as being in the newspaper business,&#8221; said Mr. Simons, the editor and publisher of The Journal-World and the chairman of the World Company, the newspaper&#8217;s parent. &#8220;Information is our business and we&#8217;re trying to [...]]]></description>
			<content:encoded><![CDATA[<p>Interesting <a href="http://www.nytimes.com/2005/06/26/business/yourmoney/26kansas.html">article in Sunday&#8217;s NYT</a> about a Lawrence, Kansas paper including user-generated content in a big way:</p>
<blockquote><p>
&#8220;I don&#8217;t think of us as being in the newspaper business,&#8221; said Mr. Simons, the editor and publisher of The Journal-World and the chairman of the World Company, the newspaper&#8217;s parent. &#8220;Information is our business and we&#8217;re trying to provide information, in one form or another, however the consumer wants it and wherever the consumer wants it, in the most complete and useful way possible.&#8221;<br />
&#8230;<br />
&#8220;We believe that journalism has been a monologue for so long and now is the perfect time for it to become a dialogue with our readers,&#8221; said Rob Curley, 34, the World Company&#8217;s director of new media. &#8220;We want readers to think of this as their paper, not our paper.&#8221;
</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://remylabs.com/blog/2005/06/the-newspaper-of-the-future/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SearchEngineWatch joins the link counting fray</title>
		<link>http://remylabs.com/blog/2005/06/searchenginewatch-joins-the-link-counting-fray/</link>
		<comments>http://remylabs.com/blog/2005/06/searchenginewatch-joins-the-link-counting-fray/#comments</comments>
		<pubDate>Fri, 24 Jun 2005 00:13:47 +0000</pubDate>
		<dc:creator>martin</dc:creator>
				<category><![CDATA[Blogosphere]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://www.remylabs.com/blog/?p=24</guid>
		<description><![CDATA[Danny Sullivan is skeptical about the accuracy of Google&#8217;s and Yahoo&#8217;s results counts, used by Tristan Louis in two studies, which concluded that Yahoo has better coverage of blogs than Google, which in turn has better coverage than Technorati.  Danny posted an email conversation with Tristan about his study.  It&#8217;s a little hard [...]]]></description>
			<content:encoded><![CDATA[<p>Danny Sullivan is skeptical about the accuracy of Google&#8217;s and Yahoo&#8217;s results counts, used by Tristan Louis in <a href="http://www.tnl.net/blog/entry/Secrets_of_the_A-list_bloggers:_Technorati_vs._Google">two</a> <a href="http://www.tnl.net/blog/entry/Technorati_Yahoo_and_Google_Too">studies</a>, which concluded that Yahoo has better coverage of blogs than Google, which in turn has better coverage than Technorati.  Danny posted an <a href="http://blog.searchenginewatch.com/blog/050622-110917">email conversation</a> with Tristan about his study.  It&#8217;s a little hard to follow the lines of argument, but it&#8217;s well worth reading because it illuminates the difficulties in getting a handle on index size, and especially blog coverage, by the search giants.</p>
<p>Danny, from his exchange with Tristan:</p>
<blockquote><p>
Also, Google did say &#8220;of about&#8221; with the numbers it reports. That&#8217;s not an accident. They&#8217;re saying that this is an estimate. But no disagreement with me. If you put up a count, it would be nice if the count was as accurate as possible. Google&#8217;s have come under question.
</p></blockquote>
<p>Hmm.  From what I&#8217;ve seen in Tristan&#8217;s data and <a href="http://www.remylabs.com/blog/?p=23">my own testing</a>, it&#8217;s Yahoo&#8217;s counts that ought to come under question, specifically for <i>link:</i> queries.  </p>
<p>Danny to Tristan again:</p>
<blockquote><p>
The link: command is completely different than the site: command. The link command tells you nothing about the size of the index. As for a confirmation that all links aren&#8217;t reported, <a href="http://blog.searchenginewatch.com/blog/041119-071502">this past blog post from SEW</a> gives you confirmation and <a href="http://www.google.com/webmasters/4.html">this page</a> on Google mentions links are only a sampling of what Google knows although <a href="http://www.google.com/help/features.html#link">this other</a> Google page fails to make this clear.
</p></blockquote>
<p><i>link:</i> and <i>site:</i> are very different, that&#8217;s true enough.  And maybe the link command doesn&#8217;t tell you much about the size of an index, but if link collection methods are similar between Yahoo and Google (and why wouldn&#8217;t they be, it&#8217;s a relatively easy part of the whole game), then the counts ought to be similar.  But they&#8217;re not, <a href="http://www.remylabs.com/blog/?p=23">not by a long shot</a>.</p>
<p>By the way, a big thanks to Tristan for posting his studies and kicking off this discussion.  Most of us don&#8217;t take the time to do analysis of that depth to support our opinions, and to post the entire method and dataset so others can reproduce it, shoot holes in it, go off on tangents from it.</p>
<p>(I stumbled onto Danny&#8217;s post via <a href="http://battellemedia.com/archives/001649.php">John Battelle</a>)</p>
]]></content:encoded>
			<wfw:commentRss>http://remylabs.com/blog/2005/06/searchenginewatch-joins-the-link-counting-fray/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What&#8217;s up with Yahoo&#8217;s link count estimates?</title>
		<link>http://remylabs.com/blog/2005/06/whats-up-with-yahoos-link-count-estimates/</link>
		<comments>http://remylabs.com/blog/2005/06/whats-up-with-yahoos-link-count-estimates/#comments</comments>
		<pubDate>Wed, 22 Jun 2005 22:10:07 +0000</pubDate>
		<dc:creator>martin</dc:creator>
				<category><![CDATA[Blogosphere]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://www.remylabs.com/blog/?p=23</guid>
		<description><![CDATA[Dave Sifry is chiming in on some analysis done by Tristan Louis about how well Google, Yahoo and Technorati are covering the blogosphere.  Briefly, here&#8217;s what Tristan did:  He ran link: queries on Google, Yahoo and Technorati for the blogs in the Technorati Top 100 and recorded the number of results reported by [...]]]></description>
			<content:encoded><![CDATA[<p>Dave Sifry is <a href="http://www.sifry.com/alerts/archives/000320.html">chiming in</a> on some <a href="http://www.tnl.net/blog/entry/Technorati_Yahoo_and_Google_Too">analysis</a> done by Tristan Louis about how well Google, Yahoo and Technorati are covering the blogosphere.  Briefly, here&#8217;s what Tristan did:  He ran <i>link:</i> queries on Google, Yahoo and Technorati for the blogs in the <a href="http://www.technorati.com/pop/blogs/">Technorati Top 100</a> and recorded the number of results reported by each search engine.  For example, taking BoingBoing, the 1st blog on that list:<br />
<span id="more-23"></span></p>
<ul>
<li>For the query <a href="http://www.google.com/search?q=link%3Aboingboing.net&#038;sourceid=mozilla-search&#038;start=0&#038;start=0&#038;ie=utf-8&#038;oe=utf-8&#038;client=firefox-a&#038;rls=org.mozilla:en-US:official">link:boingboing.net</a>, Google reports &#8220;about <b>40,700</b> results&#8221;, i.e. pages in its index that link to BoingBoing.net</li>
<li>Technorati <a href="http://www.technorati.com/search/boingboing.net">has indexed</a> <b>23,358</b> links to BoingBoing.</li>
<li>Yahoo&#8217;s <a href="http://search.yahoo.com/search?p=link%3Ahttp%3A%2F%2Fboingboing.net&#038;prssweb=Search&#038;ei=UTF-8&#038;fr=ush-help&#038;fl=0&#038;x=wrt">results</a> claim that it has indexed about <b>1,320,000</b> pages linking to BoingBoing.</li>
</ul>
<p>Interestingly, Technorati is the only one of the three that gives the same count whether the link query includes <i>www</i> before the domain name or not (I happen to think that&#8217;s the correct behavior).</p>
<p>So much for Tristan&#8217;s method, which is transparent and easily reproducible.  The picture that emerges is that Technorati&#8217;s coverage of the blogosphere is worse than Google&#8217;s, which in turn is [much] worse than Yahoo&#8217;s.  By the way, <a href="http://www.tnl.net/blog/entry/Technorati_Yahoo_and_Google_Too">Tristan&#8217;s post</a> has more depth than is relevant to this post, and it has some interesting statistics that pull apart this data.  Read it.  </p>
<p>Anyway, Dave smelled something fishy in Tristan&#8217;s data (he&#8217;s onto the right question, but he goes after a red herring and misses a different, interesting feature in the data):</p>
<blockquote><p>
&#8230;  I believe that Tristan&#8217;s analysis begs a question that hasn&#8217;t been asked yet: How accurate are the numbers that search engines report about the size of their result sets? &#8230; For example, when you search for all the results for &#8220;<a href="http://www.google.com/search?hl=en&amp;q=Tristan+Louis">Tristan Louis</a>&#8221; on Google, it reports &#8220;about 575,000&#8243;.
</p></blockquote>
<p>Whoa, hold it.  That&#8217;s a keyword query, which means Dave&#8217;s now running a different experiment from Tristan&#8217;s, which uses <i>link</i> queries.  I recommend you read Dave&#8217;s <a href="http://www.sifry.com/alerts/archives/000320.html">entire post</a>, but from this point forward, he&#8217;s on a different track, using keyword queries instead of link queries throughout.  </p>
<p>Dave&#8217;s objection is to the limit on &#8220;viewable results&#8221; that Yahoo and Google implement (Technorati doesn&#8217;t).  Both Yahoo and Google only serve up to about 1000 pages of a results set.  Crunching through 1-N results not only gets more expensive for higher N, but the value for the user falls off pretty rapidly after a while.  And as a bonus, this limit keeps pranksters with robots from chewing up bandwidth by paging through millions of results, wreaking havoc on caching.  Not to mention that nobody <b>wants</b> to wade through more than a few pages of results anyway, instead of just rephrasing the query to get better results.  [Someone should do something about the excessive recall of these keyword search engines...]  Anyway, the 1000-page limit is an interesting discovery, but it&#8217;s an obvious optimization.  The results count given at the top of the page (results 1-10 of <b>40,700</b>) is of course an estimate, again as an optimization (getting exact counts from massively distributed indexes isn&#8217;t free, and who needs an exact count at this level of recall, anyway?)  </p>
<p>So, the limit on viewable results is very straightforwardly explained as an optimization, benign to the user experience.  I don&#8217;t think there&#8217;s anything to get worked up about in only being able to see the first 1000 results for a query.  That&#8217;s what query refinement is for.  The interesting thing is the <i>estimated total number of results</i>, specifically of <i>link</i> queries.</p>
<p>When I went through Tristan&#8217;s original experiment and ran some link queries, it became pretty obvious (as if it wasn&#8217;t obvious in Tristan&#8217;s post) that there&#8217;s something weird about Yahoo&#8217;s method of estimating the total number of results.  The practice of estimating the total number of results (as opposed to computing it precisely) is a necessary optimization in a search engine that wants to scale to Google or Yahoo scale, and the estimated results counts seem plausible on Google and Yahoo for keyword queries.  Counts from both engines were plausible (and within an order of magnitude of each other) for most keyword queries that I tried.  But for <i>link</i> queries, that&#8217;s not the case.  Let&#8217;s look again at the estimated total results counts for pages linking to BoingBoing:</p>
<ul>
<li><a href="http://www.google.com/search?hs=P3o&#038;hl=en&#038;lr=&#038;c2coff=1&#038;client=firefox-a&#038;rls=org.mozilla%3Aen-US%3Aofficial&#038;biw=1152&#038;q=link%3Aboingboing.net&#038;btnG=Search">Google</a>, about 40,700</li>
<li><a href="http://www.technorati.com/search/boingboing.net">Technorati</a>, 23,358</li>
<li><a href="http://search.yahoo.com/search?p=link%3Ahttp%3A%2F%2Fboingboing.net&#038;prssweb=Search&#038;ei=UTF-8&#038;fr=ush-help&#038;fl=0&#038;x=wrt">Yahoo! </a>, about <b>1,320,000</b></li>
</ul>
<p>Now, the fact that Technorati found only half as many links to BoingBoing as Google isn&#8217;t a big deal and shouldn&#8217;t give Technorati an inferiority complex.  A sizeable chunk of the links may be from sites that Technorati isn&#8217;t indexing because those sites aren&#8217;t blogs or don&#8217;t use ping services that Technorati is monitoring.  Also, Technorati&#8217;s index isn&#8217;t as old as Google&#8217;s and other factors like multiple links per page (to the same blog) make the comparison even more difficult.  In any case, the difference between Google and Technorati is relatively small (if the Technorati team spends some time on the back end now that the new UI is up, they&#8217;ll narrow that gap).  What&#8217;s interesting, however, is Yahoo&#8217;s estimate for the number of results for this particular query.  At 1.3 million, it&#8217;s about 30x larger than Google&#8217;s count and 60x larger than Technorati&#8217;s.  That seems implausible to me, and it looks like some wacky calculations are happening in Yahoo&#8217;s estimation of results count for this query.  For several blogs I tried, Google&#8217;s results count is plausible and roughly 2-4x Technorati&#8217;s, whereas Yahoo&#8217;s is, well, <i>out there</i>.  Here are small, medium and medium-large examples (we covered extra-large above, with BoingBoing) :</p>
<p><a href="http://annikrubens.de/">annikrubens.de</a>:</p>
<ul>
<li><a href="http://www.google.com/search?hl=en&#038;lr=&#038;c2coff=1&#038;biw=1152&#038;q=link%3Awww.annikrubens.de&#038;btnG=Search">Google</a>, about 52</li>
<li><a href="http://www.technorati.com/search/annikrubens.de">Technorati</a>, 29</li>
<li><a href="http://search.yahoo.com/search?p=link%3Ahttp%3A%2F%2Fwww.annikrubens.de&#038;prssweb=Search&#038;ei=UTF-8&#038;fr=ush-help&#038;fl=0&#038;x=wrt">Yahoo!</a>, about <b>318</b></li>
</ul>
<p><a href="http://blog.jackvinson.com">blog.jackvinson.com</a>:</p>
<ul>
<li><a href="http://www.google.com/search?hl=en&#038;lr=&#038;c2coff=1&#038;biw=1152&#038;q=link%3Ajackvinson.com&#038;btnG=Search">Google</a>, about 228</li>
<li><a href="http://www.technorati.com/search/blog.jackvinson.com">Technorati</a>, 47</li>
<li><a href="http://search.yahoo.com/search?p=link%3Ahttp%3A%2F%2Fblog.jackvinson.com&#038;prssweb=Search&#038;ei=UTF-8&#038;fr=ush-help&#038;fl=0&#038;x=wrt">Yahoo!</a>, about <b>2,760</b></li>
</ul>
<p><a href="http://battellemedia.com/">battellemedia.com</a>:</p>
<ul>
<li><a href="http://www.google.com/search?hl=en&#038;lr=&#038;c2coff=1&#038;biw=1152&#038;q=link%3Abattellemedia.com&#038;btnG=Search">Google</a>, about 10,300</li>
<li><a href="http://www.technorati.com/search/battellemedia.com">Technorati</a>, 1,723</li>
<li><a href="http://search.yahoo.com/search?p=link%3Ahttp%3A%2F%2Fbattellemedia.com&#038;prssweb=Search&#038;ei=UTF-8&#038;fr=ush-help&#038;fl=0&#038;x=wrt">Yahoo!</a>, about <b>647,000</b></li>
</ul>
<p>I told you, it&#8217;s wacky.  Tristan&#8217;s conclusion is that Yahoo! is more focused on indexing the blogosphere and has more data.  That may be true.  But these counts are so far out there that I can&#8217;t help but think there&#8217;s a problem with the way they&#8217;re calculated.  So there.  Fix it.  I may check <img src='http://remylabs.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>And, to end on a nitpicky note &#8230; as I mentioned somewhere above, if you add or subtract <i>www</i> to the <i>link:</i> query, both Google&#8217;s and Yahoo&#8217;s total counts jump around like crazy.  I don&#8217;t know about you, but in this context of searching WWW content, I think <i>www</i> should be treated as a special hostname and equivalent to the domain, i.e. <i>www.domain.com</i> == <i>domain.com</i>.  </p>
]]></content:encoded>
			<wfw:commentRss>http://remylabs.com/blog/2005/06/whats-up-with-yahoos-link-count-estimates/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Fallows on getting answers</title>
		<link>http://remylabs.com/blog/2005/06/fallows-on-getting-answers/</link>
		<comments>http://remylabs.com/blog/2005/06/fallows-on-getting-answers/#comments</comments>
		<pubDate>Mon, 13 Jun 2005 02:12:49 +0000</pubDate>
		<dc:creator>martin</dc:creator>
				<category><![CDATA[Blogosphere]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://www.remylabs.com/blog/?p=22</guid>
		<description><![CDATA[Great column on the state of search by James Fallows in today&#8217;s New York Times (online version here), entitled &#8220;Enough Keyword Searches, Just Answer My Question&#8221;.  Fallows doesn&#8217;t mince words.

His article starts:

Search engines are so powerful.  And they are so pathetically weak.

He goes on to lament how ill-suited today&#8217;s KWS engines are at [...]]]></description>
			<content:encoded><![CDATA[<p>Great column on the state of search by James Fallows in today&#8217;s New York Times (online version <a href="http://www.nytimes.com/2005/06/12/business/yourmoney/12techno.html">here</a>), entitled &#8220;Enough Keyword Searches, Just Answer My Question&#8221;.  Fallows doesn&#8217;t mince words.<br />
<span id="more-22"></span><br />
His article starts:</p>
<blockquote><p>
Search engines are so powerful.  And they are so pathetically weak.
</p></blockquote>
<p>He goes on to lament how ill-suited today&#8217;s KWS engines are at answering questions.  His use case is trying to find figures for California&#8217;s school spending in their historical context and/or relative to other states&#8217; school spending, and he finds no satisfaction from &#8220;normal search tools&#8221;.</p>
<blockquote><p>
&#8230; When I finally called an education expert on Monday, she gave me the answer off the top of her head &#8230; after I&#8217;d wasted what seemed like hours over the weekend with normal search tools &#8230;
</p></blockquote>
<p>Fallows casts the problem in terms of automated question answering and cites several projects working in that area, one of them a federal intelligence project named ACQUAINT, and the others web-wide efforts ranging from shallow question-answering technologies like Ask.com (now augmented with search refinement tools) and meta-search clustering like Clusty.com.  </p>
<p>But there is another way to cast the problem, an alternative metaphor to the web as giant library or file cabinet. The web&#8217;s billions of pages of content are uploaded into the giant file cabinet by <i>people</i>, including an astounding number of experts in a vast array of subjects.  Even if the web as seen through the scratched lenses of a search index can&#8217;t find the answer, or even if the answer isn&#8217;t even out there, there&#8217;s an expert out there who knows the answer, as Mr. Fallows realized.</p>
<p>Bloggers make up a rapidly growing population of experts who contribute content to the giant library.  There are now somewhere between 9 million and 12 million blogs, dealing with a mind-boggling diversity of subjects, and with real depth in many of those subjects.  We (our industry and our government) should push full-speed ahead in research on automatic question answering (our security may depend on it, as Mr. Fallows points out), but we should remember that the experts are already there.  We just don&#8217;t know høw to find them yet, unless we have &#8220;education experts&#8221; in our address books.</p>
<p>Blog search is heading in the direction of web search, using the file cabinet metaphor and adding on top of that the <i>stream of web pages</i> metaphor, which is how RSS is most often treated and indexed.  The unique nature of blogs, e.g. the 1-to-1 relationship of author to content and the historical record of expertise amassed by each blogger over the course of many posts, should be mined to provide ways to find answerers, not just answers.</p>
]]></content:encoded>
			<wfw:commentRss>http://remylabs.com/blog/2005/06/fallows-on-getting-answers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
