Over at the edgeio blog I have posted the first insight into where we are going with edgeio search. It has been about 9 months since we launched edgeio.
We now have a dedicated search team and this is their first push. It is not yet perfect but it is a vast improvement on what was there before (also significantly better than Googlebase search – which is a primary comparison for us).
As the post says we have decided to go with the flow to some extent. Many listings based sites are uploading their listings to edgeio and we are providing search traffic back to them. We are being used as a listings search service by companies with listings and by users looking for listings. A “search engine for stuff” if you will.
Based on our experience there seems to be demand for a search engine that indexes actual items/services/offers/wants/needs. edgeio wants to become that. Try a google search for “Sony Vaio” and compare it to an edgeio search. We show “stuff” (Sony Vaio’s actually) and they show sites about stuff, but no “stuff”. That’s the opening we see. Clearly Googlebase is focused there also, but it is clear that the complexities of owning google.com and its algorithm clash with the need for Googlebase to have its data seen. edgeio actually does better on Google than Googlebase (see examples below).
Let me know what you think. Our first search algorithm is live on edgeio.com now. We have a lot more to do (we know) but its a good first step.
Oh and as promised here are some examples of edgeio’s Google performance. Basically, a secondary effect of the way edgeio is being used is that we have improved rank on google.com for searches that we have lots of listings for. The effect of this is that our listing partners get more traffic. As our listings grow, from thousands of publishers (currently about 6000) that trend should continue.
A spat has blown up over the weekend regarding Oodle and Vast.com “scraping” content from 3rd party sites and re-purposing it inside their environments. This essay is my reaction to the spat. As a founder of edgeio I clearly have an interest in the answer to the question. edgeio does not scrape or crawl. All of its content is permission based (published using the “listing” tag; uploaded directly into edgeio OR published on edgeio directly to a personal listings blog that we host).
However, there is more at stake here than competitive issues between edgeio on the one hand and Vast/Oodle on the other. The wider issue is whether or not scraping (which is very like crawling and indexing except it reads displayed content not files) constitutes stealing of data.
“This is called stealing contentâ€¦there’s no advantage to me to have them steal,” commented Laurel Touby, founder and CEO of media industry site mediabistro.com, upon learning that Vast.com had linked from its search results to full mediabistro.com job listings pages, even though those pages require registration when accessed on the mediabistro.com site.
Vast.com CEO Naval Ravikant said Vast.com’s crawlers do not automatically register or login to sites, so they must have found passage through the mediabistro.com system via a legitimate entryway.
So let’s try and address this broader issue. Firstly this is a new discussion. Nobody accuses Google of stealing the data that is in it’s index (except book publishers of course). Why not? Well, because Google primarily indexes the “visible” web. That is to say, sites that are linked to from other sites and are not behind a password protection system of any kind; and even then it respects directives in a file called robots.txt where a publisher can ask not to be indexed. And secondly, Google does not display entire documents (although its cache is getting very close to doing so and may give rise to similar discussions in future). Rather it points to the original source for reading/viewing the content. Thus the business model of the original publisher is left in tact.
With the emergence of vertical search aggregators, especially in the commerce space, the issue of ownership and permission become far more pronounced. Why? Because the data represents an inventory, and often an “invisible” web inventory – that is to say, behind a password protected site. The effort to aggregate that inventory into a central marketplace is done without permission of the owner of the inventory. Whether password protected or not this is going to give rise to disputes like the one between Craigslist and Oodle a little while ago.
There is no need to invent new means of dealing with this. But there is a need for good behavior. Crawlers always should respect robots.txt. Scrapers are different. The spiders can read displayed contentl directly and do not crawl the file system. As such they can bypass robots.txt. If scrapers respected robots.txt then a publisher could effectively put its content out of the reach of the crawlers. It isn’t clear at this point whether the scrapers do respect robots.txt files. A better solution is to use RSS for syndication rather than crawling and scraping. More on this below.
The second issue is whether the item level link from a result set points to the original source or to a hosted copy of the original. Oodle and Googlebase had a difference of opinion about this issue. Content publishers will care what the answer is.
The third issue with scraping is a quality issue. On its home page Vast.com states:
All results are automatically extracted by crawling the web
Vast.com cannot guarantee the accuracy or availability of the results
…only includes listings that are fresh and relevant: we keep track of all the listings weâ€™ve seen and auto-expire old ones that are still online and exclude things that look like listings but aren’t (reviews, spam, etc.).
The issue here is twofold. To stay current with a live inventory of listings is hard. To even attempt to do so creates a need to crawl and index very aggressiveley, and the results are often not good. Craigslist’s gripe with Oodle was at least in part driven by its experience with Oodle’s crawlers. They were apparently polling and sucking content very aggressiveley and needed to in order to stay current. If you do not poll aggressiveley your index gets even more out of sync with the original source than it already is.
It seems to me that RSS is a custom made solution to these problems. Scraping and Crawling are the wrong tools.
If publishers who wish their content to be syndicated to a third party publish an RSS feed and the third party consumes the feed we have a) a permission based syndication system; b) a real time ability to update inventory. edgeio made the decision to follow the Craigslist model whereby a listing is explicitly requested by a publisher. Publishers of listings, from your Mom to a large site, are a community, made up of many smaller communities. A central listings service (CLS) should be a service to that community. Permission based, real-time publishing, via RSS, is the right tool for the job. Over time this is a highly scalable solution. Publishers can opt in and out at will.
I predict many more accusations of stealing insofar as the industry continues to mine the “invisible” web, and the specialist web, via scraping and crawling.
And finally, edgeio publishes RSS feeds of every item (either individual items, or our entire inventory). Oodle and Vast are not competitors, but distribution partners. Our data is more valuable insofar as more people see it. That will happen if the data is placed in more environments. So, take it, for free. But please, do not scrape it or crawl it. Just read the RSS feeds. That is why we have them.
Certainly Vast is not alone in convincing classified sites that they’re helping them bring new visitors, but if the classified search engines are to see a bright future, they’ll need to secure strong partnerships with their partner sites.
Russell Beattie ay Yahoo has a lengthy post about RealNames. It’s a generous and thoughtful piece. Thanks for the link Russell.
There are a couple of things worth knowing.
Firstly RealNames didn’t really crash in the bubble. At least not directly. We were profitable and growing fast (about 120% a quarter back in Q1 2002.
Secondly, we had an awesome business model. Resellers all over the world were selling Keywords. Most uptake was in China, Korea and Japan where we were the only way to make local languages useable as navigational addresses. We had pretty strict controls on ownership but we were able to segment nations into seperate namespaces. Today we would do local keywords too.
Thirdly, we were doing 1 billion resolutions a quarter in Q1 2002. That was page views that MSN lost to us because we were able to provide direct navigation to a web page from a keyword. Microsoft decided to close us down in order to regain those page views. Search this blog for the story.
There is a patent. You (Yahoo) own it through your acquisition of 3721.
Google had it’s first earnings call Thursday. If you are interested in either search or advertising online it is really worth an hour of your time. Lot’s of questions out there. Is Google overheated? Are they doomed to fail? Are they the EBay of Data? Listen and decide for yourself.
Then it hit me. This is perfect for a podcast. So… here is my first “Earnings Cast” – with no editorial from me.
Charlene Li of Forrester has a new Blog. I met Charlene briefly at Web 2.0. She has comments on the search panel – specifically noting the large agreement between the panelists on what the future direction of search is from a features and functionality point of view. Then she hits us with a killer comment:
“I don’t think the war will be won on features and functionality. Instead, the victor will be the company that is more comfortable with opening up their index and data from personalized profiles. By turning these assets into platforms for other companies to build on, they will solidify their role as central to people and also, to businesses.”
Brilliantly put. The very concept of a search engine is changing from a destination into a platform. Those who understand this will prosper. From this perpective Exalead have a really interesting product. Check out the rich API and scripting language that sits atop the search capabilities at http://www.exalead.com
When keywords became available in Microsoft’s Internet Explorer two years ago, they offered companies a simpler way for their customers to find information about their products and services on the Web.
Now the recent closure of RealNames, the Internet’s premier provider of keywords, leaves the technology’s future in doubt even as the Internet Engineering Task Force (IETF) is finalizing a standard for resolving keywords.
Major corporations such as IBM, eBay, Ford, Bank of America and Xerox use RealNames’ keyword service, which will remain operational until June 28, to help customers find product information on their Web sites. For example, a user could type the keyword “ThinkPad” in the browser command line and go directly to the appropriate page of IBM’s Web site without needing to know the corresponding URL.
However, the technology’s greatest promise was in resolving queries for non-English language domain names, especially in such complex languages as Arabic, Japanese, Chinese and Korean. Among the Asian companies that used RealNames’ keyword service to ease user navigation were Sony, Nomura Security and Haier Group.
“The closure of RealNames, the company, is fundamentally a measure of business-level process, not a statement about keyword technologies,” says Leslie Daigle, chair of the Internet Architecture Board and leader of the IETF’s keyword resolution effort. “It has more to do with decisions made in various boardrooms than whether people want, or will supply, keyword services.”
RealNames ceased operations May 13 after Microsoft chose not to renew a contract to distribute the RealNames keyword service with its browser. Microsoft made this decision even though it owns 20% of RealNames, which is in the process of liquidating assets.
Also losing out in this surprise turn of events is VeriSign, which owns 10% of RealNames and was a reseller of its keywords. VeriSign plans to write off an $18 million investment in RealNames and must come up with another solution for resolving the 1 million foreign-language domain names it has registered.
RealNames founder and CEO Keith Teare says the company was at a break-even point and has $12 million in cash. The company owes Microsoft approximately $25 million and plans to sell its physical and technological assets related to directory services, multilingual domain name resolution and messaging services.
“We put all our eggs in one basket very consciously,” Teare says. “With Microsoft as a 20% shareholder and with significant revenue share going to Microsoft, we thought it was a pretty safe basket.”
RealNames had sold more than 200,000 brand-name keywords. During March, RealNames’ keywords were accessed 187 million times, company officials say.
“Keyword technology is being killed by Microsoft, and it has absolutely no chance of being re-created unless they’re prepared to help,” Teare says. “Their browser is used by a half-billion people, and it sits between the user and the content. If their applications don’t support keywords, then keywords don’t exist.”
Keywords are “the human-facing component that the geeks who built the Internet forgot to make,” says Michael Mealling, an engineer with VeriSign who helped author several IETF documents related to keyword resolution. “The Internet really became popular before we finished the thing. And it was used primarily by geeks who didn’t have a problem with [long HTTP addresses]. So now that our grandmothers want to use it, we have to build that human component into it.”
The demise of RealNames leaves only a few small keyword providers standing: the U.K.’s CommonName, a Chinese portal-based keyword system called 3721 and a Korean-language service called Netpia. Meanwhile, AOL Time Warner offers a proprietary keyword service to its users.
Companies spent $500 per year to register a RealNames keyword, with volume discounts available for large customers such as eBay, which had about 3,000 keywords in the RealNames system. These large customers paid RealNames from $50,000 to $500,000 per year.
Supporters of keyword technology say RealNames failed in part because it was a closed, proprietary service offered by a single company.
Ironically, RealNames is closing just as the IETF appears ready to approve a standard for keyword resolution. RealNames engineers helped create the standard, which is called the Common Name Resolution Protocol (CNRP). Daigle, who served as chair of the CNRP working group, says the IETF’s leadership has approved the CNRP documents and will publish them soon.
Whether a new player will enter the keyword market remains to be seen. RealNames executives were angered last week by the discovery that Microsoft recently was awarded its own patent for a keyword system. Microsoft declined to comment about the patent it has on keywords.
Some observers predict new keyword services will be hard to finance in today’s economic climate.
“Keywords is another layer above the basic resolution. It is an option. So the key question is: Do we really need it?” asks Marc Blanchet, co-chair of the IETF’s Internationalized Domain Name working group and an engineer with Viagenie. “This is a great technology but . . . some technologies that are not essential sometimes never get the market [support].”
San Carlos, Calif.
Keyword registration and resolution service.
Venture capital raised $133M in four rounds; Microsoft 20%, VeriSign 10%.