Dear Lazy Web, Blog Search Engine
Dear Lazy Web,
We have a bunch of internal blogs at work, or we will soon. Problem is that we have many different ways for people to blog. We’re working on an official solution but there are other ways to blog. For example, some of the wiki’s have blog like features, some groups have set up their own servers, and many of the “collaboration” products out there have similar features. It would be really nice to allow people to post in the solution they like best, yet still have a central location for people to see what’s going on in our “blogosphere” so to speak.
A traditional aggregation solution like planet or Feedjack isn’t going to work because they won’t scale to the number of feeds we’d need to track. After a certain number of feeds are configured in the system its going to spend almost as much time (if not more) crawling the feeds as it would displaying them. Especially when you consider most of those feeds won’t have been updated, crawling all of them each time isn’t very efficient. Its become very clear that a solution more like Technorati is the direction we’d want to go. By only indexing sites when they “ping” it to tell it they have been updated the content can remain up to date without wasting time crawling pages that haven’t been updated.
I’m somewhat surprised that I wasn’t able to just find something to accomplish this task very quickly. It seems like it should already exist and a simple search over a freshmeat should have turned up several options.. I think I’m looking for the wrong things though because I haven’t found anything yet that does what I’d like it to. So dear lazy web what should I be looking for instead? I know it must be out there…
If the journals export RSS or like feeds, planets work. However, adding feeds doesn’t really scale. For a while I tried a 400+ feed planet and it took 10 minutes to sync each time. The only consideration you might want to have when you do have an “official” blog is allowing offline tool for folks to blog.
Yeah, that’s the problem I’m trying to address. I’m pretty sure we’ll be looking at blogs numbering well into the thousands. Obviously, most of those won’t actually be updated on a regular basis, which is why I’m looking for something that won’t scan every feed on every update.
But the downside is that pings require configuration. On the other hand, (i) it sounds like most of your feed sources will be local (fast), (ii) planet software does unchanged-feed detection, and (iii) surely you won’t need up-to-the-second aggregation.
If the aggregation process takes a few minutes a couple of times an hour, isn’t this still a win compared to configuring pings into thousands of blogs?
I should probably give you some since of the scale we’re talking about here. Including employees, contractors, and partners working internally we could be talking about upwards of 70,000 user blogs alone. That doesn’t include blogs for various teams, projects, or topics. I expect a much higher rate of actual bloggers than you’d see out on the net. Many of the pilot users are using their blog as a sort of weekly status report. I can see this becoming popular. That could mean several thousand regularly updated blogs. Even only updating once an hour I just don’t see planet scaling that far.
For the official systems pings can be configured to happen by default. Groups setting up their own systems can figure it out on their own. Its the price of running unsupported. Planet feeds would also have to have to be configured by hand only by someone central. I suppose we could make a simple web interface to update it but it still comes back to the scalability factor.
Planet works for us very well now.