PDA

View Full Version : Dynamic pages & submitting to SE's


supine
22-08-2001, 07:10/07:10AM
Hhi can anyone help re the following ,
> i'm trying to find information on submitting dynamic
> (database driven rather than flat, static html pages ) pages to search engines - are they read at all?

highman
22-08-2001, 07:16/07:16AM
By dynamic I presume you mean ? and = signs in the URLs, some engines are starting to spider these db driven sites, but a much more effective way is to remove these special characters from the query string (this can be quite difficult and require programming skils).
This makes the website far more 'spider friendly' and therefore more effective when SEO is applied.

usbnuts
24-08-2001, 03:09/03:09AM
Hi, I have experience in this area.

For Google, dynamic pages with query strings attached are the last thing to spider. I have tested this before. Say, you have 500 content rich pages generated from one program. Even though their keyword desnity is high, Google will wait for more than a month to index those pages. Its spider probably needs to investigate whether or not those pages are just trying to spam the search engine.

If you make them look static, Google will be more than happy to index them in the first month. I have personally tested this myself.

This ans applies to Google only.

ihelpyou
24-08-2001, 07:17/07:17AM
Very good post usbnuts and good info for all to know!

JuniorHarris
24-08-2001, 11:02/11:02AM
Thanks for your experience usbnuts!~ Google has been getting much better with indexing query string URLs (hate to call them dynamic as that can be misleading), however as you note they may take longer.

Well highman, I suppose it's time we start a thread regarding spider friendly pages. It's important to remember that dynamic pages in itself is not the problem, but rather any page which leverages query strings or special characters in the URL. The biggest obstacle for most engines is the "?" question mark used to delimit the beginning of query strings.

'We have a fairly large database driven site, in fact 98% of all content is generated directly from the database itself!~ However none of our pages utilize query strings, but instead use a full [static] spider friendly URL. Granted some engines are getting better at indexing pages with query strings, however most spiders will crawl deeper and longer with spider friendly pages. Additionally, spider friendly URLs are much easier to remember by the surfer as well. For example, domain.com/product/apples/ is much easier to remember then domain.com/product.asp?ID=10000 or even domain.com/product.asp?prod=apples.

I have to give credit to SimpleEnigma and Ethan at SEF (there are some decent Mods there) for helping me to understand and develop a solution for dynamic pages. Our implementation utilizes what is commonly referred to as a 404 trap, which simply means it uses a custom 404 (asp script) to process and parse invalid page requests. These invalid page requests are invalid in the sense they do not physically exist on the server, however the 404 error script will intercept these requests and parse the URL to retrieve the necessary parameters to query the database and present the correct "page". The page status is also returned with a status OK, and the engines just seem to eat these pages up!

glengara
25-08-2001, 03:59/03:59AM
I remember coming across that very interesting thread and trying to follow it...still haven't fully recovered.

Sharon & Roy
25-08-2001, 15:35/03:35PM
Originally posted by supine
are they read at all?

Hello supine,

We can only speak for Google on this and here is what they say.

Excerpt From: Google Interview by Fredrick Marckini with Craig Silverstein, Chief Technology Officer of Google - June 2001

Dynamic Content

Google has begun to crawl and index dynamically generated Web Sites. Craig mentioned that crawling dynamic pages is fairly new to Google's service and that it is still a work in progress. Silverstein also realizes that crawling these types of pages is dangerous in that dynamic pages are generated on the fly, leaving the potential to get caught up in an infinite space of page generation (there is the potential that a Search Engine spider could get caught in a recursive trap, indexing thousands of years of calendar months by continuing to follow the "next month" for example). Although Google is approaching dynamic pages slowly and cautiously, they are now included in the Google database and that's good news for many.

JuniorHarris
27-08-2001, 08:35/08:35AM
Glengara, it still makes my head spin too...even now with the technology implemented it can be confusing at times!~ The thread is not recommended for those with weak hearts or pacemakers!~ <lol>
:read:

It is important to note as S&R posted, Google has begun to crawl dynamic [query string] pages. However, it is equally important to recognize their approach is "slow and cautious", and would clearly indicate Google is much more apt to fully index a site which IS spider friendly.

From personal experience I know Google will index query string URLs, however I had to encourage Googlebot to index the pages by listing them [moderately] on a spider friendly URL page, as Googlebot does not tend to follow links from the query string URL pages well [if any at all].

This experience was a few months ago, so Googlebot may now perform better with query string URLs. At *best* query string URLs will perform equal to spider friendly URLs, but spider friendly URLs have the advantage of being able to dynamically include keywords within the directory path and filename. ;)

nuzelonde
14-01-2002, 04:22/04:22AM
I work on many dynamic sites for my clients.

Firstly, the Google spider is getting a lot better at reading the dynamic URLs, but this is only half the answer. The spider is programmed to avoid getting caught in infinite request loops, which are a real problem on sites that are generated on-the-fly.

The spider essentially keeps generating a unique URL each time it tries to follow a link, gets stuck and, if you're lucky, times out. If you're not, your web server grinds to a halt.

The workaround is to introduce static content. Consider blocking the spider with a robot.txt file from most dynamic pages and create static representations of these pages. I find it best to build a mini theme site consisting of static pages for the spiders within the dynamic site.

You can also cloak, but that's another story :)






;)

crifer
14-01-2002, 10:27/10:27AM
I have read about dynamic pages on a site called spider-food.net and found a plugin for people who use apache webserver. The plugin makes static pages out of dynamic pages, i didn't read exactly how it worked but it's an option if you have apache as server.

Here are some links:

http://spider-food.net

Direkt to dynamic optimization page:
http://spider-food.net/dynamic-page-optimization.html

There are some tips for others then apache-server too... It's a really good page, i rekommend taking a look.

JuniorHarris
18-02-2002, 13:17/01:17PM
The Apache mod_rewrite documentation can be found here (http://httpd.apache.org/docs/mod/mod_rewrite.html). There are also products available for IIS such as QwerkSoft (http://www.qwerksoft.com/products/iisrewrite/usage.asp), for those without a savvy ASP programmer!~ ;)

BobbyK
07-04-2002, 22:16/10:16PM
so do i have to use some ASP component to convert mypage.asp?para1=val1 to mypage.asp/para1/val1 or is there anyway i can do it without componenets? Either way, please give me some tips.

Thanks in advnace

Kal
08-04-2002, 00:40/12:40AM
Hi supine - Danny at Search Engine Watch has an excellent section on making dynamic pages spider friendly here:
http://searchenginewatch.com/subscribers/more/dynamic.html
but you have to be a subscriber to access (worth it!). It lists resources for ASP, Apache, ColdFusion & Domino sites.

For ASP users, these URL's may help:
http://www.webanalyst.com.au/Products/ASPSpiderBait.htm
http://www.xde.net/

For Lotus Domino users, there is a useful workaround resource here: http://www.keysolutions.com/notesfaq/howsite.html

hope this helps!

JuniorHarris
09-04-2002, 12:09/12:09PM
You can create dynamic pages which appear static in ASP without any additional components. This could be accomplished by creating static links which are invalid URLs, and create a custom 404 error page which will process and parse the invalid URL and return the correct results. Since all the processing occurs server side, they appear exactly as static pages to the search engines as well as the users!~ ;)

Bobby, I sent you a PM....

chopsticks
02-08-2002, 00:06/12:06AM
just my $.02 worth (from personal experience).

if you're going to go through all the trouble of making a custom 404 page (OK, so it's not really any trouble, just a normal web page and the web server configured to hand that page out on errror = file not found).....

make sure you utilize the custom 404 page MOSTLY for the human visitor.

a custom 404 page will, in most cases, still generate a header that identifies it as an error (404 - file not found).

so although the page might look spiffy and be optimized for search engines, I doubt it's going to really be considered too valuable by the spider. (but it'll be hecka valuable to a human who accidently gets the error page)

JuniorHarris
02-08-2002, 08:34/08:34AM
Chops I'm not sure you fully understand the custom 404 error trap method. Our entire site is all driven from a single 404 error script, even the root page uses the same script!~:eyes:

We have hundreds of pages indexed by the engines without any problem. The key is to change the status code [Response.Status=200 ok] for valid pages (those in the database), and still pass a "valid" 404 for real 404 pages. This is typically done server side, before any output [including headers] are returned to the browser.

Given a proper implementation, neither engines or users can tell the difference between a "regular" page and one that was created dynamically through a 404 trap. Anything that occurs server side is invisible in a sense, and only the resulting output [from server side operations] sent to the client is available for indexing.

From my commercial experience, using a custom 404 trap has not created any problems for search engines or users. In fact it has actually solved problems!~ Look Ma, No query strings!~

Whether an engine/spider values a page depends more on the content or browser output then it does regarding how the page was constructed!~:read:

Alan Perkins
02-08-2002, 09:04/09:04AM
There are two sets of advice regarding 404 handlers and it's easy to mix these up.

1) Generally, it's a good idea to replace the default 404 page your Web server spits out with a prettier, more useful one. This has nothing to do with SEO. The pretty page should still be returned with a 404 error status.
2) For SEO purposes, you may choose to deliberately create references to "missing" pages, then use a 404 handler to translate the missing URL into a DB lookup. Where the URL will translate, the HTTP response will not be a 404. However, where the URL will NOT translate a 404 should be returned.

Often 404 handlers fail to ever return a 404! This creates an infinite-sized site (think about it ;)). When you write an interrupt handler, you have to be sure you fulfil every function of the default handler you are replacing.

chopsticks
02-08-2002, 10:28/10:28AM
It 'tis true that one can have a lot of fun with the error pages if you can control the headers (that's why I love playing with Apache... and *nix servers in general).

Unfortunately myself and some of my past clients have had a virtually hosted WindowsNT domain. And while the hosting company often says things like you get a custom 404 page, they just slap a template into the IIS interface. (Windows 2k & IIS5 solved a lot of those limitations).

***

So, at least with Windows I've had a pain in the rear. I eventually just used ASP to send the headers manually (since the dang hosting firm just wouldn't budge on their "policy").

That worked fairly well. Except for the stats reporting that we were getting from our hosting firm. *d0h!* Since it was officially an error page all of our visitors were tracked via their stats program as a 404. (It was possible to track everything, but it was darn strange! Since we were using the custom error page as a 'catch-all' page.... everything was backwards! Whenever someone DID get a real page that was essentially an error, and whenver they got the 404 page [that I had the ASP code send back the "OK" (200) headers] it was actually a valid hit!)

Ahhh... I felt like Charley Brown trying to kick the football with that hosting company.....

JuniorHarris
02-08-2002, 12:11/12:11PM
Good Points Alan! Knowing the intent helps considerably.

A replacement [or pretty] 404 page can be used in place of the default 404 page. But the performance or function with the engines should/would be no different then the default, they still return a status 404.

The post by Chops may have been in reference to using a replacement 404. And since the thread discusses dynamic content generation utilizing a 404 error trap, it may have created some confusion. Especially the part regarding the performance with the engines.

It is really not that hard for a 404 handler to return a true 404 error. The logic just needs to ensure a "bottom-out" execution path, so that if data is not returned from the data store, a true 404 error page (including status 404) would be generated!~;)

FYI: I have tracked several spiders which seem to request randomly generated [looks like checksum] page names, as if testing for an infinite loop site. Which is really not that hard to test for:

See: http://www.ihelpyouservices.com/forums/foo.

potato
03-08-2002, 15:29/03:29PM
It appears that the more complicated the QUERY_STRING behind the ?mark is, the less is the chance of being indexed.

In case your pages are hosted on an Apache server, you are lucky.
Lets say, you want ../news.cgi?ID=256&lg=en&cat=23 to look more spider friendly, e.g. like
/news/256en23.html (which would certainly be indexed), you could try one of the various methods of server based url-rewriting. The easier things can be done by mod_alias, the more tricky issues may need mod_rewrite.

You have to work out a way to put your urls in one simple string as I did in the above sample.

Then you have to write a one-line textfile (named .htaccess) and put it in the /news directory with a line of text. The following will *not* work, as the paths are not correct, I just write it down as an example:
RedirectMatch ^(...)(..)(..)\.html /cgi-bin/news.cgi?ID=$1&lg=$2&cat=$3
easy, ain't it?

The actual naming depends on your filenames and filestructure. It follows the syntax of POSIX regular expressions.

There are various ways of achieving different HTTP-header responses but for spiders, a 302 temporarily moved header works well. But also a HTTP 200 can be sent by the server if needed. If you want to learn it yourself you have to dig into the Apache manuals. If you need just one redirection, post it here and we can work out a solution.