PDA

View Full Version : Symbols in URL's + Deep Content Indexing


Kal
13-02-2002, 02:02/02:02AM
Hi there - I have a two part question for everyone:

1) Are there any search engines that still cannot index "+" or "-" symbols within URL's such as This One? (http://www.abbi.com.au/dire/direct/directpublishing.nsf/Content/Insurance+Solutions+-+Tools)

2) Is there any spidering disadvantage to having content many levels deep (such as in URL above). For example, are there any spiders that will not crawl past a certain level below the top domain? Does my client risk content not being indexed if it is placed too deep?

Thanks in advance.

pageoneresults
13-02-2002, 02:47/02:47AM
2) Is there any spidering disadvantage to having content many levels deep (such as in URL above). For example, are there any spiders that will not crawl past a certain level below the top domain? Does my client risk content not being indexed if it is placed too deep?Hello Kal! I've given this method of optimization my own term...

The Chain of Command Theory

One very important element in optimization is the directory structure of the site. Content relative to your main theme should be in the root directory.

Content relative to your secondary theme should be in your 1st level sub-directories.

Content after the secondary theme might go in a 2nd level sub-directory.

I like to keep everything within 3 levels. Your Admirals at the top, Vice Admirals at the 1st level and then the Captains at the 2nd level. I try to keep the Petty Officers out of there. This is strictly Scrambled Eggs material!

For those spiders that do deep crawls like Google and a few others, you could probably go 10 levels and still get indexed. I think dilution of content occurs as you travel down each directory level.

Check this one out from Google (http://www.tema.ru/p/h/o/t/o/s/h/o/p/)

http //www.tema.ru/p/h/o/t/o/s/h/o/p/

Kal
13-02-2002, 03:11/03:11AM
Great analogy P1 :thumb: I'll use that when discussing the issue with my client. Anyone else?

ihelpyou
13-02-2002, 06:05/06:05AM
hey Kal, maybe it's me but I avoid those types of url's because I do believe the spiders have a hard time with them. I know they have a hard time with this:

www.thatdomain.com/that domain.htm

Notice the ( %20 ) put into the url with just a space separating the words? The spiders stop dead in their tracks on it.

IMO

Kal
13-02-2002, 06:20/06:20AM
Originally posted by ihelpyou
Notice the ( %20 ) put into the url with just a space separating the words? The spiders stop dead in their tracks on it.
IMO
Couldn't see the %20 in your example, but I have seen this happen before, esp when I quote URL's in emails. So are you saying I should rename all MY pages from this (http://www.high-search-engine-ranking.com/search_engine_optimization_benefits.htm) to this? (http://www.high-search-engine-ranking.com/search-engine-optimization-benefits.htm) I've never had a problem with this before.

Advisor
13-02-2002, 09:11/09:11AM
Kal,

I think Alan can probably best answer your questions about dynamic URLs. He has a program that can spider through sites and determine which URLs can and can't be read by the engines. He may also know which engines are good (or bad) at indexing those kinds of URLs. I think he's said here before that the engines can index those dynamic URLs, but they're often reluctant to.

I would also be interested in knowing which engines are still having a hard time with them (if any), because I'm getting less afraid to optimize sites that are dynamic. I don't worry about them too much in Google, and we can always pay to get them in AV and Inktomi, if necessary. What else is really left after that?

Jill

Advisor
13-02-2002, 09:12/09:12AM
Doug, the %20 thing is different than a dynamic page, however. As you said, it's just a space in the file name. I think that's a whole other problem than dynamic sites, although I could be wrong about that. I always tell my clients to fix that space problem if they have it.

Jill

Alan Perkins
13-02-2002, 09:39/09:39AM
A "+" in a URL represents a space, as does a %20. An underscore "_" is not a space. URLs that contain spaces are not neccessarily dynamic URLs.

I define static URLs as not containing a "?", dynamic URLs as containing a "?" and personalised URLs as being designed only for one person to see. Personalised URLs are usually (but not always) dynamic. Notes these are my definitions, there's no definitive terminology for this area.

Robots can crawl dynamic URLs but:

Whether they do or not is another question.
Whether they index the content or not is yet another question
Whether the content ranks well or not is yet another question

These questions have to be answered on an engine-by-engine basis. Taking Google, yes it crawls some dynamic URLs (especially if they are linked to by a static URL) but not all, and its cut-off algo changes regularly. Yes, it indexes dynamic URLs. And they rank according to their PageRank (often very low since they are deeply buried) and their content.

You have to be very careful about duplicate content being displayed crawled under different URLs, and about temporary or personalised URLs becoming indexed and the side effects that may have.

I specialise in clients that use systems like Broadvision, ATG Dynamo, Intershop, Vignette, and other systems that weren't designed to market anything but the home page :rolleyes:

Kal, the .nsf in the domain you posted is Lotus Domino/Notes. I've come across it once or twice and you can get it indexed, if not ranking well. Take a look at the URLs from this search:

http://www.google.com/search?num=100&hl=en&q=+site%3Awww.lotus.com+nsf

Advisor
13-02-2002, 10:02/10:02AM
I specialise in clients that use systems like Broadvision, ATG Dynamo, Intershop, Vignette, and other systems that weren't designed to market anything but the home page So, Alan...when you work with clients that use the above systems, do you somehow tweak their programming so that it creates more user friendly URLs or is it something else? I know that you and I have discussed this stuff a bit privately, but I'm still not clear on what/how you do this. Here's your chance for a little self promotion, as I'm sure many of us have clients and potential clients that utilize those kinds of systems. To be able to offer them a method to get their question mark-filled URLs definitely indexed is a huge selling point.

Jill

highman
13-02-2002, 10:23/10:23AM
I will chuck my 2p's worth in here;

You have to be careful when refering to dynamic URL's, we deliver all our dynamic website pages as htm files (or whatever extension we fancy) with no wierd characters in the URL, because a URL has no ? or % etc does NOT mean it is not dynamic.
Dynamic means delivered from a db and not from a file residing on the server hard disc.... well thats my def. anyway.

All the engines out there prefer static LOOKING url's, some will index URL's with ? or % in the URL but they certainly do so with less eagerness than a URL without this characters in.

To be able to offer them a method to get their question mark-filled URLs definitely indexed is a huge selling point.

Indeed..... it is :)

Jill if you have a programmer on board, mail me and I can explain how to set this up, but not being a programmer myself i cant write the code for you.....

Getting back to the original post, I would do everything you can to remove any strange characters in the URL that may indicate a database is involved.

And im with Pageone on the directory structure although i tend to keep the Captains in with the Vice Admirals on the good ship 'keyword directory names' ;)

Alan Perkins
13-02-2002, 10:48/10:48AM
Do you somehow tweak their programming so that it creates more user friendly URLs or is it something elseI prefer to tweak the programming if possible, but when it isn't there are other techniques that can be used to get the content indexed. Personalised URLs are the worst to deal with.

Dynamic means delivered from a db and not from a file residing on the server hard disc.... well thats my def. anywayI try not to use "dynamic" on its own. I use these defs:

Dynamic URLs: URL's containing a "?"
Dynamic content: content delivered from a database and/or created on-the-fly

Blue
13-02-2002, 11:28/11:28AM
Sidebar:

I don't know if it holds true for the latest version(s) of Netscape, but files (ie: that domain.htm) with spaces (%20) will not render in some versions (IE was OK with it).

I found this out a few years ago when I did a redesign of a clients site that had spaces in ALL their pages file names. Big headache as there was no global way I could change them. Sheesh....

ihelpyou
13-02-2002, 12:05/12:05PM
hey kal, yes, I was referring to a 'space' in the file name/page name. It always yields a '%20' if you click on it. You can see what is in the address bar of the url I posted if you have a space in the file name.

Google seems to have a hard time following those kinds of url's. Whether they be dynamic or static, if they have a %20 in them, Google does hesitate. A space always seems to mean hesitation.

Further;
I find that Google likes file names with an underscore( _ ) separating the words rather than a hyphen( - ) .

For a file name, an underscore works best.

markymark
13-02-2002, 19:34/07:34PM
I find that Google likes file names with an underscore( _ ) separating the words rather than a hyphen( - ) .

Doug, on what do you base this. I have never seen a difference. In fact, most SEOs use the hyphen (easier to type ?) and seem to get good results.

Going back to something Alan said about duplicate content crawled under different URLs. Are you referring to duplicate content in the literal sense - ie: two or more separate pages with the same content or are you referring to one page that can be accessed via two different URLs.

IE: mydomain.com/mypage.html and mydomain.com/mypage.asp?somerubbish=somedifferentrubbish.

This has got me a little concerned as one of the sites I work on has pages that can be accessed via three different extensions, though they are all the same page.

ihelpyou
13-02-2002, 19:44/07:44PM
hey Mark, that is simply my observation of experimenting with both a hyphen and an underscore. Nothing scientific and not lots of research, believe me, but simply my belief. :)

As to your other question, your server setup thing determines all of this as per PageOne's problem in another thread. If Google would see those url's as a different page in a different space, then problems could occur. If the setup is correct and the url's are pointing correctly to the same space/page, then all should be just fine and Google will see them as one page/space.

Hope that makes sense. Not sure now as I am re-reading it. :)

Kal
13-02-2002, 21:02/09:02PM
Um Doug - that example I typed uses an underscore _ . Most of my site pages use underscores instead of hyphens, but now I'm unsure if perhaps a hyphen would be better. I know you find an underscore personally better, but I'm not so sure. Can anyone let me know if they think using underscores as in domain/page_name.htm is ok or should I change it to domain/page-name.htm or perhaps I'm obsessing?

Also, thanks Alan for your response. But I'm still confused. As you stated, the client sample page is not dynamic (not sure why everyone is talking about dynamic pages actually), but if the "+" in the URL represents a space, does this mean some engines will not be able to index such URL's? I couldn't see any "+" symbols in the Google example you gave. What is the purpose of using "+" symbols anyway, or is that just something the Domino server creates automatically?

Alan Perkins
14-02-2002, 11:44/11:44AM
Originally posted by markymark
Are you referring to duplicate content in the literal sense - ie: two or more separate pages with the same content or are you referring to one page that can be accessed via two different URLs. I don't see any difference between those two. :)
Originally posted by markymark
This has got me a little concerned as one of the sites I work on has pages that can be accessed via three different extensions, though they are all the same page.That's something to be concerned about, but not excessively. They are duplicate pages. If detected as such, the spider has a choice:

index none of them
index one of them (i.e. pick one of the three URLs)
index more than one of them (similar concept to parked domains)

Originally posted by Kal
does this mean some engines will not be able to index such URL's? I couldn't see any "+" symbols in the Google example you gave. What is the purpose of using "+" symbols anyway, or is that just something the Domino server creates automatically?

I think URLs with spaces should be indexed OK, but they are crappy URLs. They HAVE to be quoted or escaped or look what happens:

http://www.searchmechanics.com/prepare to be amazed.htm

The "+" is an automated thing. It's not Domino, it's the HTTP specification. A space should always be represented by a + or a %20.

markymark
14-02-2002, 11:54/11:54AM
I don't see any difference between those two.

Well, in the first case there are two different pages (albeit with the same content). In the second case, there is only the one page.

I have been trying to get the programmer to phase out the other extensions, but with 2,000 plus pages this is taking time.

pageoneresults
14-02-2002, 11:56/11:56AM
Can anyone let me know if they think using underscores as in domain/page_name.htm is ok or should I change it to domain/page-name.htm or perhaps I'm obsessing?Okay, this is a debate that I've been involved with many times. Hyphens and underscores represent a space to the SEs. Do not, I repeat, do not use file names with spaces. As Alan says it is bad practice and there are some problems with indexing those from my understanding.

Now on to the hyphen vs. underscore debate. The first thing I bring peoples attention to when discussing this is what the user sees when that url becomes a hyperlink. If you have an underscore and the link underline is just right, you don't see the underscore. Many people will think it is a space and may type it that way. Hyphens eliminate that concern and I just think they look better than underscores.

I'm attracted to hyphens!;)

Alan Perkins
14-02-2002, 11:58/11:58AM
That's an insider's point of view, Mark. How can an "outsider" (such as a spider) differentiate the two.

pageoneresults
14-02-2002, 12:01/12:01PM
I have been trying to get the programmer to phase out the other extensions, but with 2,000 plus pages this is taking time.Is there something you can do with the robots.txt file to keep those extensions from being indexed?

highman
14-02-2002, 12:05/12:05PM
That's an insider's point of view, Mark. How can an "outsider" (such as a spider) differentiate the two.

different file size / page layout? maybe

ihelpyou
14-02-2002, 12:05/12:05PM
Hey Kal, I'm not saying that underscores are better, just saying that I prefer them to hyphens based on my 'not so substantial' research.

It's just a preference. I actually think it makes no difference in the long run. PageOne makes a good point but how many times will someone look at a link and then type that exact link including the extension into the address bar?

Normally a user will remember the domain only and type that in and not the whole url, so I think the fact that an underscore is hid behind the hyperlink is a minor point.

markymark
14-02-2002, 14:33/02:33PM
Alan,

Good point. Hadn't thought of that. As for Highman's robots.txt suggestion, I'm gonna try that - it will be the biggest robots.txt file in the universe, but it may work. Thanks all.