View Full Version : Can't block index.html with robots.tx
Connie
18-08-2007, 14:50/02:50PM
I was using the robots.txt checker in Webmaster tools today and discovered you could not block /folder/index.htm or html.
User-agent: *
Disallow: /folder/index.html
However you can block
User-agent: *
Disallow: /folder/index.php
This only affects the index.html file in a folder.
I'm just curious why you can block the index.php extension, but not the index.html extension?
SEFL
18-08-2007, 15:02/03:02PM
I've never known an extension to make a difference. How long did you give it after you put that rule in?
WebSavvy
18-08-2007, 15:19/03:19PM
Connie, that's probably because the default folder file extension for your server (set up by your host) is .html
If you add a line to .htaccess to set the default folder file extension to be .php -- you'll then be able to disallow the .html file in robots.txt
It's been a while since I've done the codes in .htaccess for that, so let me have a look through my code snippets library and I'll find it and post it here for you.
Might be a few hours before I can get back here though because today is really chaos.
Comeran
18-08-2007, 15:40/03:40PM
I had never heard of this! Are you sure that the tool you are using to validate isn't mistaken? I can't imagine that an extension effects the robots.txt file.
Com-
Connie
18-08-2007, 17:13/05:13PM
Originally posted by Comeran
I had never heard of this! Are you sure that the tool you are using to validate isn't mistaken? I can't imagine that an extension effects the robots.txt file.
Com-
It's Google's tool. It tells you how Googlebot will treat your robots.txt file. Don't you use Webmaster Tools?
Adam time has nothing to do with it. Everything is done in the Webmaster Tool area, and is in real time. You enter what you want to check in one section, then enter a URL in another section, click the button and you will see how Googlebot will respond to that entry.
Google is not checking your actual robots.txt file. This is all done in real time.
I was just curious as to why Google would treat a index.htm file differently than a index.php file.
Here's the situation I was checking.
I have one folder on one site, that all files should be included except the index.html file. The structure (tree) would be /folder/index.html. In this case if Googlebot found /folder/index.html, Googlebot would get banned if Googlebot then followed the link on the index.html file.
Since there are no actual links to /folder/index.html Googlebot will probably never find the page. As a precaution I wanted to block Googlebot from the page.
Changing /folder/index.html to /folder/index.php solves the problem.
I'm just curious as to the difference.
Deb the default server setup might have something to do with this. Then that gets way beyond my understanding. :D
Danny
18-08-2007, 20:12/08:12PM
Connie,
Could it be you are redirecting both folder/index.html and folder/index.htm towards folder/
If so, then Googlebot will never see those pages.
Googlebot requests folder/index.html, get redirected towards folder/ and check the robots.txt file for any occurences of /folder/ (without the index.html)
If index.php can be blocked, then that would mean there is no such redirect for that file.
Connie
19-08-2007, 00:13/12:13AM
Good thought Danny.
Index.htm is redirected to www.domain.com in root.
The index file in the folder is not redirected.
I can type www.domain.com/folder/index.html into my browser and that is the page I get.
If I block the entire folder then the index.html file is blocked according to Google.
In this particular situation, I did not want to block the folder. I only wanted to block index.html.
Normally I block folders. This was a rare exception. I could block other files in the folder, like /folder/page.html. I just could not block /folder/index.html.
Google kept saying /folder/index.html was allowed.
I could block /folder/index.php. For all practical purposes that accomplishes the same thing.
I changed index.html to index.php. Problem solved.
I probably did a poor job of explaining.
However, if you have never tried it, I would suggest you try the robots.txt validator in Webmaster Tools.
vBulletin® v3.8.3, Copyright ©2000-2010, Jelsoft Enterprises Ltd.