PDA

View Full Version : Another Scraper Site


grungee
21-12-2006, 19:15/07:15PM
I just found another scraper site zta(dot)net they have my site scrapped luckily the ssl warns users that the site isn't real. I know there is a way to mod_rewrite there ip so I can send it to themselves but I can't seem to find the thread. Any ideas please?

grungee
21-12-2006, 20:33/08:33PM
I found it I used

RewriteCond %{REMOTE_HOST} !123\.45\.67\.89+
RewriteRule (.*) http://www.scraper.com/$1 [R]

grungee
21-12-2006, 21:03/09:03PM
Nope it seems to redirect everybody back to the drawing board

grungee
21-12-2006, 21:38/09:38PM
Now I think I got it

RewriteCond %{REMOTE_ADDR} ^123.456.78.9
RewriteCond %{HTTP_HOST} !123\.456\.78\.9+
RewriteRule /* http://www.scrapersite.com/ [R,L]

anyone care to comment?

grungee
21-12-2006, 21:40/09:40PM
Incredibill I thought you may want to know they scraped your site as well.

Dave Hawley
22-12-2006, 00:47/12:47AM
I thought Incredibill had taken measures to stop any such thing?

Arnabme
22-12-2006, 01:21/01:21AM
Hi Incredibill - Not a post for a long time.

grungee
22-12-2006, 01:27/01:27AM
They run it as a subdomain and maybe he didn't know about it yet. So in the end my site looked like www{dot}flatpackitchens{dot}com{dot}zta{dot}net

Arnabme
22-12-2006, 02:23/02:23AM
Came across this one - news.com.com . Is this another scrapper site.

grungee
23-12-2006, 05:01/05:01AM
I see there adding to their list of scraped sites they have added Dougs www.ihelpyou.com, www.google.com.tw, Debs www.websavvy.cc, Quads www.seo2seo.com and I am assuming many others. They don't display google adverts or such they change your words to be link to their other sites.

grungee
23-12-2006, 05:05/05:05AM
Originally posted by Arnabme
Came across this one - news.com.com . Is this another scrapper site.

No Cnet own .com.com they just seem to have a funny way of doing there site somethings are .cnet.com others are .cnet.com.com who knows what their reasons are.

grungee
23-12-2006, 05:11/05:11AM
Originally posted by Dave Hawley
I thought Incredibill had taken measures to stop any such thing?

Dave it looks like they got your site and I can't be quite sure but it looks like they have highjacked some of your adverts.

Quadrille
23-12-2006, 06:06/06:06AM
Does this site really matter?

You found the stealing subdomains because you specifically searched - is anyone going to find them via a search engine (or any other route) in a million years?

In fact, they scrape EVERY SITE ON THE WEB - just rewrite the URL, and you'll see.

It is obviously done on the fly, with certain words being automatically made into links, and certain links being removed and/or substituted.

Technically clever - but I find it hard to worry about it.

Or am I missing something?

ihelpyou
23-12-2006, 06:07/06:07AM
What site are we talking about?? What the heck is the url anyway? I cannot decipher anything from above.

Quadrille
23-12-2006, 06:10/06:10AM
www xxx.com.zta.net

One sneaky thing is every outgoing link has zta.net added on the end - so sooner or later, you will click on one of their advertisers.

But that's assuming you got caught in the first place!

How did you find zta in the first place?

[added:] It's a Chinese proxy server: www.zta.net - to help them evade their local censorship.

They could lose their heads for that, which means a little bit extra for sick Chinese people - two kidneys, one liver, one heart, two corneas ...

grungee
23-12-2006, 07:01/07:01AM
Yeah it may not matter at all to you but I found it when someone went to their site to access mine I thought this url looks wierd whats going on. So someone did find it and they could beleive that those links on that page are given by us so association by default I guess. I found the peoples sites only because they had them indexed in google already otherwise I wouldn't have mentioned them. I realise ya can just change the domain name and any site is captured on the fly, it was the listings in google that bothered me.

grungee
23-12-2006, 07:14/07:14AM
The main reason I started this thread was to ask for help on the redirect. I had wondered if I should send them to scrape the fbi site or some other site but I actually had second thoughts because technically you are an accessory by helping them steal someone else's content.

Quadrille
23-12-2006, 08:07/08:07AM
I suspect that there will be the occasional visitor from China, who has no access except through a proxy - and when they surf, they are kept on the proxy server, wherever they go.

I suspect it'll only get in Google where someone places a link to the 'wrong' url, and Google will then have a duplicate site issue. So I do see a potential problem.

Is there any way to block a proxy server, that's the question. I dunno about such things!

grungee
23-12-2006, 08:44/08:44AM
Yeah the thing that worried me was they add their own links to my content so it makes it look like I am linking to their sites. I reported it to google through webmaster apps so wonder if they will remove it?

WebSavvy
23-12-2006, 13:30/01:30PM
OK, I did a little testing with a php file. Even though I have all of the links in my domain hardcoded with the full domain name and file name, they're still rewritting it to make it look like it's on their server.

However, they're serving all of these scraped sites off of subdomains located on this IP address:

65.110.17.34

If you add the following in your .htaccess file it will stop it.

<Files 403.shtml>
order allow,deny
allow from all
</Files>
deny from 65.110.17.34

Access websavvy through that zta.net address and you'll see what they're now being shown.

:D

WebSavvy
23-12-2006, 14:13/02:13PM
This query: site:zta.net google

http://www.google.com/search?hl=en&lr=&q=site%3Azta.net+google&btnG=Search

Lists the first SERP with the url of:
www.google.ca.jpark.com.zta.net

That jpark.com is another scraper (proxy). So here we have one proxy eating another proxy and then indexed by google.

That's pathetic. No wonder so many domains are ending up with duplicate content issues, and why google's (and the other majors) serps are so polluted.

Remember when Google's indexed page count almost doubled (or tripled?) overnight? This might be why. Their bot hit a scraper pot.

Seems to me they'd be better off if they figured out some way to filter out whether or not it's a proxy and if it is, don't index it.

This can't be considered a quality serp for their searchers.

grungee
24-12-2006, 11:24/11:24AM
Nice Message Deb hehe, the way they had some pages in the non supplemental and also a user used it to come to my site thats what had me worried and it seemed they were able to get around other forms of blocking, the javascript frame popout and stuff like that. I knew you had some measures and they got around them thats why I thought it may interest some people.

Curt
27-12-2006, 03:41/03:41AM
Doug, Check out that domain at zta.net:

http://www.ihelpyou.com.zta.net/

Load of CRAP!! They copied your site content verbatim. Dupe content is what google and other engines will find. Wonder if my sites are being scraped... off to see...

Curt
27-12-2006, 03:53/03:53AM
That site is grabbing the exact code of whatever page is queried and simply displays that page with images. Even the images are somehow retrieved and served up through that server. It's a dynamic site scraper and does not actually store the pages on the actual server, but retrieves the content from the original site. I'm off to do what Websavvy suggested...

Curt
27-12-2006, 04:15/04:15AM
WebSavvy, your code did not work for me with my Linux server w/apache

However, the code provided by grungee did work with the modification of IP address.

Use the following code:

RewriteCond %{REMOTE_ADDR} ^65.110.17.34
RewriteCond %{HTTP_HOST} !65\.110\.17\.34+
RewriteRule /* http://www.zta.net/ [R,L]

That works for me. However, with that said, WebSavvy's code may work for others so everyone will need to try one method or the other one if the first “htaccess” code that is tried does not work.

Again, either try WebSavvy's code:

<Files 403.shtml>
order allow,deny
allow from all
</Files>
deny from 65.110.17.34

...or grungee's code with IP address modified:

RewriteCond %{REMOTE_ADDR} ^65.110.17.34
RewriteCond %{HTTP_HOST} !65\.110\.17\.34+
RewriteRule /* http://www.zta.net/ [R,L]

DANG SCRAPER SITES!!!

Curt
27-12-2006, 04:28/04:28AM
Do search engines actually spider content on web proxy servers? Just wondering because of this particular site being a web proxy. If nobody actually links to your site via the proxy web site (www.{yoursite.com}.zta.net) then it would seem the engines would not spider it.

Any thoughts on that?

Arnabme
27-12-2006, 07:54/07:54AM
Even my company's site came up for the url

http://www.domain.com.zta.net/en/index.asp

Need to stop this probably need to use the WebSavvy's code and check.

grungee
27-12-2006, 11:27/11:27AM
Curt google has indexed 900 odd sites from this scraper I only found it because a user come through from its link well not even a user there bot come through and showed the address as www.flatpackitchens.com.zta.net and that just made me curious enough to see whats going on. I did a spam form from the google webmaster site which seems to have got my site out of the list but not others yet?

grungee
27-12-2006, 11:33/11:33AM
Actually just doing a check then they also have www.sitename.com.marketdata.org but only the subdomains of scraped sites seem to be listed in google they also seem to use the same ip so no need to change any code.

grungee
27-12-2006, 11:39/11:39AM
What a mess that place seems to be this page is indexed but if you look carefully they have scraped there own scraped site which scraped the original site

www.zta.net.marketdata.org.zta.net/services.html

Can see there database filling up quick


Here's a technical question if I add a link on my site to there scraped site/site/site and keep adding the scraped site to the link could I then by pure numbers of times crash their database?

WebSavvy
27-12-2006, 17:13/05:13PM
It's not a database, Tony.

There's a script I know of (PHP) where you can set it up to do dynamic sub-domains on the fly.

What they've probably done is, write something that combines a dynamic sub-domain on the fly with a content scraper (proxy). Which means it takes up ZERO space on their site except for the two scripts running this scheme which would total less than about 1 MB in size.

So, that means -- your site might not actually be "proxied/scraped" ... but as it is DYNAMIC (done on the fly) it will be shown to you simply because you've requested it to be shown to you.

Understand now?

The part about this that I find odd is, with traditional "proxies" though they may get indexed there's nothing cached in Google to look at. (shows a blank page)

BUT with this one, there are pages BOTH indexed and cached.

So, there's a bit more than the usual proxy taking place with this -- though at this point I'm not sure what, as I haven't really had the time to fully look into it.

Busy programming some stuff of my own at the moment. They're blocked through .htaccess -- so, later when I have some time I'll look into it a bit more.

grungee
29-12-2006, 04:27/04:27AM
I have done a little bit more looking into it and they host a couple of other sites and have the www.version of the site go to the real site and the non www version of the site go to their scraper site

www.blueflame.net and blueflame.net check that out they ripping off their own customers.

grungee
04-01-2007, 07:48/07:48AM
Just curious does google still have the problem with 302 hijackers?

As this scraper has turned of its scraping and now 302 redirects the indexed sites to the original domains, is this an attempt to hijack the serps or just redirection done badly? Any thoughts? I haven't seen any of the serps being hijacked yet but they only just turned the 302 redirects on.

ihelpyou
04-01-2007, 07:57/07:57AM
They are spammers. They think that by 302'ing to other domains, it will improve their pagerank. It's something going around right now that is something very false. Spammers will try anything.

grungee
04-01-2007, 08:10/08:10AM
Thanks Doug so there only trying to grab the PR and google doesn't have the hijack problem any more?

ihelpyou
04-01-2007, 08:32/08:32AM
Nope. No hijack prob for over a year now.

Curt
04-01-2007, 11:21/11:21AM
302 redirect hijacking problem has been fixed by google? About time. That problem took way-way too long to fix and messed over a good deal of sites while it was a problem. I can't phathom why that problem existed in the first place or why it was allowed to persist for so long a time.

Quadrille
04-01-2007, 12:13/12:13PM
I suspect it was not so easy to separate genuine 302s from hijackers.

Even now, 302s are misused by many, many sites - and these folk were all at risk from a quick fix.

But it's a great relief that they [finally] cracked it!

Curt
04-01-2007, 12:29/12:29PM
I wonder what would make that bug in google so difficult to fix. Guess that's what happens when the algo is so blasted complex that an SE can't do what would seem simple without goofing up some other aspect of SERP's. Still would like to know why it took so long...

Arnabme
11-01-2007, 06:55/06:55AM
hi Quadrille - I heard that these are done for adsense purpose. But when I do a www.domain.com.zta.net i find most of the pages are scrapped without any adsense.

Quadrille
11-01-2007, 06:59/06:59AM
It's often for adsense - they want your content to earn money for them.

In this case, it's a China site allowing people to workaround their governments censorship (I think), while promoting a few sites along the way.

Or it just might be a way of selling 'sponsored links'.

Quadrille
30-01-2007, 13:59/01:59PM
... And it seems to have bitten the dust.

Curt
30-01-2007, 21:18/09:18PM
Quadrille said:

... And it seems to have bitten the dust.
hmmm, I still see http://www.ihelpyou.com.zta.net/ going strong :(.

Doesn't seem to have bitten the dust yet. Perhaps you can elaborate on your comment.

Quadrille
31-01-2007, 05:26/05:26AM
Originally posted by Curt
Perhaps you can elaborate on your comment. Gone yesterday.
Back today.

That's about as elaborate as it gets :)

Dave Hawley
31-01-2007, 05:32/05:32AM
There seems to be loads of sites, that come and go, with the zta.net extension. I can't see why they bother as none I have seen ever rank anywhere.

Curt
31-01-2007, 11:10/11:10AM
Quadrille said:

... That's about as elaborate as it gets
That is elaborated enough for me :D

cthun
02-02-2007, 11:07/11:07AM
Report ZTA.NET to google using the form on http://www.google.com/contact/spamreport.html

If enough complaints, I’m sure Google will take action — hopefully, blacklisting ZTA.NET off google.com results. That will solve part of the problem/incentive that ZTA.NET has for scraping/copying websites.

Here’s what I wrote: “The entire domain ZTA.NET is merely copies/scrapes of legitimate websites. Type in any DOMAINNAMEHERE.TLD.zta.net in a browser, and you’ll see that ZTA.NET is merely copying/scraping the actual websites while spamming their own/clients websites in links which are not in the original/copied websites.”

p.s. found this site when i went to google "zta.net" after their "copied/scraped/spammed" version of my website showed up on a google search.

ihelpyou
02-02-2007, 11:18/11:18AM
Welcome cthun! :hi:

Connie
02-02-2007, 11:34/11:34AM
Hi cthun and welcome. :hi: