Checking Content for Duplicate

Started by Drastic, November 09, 2010, 12:39:51 AM

Previous topic - Next topic

Drastic

Some new content writers, they are using a bit too much of other authors' original work.

What do you use to check for dupe? Copyscape is a good first-pass, but often a word or two here and there tweaked will fool it, but I doubt G is fooled.

Peter

I'm pretty much stuck on Copyscape - would be interested in any alternatives

Drastic

What I found is one or two words altered in the sentence gets by CS. But if you piece in 8-12 word sections into G you find stuff everywhere.

A script that would take each sentence, halve it if it were over say 15 words, put quotes around and search G, spit out any results that had a serp. Would need proxies.

With scrapebox, all you would need is the input, sentence/fragments separated by line surrounded by quotes. I think it would do the rest, but need to check it.

I guess an excel macro could do the chopping up and quoting bit?

4Eyes

"Dupe free pro" lets me compare an original to a variation for uniqueness.

Its not great, its a bit buggy, and it probably uses nothing like the same dupe detection routines as G, but it is good enough for checking my spun crap against the original.

Drastic

Yeah, I've seen that. What I'm getting from elves is an article put together from different sources. 2 paras here, 1 there, 2 more from here. Some words added, some deleted and some changed. Might be fine, but I'm not feeling it.

Drastic

Well, scratch my idea. Scrapebox still sees the results after "Results for ***my contenet***. (without quotes):"

So, even though it was unique it lists results without quotes.


4Eyes

I have one of my elves do the overseeing of the article writing and distribution.

She uses many local writers, and part of her job is to do manual checking for plagiarism and blatant copying - all she really does is Copyscape plus takes a few phrases from each article and search for them by hand. I am sure that some sneaks through - just can't think of any better way.

TallTroll

Google allows 32 token search terms now. I've found that's enough to catch even fairly extensive spinning, if it's been done one word >> one word. If there's been proper paraphrasing, replacing one word with many and vice versa, the technique fails, but then, that's not really what you're chasing, is it?

Woz

If I am looking for dupes, I often use Copernicto set up a custom set of SEs and then search for a target phrase, either as a "phrase" and/or just the words. Copernic then polls the SE and aggregates the result which I then export into a database.  Having the results in a database then allows me to keep an eye on known offenders.
Courage, Courtesy and Service.
Constant and True.

Drastic

>not really what you're chasing, is it?
Not exactly, need more of a way to semi automate it.

>In that article I am sure it said that  CopyScape were effectively taking the content at the URL, cleaning the mark up then splitting it into n word grams. Again from memory n = 6.
Well, from what I saw it doesn't work like that.

Copyscape results didn't show any of the results G showed for a 10 word snippet with quotes, which is why I'm looking for a solution.

I think I'll get someone on fiver to write a macro that will spit the article out into:
http://www.google.com/search?q="my 8 word snippet blah blah blah blah"
etc.

Then I can just click the links randomly using tor on the browser or figure out a way to feed it into scrapebox.

littleman

#10
This wouldn't be hard to write in perl or a lot of other scripting language.  The main issue would be keeping a good list of proxies.

Drastic

How much time lm?

I got three bites on fiver to spit out urls that search G for snippet lengths of your choosing, using excel macro. Just need a way to check them through a proxy.

Drastic

>Actually all you care about is the number returned. IE how many results for each query.
The problem I had with SB is when there are no results, yet G shows results for no quotes.

Example:
http://www.google.com/search?q="Curabitur dui velit, vehicula quis lacinia eu, mollis ut lectus"

>Leave it with me for a day or 2
! Wow, thanks.

Drastic

>I will see if I can knock something up anyway

Any luck? (I'll be glad to pay you.)

Drastic

Sweet!

No sweat on timeframe, I was just going to hire out to someone if you didn't have time to mess with it.