Author Topic: Checking Content for Duplicate  (Read 16973 times)

Drastic

  • Need a bigger hammer...
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3084
  • Resident Redneck
    • View Profile
Checking Content for Duplicate
« on: November 09, 2010, 12:39:51 AM »
Some new content writers, they are using a bit too much of other authors' original work.

What do you use to check for dupe? Copyscape is a good first-pass, but often a word or two here and there tweaked will fool it, but I doubt G is fooled.

Peter

  • Inner Core
  • Full Member
  • *
  • Posts: 118
    • View Profile
Re: Checking Content for Duplicate
« Reply #1 on: November 09, 2010, 12:46:38 AM »
I'm pretty much stuck on Copyscape - would be interested in any alternatives

Drastic

  • Need a bigger hammer...
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3084
  • Resident Redneck
    • View Profile
Re: Checking Content for Duplicate
« Reply #2 on: November 09, 2010, 12:52:24 AM »
What I found is one or two words altered in the sentence gets by CS. But if you piece in 8-12 word sections into G you find stuff everywhere.

A script that would take each sentence, halve it if it were over say 15 words, put quotes around and search G, spit out any results that had a serp. Would need proxies.

With scrapebox, all you would need is the input, sentence/fragments separated by line surrounded by quotes. I think it would do the rest, but need to check it.

I guess an excel macro could do the chopping up and quoting bit?

4Eyes

  • Hero Member
  • *****
  • Posts: 817
    • View Profile
    • Email
Re: Checking Content for Duplicate
« Reply #3 on: November 09, 2010, 12:58:45 AM »
"Dupe free pro" lets me compare an original to a variation for uniqueness.

Its not great, its a bit buggy, and it probably uses nothing like the same dupe detection routines as G, but it is good enough for checking my spun crap against the original.

Drastic

  • Need a bigger hammer...
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3084
  • Resident Redneck
    • View Profile
Re: Checking Content for Duplicate
« Reply #4 on: November 09, 2010, 01:04:34 AM »
Yeah, I've seen that. What I'm getting from elves is an article put together from different sources. 2 paras here, 1 there, 2 more from here. Some words added, some deleted and some changed. Might be fine, but I'm not feeling it.

Drastic

  • Need a bigger hammer...
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3084
  • Resident Redneck
    • View Profile
Re: Checking Content for Duplicate
« Reply #5 on: November 09, 2010, 01:25:46 AM »
Well, scratch my idea. Scrapebox still sees the results after "Results for ***my contenet***. (without quotes):"

So, even though it was unique it lists results without quotes.


4Eyes

  • Hero Member
  • *****
  • Posts: 817
    • View Profile
    • Email
Re: Checking Content for Duplicate
« Reply #6 on: November 09, 2010, 08:43:34 AM »
I have one of my elves do the overseeing of the article writing and distribution.

She uses many local writers, and part of her job is to do manual checking for plagiarism and blatant copying - all she really does is Copyscape plus takes a few phrases from each article and search for them by hand. I am sure that some sneaks through - just can't think of any better way.

TallTroll

  • Sr. Member
  • ****
  • Posts: 272
    • View Profile
    • Email
Re: Checking Content for Duplicate
« Reply #7 on: November 09, 2010, 09:47:10 AM »
Google allows 32 token search terms now. I've found that's enough to catch even fairly extensive spinning, if it's been done one word >> one word. If there's been proper paraphrasing, replacing one word with many and vice versa, the technique fails, but then, that's not really what you're chasing, is it?

Woz

  • Tea! Black! Strong! Hot! Now!
  • Global Moderator
  • Full Member
  • *****
  • Posts: 214
    • View Profile
Re: Checking Content for Duplicate
« Reply #8 on: November 09, 2010, 10:21:56 AM »
If I am looking for dupes, I often use Copernicto set up a custom set of SEs and then search for a target phrase, either as a "phrase" and/or just the words. Copernic then polls the SE and aggregates the result which I then export into a database.  Having the results in a database then allows me to keep an eye on known offenders.
Courage, Courtesy and Service.
Constant and True.

Drastic

  • Need a bigger hammer...
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3084
  • Resident Redneck
    • View Profile
Re: Checking Content for Duplicate
« Reply #9 on: November 09, 2010, 12:35:12 PM »
>not really what you're chasing, is it?
Not exactly, need more of a way to semi automate it.

>In that article I am sure it said that  CopyScape were effectively taking the content at the URL, cleaning the mark up then splitting it into n word grams. Again from memory n = 6.
Well, from what I saw it doesn't work like that.

Copyscape results didn't show any of the results G showed for a 10 word snippet with quotes, which is why I'm looking for a solution.

I think I'll get someone on fiver to write a macro that will spit the article out into:
http://www.google.com/search?q="my 8 word snippet blah blah blah blah"
etc.

Then I can just click the links randomly using tor on the browser or figure out a way to feed it into scrapebox.

littleman

  • Administrator
  • Hero Member
  • *****
  • Posts: 6531
    • View Profile
Re: Checking Content for Duplicate
« Reply #10 on: November 10, 2010, 02:37:57 AM »
This wouldn't be hard to write in perl or a lot of other scripting language.  The main issue would be keeping a good list of proxies.
« Last Edit: November 10, 2010, 04:13:25 AM by littleman »

Drastic

  • Need a bigger hammer...
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3084
  • Resident Redneck
    • View Profile
Re: Checking Content for Duplicate
« Reply #11 on: November 10, 2010, 04:24:22 PM »
How much time lm?

I got three bites on fiver to spit out urls that search G for snippet lengths of your choosing, using excel macro. Just need a way to check them through a proxy.

Drastic

  • Need a bigger hammer...
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3084
  • Resident Redneck
    • View Profile
Re: Checking Content for Duplicate
« Reply #12 on: November 10, 2010, 04:59:01 PM »
>Actually all you care about is the number returned. IE how many results for each query.
The problem I had with SB is when there are no results, yet G shows results for no quotes.

Example:
http://www.google.com/search?q="Curabitur dui velit, vehicula quis lacinia eu, mollis ut lectus"

>Leave it with me for a day or 2
! Wow, thanks.

Drastic

  • Need a bigger hammer...
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3084
  • Resident Redneck
    • View Profile
Re: Checking Content for Duplicate
« Reply #13 on: November 18, 2010, 09:30:02 PM »
>I will see if I can knock something up anyway

Any luck? (I'll be glad to pay you.)

Drastic

  • Need a bigger hammer...
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3084
  • Resident Redneck
    • View Profile
Re: Checking Content for Duplicate
« Reply #14 on: November 22, 2010, 05:58:47 PM »
Sweet!

No sweat on timeframe, I was just going to hire out to someone if you didn't have time to mess with it.