Author Topic: Need some scraping-fu help  (Read 5408 times)

Drastic

  • Need a bigger hammer...
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3084
  • Resident Redneck
    • View Profile
Need some scraping-fu help
« on: April 21, 2015, 02:20:18 PM »
I need a way to grab urls from html code source that is dynamically generated. I am looking for a way to extract all of the reblog post urls from a tumblr post that has hundreds of reblogs.

Example (not mine):

http://fishkillkidskarate.tumblr.com/post/116776397487/benefits-of-karate-for-pre-school-age-kids

Here is the code for one reblog link:
Code: [Select]
<li class="note reblog tumblelog_beirl1962pb without_commentary">
  <a rel="nofollow" class="avatar_frame" target="_blank" href="http://beirl1962pb.tumblr.com/" title="Little Rebel">
    <img src="http://assets.tumblr.com/images/default_avatar/octahedron_closed_64.png" class="avatar " alt="" /></a>
  <span class="action" data-post-url="http://beirl1962pb.tumblr.com/post/116835393441">
    <a rel="nofollow" href="http://beirl1962pb.tumblr.com/" class="tumblelog" title="Little Rebel">beirl1962pb</a> reblogged this from
    <a rel="nofollow" href="http://fishkillkidskarate.tumblr.com/" class="source_tumblelog" title="Developing Focus in Kids Through Martial Arts">fishkillkidskarate</a>
  </span>
  <div class="clear">
  </div>
</li>

-------

Two issues - 1) to get all urls loaded in, you have to keeping scrolling down to the bottom until they are all loaded.
2) The url I need is in the span tag:
 <span class="action" data-post-url="http://beirl1962pb.tumblr.com/post/116835393441">

Used to be able to do this with web developer plugin, but tumblr has changed and the plugin is not picking up the url in the span tag.

Is there a way I can get this done fairly inexpensively? Tool, or script maybe?

JasonD

  • Inner Core
  • Hero Member
  • *
  • Posts: 1420
  • Look at THAT!!!!
    • AOL Instant Messenger - JasonDDuke
    • View Profile
    • Domain Names
    • Email
Re: Need some scraping-fu help
« Reply #1 on: April 21, 2015, 02:39:53 PM »
A relatively simple JS Bookmarklet should do it.

Quickly looking at the page - this URL seems relevant, which may make the task easier

http://fishkillkidskarate.tumblr.com/notes/116776397487/Iw7M6PTwi?from_c=1429462202&large=true

Drastic

  • Need a bigger hammer...
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3084
  • Resident Redneck
    • View Profile
Re: Need some scraping-fu help
« Reply #2 on: April 21, 2015, 02:58:59 PM »
Thanks Jason that does look like it might be easier to get the full display of all urls. Still puts the post url in span tag though so my off-the-shelf stuff can't extract them.

How should I go about getting a bookmarklet built for this?

BoL

  • Inner Core
  • Hero Member
  • *
  • Posts: 1205
    • View Profile
Re: Need some scraping-fu help
« Reply #3 on: April 21, 2015, 05:19:22 PM »
Here's some PHP that'd do it



Drastic

  • Need a bigger hammer...
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3084
  • Resident Redneck
    • View Profile
Re: Need some scraping-fu help
« Reply #4 on: April 21, 2015, 06:31:36 PM »
Awesome, BoL! That works great!

Hacked it around a bit and made a form = perfect!!!


How much do I owe you?

Drastic

  • Need a bigger hammer...
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3084
  • Resident Redneck
    • View Profile
Re: Need some scraping-fu help
« Reply #5 on: April 21, 2015, 06:34:45 PM »
Actually I spoke too soon, it is extracting the same first 50 urls repeatedly until the page is finished loading them.

Easy fix?

BoL

  • Inner Core
  • Hero Member
  • *
  • Posts: 1205
    • View Profile
Re: Need some scraping-fu help
« Reply #6 on: April 21, 2015, 06:41:24 PM »
Ah, yes I forgot a line in the while loop. Add this to the last line within the while() statement

preg_match_all("'data-post-url=\"(.+)\"'Uims",$file,$m);

You don't owe me anything, though you're welcome to get a domain at account . the . domain . name :)

If you're looking for deactivated tumblr accounts I may have a list based on some old rank checking.


JasonD

  • Inner Core
  • Hero Member
  • *
  • Posts: 1420
  • Look at THAT!!!!
    • AOL Instant Messenger - JasonDDuke
    • View Profile
    • Domain Names
    • Email
Re: Need some scraping-fu help
« Reply #7 on: April 21, 2015, 06:53:52 PM »
> get a domain at account . the . domain . name
:)

Drastic

  • Need a bigger hammer...
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3084
  • Resident Redneck
    • View Profile
Re: Need some scraping-fu help
« Reply #8 on: April 21, 2015, 07:03:20 PM »
Quick work man, thanks so much, I needed this today and this got me sorted.

I set up an account, love the new system, you guys have come a long way!

Next domains I buy I'm trying you guys out for sure, for at least a domain or two. I've got about 30 in the hopper waiting to be built atm.
https://account.the.domain.name/

Thanks again guys.

added:
>If you're looking for deactivated tumblr accounts I may have a list based on some old rank checking.
Definitely! lmk what you have when you have time.
« Last Edit: April 21, 2015, 07:05:06 PM by Drastic »