Need some scraping-fu help

Drastic · April 21, 2015, 02:20:18 PM

I need a way to grab urls from html code source that is dynamically generated. I am looking for a way to extract all of the reblog post urls from a tumblr post that has hundreds of reblogs.

Example (not mine):

http://fishkillkidskarate.tumblr.com/post/116776397487/benefits-of-karate-for-pre-school-age-kids

Here is the code for one reblog link:

Code Select

<li class="note reblog tumblelog_beirl1962pb without_commentary">
  <a rel="nofollow" class="avatar_frame" target="_blank" href="http://beirl1962pb.tumblr.com/" title="Little Rebel">
    <img src="http://assets.tumblr.com/images/default_avatar/octahedron_closed_64.png" class="avatar " alt="" /></a>
  <span class="action" data-post-url="http://beirl1962pb.tumblr.com/post/116835393441">
    <a rel="nofollow" href="http://beirl1962pb.tumblr.com/" class="tumblelog" title="Little Rebel">beirl1962pb</a> reblogged this from 
    <a rel="nofollow" href="http://fishkillkidskarate.tumblr.com/" class="source_tumblelog" title="Developing Focus in Kids Through Martial Arts">fishkillkidskarate</a>
  </span>
  <div class="clear">
  </div>
</li>

-------

Two issues - 1) to get all urls loaded in, you have to keeping scrolling down to the bottom until they are all loaded.
2) The url I need is in the span tag:
<span class="action" data-post-url="http://beirl1962pb.tumblr.com/post/116835393441">

Used to be able to do this with web developer plugin, but tumblr has changed and the plugin is not picking up the url in the span tag.

Is there a way I can get this done fairly inexpensively? Tool, or script maybe?

JasonD · April 21, 2015, 02:39:53 PM

A relatively simple JS Bookmarklet should do it.

Quickly looking at the page - this URL seems relevant, which may make the task easier

http://fishkillkidskarate.tumblr.com/notes/116776397487/Iw7M6PTwi?from_c=1429462202&large=true

Drastic · April 21, 2015, 02:58:59 PM

Thanks Jason that does look like it might be easier to get the full display of all urls. Still puts the post url in span tag though so my off-the-shelf stuff can't extract them.

How should I go about getting a bookmarklet built for this?

BoL · April 21, 2015, 05:19:22 PM

Here's some PHP that'd do it

Drastic · April 21, 2015, 06:31:36 PM

Awesome, BoL! That works great!

Hacked it around a bit and made a form = perfect!!!

How much do I owe you?

Drastic · April 21, 2015, 06:34:45 PM

Actually I spoke too soon, it is extracting the same first 50 urls repeatedly until the page is finished loading them.

Easy fix?

BoL · April 21, 2015, 06:41:24 PM

Ah, yes I forgot a line in the while loop. Add this to the last line within the while() statement

preg_match_all("'data-post-url=\"(.+)\"'Uims",$file,$m);

You don't owe me anything, though you're welcome to get a domain at account . the . domain . name

If you're looking for deactivated tumblr accounts I may have a list based on some old rank checking.

JasonD · April 21, 2015, 06:53:52 PM

> get a domain at account . the . domain . name

Drastic · April 21, 2015, 07:03:20 PM

Quick work man, thanks so much, I needed this today and this got me sorted.

I set up an account, love the new system, you guys have come a long way!

Next domains I buy I'm trying you guys out for sure, for at least a domain or two. I've got about 30 in the hopper waiting to be built atm.
https://account.the.domain.name/

Thanks again guys.

added:
>If you're looking for deactivated tumblr accounts I may have a list based on some old rank checking.
Definitely! lmk what you have when you have time.