Need some scraping-fu help

Started by Drastic, April 21, 2015, 02:20:18 PM

Previous topic - Next topic

Drastic

I need a way to grab urls from html code source that is dynamically generated. I am looking for a way to extract all of the reblog post urls from a tumblr post that has hundreds of reblogs.

Example (not mine):

http://fishkillkidskarate.tumblr.com/post/116776397487/benefits-of-karate-for-pre-school-age-kids

Here is the code for one reblog link:
<li class="note reblog tumblelog_beirl1962pb without_commentary">
  <a rel="nofollow" class="avatar_frame" target="_blank" href="http://beirl1962pb.tumblr.com/" title="Little Rebel">
    <img src="http://assets.tumblr.com/images/default_avatar/octahedron_closed_64.png" class="avatar " alt="" /></a>
  <span class="action" data-post-url="http://beirl1962pb.tumblr.com/post/116835393441">
    <a rel="nofollow" href="http://beirl1962pb.tumblr.com/" class="tumblelog" title="Little Rebel">beirl1962pb</a> reblogged this from
    <a rel="nofollow" href="http://fishkillkidskarate.tumblr.com/" class="source_tumblelog" title="Developing Focus in Kids Through Martial Arts">fishkillkidskarate</a>
  </span>
  <div class="clear">
  </div>
</li>


-------

Two issues - 1) to get all urls loaded in, you have to keeping scrolling down to the bottom until they are all loaded.
2) The url I need is in the span tag:
<span class="action" data-post-url="http://beirl1962pb.tumblr.com/post/116835393441">

Used to be able to do this with web developer plugin, but tumblr has changed and the plugin is not picking up the url in the span tag.

Is there a way I can get this done fairly inexpensively? Tool, or script maybe?

JasonD

A relatively simple JS Bookmarklet should do it.

Quickly looking at the page - this URL seems relevant, which may make the task easier

http://fishkillkidskarate.tumblr.com/notes/116776397487/Iw7M6PTwi?from_c=1429462202&large=true

Drastic

Thanks Jason that does look like it might be easier to get the full display of all urls. Still puts the post url in span tag though so my off-the-shelf stuff can't extract them.

How should I go about getting a bookmarklet built for this?

BoL


Drastic

Awesome, BoL! That works great!

Hacked it around a bit and made a form = perfect!!!


How much do I owe you?

Drastic

Actually I spoke too soon, it is extracting the same first 50 urls repeatedly until the page is finished loading them.

Easy fix?

BoL

Ah, yes I forgot a line in the while loop. Add this to the last line within the while() statement

preg_match_all("'data-post-url=\"(.+)\"'Uims",$file,$m);

You don't owe me anything, though you're welcome to get a domain at account . the . domain . name :)

If you're looking for deactivated tumblr accounts I may have a list based on some old rank checking.


JasonD

> get a domain at account . the . domain . name
:)

Drastic

#8
Quick work man, thanks so much, I needed this today and this got me sorted.

I set up an account, love the new system, you guys have come a long way!

Next domains I buy I'm trying you guys out for sure, for at least a domain or two. I've got about 30 in the hopper waiting to be built atm.
https://account.the.domain.name/

Thanks again guys.

added:
>If you're looking for deactivated tumblr accounts I may have a list based on some old rank checking.
Definitely! lmk what you have when you have time.