I've been meaning to post that I've been busy elsewhere.
Obviously, I have not posted.
Darius Kazemi's Two Headlines bot scrapes topics from google-news.
His bot still works over at https://twitter.com/TwoHeadlines.
But the code, freshly pulled from github, does not retrieve headlines.
The code seems to be searching for classes named
topic but these no longer appear present.
What does appear to be topic links can be found in the following sample line:
<a class="esc-topic-link" href="https://news.google.com/news/section?cf=all&ned=us&hl=en&q=Kobe+Bryant&ict=clu_bl"> Kobe Bryant »
The code used to build the main subject pages still seems accurate.
The above line was pulled from
The new topic-grabbing code is as follows:
var nbspre = '/(\xC2\xA0/| )';
var rdaqre = /\xBB/g; // remove right-double-angled-quote
topic.name = this.text().replace(nbspre, '').replace('/r/n', '').replace(rdaqre, '').trim();
topic.url = baseUrl + this.attr('href');
Checking for topic name w/in a headline needed to be case-insensitive, so to the replacement:
var newTopic = topics.pick();
console.log('newtopic: ' + newTopic);
// s/b case-insensitve matche
var nameRe = new RegExp(topic.name, 'gi');
var newHeadline = headline.replace(nameRe, newTopic.name);
console.log('orig: ' + headline + '\nnew: ' + newHeadline);
Wait.... why am I doing this?
Because I'm going to make another bot that will work with headlines.
And I hate spending weeks reinventing wheels.
I prefer to gank them off somebody else's car, and modify them as required.
Think the metaphor breaks down a bit near the end; never heard of somebody modifying a tire...
Anyway, I needed headlines for a bot, and
TwoHeadlines was the logical place to look.
Now that I have it working, and understand the current state of google-news headline scraping, I can move on to modify it to work the way I need it to.
And how is that? Goats.
TODO: do it
I did it for my work-wiki; not sure why not here.
Because I was trying to leave this more "original" and untouched?