Hey, I'm back

I've been meaning to post that I've been busy elsewhere.

Obviously, I have not posted.

Scraping google news

Darius Kazemi's Two Headlines bot scrapes topics from google-news.
His bot still works over at https://twitter.com/TwoHeadlines.
But the code, freshly pulled from github, does not retrieve headlines.
The code seems to be searching for classes named topic but these no longer appear present.

What does appear to be topic links can be found in the following sample line:

<a class="esc-topic-link" href="https://news.google.com/news/section?cf=all&ned=us&hl=en&q=Kobe+Bryant&ict=clu_bl"> Kobe Bryant »

The code used to build the main subject pages still seems accurate.
The above line was pulled from https://news.google.com/news/section?ned=us&topic=s

The new topic-grabbing code is as follows:

        // clean up name: ' Kaspersky Lab »\r\n'
        var nbspre = '/(\xC2\xA0/|&nbsp;)';
        var rdaqre = /\xBB/g; // remove right-double-angled-quote
        topic.name = this.text().replace(nbspre, '').replace('/r/n', '').replace(rdaqre, '').trim();
        topic.url = baseUrl + this.attr('href');

Checking for topic name w/in a headline needed to be case-insensitive, so to the replacement:

      if (headline.toLowerCase().indexOf(topic.name.toLowerCase()) > -1) {
        getTopics(categoryCodes.pickRemove()).then(function(topics) {
          var newTopic = topics.pick();
          console.log('newtopic: ' + newTopic);
          // s/b case-insensitve matche
          var nameRe = new RegExp(topic.name, 'gi');
          var newHeadline = headline.replace(nameRe, newTopic.name);
          console.log('orig: ' + headline + '\nnew: ' + newHeadline);

Wait.... why am I doing this?

Because I'm going to make another bot that will work with headlines.
And I hate spending weeks reinventing wheels.
I prefer to gank them off somebody else's car, and modify them as required.
Think the metaphor breaks down a bit near the end; never heard of somebody modifying a tire...

Anyway, I needed headlines for a bot, and TwoHeadlines was the logical place to look.
Now that I have it working, and understand the current state of google-news headline scraping, I can move on to modify it to work the way I need it to.

And how is that? Goats.

Blog Categories should be in a rough word-cloud

TODO: do it

I did it for my work-wiki; not sure why not here.
Because I was trying to leave this more "original" and untouched?