PHP

How to remove duplicate URLs and irrelevant pages from RSS feeds with SimplePie

At the end of my last post about using SimplePie, I mentioned that I did not think I would have any issues with removing duplicate URLs. I actually did have some issues with the appearance of duplicate URLs and irrelevant pages in my results. By irrelevant articles, I mean articles where reclaiming a brand mention would not make any sense. In any case, I’ll be showcasing the modifications I have done to my last snippets of code in this post. For those who wish to see both posts combined, there will be an excellent tutorial on branded link reclamation coming up in the next month.

Using Regex To Filter Out Non Relevant Titles

Most of the irrelevant pages were created by pages trying to sell products related to the brand. While I could of still used an array to look for specific substrings in the titles, I decided to use a regex variable. Regex is not only more flexible, but also requires less code to actually filter out results. Instead of looping through each string in the array and compare it to the item in the RSS feed, only one check is needed.

<?php $titlePattern = "/(U|u)sed|(f|F)or.(S|s)ale/"; ?>

As you can see, the filter looks quite different than the one created previously for domain names. If you look bellow, you’ll see that instead of doing a check on each item using stristr, you’ll need to use pre_match and it’ll check for all the combinations in the regex variable.

<?php $filtration = 0; ?>
<!-- title check -->
<?php   if(preg_match($titlePattern, $item->get_title()) != false){
                $filtration = 1; 
		} else {
                        //if title check passes, check domain
			foreach($filter as $token){
				if(stristr($item->get_permalink(), $token) !== false){
				$filtration = 1;
				break;
}}}
?>

How to Remove Duplicate URLs & Preserve Relevant Information

You’ll notice that there’s a new variable included in this filter called $filtration. This works as a flag that tells the filter that the URL did not pass the various domain and title filters. The next step is to remove all duplicate domains while preserving only the information we want to display. Unlike the answer in this Stack Overflow question, the filtration creates items that include both the title and the permalink instead of just the title.

<?php                                      
//remove duplicates that pass filter 	
if($filtration != 1){ 
    $filteredFeed[$item->get_title()] = array('title' => $item->get_title(), 'permalink' => $item->get_permalink());
} 
endforeach; ?>

If you were wondering how this removes duplicate URLs, it’s because the items are created using the title of the RSS feed item. Whenever there’s a duplicate title, it gets removed from the list!

You Might Also Like

No Comments

Leave a Reply