Pagination leak - desperately in need of help

nroptions

46 posts

12 years ago

nroptions

I have a template that invokes some pagination:

{

embed="embeds/.html-header" my_webpage_title="Websites of Interest" 
 my_meta_description="A diverse list of websites focusing on issues of reproductive justice."}
{embed="embeds/.page-header" my_header="headerResources" my_navtab="resources"} 

<div id="mainContent">
 {exp:channel:entries channel="intros" url_title="websites-of-interest" disable="categories|member_data|pagination"}
  <h1>{ispec-bigicon-websites}{page_title}</h1>
  {if last_segment=="websites-of-interest"}  {!-- Only display the intro text on the first page of the list. --}
   {introductory_text}
  {/if}
 {/exp:channel:entries}
 

 {gv-sidemenu-resource-center}

 {exp:channel:entries channel="websites-of-interest" disable="categories|member_data" limit="8" orderby="title" sort="asc"}
  <div id="{switch='resourceA|resourceB'}">
   {title}
   <a href="http://{website_address}" target="_blank" rel="noopener">{website_address}</a>
   {description}
  </div>
  {snp-pagination}
 {/exp:channel:entries}
</div>

{snp-sidebar-nothome}
{snp-page-footer}

Here is {snp-pagination}:

{paginate}
        .....................................
        Page {current_page} of {total_pages} ::: {pagination_links}
 {/paginate}

Here’s what being produced. It ‘s somehow grabbing the global variable “gv-donate-online-path”. Where is that coming from!!!

.....................................
        Page 1 of 3 :::  <strong>1</strong> <a href="http://www.nroptions.org/resource-center/websites-of-interest/{gv-donate-online-path}/P8">2</a> <a href="http://www.nroptions.org/resource-center/websites-of-interest/{gv-donate-online-path}/P16">3</a> <a href="http://www.nroptions.org/resource-center/websites-of-interest/{gv-donate-online-path}/P8">></a>

nroptions

46 posts

12 years ago

nroptions

Additional comment to this problem:

The behavior occurs in Chrome, not Safari or Firefox. Unfortunately, Google is crawling the website and sees that junk there and is making a mess of reporting in a google search!!

nroptions

46 posts

12 years ago

nroptions

I’ve been doing more research. Somehow when google did its crawl it came up with a url of:

 www.nroptions.org/resource-center/websites-of-interest/{gv-donate-online-path}

I have tried to navigate the website to find some path that would have tacked on the extra segment at the end. All the paths I found properly generate the url without the last segment.

There is nothing I can see when I “view source” that distinguishes the good vs the bad page except that pagination quirk, which obviously also picked up the bad url.

Any ideas what to look for in my EE code would be greatly appreciated.

Dan Decker

7,338 posts

12 years ago

Dan Decker

Hey nroptions,

That is certainly odd!

Is there any place in your templates you are using that global variable? In one of the embeds perhaps?

Let’s take this out of the snippet and see what the result is. Try placing the pagination tags directly in the template for a test.

Cheers,

nroptions

46 posts

12 years ago

nroptions

Yes, that global variable does exist of course. The problem is that I can’t recreate the result. Every way I generate that page within the website (from a navigation tab or a hotlink) currently creates clean code and clean URLs. Somehow, during a crawl by google, that oddity was created/found. If I could reproduce it I’d have a fighting chance of solving the problem. I looked at the fetched code that google found, but there is nothing in the code to distinguish it from the good pages except that junk in the pagination.

But it’s not really a pagination problem and my problem thread description is unfortunately misleading. Somehow a URL was created with that string in it. The EE pagination routine simply picks up that string and inserts it as the initial portion of the URL to get to other pages.

nroptions

46 posts

12 years ago

nroptions

There is one method I used in my website that I am suspicious about. I have a template with essentially one line of code:

  <meta http-equiv="Refresh" content="0; url={gv-donate-online-path}">

This acts as a redirect (I don’t want to use javascript, .htaccess, etc to accomplish it). Perhaps because of the way google crawls this causes a problem. Regardless, I still can’t recreate the result, so I can’t really test a change to see if it works!!

Dan Decker

7,338 posts

12 years ago

Dan Decker

Hi nroptions,

here is one method I used in my website that I am suspicious about. I have a template with essentially one line of code: <meta http-equiv=“Refresh” content=“0; url={gv-donate-online-path}”>

Have you tried using the ExpressionEngine method for redirection?

{redirect="{gv-donate-online-path}"}

Also, you should be able to exclude this URL in a robots.txt file to keep crawlers from accessing it.

Let me know if you have further questions!

Cheers,

nroptions

46 posts

12 years ago

nroptions

I did try EE’s redirect first but it inserts the site_url and this address is an external site. If there is someway to get around that I would prefer to use all EE methods. Perhaps I didn’t code it correctly. Now I’m wondering if you enter the full address including “https://…” it will not insert site_url”?

I will do some research into the robots.txt approach. My concern is that the initial portion of the URL is a correct one; only the final segment being junk. I need to make sure that in disallowing the bad one I don’t somehow also disallow the good one.

I still wish I had an inkling as to how it got generated, since it could recur. Interestingly, the search keywords that uncover the offending url in google do not seem to produce the same results if I use Bing or my new favorite, DuckDuckGo. It seems to be a phantom occurrence or a google-crawling artifact. Moreover, when I click on that link in the google list, it takes me to the correct page (ignoring the last segment) rather than the 404 default which is the home page.

Thanks for your suggestions.

Jonathan

nroptions

46 posts

12 years ago

nroptions

Here is my current theory, although I don’t actually understand how the Google crawler works within the EE environment. Template “X” has a link in it (via an HTML form tag) to the template donate-online/index. This latter template has only one line of code with the meta statement containing the redirect specified as url={gv-donate-online-path}. So, in some cases, the Google crawler used the name of the global variable found in the meta statement to generate its url.

In practice, when I navigate the site on the Internet via a browser, it all works fine and I never see an incorrect url because all the templates have been fully processed (all variables fully evaluated) and use the value of the global variable. And by implication, the bad urls are an artifact of the Google crawler.

One problem with the theory is this: the link to the template donate-online/index is present in all templates (it is part of a sidebar on each web page), so my theory would suggest that many more occurrences of bad urls should exist – but I only found two.

Also, I don’t understand how the bad url was picked up by the EE pagination routine. It’s as if the crawler is constructing the pages by its rules (i.e., defining the url that gets passed to the pagination routine), rather than by EE’s rules.

Does any of this make sense? And please let me know where I might find an explanation of how web crawlers work in the EE environment.

Thanks

Dan Decker

7,338 posts

12 years ago

Dan Decker

Hi nroptions,

Whew! I’ve never encountered anything like this. I’ve not done a lot with robots.txt or google crawling, so I might not be the best resource 😉

If you supply the {redirect} a full url, it should use that and not append the site url. It tends to be intelligent about that.

If you want to get some broader insight, I can move this into Development and Programming where the Community there has had more experience with this.

Cheers,

nroptions

46 posts

12 years ago

nroptions

Yes, please go ahead and move the thread … I am curious to see if anybody has an answer.

Thanks.

nroptions

46 posts

12 years ago

nroptions

I think I found the cause: a misplaced quote mark in the footer in an <li> element. No page/url is created, but somehow the google crawler picks it up – oddly only in some cases (since the footer is common to all pages).

Please close this thread … thanks.

Pagination leak - desperately in need of help

Reply