ExpressionEngine CMS
Open, Free, Amazing

Thread

This is an archived forum and the content is probably no longer relevant, but is provided here for posterity.

The active forums are here.

A lot of strange malformed URLs

November 13, 2012 4:17pm

Subscribe [3]
  • #1 / Nov 13, 2012 4:17pm

    lehrerfreund's avatar

    lehrerfreund

    263 posts

    Hi,

    there are going on some strange things on my site according to Google Webmaster Tools.

    There are a lot of 404s and a lot of “other” errors (Webmaster Tools does this categorization: server errors, soft 404, forbidden, not found (404), other). Google tries to crawl those URLs and gets the errors. I wonder how Google is finding this URLs?!

    1) 404s - Example
    Google tries to crawl
    http://www.lehrerfreund.de/schule/1s/wordle/P1800

    The P1800 seems to be a pagination-segment. Well the correct URL to the entry “wordle” is
    http://www.lehrerfreund.de/schule/1s/wordle-interpretation/
    or
    http://www.lehrerfreund.de/schule/1s/wordle-interpretation/3200

    Now I wonder why he is getting the wrong URL, and especially why he is getting the pagination-segment. When I look in Webmaster Tools, wherefrom the links are coming, the list looks like this:
    http://www.lehrerfreund.de/schule/1s/wordle/P600
    http://www.lehrerfreund.de/schule/1s/wordle/P1080
    (and more)

    The structure of the 404s is obvious and in most cases similar: It’s the correct template-path (schule/1s) but then the url-title is malformed (e.g. wordle instead of wordle-interpretation), followed by a pagination segment.

    2) Other errors
    I also get thousands of errors, where the malformed URL has the structure
    in/technik/{title_permalink=schule/1s}/P100
    The error-code given back is always 400.
    The full URLs cannot be posted here, they break the entry, therefore 2 examples as code. I had to fill in a lot of blanks, otherwise the entry is broken:

    http:// <a href="http://www.lehrerfreund.de/technik/1s/uraltkran/">http://www.lehrerfreund.de/technik/1s/uraltkran/</a> % 7b permalink= {tec_my_template_group % 7d/tell_friend % 7d
    http:// <a href="http://www.lehrerfreund.de/schule/1s">http://www.lehrerfreund.de/schule/1s</a> / % 3c/ dl % 3e % 3 cp % 3e % 3ca % 20 href= / P120


    As you can see there seems some EE-tag (like title_permalink) to be not parsed correctly. This is even more strange as I have removed the segment /in/ (which was my replacement for index.php a long time) with the htaccess.


    I have double checked my templates and it seems like there are no syntactic errors in the templates.

    I would really appreciate any idea how I could fix this.

    Thanks in advance!

  • #2 / Nov 15, 2012 2:10pm

    Shane Eckert's avatar

    Shane Eckert

    7174 posts

    Hey there lehrerfreund,

    I am sorry to hear you are running into this snag with Google.

    A lot of this can be avoided with strict URL’s.

    Do you have that enabled?

    Thank you,

  • #3 / Nov 15, 2012 6:04pm

    lehrerfreund's avatar

    lehrerfreund

    263 posts

    Hi Shane,

    yes, they are enabled ...

  • #4 / Nov 20, 2012 4:08pm

    Shane Eckert's avatar

    Shane Eckert

    7174 posts

    Hello lehrerfreund,

    Can you show me what your htaccess file contains?

    I would also love to see the actual template code for the links be generated.

    You can just juse [ code ] tags to show that our use pastie.

    Thank you,

  • #5 / Nov 21, 2012 10:06am

    lehrerfreund's avatar

    lehrerfreund

    263 posts

    Hi Shane,

    the problem is that I use some embeds and a lot of snippets and global variables. So if you need I also could give you access to the live-version, perhaps this is more easy for you.
    Anyway, below are the main parts at pastie.

    htaccess
    http://pastie.org/5412374

    the template
    http://pastie.org/5412409

    embed “OBEN”:
    http://pastie.org/5412422

    embed “RECHTS”:
    http://pastie.org/5412426

    embed “UNTEN”:
    http://pastie.org/5412431

  • #6 / Nov 21, 2012 12:10pm

    Kurt Deutscher's avatar

    Kurt Deutscher

    827 posts

    A few years back we had a client who’s site would slow down every evening around 7:00 PM and then pick back up around 8:40 PM. It took the client about 3 evenings to catch the behavior. The next evening we were watching, and sure enough at the stroke of 7:00 the site slowed to a painful crawl right on queue, so we dug into the logs.

    Some bot was hitting EE and adding a question mark followed by it’s own tracking codes to URIs. It was attempting to assign a tracking code to each and every possible URI in the site. EE tried to do the friendly thing and respond to all these new URI’s and began paginating them in the image gallery and some other part of the site.

    The bot was set up to add tracking codes to windows servers, and wasn’t well configured so it would find a new URI on the site, then assign a tracking code to it, then attempt to record the content on the new page. Each time EE would create a new page for it in response to the new request with the tracking code. You see the tracking codes looked to EE like search query strings.

    The nightly dance consisted of 10,000 improvised URIs, and 10,000 improvised result pages served up by EE until the bot would give up for the night.

    We tracked down the IP the bot was coming from and contacted the datacenter requesting that they shut this client’s bot down.

    A couple of hours later, a guy from the datacenter emailed me and said he wanted to introduce me to the folks that owned the bot. I thought this was totally strange as I didn’t care who it was, I just wanted the abusive bot stopped.

    Turned out, my client’s National office’s head of marketing had secretly hired a company to track all 67 of the organization’s affiliate website’s content. He wanted to know how often the affiliates were updating content, and what they were posting. He was basically spying on the affiliates and didn’t have permission to do so. He was highly embarrassed as his “research” was actually crashing some affiliate sites every evening.

    You are not alone with this sort of thing, and I hope your solution is something as simple, and quick to resolve as ours turned out to be.

  • #7 / Nov 21, 2012 5:03pm

    lehrerfreund's avatar

    lehrerfreund

    263 posts

    Kurt, this is very interesting and almost grotesque! I also thought about crazy going bots - but as my main analyzis is based on googlebot (all examples in my orginial post are taken from google webmaster tools) this may be a general concern. I am just unsure where the problem has it’s cause and perhaps anybody (e.g. Shane 😊 ) could help me clearing this.

.(JavaScript must be enabled to view this email address)

ExpressionEngine News!

#eecms, #events, #releases