Importing cyrillic Wordpress comments into Disqus

My wife has a blog on Wordpress. She writes in Russian – it’s important to mention (and it’s in the subject). Some time ago I decided to improve the blog a bit, and one of the improvements was moving comments to Disqus. I like Disqus because it makes commenting easier which potentially leads to more comments and discussions. And comments and discussions are good.

I installed Disqus plugin and started to export comments into Disqus. Honestly saying I don’t remember how it ended (it was a while ago), but when I went to Disqus, I found no comments there.

Then I started to try to move comments semi-manually. It means exporting them from Wordpress into .xml file and importing the file into Disqus. It started well – I exported a file successfully. But when I tried to import it into Disqus, page just got reloaded with no message about what is going on. Comments did not appear.

I started to investigate into the issue and found which was a bit more informative on the problem. The error was the following:

XML syntax error: Input is not proper UTF-8, indicate encoding !
Bytes: 0xD0 0xBE 0xD1 0x81, line 51, column 294 (line 51)

If encoding would not be specified and that would be a problem, I would not be writing this post. Surely encoding was indicated. I played with the file really lot -- I was checking it with a hex editor, validating it with XML validators – it all was helpless. I was pretty sure that the problem is connected to cyrillic characters, in fact it might have been, but not directly.

After a long googling, working with the XML file, drinking a lot of tea (I’m a tea drinker, yes) and saying a bunch of “good” words addressed to Disqus importer, I’ve found out a way to get through the issue.

1. Wipe out posts and pages content from the XML

You import comments, not the posts and pages themselves. More contents – more symbols causing troubles. Wipe contents and excerpts, you can even use a regular expression to replace everything between .

<!-- Was: -->
<content:encoded><![CDATA[Привет, тут какой-то текст!]]></content:encoded>
<excerpt:encoded><![CDATA[Тут тоже текст!]]></excerpt:encoded>
<!-- Is: -->

2. Wipe out all the suspicious punctuation

Regulal expressions would be helpful for this too – I’ve written one to search for everything except commonly used words, digits and punctuation characters. It found me some weird ellipsis, strange dashes and crazy spaces. Removed all of them. You do the same.

3. Do something with the same meaning as shaman dances and hope it helps

Now when you are pretty sure that your file is clean of garbage, it might be imported successfully. Or might be not. Mine was still causing an error, but on different line and column. When I was looking to it, it turned that problem is always caused exactly between CDATA[ opening bracket and first letter of the Cyrillic text. Here is what I did – I added a space there and it worked. When it didn’t helped, I removed the space, tried again. If that didn’t work too – added space again. It’s crazy. I know. Ask Disqus, why. After maximum of three attempts, error was gone and another one appeared until I went through all of them (I had about 150 comments and about 10 errors on this stage).

Finally I was able to import the file. Hope my recipe would be helpful.

P.S. Another small tip which was indirectly mentioned above – don’t forget to move page comments, not posts only. I forgot. Was punched for that. :)

P.P.S. Couple days ago I was importing pages comments which I forgot to import first time. During the import I broke Disqus importer so it started to give me 500 errors. Next day they emailed me saying that there was a bug on their side and they have fixed it. Haven’t checked, but good reaction!

