Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk insert reports when archiving #2243

Closed
mattab opened this issue Mar 30, 2011 · 39 comments
Closed

Bulk insert reports when archiving #2243

mattab opened this issue Mar 30, 2011 · 39 comments
Labels
Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change.
Milestone

Comments

@mattab
Copy link
Member

mattab commented Mar 30, 2011

Extract from Custom date range processing with sql profiler enabled (see global.ini.php)


Executed 5314 times in 5313.7ms (average = 1ms)

    INSERT IGNORE INTO piwik_archive_blob_2010_01 (idarchive, idsite, date1, date2, period, ts_archived, name, value) VALUES (?,?,?,?,?,?,?,?)

What happens is that the Page URL / Page Titles have many children, ie. thousands, and Piwik will INSERT one at a time, in this case 5313 times. Instead, Piwik should log all these into a file and use LOAD DATA INFILE

I'm pretty sure we'll hit some bugs with some user configurations, so plan for a setting in config file to disable the bulk insert.

@mattab
Copy link
Member Author

mattab commented Mar 30, 2011

(In [4239]) Refs #2243

  • better logging
  • better performance when archiving Custom Date Range by changing cache key
  • remove unnecessary class hierarchy - ArchiveProcessing/Record*
  • Adding helper function to load data bulk in mysql + unit tests - tested with Mysql PDO but not mysqli (are webtests still running with mysqli ?)
    If the bulk import fails for any reason, we fallback to loading results in a loop and doing plain INSERTs (rather than config file to disable)

@robocoder
Copy link
Contributor

are webtests still running with mysqli

Only unit tests. I'll address this when we have parameterized builds running on Jenkins.

@mattab
Copy link
Member Author

mattab commented Mar 30, 2011

OK cool I actually meant unit tests, that's perfect

@mattab
Copy link
Member Author

mattab commented Mar 31, 2011

(In [4253]) Fixes #2243

  • Enabling the bulk loading on BLOB records.
  • Had to first run the blobs through bin2hex... indeed even though it looked like it was working at first, the tests once again saved us from a VERY nasty bug which would have shown up only later and be extremely difficult to find. LOAD DATA INFILE was working fine most of the time, but a few tests were failing! Then I discovered that the file containing the blob wasn't loaded properly in all cases, unless the binary blob was hex'ed.
    I ran quick performance test and it seems approximately the same performance, and uses the same disk space as well.
    However in some cases, it will definitely improve performance.
    memory usage somehow is much less in my test! so definitely a great point.
    it would have been nice to have versionning in the archive_blob table, since now we have 2 different data types (old blob are pure binary, new blobs are hex'ed) but it is working nonetheless (Old blobs will fail the first gzuncompress(hex2bin()) then fallback to gzuncompress() only

@robocoder
Copy link
Contributor

Can you use MySQL's native HEX() and UNHEX() functions? I would think this would give us backward compatibility on blobs while also reducing php's runtime memory usage (because the string conversion happens in the db server).

@mattab
Copy link
Member Author

mattab commented Mar 31, 2011

To use HEX we would have to use LOAD DATA INFILE ....... SET value = HEX(value) but the SET feature is only available as of 5.0.3

For the select part, I gave it a quick try, but it doesn't seem to work. Attached patch for reference.
in any case if this code is working perfectly as I hope it does, then it's fine ;)

@mattab
Copy link
Member Author

mattab commented Mar 31, 2011

Attachment:
unhex test.txt

@robocoder
Copy link
Contributor

That wasn't what I meant. Let me test something.

@robocoder
Copy link
Contributor

I'm still investigating an alternative to hex2bin/bin2hex because the current method doubles the blob size, but I want to log some observations first:

re: r4259. Not sure why it passed (before commenting it out) on your box. The CI server runs 5.3.6, so it's falling back to databaseInsertIterate. I suspect the test fails here because it inserts binary data (e.g., containing NULs) into a varchar column.

MySQL documentation says LOAD DATA INFILE uses the database's character set but we haven't explicitly set this in the CREATE DATABASE. Also, the "CHARACTER SET" clause wasn't added to the LOAD DATA INFILE syntax until MySQL 5.0.38. (At minimum, we should set the database charset for consistency.)

One thought is conditionals to use the newer features if available, e.g., Zend_Db_Adapters have a getServerVersion() method.

@robocoder
Copy link
Contributor

(In [4263]) refs #2243 - set database's default charset explicitly

@robocoder
Copy link
Contributor

(In [4264]) refs #2243 - use Piwik_Exec() for MYSQLI compatibility, i.e., can't prepare LOAD DATA INFILE when using Piwik_Query(). Tested with 5.1.3, 5.1.6, 5.2.0, and 5.2.17.

@robocoder
Copy link
Contributor

I'll add a test case that uses a table with a blob column It appears the bin2hex/hex2bin is unnecessary, so long as the column's datatype is blob.

@mattab
Copy link
Member Author

mattab commented Mar 31, 2011

You says it doubles the blob size, but my tests shows that in fact it doesn't.
I run the archiving on 2 full years with the batch insert, and wihout, and the DB size was the same in both cases.

OK for charset, good thinking.

I'm just slightly concerned about the added escaping function, but tests are passing which should be fine.

When you test, it is good to also not only rely on unit tests: the blob sizes & scope they generate is quite limited. Try to generate loads of data using Visito Generator, then archive the data with the old & new algorithm to see if both work as excpected. You can also easily compare piwik_archive_* db size.
To switch from one to the other, simply do:

### Eclipse Workspace Patch 1.0
#P trunk
Index: core/ArchiveProcessing.php
===================================================================
--- core/ArchiveProcessing.php  (revision 4260)
+++ core/ArchiveProcessing.php  (working copy)
@@ -696,7 +696,7 @@
    protected function insertBulkRecords($records)
    {
        // Using standard plain INSERT if there is only one record to insert
-       if($DEBUG_DO_NOT_USE_BULK_INSERT = false
+       if($DEBUG_DO_NOT_USE_BULK_INSERT = !false
            || count($records) == 1)
        {
            foreach($records as $record)

@robocoder
Copy link
Contributor

But that doesn't disable the bin2hex/hex2bin. When I say "double the size", I'm comparing the old blobs (octets) vs the new blobs (nybbles as hexits). So, DEBUG_DO_NOT_USE_BULK_INSERT doesn't affect the size as it still uses hexits.

With the escaping function (used only when writing to the infile), blobs go back to being binary data.

@robocoder
Copy link
Contributor

(In [4272]) refs #2243 - looks good so far

@mattab
Copy link
Member Author

mattab commented Apr 1, 2011

But that doesn't disable the bin2hex/hex2bin.

oh oh I messed up! and I thought that the results were strange, well I should have thought a bit more indeed

The tests are failing for missing table: http://qa.piwik.org:8080/webtest/003_UnitTests/001_response_invoke.html

when you are confident all is working and tested properly, please ping me and I will test on the demo :)

@robocoder
Copy link
Contributor

Yeah, I forgot to drop the database before running my unit test changes. I'll fix this a.m.

@robocoder
Copy link
Contributor

Fixing the unit tests is taking longer than I thought ... encountering more test runner side-effects (similar to #896).

@robocoder
Copy link
Contributor

(In [4285]) refs #2243 - move test code around

@robocoder
Copy link
Contributor

(In [4287]) refs #2243, refs #2253

@robocoder
Copy link
Contributor

(In [4289]) refs #2243, refs #2253

@robocoder
Copy link
Contributor

(In [4290]) refs #2243, refs #2253 - use a pseudo test method since __destruct is called too late

@robocoder
Copy link
Contributor

(In [4291]) refs #2243, refs #2253

@robocoder
Copy link
Contributor

(In [4292]) refs #2243, refs #2253

@robocoder
Copy link
Contributor

The non-integration tests are working but the integration tests are broken. I'm getting different errors depending on whether I run Main.test.php or all_tests.php.

I've been hacking this since r4287. I'm off for dinner, and will revisit later tonight, unless someone beats me to it. ;)

@robocoder
Copy link
Contributor

blink blink Main.test.php passes from command line and browser. Only failing in the browser when run as part of all_tests.php.

@robocoder
Copy link
Contributor

(In [4293]) fixes #2243, refs #2253

@robocoder
Copy link
Contributor

(In [4294]) fixes #2255

  • When the LOCAL keyword is omitted from the statement, the unit tests fail on dev6 because the db user doesn't have FILE privilege granted. (Granted.)
  • When the LOCAL keyword is included in the statement, the unit tests fail on dev6 because MySQL was not built with --enable-local-infile. (Thus, PDO_MYSQL on dev6 is affected by PHP bug 54158.)

refs #2243

  • fixes problem where infile not being deleted when Piwik_Exec() threw an exception (MYSQLI)
  • set driver_options uniformly between MYSQLI and PDO_MYSQL
  • update unit test conditions (when batch insert should work)
  • renamed methods
  • the performance gain is worth the effort of trying both LOAD DATA LOCAL INFILE and LOAD DATA INFILE

@mattab
Copy link
Member Author

mattab commented Apr 4, 2011

Reopening as on Windows at least there are some issues:

ArchiveProcessing.test.php

Fail: Test_Piwik_ArchiveProcessing -> test_tableInsertBatch -> Record 4 bug, not  BUT ???? at [D:\piwik\svn\trunk\tests\core\ArchiveProcessing.test.php line 305]
Fail: Test_Piwik_ArchiveProcessing -> test_tableInsertBatch -> Record 4 bug, not  BUT ???? at [D:\piwik\svn\trunk\tests\core\ArchiveProcessing.test.php line 305]

Also, some Main.test.php fail with "no data" when there should be some, which I think is because the blobs don't uncompress properly and Piwik defaults to "no data" rather than throwing the error.

Maybe the archiveprocessing failure above can help: it looks like utf8 characters. COuld it be some kind of charset issue? or windows specific issues?

Alternatively, we could disable bulk loading on Windows, but only when we are 100% sure that the failing tests are due to windows.

@robocoder
Copy link
Contributor

I hope it's trivial ... I'll take a quick look in the a.m.

@mattab
Copy link
Member Author

mattab commented Apr 4, 2011

Actually, only ArchiveProcessing.test.php is failing, Main.test.php seems fine

@robocoder
Copy link
Contributor

We may need some more users to test on Windows. ArchiveProcessing.test.php passes for me with r4314 on Windows using:

  • PHP 5.2.9 + Apache + MySQL 5.1.33
  • PHP 5.2.14 + IIS + MySQL 5.1.33

Can you run SHOW VARIABLES LIKE 'character_set%'; ?

On my Windows box, my.ini has:

 character_set_client            | latin1
 character_set_connection        | latin1
 character_set_database          | latin1
 character_set_filesystem        | binary
 character_set_results           | latin1
 character_set_server            | latin1
 character_set_system            | utf8

This is the same as my Ubuntu dev box:

| character_set_client     | latin1
| character_set_connection | latin1
| character_set_database   | latin1
| character_set_filesystem | binary
| character_set_results    | latin1
| character_set_server     | latin1
| character_set_system     | utf8

@mattab
Copy link
Member Author

mattab commented Apr 5, 2011

Runnning the query on the piwik_tests database, I get:

Variable_name   Value
character_set_client    utf8
character_set_connection    utf8
character_set_database  utf8
character_set_filesystem    binary
character_set_results   utf8
character_set_server    latin1
character_set_system    utf8
character_sets_dir  C:\xampp\mysql\share\charsets\

and the values from mysql.exe

$ /cygdrive/c/xampp/mysql/bin/mysql.exe --verbose --help | grep set

character-sets-dir                (No default value)
default-character-set             latin1

@mattab
Copy link
Member Author

mattab commented Apr 5, 2011

FYI the code is now running on demo.piwik.org

I had to DROP the archive tables for March & April since the ones built with the previous RC were HEXed and therefore not read properly. After dropping the tables, I ran archiving.

The problem is that, apparently the load data infile isn't used, I saw this in the SQL prfiler output

Executed 2 times in 1.2ms (average = 0.6ms)

    INSERT IGNORE INTO piwik_archive_blob_2011_01 (idarchive, idsite, date1, date2, period, ts_archived, name, value) VALUES (?,?,?,?,?,?,?,?)

investigating...

@robocoder
Copy link
Contributor

Attachment: Hack for charset mismatch
2243.patch

@robocoder
Copy link
Contributor

matt: can you try the attached patch?

@mattab
Copy link
Member Author

mattab commented Apr 5, 2011

On demo.piwik.org: There is just an issue that somehow the Piwik_Exec doesn't log the query execution in the profiler like it does with other functions

After adding lots of logging the bulk insert code is executed as expected, cool!!

@mattab
Copy link
Member Author

mattab commented Apr 5, 2011

I applied the patch locally on windows andit fixes the ArchiveProcessing.test.php !!!

@robocoder
Copy link
Contributor

(In [4320]) fixes #2243 - charset mismatch hack; also handle case where dbhost is blank (eg socket), or a pseudo-domain was added to the hostname

@mattab mattab added this to the Piwik 1.3 milestone Jul 8, 2014
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change.
Projects
None yet
Development

No branches or pull requests

2 participants