Dupe php, mysql, massive data set #386

mattab · 2008-10-14T14:47:29Z

SEE NEW TICKET AT: #1999

This one was closed

This ticket is a place holder for all performance related notes/thoughts/feedback.

There are already a number of interesting performance improvements tickets
Here is the list:
\* Piwik creates one cookie per install which is not scalable, we should build a server side DB based cookiestore: #409
\* Piwik logs should be rotated into a yearly table: #5
\* Bulk load Piwik logs (with a documented API) which would improve tracking performance, and make it easy to do performance testing #134
\* Fix memory leak error during Piwik archiving task for larger piwik setup #766
\* All Websites dashboard should work when Piwik has thousands of registered websites (currently only scales to a few hundreds websites) #1077

Building a regression and performance testing environment for Piwik
Partly described in #134.

The objective of this project is to build a automatic and reusable performance testing infrastructure, that would make it easy to generate lots of hits and big data sets

- Setup performance testing server (called “preprod”) with monitoring
- Replaying logs from previous days (see #5554)
- Use preprod to have a precise idea of load that Piwik can handle (pages per day, per month), the size of data in mysql
- Use preprod to run Archiving and determine where Archiving is slow, or fails (via profiling).
- Use preprod to optimize the tracking mechanism (denormalize tables, review index strategy)
- Implement quick wins and plan for bigger changes 1 day

Once we have a system to assess performance, we could answer in a specific documentation page a few of the more common questions
- Does it use a lot of bandwith on your site?
- Does it take a lot of space/memory in the MySQL database?
- Does it make your site load (and work) slower?
- Expected DB size for a website with 1k, 10k, 100k visits and 1M pages.
List server configuration required to run Piwik: on a shared server / on a dedicated server.
List archiving processing time examples for large traffic.
- Suggest best setting configuration for small & medium (default config) and higher traffic installs

Interesting read on scalability & performance

```

[Rasmus Lerdorf on scaling web apps with PHP](http://www.niallkennedy.com/blog/2006/07/rasmus-lerdorf-php-web20.html)

[Scaling PHP/MySQL, Hardware only, by Flickr](http://www.scribd.com/doc/21593/Scaling-PHPMySQL)

[Scaling at Digg](http://eliw.crossbows.biz/conference/zendcon-2007-high-perf.pdf)

[How Flickr Handles Moving You to Another Shard](http://highscalability.com/how-flickr-handles-moving-you-another-shard)

[Real World Web:Performance & Scalability](http://develooper.com/talks/rww-mysql-2008.pdf)

[Flickr architecture details](http://highscalability.com/flickr-architecture)

[Flickr architecture details](http://highscalability.com/flickr-architecture)

[Mysql performance blog](http://www.mysqlperformanceblog.com/)
```

To study

```

[Mysql query profiler](http://dev.mysql.com/tech-resources/articles/using-new-query-profiler.html)

[Optimizing php with APC (by a Facebook dev)](http://c7y.phparch.com/c/entry/1/art,apc_facebook)

[ Various Mysql performance reporting tools](http://www.mysqlperformanceblog.com/tools/)

Monitoring tools: nagios, mrtg, puppet, munin, ganglia.sourceforge.net (graph across multiple servers)
```

Keywords: performance fast scalability high traffic

mattab · 2008-10-14T14:50:14Z

```
o DB sharding
+ This is expected to be the most useful and powerful feature. This is a required step to build a scalable Piwik: being able to partition the data horizontally on different mysql servers.

+ The first step would be to isolate a DB layer in Piwik, make sure that all database queries are located inside disctinct classes. We can then easily change the target server of these sql queries depending on the idsite. For example, all sites from id=1 to id=1,000 would go on server1, all sites from id=1001 to id=2001 would go to server2 + we want to keep all the logic in php/mysql so we have to make the partition/server lookup in the code (in the DB layer) o Tools to manage sharding + when a website is too big to fit on existing shard, how do we move to other shard? This will happen. We can’t afford downtime in the logging process + Moving the partitions from one server to another. o Parallelize archiving process (for example, archiving siteA and siteB can be done in parallel) o At some point, archiving data for a single website could possibly take several hours… can we split this process and do it every hour instead? Then sum the data for a the 24hours? (this is quite a big change in the php logic code)

```

mattab · 2008-10-14T16:26:36Z

getDateStart getDateEnd in *period are not optimized. could somehow be cached.

mattab · 2008-10-29T18:36:26Z

when the plugins are not used in the piwik.php logging script, don’t load the related files
(that was #19)

mattab · 2008-11-17T11:34:15Z

Find memory leaks in PHP

If the image processing extension uses emalloc() style allocation, then you can compile php with —enable-debug and you will get a report of all leaks and their location at the end of the script. Note that you must finish with “return” or nothing to get the report, not with “exit”.

But that won’t pick up malloc() leaks. For that though you can use http://www.valgrind.org if you are running under unix.

https://www.zend.com/forums/index.php?t=msg&goto=13062&S=7f75627561a92cfe442aaed40c3306eb

xdebug.show_mem_delta
Type: integer, Default value: 0
When this setting is set to something != 0 Xdebug’s human-readable generated trace files will show the difference in memory usage between function calls. If Xdebug is configured to generate computer-readable trace files then they will always show this information.
http://www.xdebug.org/docs/execution_trace

mattab · 2008-11-19T23:06:46Z

One other idea would be to remove the count(distinct idvisitor) in the archiving query. other products like GA don’t give the count of unique for each metric; that could eventually be a setting to decide to count unique or not.

we would still count uniques for the global stat, per month, week, day.

mattab · 2008-12-05T11:33:58Z

Optimizing Large Databases Using InnoDB Clustered Indexes
http://knielsen-hq.org/presentations/innodb-clustered-index/innodb-clustered-index.pdf

http://www.mysqlperformanceblog.com/2006/10/03/long-primary-key-for-innodb-tables/
http://www.mysqlperformanceblog.com/2006/10/03/mysql-optimizer-and-innodb-primary-key/

anonymous-matomo-user · 2009-02-28T13:39:30Z

> I have set up a git branch at github to help port piwik to postgresql :
>
> http://github.com/klando/pgpiwik/tree/svn-merge
>
> Edit : I made a mistake with branch naming.

The github is here : http://github.com/klando/pgpiwik/tree/master

Just do that to grab it : git clone git://github.com/klando/pgpiwik.git

mattab · 2009-03-22T22:13:25Z

see also #620: Piwik should use autoload to automatically load all classes intead of using require_once

mattab · 2010-04-07T17:35:25Z

plandem, see the thread on piwik-hackers for some thinking around alternative nosql databases in piwik: http://lists.piwik.org/pipermail/piwik-hackers/2010-February/000829.html

robocoder · 2010-04-07T18:41:35Z

There's also an interesting FAQ/blog post re: Infinidb's column-oriented storage engine for MySQL vs "NoSQL":

mattab · 2010-04-07T23:55:07Z

Infinidb sounds like something we should definitely investigate first, as it might be much (much) easier to use with the current Piwik architecture. Are there limitations when "dropping it" in instead of mysql?

robocoder · 2010-04-08T03:30:20Z

There's now a migration guide for InfiniDB. The relevant section starts at page 17.

http://www.infinidb.org/downloads/doc_download/163-migrating-from-mysql-to-infinidb

The only limitation to the open source, community edition is the limit to a single machine (not CPUs, RAM, or concurrent users). Theoretically, you can build a fairly powerful box (think: multi-core, multi-processor boards) before you have to think about adding nodes (and license fees for the enterprise edition).

anonymous-matomo-user · 2010-04-16T17:54:19Z

It looks like you have a good handle on where to start looking at InfiniDB. I took a quick look at your schema and don't see any fundamental problem with the log_visit fact table or the queries that reference it above. We do not (yet) support blob for your archive tables. The only other quick note I would add is that we aren't optimal for web/oltp style loads. However you could easily select * into outfile from existing schema, load data infile into InfiniDB to get good load rates, or use select into outfile, + cpimport (our bulk load) to get excellent load rates. 100k or more rows/second possible but will vary significantly based on disk and table definition.

One additional note, our current parallization distributes ranges of 8 million rows to each thread, so smaller tables won't show the same benefits from many cores as larger tables.

Anyway, currently signed up to follow this discussion, let me know if you have any questions or comments. Thanks - Jim Tommaney

mattab · 2010-05-05T18:03:57Z

See percona paper about Goal driven performance optimization: http://www.percona.com/files/white-papers/goal-driven-performance-optimization.pdf

mattab · 2010-07-23T00:15:27Z

we will tackle critical issues (#409, probably #1077), and postpone others to post 1.0

mattab · 2010-12-28T06:00:53Z

It might be good to look into storing json encoded data tables rather than serialized php tables. This would improve portability. See http://stackoverflow.com/questions/1306740/json-vs-serialized-array-in-database as reference. Speed of json decoding large arrays VS unserialize should be tested.

mattab · 2011-01-11T10:21:31Z

I created a summary ticket from this one, as this ticket became unclear. See #1999

mattab added this to the Future releases milestone Jul 8, 2014

mattab added T: Task labels Jul 8, 2014

robocoder mentioned this issue Sep 15, 2009

mysql too many connections/looses connection #982

Closed

mattab mentioned this issue Apr 6, 2014

Performance & Scalability improvements in Piwik: ideas to research #1999

Closed

mattab added the wontfix label Aug 3, 2014

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dupe php, mysql, massive data set #386

Dupe php, mysql, massive data set #386

mattab commented Oct 14, 2008

mattab commented Oct 14, 2008

mattab commented Oct 14, 2008

mattab commented Oct 29, 2008

mattab commented Nov 17, 2008

mattab commented Nov 19, 2008

mattab commented Dec 5, 2008

anonymous-matomo-user commented Feb 28, 2009

mattab commented Mar 22, 2009

mattab commented Apr 7, 2010

robocoder commented Apr 7, 2010

mattab commented Apr 7, 2010

robocoder commented Apr 8, 2010

anonymous-matomo-user commented Apr 16, 2010

mattab commented May 5, 2010

mattab commented Jul 23, 2010

mattab commented Dec 28, 2010

mattab commented Jan 11, 2011

Dupe php, mysql, massive data set #386

Dupe php, mysql, massive data set #386

Comments

mattab commented Oct 14, 2008

mattab commented Oct 14, 2008

mattab commented Oct 14, 2008

mattab commented Oct 29, 2008

mattab commented Nov 17, 2008

mattab commented Nov 19, 2008

mattab commented Dec 5, 2008

anonymous-matomo-user commented Feb 28, 2009

mattab commented Mar 22, 2009

mattab commented Apr 7, 2010

robocoder commented Apr 7, 2010

mattab commented Apr 7, 2010

robocoder commented Apr 8, 2010

anonymous-matomo-user commented Apr 16, 2010

mattab commented May 5, 2010

mattab commented Jul 23, 2010

mattab commented Dec 28, 2010

mattab commented Jan 11, 2011