Database

Going to the O’Reilly MySQL Conference & Expo

As I’ve been doing the last couple of years, I will be going to the
O’Reilly MySQL Conference & Expo. In addition to
the tutorial and the replication sessions that I will be holding
together with Lars,
I will be holding a session about the binary log together with Chuck
from the Backup team which the Replication team normally works very
close with.

This year, O’Reilly also have a Friend of the Speaker
discount of 25% that you can use when you register using the code
mys10fsp.

The sessions that we are going to hold are listed below. Note that I
am using Microformats, which will
allow you to easily extract and add the events to your calendar using,
for example, the Operator
plugin for Firefox.

See you there!

Mysteries of the Binary Log

April 14th, 2010 10:50am -
11:50am
Room: Ballroom F
New Replication Features

April 13th, 2010 2:00pm -
3:00pm
Room: Ballroom A
Replication Tricks & Tips

April 14th, 2010 2:00pm -
3:00pm
Room: Ballroom B
The Replication Tutorial

April 12th, 2010 8:30am -
12:00pm
Room: Ballroom E

Tags:
By Jeffrey on March 7, 2010 | MySQL | A comment?

Automated deployment systems: push vs. pull

I’ve been immersed in the world of automated deployment systems for quite a while. Because I like Python, I’ve been using Fabric, but I also dabbled in Puppet. When people are asked about alternatives to Puppet in the Python world, many mention Fabric, but in fact these two systems are very different. Their main difference is the topic of this blog post.

Fabric is what I consider a ‘push’ automated deployment system: you install Fabric on a server, and from there you push deployments by running remote commands via ssh on a set of servers. In the Ruby world, an example of a push system is Capistrano.

The main advantages of a ‘push’ system are:

  • control: everything is synchronous, and under your control. You can see right away is something went wrong, and you can correct it immediately.
  • simplicity: in the case of Fabric, a ‘fabfile’ is just a collection of Python functions that copy files over to a remote server and execute commands over ssh on that server; it’s all very easy to set up and run

The main disadvantages of a ‘push’ system are:

  • lack of full automation: it’s not usually possible to boot a server and have it configure itself without some sort of client/server protocol which push systems don’t generally support (see ‘pull’ systems below for that)
  • lack of scalability: when you’re dealing with hundreds of servers, a push system starts showing its limits, unless it makes heavy use of threading or multi-processing

Puppet is what I consider a ‘pull‘ automated deployment system (actually to be more precise, it is a configuration management system). In such a system, you have a server which acts as a master, and clients which contact the master to find out what they need to do, thus pulling their configuration information from the master. In Puppet, configuration files are called manifests. They are written in a specific language and they are declarative, i.e. they tell each client what to do, not how to do it. The Puppet client software running on each server knows how to interpret the manifest files and how to translate them into actions specific to the operating system of that server. For example, you specify in your manifest file that you want a user created and you don’t need to say ‘run the adduser command on server X’. Other examples of ‘pull’ deployment/configuration management systems are bcfg2 (Python),Chef (Ruby) and slack (Perl). A newcomer in the Python world is a port of Chef called kokki (it looks like it’s very much in its infancy still, but I hope the author will continue to actively develop it).

The main advantages of a ‘pull’ system are:

  • full automation capabilities: it is possible, and indeed advisable, to fully automate the configuration of a newly booted server using a ‘pull’ deployment system (for details on how I’ve done it with Puppet, see this post)
  • increased scalability: in a ‘pull’ system, clients contact the server independently of each other, so the system as a whole is more scalable than a ‘push’ system

The main disadvantages of a ‘pull’ system are:

  • proprietary configuration management language: with the notable exception of Chef, which uses pure Ruby for its configuration ‘recipes’, most other pull system use their own proprietary way of specifying the configuration to be deployed (Puppet’s language looks like a cross between Perl and Ruby, while bcfg2 uses…gasp…XML); this turns out to be a pretty big drawback, because if you’re not using the system on a daily basis, you’re guaranteed to forget it (as happened to me with Puppet)
  • scalability is still an issue: unless you deploy several master servers and keep them in sync, that one master will start getting swamped as you add more and more clients and thus will become your bottleneck

My particular preference is to use a ‘pull’ system for the initial configuration of a server, including all the packages necessary to deploy my application (for example tornado). For the actual application code deployment, I prefer to use a ‘push’ system, because it gives me more control over how exactly I do the deployment. I can take a server out of the load balancer, deploy, test, then put it back, rinse and repeat.

In discussions with Holger Krekel at PyCon, I realized that execnet might be a good replacement for Fabric for my needs. It already provides remote command execution via ssh, and an rsync-like file transfer protocol. All it needs is a small library of functions on top to do common system administration tasks such as running commands as sudo, etc. I also want to look into kokki as a replacement for Puppet in my deployment architecture.

A parting thought: my colleague Dan Mesh suggested using a queuing mechanism for the client-server protocol in a ‘pull’ system. In fact, I am becoming more and more convinced that as far as scalability is concerned, when in doubt, use a queuing mechanism. In this deployment architecture, the master would post tasks to be done by a specific client to a central queue. The client would check the queue periodically for a task assigned to it, would execute it then would send a report back to the server when done. Of course, you need to worry about authentication in this scenario, but it seems that it would solve a lot of the scalability issues that both push and pull systems exhibit. Who knows, we may build it at Evite and open source it…so stay tuned ;-)


Tags:
By Jeffrey on March 5, 2010 | MySQL | A comment?

FromDual – The MySQL consulting company goes operational today!

Hello everybody,

One month earlier than planned we have the great pleasure to announce you that the company called FromDual goes operational today!

We are excited about this step and it is an new era in our personal evolution to get back in full-contact with customers and solve their real life day-to-day MySQL problems.

So we are happy hearing from you and to help you solving your individual MySQL problems…

You can find us at FromDual or you can drop us a line.

Regards,
Oli Sennhauser (aka Shinguz)
Senior MySQL Consultant at FromDual

About FromDual

FromDual provides neutral and vendor independent MySQL consulting, training and other services around MySQL and its derivatives. The company concentrates on the individual needs of its customers and achieves, in a close co-operation the best results for their problems.

Our consultants have been working in many projects in Europe. We were involved in small start-ups, medium size enterprises and huge world wide operating top-500 companies and solved their Performance Problems, developed Architecture & Design studies with them, answered their operation questions, and reviewed their Backup/Recovery concepts.

FromDual does on-site and remote consulting, remote emergency aid and helps its customers to fill MySQL staff gaps if needed.

The company is privately owned. Its HQ is close to Zurich in Switzerland.


Tags:
By Jeffrey on March 3, 2010 | MySQL | A comment?

Getting past hung remote processes in Fabric

This is a note for myself, but maybe it will be useful to other people too.

I’ve been using Fabric version 1.0a lately, and it’s been working very well, with an important exception: when launching remote processes that get daemonized, the ‘run’ Fabric command which launches those processes hangs, and needs to be forcefully killed on the server where I run the ‘fab’ commands.

I remembered vaguely reading on the Fabric mailing list something about the ssh channel not being closed properly, so I hacked the source code (operations.py) to close the channel before waiting for the stdout/stderr capture threads to exit.

Here’s my diff:

git diff fabric/operations.py diff --git a/fabric/operations.py b/fabric/operations.pyindex 9c567c1..fe12450 100644--- a/fabric/operations.py+++ b/fabric/operations.py@@ -498,12 +498,13 @@ def _run_command(command, shell=True, pty=False, sudo=False, user=None):     # Close when done     status = channel.recv_exit_status()     +    # Close channel+    channel.close()+     # Wait for threads to exit so we aren't left with stale threads     out_thread.join()     err_thread.join()-    # Close channel-    channel.close()     # Assemble output string     out = _AttributeString("".join(capture_stdout).strip())

I realize this is a hack, but it solved my particular problem…If you’ve seen this and have found a different solution, please leave a comment.


Tags:

Plays well with others

A few years ago MySQL+memcached and PostgreSQL+memcached were the only choices for high-scale applications. That has changed with the arrival of NoSQL. Change is good. Open-source monopolies are not much better than closed-source ones from the perspective of an end user. I expect MySQL to focus much more on the needs of high-scale applications to remain relevant. I also expect it to play better with others as it is no longer the only persistent data store for high-scale applications.

I think that MySQL+memcached is still the default choice and I don’t think it is going away in the high-scale market. But some high-scale applications either don’t need all of the features of a SQL RDBMS or are willing to go without those features to scale. This isn’t a blanket endorsement of NoSQL as the definition of NoSQL is weak. I am referring to the NoSQL systems that support high-scale.

I don’t believe all of the bad press that MySQL receives from high-scale applications. I know that some problems with MySQL are self-inflicted (seriously, I know this). It is hard to diagnose many problems for which the primary symptom is a slow MySQL server so it is also hard to identify self-inflicted problems. I also don’t think that some NoSQL systems will provide a different scale-out experience than MySQL given that some NoSQL systems scale-out by sharding (just like MySQL) and that I can deploy MySQL like NoSQL (disallow joins and secondary indexes, use HANDLER statements)

I also wonder whether affordable SSD/Flash reduces the need to migrate from MySQL to NoSQL. Many MySQL deployments that were IO bound when it was difficult to get more than a few thousand IOPs on a commodity server can now get 10,000 to 100,000 IOPs in that server at commodity prices.

MySQL and NoSQL are also at significantly different stages. MySQL is mature and maturity has its benefits. MySQL has amazing support and documentation.  There are client libraries for almost every language that you should use. There are even bindings for languages you shouldn’t use. The MySQL C API is easy to use. The JDBC driver is awesome, even if support for JDBC makes it much more complex than needed. There is a lot of MySQL expertise that can be hired or rented (MySQL, Monty Program, Percona, Open Query, Pythian, FromDual) and there is some innovation (not enough companies, but they are doing amazing things) from third-parties such as InfiniDB, InfoBright and TokuDB.

What happened?

NoSQL systems are improving faster than MySQL, MySQL has focused on features for the enterprise RDBMS market in the past two releases and the changes we need from MySQL are hard to implement.  Change is hard because MySQL is a complex server that supports many features. Change is also much harder than it should be because of the MySQL coding style. Parts of it are not modular and features are entangled. Some of the difficulty could be overcome were there interest from external contributors. There are external contributors willing and able to improve server code but they are working on other projects like NoSQL. The MySQL effort is also split (or diluted) between official MySQL, Drizzle and MariaDB.

What really happened?

I don’t know. It may have been better for the business of MySQL to focus on the enterprise market. I can describe some of the problems that need to be fixed in MySQL to make things easier for me. I think other high-scale applications share these problems:

  • Multi-master – high-scale applications have users around the world. Latency is reduced by distributing databases and application servers around the world. Databases are rarely sharded by location so the data store must support multi-master deployments with conflict resolution and eventual consistency for some database tables. There is no support for conflict resolution in MySQL. It might be possible to do something with the output of row-based replication.
  • SQL – this is a problem that MySQL cannot fix. SQL makes it easy to make mistakes. Mistakes include insert, update and delete statements that lock all rows in a table. Alas, the EXPLAIN statement in MySQL does not support insert, update and delete statements. Another serious mistake is a query that has a lousy response time when the database buffer cache is cold because it does many random disk reads. The EXPLAIN statement in MySQL does not provide an estimate for the worst-case number of random disk IOs and many people who write SQL don’t know how to interpret it to get an estimate. Worst-case performance is critical for queries run during web requests.
  • Write-optimization – several NoSQL systems are write-optimized including Cassandra, HBase and Bigtable. A write-optimized system makes it possible to use more indexes than an update-in-place data store. With more indexes it is more likely that there can be an index defined for every popular query and the index reduces the number of disk reads that must be done to evaluate the query.  This improves worst-case query response time and reduces the need to use memcached or a huge database buffer cache. Write-optimization has finally arrived for MySQL with the availability of TokuDB. I hope that RethinkDB provides a GA version in the future.
  • Monitoring – without good monitoring you will either spend too much time fixing performance problems or never find them and buy too much hardware. I suspect that monitoring in MySQL is much better than anything in a NoSQL system but MySQL is missing features that make it easy to understand current and new sources of workload. I need to aggregate the overhead (CPU time, disk operations, rows read, …) by database user, table and statement. It is extremely hard or not possible to do this by database user and table. It became possible in MySQL 5.1 to do this by statement for short periods of time by using the slow query log in MySQL 5.1. Prior to MySQL 5.1, the slow query log was limited to queries that ran for at least two seconds. The alternative is to use tcpdump with mk-query-digest. Despite all of the work that has gone into the performance schema, MySQL has yet to support anything like my favorite feature — user and table monitoring.
  • Crash proof slaves – replication slaves are not crash proof. The slave commits transactions to a storage engine and then updates a state file to maintain the replication offset. There is nothing to keep the state file and storage engine in sync. Until recently there wasn’t even an option to force the state file to disk after it was updated. Unplanned hardware reboots are frequent when there are hundreds or thousands of slaves. If you are clever and use the right version of InnoDB it is possible in some cases to figure out the correct offset for the slave after a crash and repair it manually. This isn’t a good use of DBA time. Otherwise DBAs must waste their time and network bandwidth to restore the slaves. The Google patch published two different fixes for this: rpl_transaction_enabled and global transaction IDs. MySQL is working on a fix.
  • Automated failover – for MySQL deployments that have many slaves connected to one master it isn’t possible to automate failover when a master crashes. Tungsten and DRBD might make this better. 
  • Resharding – sharding is an excellent way to scale MySQL. Sharding usually requires resharding. Resharding is hard and must be done with minimal downtime. It might be possible to build a tool that uses row-based replication output to reshard a database in the background with little downtime. No such tool exists today.
  • Replication lag – a slave with replication lag is useless for OLTP scaleout. The replication thread is single-threaded. MySQL is working on support for parallel execution on a slave. Until then we need to improve mk-slave-prefetch (Domas can you hear me).
  • Schema change – these are frequently needed for growing high-scale applications. Long running schema changes in MySQL require downtime unless first done on a slave (assuming you have a spare slave and that slave can become the master after the change). Users don’t like downtime. I think it is possible to do many of these on a master with minimal downtime using the output from row-based replication. Alas, there is no tool for that today.

What is NoSQL?

Do your homework when evaluating a NoSQL system as they differ greatly from each other:

  • Crash safety – most NoSQL systems are crash safe but a few are not. I would limit the use of systems that are not crash safe to supporting batch workload. Unplanned server reboots are frequent for high-scale applications when a large number of servers is used. At least two prominent members of the NoSQL family are not crash safe. That should be documented in bold text on their project pages. It is not.
  • Sharding – some NoSQL systems do sharding. BigTable and others do not. With sharding it is possible to support transactions and multiple-indexes on a table within the scope of a shard. That then requires support for resharding. It also requires that queries on secondary indexes to be run on all shards while queries on primary indexes can be limited to run on one shard which may limit the ability to use secondary indexes.
  • Index types – Many NoSQL systems are limited to hash indexes. You can’t do range scans on hash indexes. I wonder whether this leads to data redundancy when every query must be resolved by one index lookup.
  • Secondary indexes – NoSQL systems like BigTable not only do not support transactions, they also do not support secondary indexes. You can explicitly maintain a secondary index but there is no support to make multiple-changes atomic and there can be a failure between the primary and secondary index updates which results in data drift. It is also difficult to do consistent reads between the two.
  • Consistent reads – consistency is usually the responsibility of the client and done by using per-row timestamps.
  • Single-node performance – I know that performance != scale-out but scale-out is not a substitute for lousy single-node performance in the high-scale application market. It might be acceptable to use 5X as many nodes because your data store is slow when you end up using 40 nodes. This becomes a show-stopper when you end up using thousands of nodes. One NoSQL system has accepted this compromise. While I know it has many other use cases I think that will limit the use of it for high-scale applications.
  • Network efficiency – MySQL reduces use of the network because all query evaluation is done at the server. All NoSQL systems evaluate predicates implied by the index access. Only some NoSQL systems evaluate non-indexed predicates. This can result in more data returned to the client.
  • Technology or solution – MySQL is more mature than the NoSQL systems. A lot of work remains to grow NoSQL from technology into solution with support for audit, backup, monitoring and all of the other things required to scale in a large company.

Tags:

ReadyBoost

I didn’t know that Windows has a similar technology to ZFS L2ARC which is called ReadyBoost. Nice.

I’m building my new home NAS server and I’m currently seriously considering putting OS on an USB pen drive leaving all sata disks for data only. It looks like with modern USB drives OS should actually boot faster than from a sata disks thanks to much better seek times. I’m planning on doing some experiments first.

Tags:
By Jeffrey on March 1, 2010 | MySQL | A comment?

Use HAProxy 1.4 if you need MySQL health checks

I was using the latest version of HAProxy 1.3 and was load balancing backend MySQL servers while also checking their ports, so if one server went down it would be taken out of the load balancing pool. However, since the port checks in HAProxy happen at the TCP level, the MySQL instance which was being hit by the port checks wasn’t happy, because it wasn’t a proper MySQL connection. As a result, after a number of some checks, MySQL refused to allow clients to connect, with a message like this:

OperationalError: (1129, “Host ‘myhost’ is blocked because of many connection errors; unblock with ‘mysqladmin flush-hosts’”)

Solution: upgrade HAProxy to the newly released version 1.4 (at the date of this writing, the exact version is 1.4.0). Documentation is here.

For MySQL specific checks, you can specify ‘option mysql-check’ in a backend or ‘listen’ section of the configuration file. For example, I have something similar to this in my HAProxy configuration file:


listen mysql 0.0.0.0:3306
mode tcp
option mysql-check
balance roundrobin
server mysql1 10.0.1.1:3306 check port 3306
server mysql2 10.0.1.2:3306 check port 3306 backup



Tags:

Embedded PBXT is Cool

Martin Scholl (@zeit_geist) has started a new project based on the PBXT storage engine: EPBXT – Embedded PBXT! In his first blog he describes how you can easily build the latest version: Building Embedded PBXT from bzr.

The interesting thing about this project is that it exposes the “raw” power of the engine. Some basic performance tests show this really is the case.

At the lowest level, PBXT does not impose any format on the data stored in tables and indexes. When running as a MySQL storage engine it uses the MySQL native row and index formats. Theoretically it would be possible to expose this in an embedded API. The work Martin is doing goes in at this level. The wrapper around the engine determines the data types, data sizes, row and index format. Comparison operations for the data types are also supplied by the embedded code or user program.

This flexibility will make it possible for an application to store its own data very efficiently. As Martin suggested, it would also be possible to use Google’s protobuf’s for the row format. This would eliminate the need to use an ALTER TABLE for many types of changes to a table’s definition!

Of course, EPBXT is still a way from realizing this vision, and Martin has some very specific problems he wants to solve with the development. However, judging by his command of the code within such a short time, this is going to be a project to watch in the future!


Tags:
By Jeffrey on February 27, 2010 | MySQL | A comment?

ZVOLs – Write Cache

When you create a ZFS volume its write cache is disabled by default meaning that all writes to the volume will be synchronous. Sometimes it might be handy though to be able to enable a write cache for a particular zvol. I wrote a small C program which allows you to check if WC is enabled or not. It also allows you to enable or disable write cache for a specified zvol.

First lets check if write cache is disabled for a zvol rpool/iscsi/vol1

milek@r600:~/progs# ./zvol_wce /dev/zvol/rdsk/rpool/iscsi/vol1Write Cache: disabled

Now lets issue 1000 writes

milek@r600:~/progs# ptime ./sync_file_create_loop /dev/zvol/rdsk/rpool/iscsi/vol1 1000real 12.013566363user 0.003144874sys 0.104826470

So it took 12s and I also confirmed that writes were actually being issued to a disk drive. Lets enable write cache now and repeat 1000 writes
milek@r600:~/progs# ./zvol_wce /dev/zvol/rdsk/rpool/iscsi/vol1 1milek@r600:~/progs# ./zvol_wce /dev/zvol/rdsk/rpool/iscsi/vol1Write Cache: enabledmilek@r600:~/progs# ptime ./sync_file_create_loop /dev/zvol/rdsk/rpool/iscsi/vol1 1000real 0.239360231user 0.000949655sys 0.019019552

Worked fine.

The zvol_wce program is not idiot-proof and it doesn’t check if operation succeeded or not. You should be able to compile it by issuing: gcc -o zvol_wce zwol_wce.c

milek@r600:~/progs# cat zvol_wce.c/* Robert Milkowski http://milek.blogspot.com*/#include <unistd.h>#include <stropts.h>#include <sys/types.h>#include <sys/stat.h>#include <fcntl.h>#include <stropts.h>#include <sys/dkio.h>int main(int argc, char **argv){ char *path; int wce = 0; int rc; int fd; path = argv[1]; if ((fd = open(path, O_RDONLY|O_LARGEFILE)) == -1) exit(2); if (argc>2) { wce = atoi(argv[2]) ? 1 : 0; rc = ioctl(fd, DKIOCSETWCE, &wce); } else { rc = ioctl(fd, DKIOCGETWCE, &wce); printf("Write Cache: %s\n", wce ? "enabled" : "disabled"); } close(fd); exit(0);}

Tags:

NoSQL explained to DBAs

Brian Aker, a brilliant helpful duder, who I learn a lot from. Gives a great talk about what is NoSQL explained in a way for database guys. I warn you, there are some points in this video where you can’t hear Brian due to the audience “participation” but you should get the content.


Tags: