hn-classics/_stories/2010/11814002.md

98 KiB
Raw Permalink Blame History

created_at title url author points story_text comment_text num_comments story_id story_title story_url parent_id created_at_i _tags objectID year
2016-06-01T13:02:24.000Z When should you store serialized objects in the database? (2010) https://www.percona.com/blog/2010/01/21/when-should-you-store-serialized-objects-in-the-database/ harshasrinivas 54 68 1464786144
story
author_harshasrinivas
story_11814002
11814002 2010

Source

When should you store serialized objects in the database? - Percona Database Performance Blog

EmergencyEMERGENCY? Get 24/7 Help Now!

[Percona Care

]16

Find all the documentation you need to set up and manage all our products.

    * [Read More][55]
    * [Downloads][56]
    * [Read More][57]

Read from leading data performance experts in Percona's Official blog.

    * [Read More][3]
    * [Events][82]

View all the information about upcoming events and shows where we can meet up!

    * [Read More][82]
    * [Forums][4]

Ask Percona database experts for performance help now in our support forums!

    * [Read More][4]
    * [Let's Get Social][83]
    * [Twitter][84]
    * [LinkedIn][85]
    * [Facebook][86]
    * [Youtube Channel][87]
    * [Google Groups][88]
    * [Read More][83]

[MySQL 101 Sessions

]90

Percona Database Performance Blog

When should you store serialized objects in the database?

Morgan Tocker  | January 21, 2010 |  Posted In: Insight for Developers, MySQL

PREVIOUS POST

NEXT POST

A while back Friendfeed posted a blog post explaining how they changed from storing data in MySQL columns to serializing data and just storing it inside TEXT/BLOB columns. It seems that since then, the technique has gotten more popular with Ruby gems now around to do this for you automatically.

So when is it a good idea to use this technique?

If the application really is schema-less and has a lot of optional parameters that do not appear in every record, serializing the data in one column can be a better idea than having many extra columns that are NULL. The restriction on this, would be that searching on these columns now becomes more difficult1. A good example of this optional nature of data is user preferences you only really need to store the settings that differ from the default values.

The second situation where I can see this technique making sense, is that when you update the text/blob, a large percentage of the data is actually modified. If you have to write back the full blob for 1 change, it is expensive.

Another potential pro for this technique is that ALTER TABLE commands are no longer required. Until 5.1-plugin simple operations like adding a secondary index on an InnoDB table require the whole table to be rebuilt (now just the index is built). I dont really buy this pro, since using master-slave switching Ive always been able to emulate online index adds. Projects like MMM and Flipper make this even easier.

In Friendfeeds case, they also used the serialization to be able to compress the objects.  From 5.1-plugin this is now available natively.

1 Friendfeed solved this indexing problem by creating separate index tables.

What are the downsides?

I would say that the first serious downside is write amplification. If you are constantly making small updates to one piece of data in a very large blob, the effort MySQL has to go to is greatly increased.

The second downside I would mention is that this pattern tends to force you to read/write larger amounts of data at once. Weve all been trained to know that SELECT * is bad. This creates a pattern where SELECT * is not optional. In practical terms I would expect this increases the size of a working set since more data must be kept in the buffer pool.

The next obvious restriction is that there is a clear loss in functionality. You can no longer easily perform aggregation functions on the data (MIN, MAX, AVG). You are storing the data in a way that is very specific to one application, and you cant just point a BI tool at it to process.

It can become difficult to apply even the simplest constraints on the data such as character length, if an age must be a number, and if the age must be unsigned.  MySQL doesnt go as far as having check constraints, but what it has is better than nothing.

Some smaller issues I have is that if you are using a standard serialization technique like JSON its difficult to store pieces of data like numbers or IP addresses in their most efficient form, and that technically this design breaks 1NF.

**What would I recommend?
**

Josh Berkus (of PgExperts) calls this serialization E-blob. Much like EAV, he criticizes this as one of 10 ways to wreck your database (slides). I tend to side with Josh.  Im optimistic that this design has its place in a smaller component of your application, but Im weary every time I hear someone decide to switch to it exclusively.

Before making such decisions for yourself I would recommend reading Peters KISS KISS KISS post.

PREVIOUS POST

NEXT POST

Morgan Tocker

Morgan is a former Percona employee. He was the Director of Training at Percona. He was formerly a Technical Instructor for MySQL and Sun Microsystems. He has also previously worked in the MySQL Support Team, and provided DRBD support.

52 Comments

January 21, 2010 at 3:02 pm

A few comments…

The next obvious restriction is that there is a clear loss in functionality.

You can easily run aggregations on the index tables if you like. Or, you can aggregate in process in the background and cache the values. I havent heard of many people running aggregation queries at web scale.

I don’t really buy this pro, since using master-slave switching I’ve always been able to emulate online index adds.

If you actually run operations for any decent sized installation (or even read the FriendFeed article), you know that this process is extremely tedious and error prone. It just isnt worth it if you have a lot of tables that youre often modifying.

Sure, this is purportedly fixed in 5.1, but at least in our case, 5.1 is about 30% slower than 5.0. So, thats not a great solution.

Those are a few things I picked out of the article. One convenient omission from your article is the fact that companies like Facebook and Twitter do exactly this. Especially in Facebooks case, this strategy has been key to their scalability.

Treating MySQL like a key value store makes it dead simple to cache your objects. My Friendly library that you linked to will read and write through to cache without any extra effort. We see a 99.9% cache hit rate in production.

In practice, that means that even if the queries through to MySQL are somewhat slower, the machine is serving so many fewer queries that its hardly an issue.

Theres this DBA attitude that you should always write your applications to conform to the performance characteristics of the database. It stinks of a lack of understanding of how code actually gets written. Its like DBAs arent aware that anything other than the database even exists.

Sometimes (actually, almost always) there are good reasons to trade a little performance for a lot of convenience.

Reply

January 21, 2010 at 3:05 pm

Oh, I should probably say that the database that were doing this with has about 20GB of data growth per week, serves around 70req/s at peak, and has shown absolutely stable performance characteristics since we deployed it.

Reply

January 21, 2010 at 3:06 pm

My first comment seems to have gone in to moderation…

Reply

January 21, 2010 at 3:15 pm

James, I dont dispute that you are getting those sorts of numbers.

A better question would be could you have achieved the same without doing it. The problem with going to too much complexity (as in the link to Peters KISS KISS KISS post) is the operational overhead that can come with it.

Reply

January 21, 2010 at 3:27 pm

I submitted a much longer comment about it (before the other two), but for some reason, its not here.

You claim that serializing objects is complex, but Friendlys code base is an order of magnitude smaller than most ORMs Ive seen. Without its tests, its 1200 lines. Thats not a lot of code.

According to you, playing the operations hokey pokey to build indexes and add fields offline is simple? Thats a manual process and its error prone as hell. I should know, Ive screwed it up.

Reply

  • bob says:

January 21, 2010 at 3:49 pm

If you dont need a database, why are you using a database?

Reply

January 21, 2010 at 3:54 pm

The same reason a lot of companies (facebook, twitter, FriendFeed, etc) use this pattern. At this point, MySQL is stable and predictable. We know how to run it, scale it, perform and recover backups, etc.

We did have mongodb in production for a while, but ran in to problems with it as our dataset grew. It was fast as hell at the beginning, but got brutally slow near the end. We have no such problems with MySQL.

Reply

  • Jonny Armstrong says:

January 21, 2010 at 4:09 pm

@bob thats a really silly question. They are storing data that still requires some indexing. I think most people still use databases for storing indexed data.

Reply

  • Mark Callaghan says:

January 21, 2010 at 4:25 pm

Somewhat related to this, MySQL needs a storage engine that is crash safe and good at storing blobs. InnoDB is not it. Maybe PBXT is.

Reply

January 21, 2010 at 5:55 pm

Mark Id be curious what features you think are missing for InnoDB to store blobs better?

Reply

  • Mark Callaghan says:

January 21, 2010 at 6:01 pm

Morgan the overflow pages are not shared. When a column must be spilled to overflow pages it has exclusive use of those pages. Data from other columns in the same row and from other columns in other rows cannot use that space. This will waste a lot of space unless your average BLOB column size is many multiples of the page size. For large BLOB columns, too much IO will be wasted because the BLOB is written to the transaction log and the database file. It is better to avoid the transaction log in that case.

Reply

January 21, 2010 at 6:26 pm

No matter how you look at it, storing non-relational data in a relational database is a hack and nothing more. The solution is using a system designed for non-relational data. If it fails, then fix it.

Reply

January 21, 2010 at 6:51 pm

Hi,

Right a lot of people do exactly this… because they know no other solution than MySQL, or trust it enough.

I think where are plenty of cases when storing serialized data makes sense but I would view it as EXCEPTION when “normal” database usage is not feasible for some reason.

The important requirement for storing serialized data is it should be dealt with by application and you have to ensure if there are multiple applications their serialization is really compatible. Manually messing with serialized data is very error prone.

When you typically need to access all (or most) of serialized data at the same time. If you serialize 1MB blob but you mostly need 50 bytes out of it it will be a big loss. Same applies for writes changing 1 byte in blob is costly as it has to be completely replaced.

Interesting enough reverse is true as well. If you need to store 10000 IDs retrieving single blob which holds them will be a lot faster than reading 10000 rows even if they only contain 1 column. It is even more true for writes if you need to modify say 5000 of these 10000 rows it will be orders of magnitude faster to change the blob.

This is why for example if Sessions are stored in MySQL (which is typically bad idea to start with) they are stored in stored as a blobs because you need majority of session data at once and large portion of it may change between requests.

Reply

January 21, 2010 at 8:14 pm

@Jonny The key value store NOSQL database dont have indexes, but MongoDB does:
http://www.mongodb.org/display/DOCS/Indexes

I dont know enough about MongoDB to suspect why it was slow for James. I dont believe one of the design goals of MongoDB was that all data needed to fit in memory. The wiki suggests it has a use case for analytics: http://www.mongodb.org/display/DOCS/Use+Cases

Reply

January 21, 2010 at 8:18 pm

I have notes on a handful of clients weve helped with situations that fall on both sides of the “continental divide” between the EBLOB solution and “normal” relational usage. One of them, for example, was seeing terrible performance. When we dug into it, the issue was exactly what Peter mentions: big string, changing 1 or 2 characters, writing and reading the whole thing. I indexed it in my notes file under “absurd and ridiculous” because of the weird way they were working with the data. The solution? Not what I thought it would be: compress it before stuffing it into the database. Different strokes for different folks.

70 requests per second is not a lot in the most general terms, and 20GB/week doesnt mean much to me unless I know things like the total data size, schema, clustering characteristics, and nature of the queries. Without knowing much about Jamess system its a total guess, but Id be surprised if theres really only one way to accomplish the goal under that kind of workload. I dont dispute that it works for him, though.

I wonder if the MongoDB folks are tuned into the cases in which performance can drop off a cliff as James wrote. I dont know much about MongoDB myself so I wouldnt know why that is, but the classic problem is something like a b-tree not fitting into memory anymore.

Reply

January 21, 2010 at 9:43 pm

70req/s isnt a lot. I just wanted to give a (very general) sense of what our environment and workload looks like. My point is that using MySQL like a key/value store makes caching incredibly easy. So, theres actually very little read load imposed on the machine, and performance stays extremely consistent, even as the dataset grows far beyond available memory.

As you say, Baron, therere almost certainly other ways to accomplish this. But, for our purposes, the tradeoffs (flexible schema even with huge tables and dead simple caching in exchange for somewhat reduced write performance) made perfect sense.

Its just a bit frustrating to read a “dont ever do this” post about this topic. I think theres a real tendency for DBAs to become so focused on the data stores esoteric performance characteristics that they miss the bigger picture. There are a lot of other, sometimes more important concerns when building real applications.

In our case, we gained back whatever write performance we lost 10-fold with easy caching.

Reply

January 21, 2010 at 10:06 pm

James, to be perfectly clear I didnt say “dont ever do this”. Our blog is about MySQL performance, and my aim was to inform readers of the measurable net wins/losses 🙂

I even said that I was optimistic that e-BLOB could be used efficiently in some parts of your application (if your application really is schema-less, if you modify a large percentage on each update).

A lot of applications are read heavy. Im glad you found an easy way to leverage this.

Reply

January 22, 2010 at 1:43 am

Hi!

Ive been reading the article and the discussion with great interest, and I am curious what serialization formats people are using. In particular @James: what are you using for format? how large is a typical serialization string? what do you use in the application to serialize and deserialize? Do you use compression too?

I realize this blog and post is mostly about databases and database performance, so I hope that you dont feel Im hijacking the thread. The reason Im asking is mostly that I have seen some instances of serialization that were really expensive for the application, and Im wondering what other people have found.

Cheers,

Roland

Reply

January 22, 2010 at 1:47 am

@Morgan: you are right you didnt say not to use it.

We do use this technique sparingly, and it really does offer a huge gain in places where it literally is just meta-data. We classify it as data that will never be indexed or searched on.

It is also extremely useful technique for rapid development at the start of a project, where you dont really know what precise fields you need, so this is like a rapid development technique that if you use an ORM or any custom object wrapper to your database then it is easy to refactor out and make into a first-class column citizen.

There is a trade-off without a doubt, like many of the suggestions this blog makes, you have to carefully weigh up what works for your application, and not assume just because someone said it was GREAT or POOR that you have to follow.

Reply

January 22, 2010 at 7:23 am

Im currently using this method with a few projects. Im using PHP, and have a model called KV that Ive built to access the data, stored in JSON. The model also reads and writes from memcached. It has 3 simple functions: get(key), set(key, value), and delete(key). In the set() function, it also checks if the value is an object, and if so, JSON-encodes it. And theres one more function called get_object(key) that simply JSON-decodes the value before returning it as an object.

As far as “blob size”, I think some common sense should be used. I havent done tests to figure out how much data you can store before taking a performance hit, but I imagine its related to MySQLs block size, which Im sure Im under. Most JSON strings for these smaller applications arent more than a few hundred characters. If they were larger, it might make sense to break them up with prefixes. But like James said above, going to the database less for reads makes the trade-off worth it in most of my applications.

Anyway, this allows me, as a PHP programmer, to write code insanely fast, and will allow my projects to scale and be ported easily in the future.

Reply

  • Robin says:

January 22, 2010 at 7:24 am

I agree with your post for most use cases. However, Ive recently been working on a project where we had to deal with lots of “schema less” data, or better said, a mixture of multiple highly dynamic schemas. Weve looked at the various key-value stores out there but didnt really feel happy and comfortable with either of them (reasons for that being a bit too much to explain here). Instead we decided to store the data as XML blobs and we do that in MySQL not because it makes sense but because our Site Operations department felt more comfortable with it.. 😐 To deal with indexes (and aggregate functions etc) we index (but not store) the documents in a Solr index. This provides great flexibility for searching and for full-text searches it does a whole lot better than MySQLs Full-Text Search. We now got faceted search, full-text search, aggregate functions and a search engine that is easy to extend with plugins and let mysql store and replicate the data. For our specific use case this works excellent. We now store more than 100 million documents without any hassle and our customers have already changed their data schema several times without us having to worry about it.
Im working on an article about our approach, but it may take some time still before it will appear online.
Great article though everyone should be fully aware of all the ups and downs before choosing an alternative -possibly very dangerous- route to solve their problems.

Reply

January 22, 2010 at 7:56 am

I think there is too much thought into whether the technique is a good or bad one and not focusing on when to use the serialized data in tables and when not to use them. If you are of my opinion, this is another technique, that when used wisely, can be very powerful and if used wrongly, disastrous.

WordPress has been using this technique for many years now. Though Im not 100% in agreement with the way WordPress uses the technique, it does have advantages, depending on what your application needs the data for.

For example, if you have a web site where users can set specific account settings, and you know these settings values are never going to need to be searched using SQL queries, then you may want to consider the flexibility of storing these settings into a TEXT or BLOG serialized. Combine this with associative arrays in PHP and you got yourself a really powerful combination. One big advantage is if you add or remove a setting, you dont have to add/remove a column from your database because everything is stored in one field.

Here is my recommendation. If you do this, store your TEXT/BLOB into a separate database table with one other column indexed to map to another table. For example, have a table of accounts, make another table called account_settings that has the account_id and the account_settings, where account_id is INDEXED and account_settings is a TEXT.

One other important note, inserts/updates with TEXT/BLOGs are slower, so only do this if the design of your site will have a lot more selects of this data than inserts/updates. Usually for account settings, the user only does this every so often, most of the time the settings are simply being retrieved.

My take on this? Add it to the tool belt, use the technique if it makes sense, avoid it otherwise.

Reply

January 22, 2010 at 10:09 am

Hi Morgan,

Great topic. In addition to the pros and cons already listed for eBlobs, here are three more to consider:

Pro: Modeling issues. Blobs are sometimes necessary to store data that cant be effectively modeled in SQL. Replication and messaging systems often contain data that cant expand easily into tables with fixed types.

Con: Debugging. If something goes wrong, its hard or even impossible to look directly at the data at the SQL level. This is a big issue for debugging production systems, which is where your nastiest problems end to arise.

Con: Upgrades. Unless you have a very disciplined approach to serialization, application upgrades can make old data unreadable. You will at that point yearn sorely for SQL schema evolution, which may be inconvenient but tends to get things right. Only choose serialization that addresses this problem.

As previous posters have indicated theres no right choice for everyone. You have to weigh the pros and cons very carefully. This can be hard because some of the cons happen far in the future (i.e., after next week), hence are hard to judge. However, they can be business-threatening when they do occur.

Reply

January 22, 2010 at 10:35 am

I somewhat frequently recommend this option (for game objects with potentially hundreds of characteristics and/or parameters), but with two caveats: The BLOBs must be serialized in a portable way (protobuf is a good thing here) and they must be versioned. Then you must write code to upgrade v1 objects to v2, etc., and its very easy to migrate an object from v1 to v2 to v3 to v4 (latest) at read time. Then you go for model of read any version, write newest.

Reply

January 22, 2010 at 11:00 am

@Roland I dont think youre hijacking. Im going to extract from what Mark Callaghan posted and say the edge case you dont want to fit in is where blobs are located off-page, but can not fill up a significant amount of that page. So say the range of ~370 bytes ~14K? This is going to be version specific, with Barracuda offering fully off-page storage.

@Alan The ORM should be able to create columns as easy as it does serialized blobs. YMMV, but Im a little worried that with Broken Window Theory that prototype never gets converted to columns. The problem with when you choose to convert, is that you cant easily get statistics such as how often the parameter is specified or what the average length is. All meaningful use of existing tools (like PROCEDURE ANALYSE()) is lost.

@Robin Great to see youre in one of the “when is it a good idea” cases.

@Angelo That indexing technique is described pretty well in the Friendfeed link in my article. WordPress is a completely different kettle of fish. They dont want performance, they want it to run anywhere. Theres lots of optimizations possible with wordpress that will probably never be included. I dont think we should necessarily look to them when designing our own applications.

@Robert thanks for your comments 😉

Josh Berkus cited upgrades as one of the main reasons as well. Imagine that you had some application error which caused an unindexed column to store the word broccoli as brocoli [sic]. For me thats a very easy UPDATE statement that I can make run transactional at a cost of a tablescan and whatever I update. For the E-Blob users thats a script that pulls everything out, examines, and sends some data back.

On Debugging Particularly w/compressed serialized objects, I loose all my tools Im used to (mysql cli, etc). I then have to create new tools (which is why I believe that this technique has a higher operational cost all other things considered equal).

On Modeling Id say I probably hinted at that in saying “if the application really is schema-less…”, but thank you for the two additional examples. Replication and messaging systems are both interesting in that they are so specific.

I think one of the common mistakes people make is they dont interpret the mixed-requirements of their application. They use the justification that it needs to be this way for the message system to convert everything over to E-Blob. Thats a mistake. My message here is that E-Blob is okay, but its the EXCEPTION to the rule and I question every time I see a complete system designed that way.

Reply

  • Steven Roussey says:

January 22, 2010 at 11:26 am

I typically use a mixed form of this. Anything that we may want to query goes into columns, and the rest goes into a blob. However that is tempered by the use case of the data, typical query loads, etc. Sometimes production systems have characteristics that require non-intuitive solutions. well, until your intuition changes from your experience…

As for compression — it typically is better done on the application end. If you have 20 web servers for each DB server, you can spread the compression load to 20 servers rather than having one DB server do it. I had proposed years ago to Monty to have compressed columns and change the client-server API to recognize them such that the client API would do the decompression. This would also mean less network traffic. And the server could do it if the client version was old.

Unfortunately, MySQLs compress/uncompress puts a length header in front of the data so you cant use them out of the box and have them interop with an app that is doing compression, unless the app ads the length in front. You can create a stored procedure to tease the data out, but it is a pain when using a cli. If getting data from the mysql cli is important, the app should add this header info so you can use uncompress in your queries easily.

The Drizzle folk are welcome to my suggestion to have compressed columns and integrate it at the client/server api level, especially with prepared statements.

Reply

* ![][169] **[Baron Schwartz][139]** says:

January 22, 2010 at 2:52 pm

Steven, I agree with you about where to do compression. Your idea about the client automatically handling it is great. Why not mention it on the Drizzle mailing list?

Reply

January 22, 2010 at 12:26 pm

E-blob is appropriate when you need variable attributes per row (which already breaks the relational paradigm), and you arent allowed to use any NoSQL technology, and the only other option seems to be an EAV design.

E-blob is still a square peg in the round hole of RDBMS, but at least its less evil than EAV.

Note that you dont need to put all of a tables attributes in the E-blob. You can have a set of conventional columns for attributes that are the same on every row of the table, or if you need to use indexes or constraints or data types for these attributes. Only the attributes that must be variable belong in an E-blob.

Reply

January 22, 2010 at 1:13 pm

@Bill
Good point. Tungsten Replicator uses E-blob for data to be replicated with index and metadata stored in SQL types, hence exactly what you recommend. I have used that pattern many times. Another interesting issue turns out to be [de-]serialization performance for the blobs. We use Java which is pretty fast but its still a significant tax on processing. This is undoubtedly an issue for many high-performance sites, and is a good reason to go for a No-SQL solution or start figuring out to cache more effectively within process memory. We are looking at both of these for Tungsten.

Reply

January 22, 2010 at 1:26 pm

Robert MongoDB uses a binary representation of JSON. They call it BSON:
http://www.mongodb.org/display/DOCS/BSON

If youre talking about fast formats for MySQL, I liked Jeremys suggestion (Google protobuf).

Reply

  • Sean says:

January 24, 2010 at 4:46 am

Yup, this is a great technique for storing user preferences, e.g. the layout of a page or enabling/disabling certain feature. We just store a serialized associate array in the database for several different things on our site, and when the user logs in, we grab it, unserialize it, and store it in a session. Works great. It woulud be a huge pain in the ass if everytime we wanted to add a new option to the set, we had to change the table structure.

As long as your datda doenst need to be indexed or searchable, this can save a ton of time, short term and long term.

Reply

January 24, 2010 at 7:44 pm

I just thought Id mention where PostgreSQL is for this: Its less costly to have null columns in PostgreSQL than in InnoDB, so having lots of null columns is not a reason to use E-Blob. Postgres is also very good at storing blobs, and compresses them for offline storage automatically. Also, with GIST, GIN, expression indexes and HStore (and the upcoming HStore-to-JSON) there are indexing options for blobs. So thats two reasons TO use them.

Reasons to use E-Blob are:

a) the data stored varies wildly per entity, over time, or between customer installations;

b) the data in the blob is almost always retrieved, read, and updated all-at-once due to the application design;

c) the data is not going to be used for aggregation, ad-hoc querying, or response-time-sensitive filtering (since cheap indexes are impossible);

d) the data does not need to be constrained or used to enforce a constraint

Reasons not to use it are:

  1. the above-mentioned update cost: update an entire 1K blob to change one value;

  2. complete inability to enforce meaningful constraints on the data, thus allowing garbage to creep into the database;

  3. high cost of blanket updates to the data which might be required by application design changes.

Generally, I only consider e-blob for non-essential data which is going to vary by installation, or for specially structured data which is infrequently updated and thus works well with special index types which work with blobs.

Reply

  • Mark Callaghan says:

January 24, 2010 at 7:52 pm

Josh what is the cost of a null column in Postgres? Do possibly null columns require a bit in the row header?

Reply

  • Brian Cavanagh says:

January 26, 2010 at 8:53 am

Yeah PostgreSQLmight be a good choice if you can get around the lousy replication support, as it will let you store arrays and another data types natively in the fields, so you dont have to worry about an inaccessible object model.

Reply

  • Rob Wultsch says:

January 26, 2010 at 5:01 pm

Mark:
“All table rows are structured in the same way. There is a fixed-size header (occupying 23 bytes on most machines), followed by an optional null bitmap, an optional object ID field, and the user data. The header is detailed in Table 53-4. The actual user data (columns of the row) begins at the offset indicated by t_hoff, which must always be a multiple of the MAXALIGN distance for the platform. The null bitmap is only present if the HEAP_HASNULL bit is set in t_infomask. If it is present it begins just after the fixed header and occupies enough bytes to have one bit per data column (that is, t_natts bits altogether). In this list of bits, a 1 bit indicates not-null, a 0 bit is a null. When the bitmap is not present, all columns are assumed not-null. The object ID is only present if the HEAP_HASOID bit is set in t_infomask. If present, it appears just before the t_hoff boundary. Any padding needed to make t_hoff a MAXALIGN multiple will appear between the null bitmap and the object ID. (This in turn ensures that the object ID is suitably aligned.) ”
http://www.postgresql.org/docs/8.4/interactive/storage-page-layout.html

Reply

January 26, 2010 at 6:50 pm

This is completely OT and not meant to be trolling, but one of the things Ive always thought Postgres could improve is make their data files less architecture-dependent.

Reply

February 3, 2010 at 5:16 pm

Hi, Im dropping by via the High Scalability blog.

This is a neat technique.

Something similar but much simpler can be done in IBM DB2. DB2 has a built-in, indexable XML column type. Theres native support for XPath and XQuery, so you can store a clob with arbitrary fields serialized as XML and then do SQL queries with XPath to extract arbitrary fields, run aggregation functions on them, etc. Effectively, this lets you do exactly the same thing except without the opaqueness and with potentially higher performance depending on what you want.

This is included in the free edition (ibm.com/db2/express/)

Disclaimer: I work on the DB2 team, the opinions are my own, etc.

Reply

  • Diego says:

February 6, 2010 at 1:20 pm

Great post and discussion. IMO, unless someone is really unconfortable with trying new dbs (which are somewhat not as proven as mysql), they shouldnt mysql that way. Its harder to administer than any nosql db, and its just not the best tool for storing key/value data.

Reply

  • Vinay Binny says:

January 17, 2011 at 2:02 am

Another downside to this approach is refactoring. If you refactor a class whose instances are persisted in the database in serialized form, you have to take care of that. Usually by writing a separate program/s that needs to be run out of the applications deployment environment. Imagine the plight of this refactoring when instances like Hibernate proxies are serialized inadvertently.

The only way to avoid this refactoring nightmare is to store instances(serialized) of native language classes like Javas String or classes that are part of stable frameworks, value objects are ok, store entity objects at our own risk as they may give you state synchronization problems. Once tool I see that does this is Quartz(or its spring supported bean), which is a scheduling framework written in Java.

Of course, you can probably avoid this if your language gives you a seamless way of refactoring/manipulating classes in the serialized data by just replacing the class names which are usually stored as plain strings rather than binary. Still that may not cover all the refactoring cases.

Reply

February 9, 2011 at 7:40 am

I have a startup that is in the very beginning stages of engineering a website that will process data between user accounts. The database will store user data which in most cases is static data that includes a mug-shot of each user (Im not that worried about that)!

Where the database gets complicated is the user accounts each have a full figured avitar that needs to be rendered on the fly with an ongoing second by second manipulation of the data for each account!

And like any good business man, Im planning for growth. What if the website ends up with 50 million user accounts (I know, Im dreaming, BUT)… What if?

Trust me, if Facebook had to start over and knew what they know now, do you think they would have used MySQL again?

Based on what I have been reading a lot of you dont like PostgreSQL, but has anyone looked over the latest version 9.0 or greater with regards to the replication?

streaming replication:
This streaming replication allows one or more databases to be replicated from a master server. Its an asynchronous transfer but very quick.

In addition to the new streaming replication, Postgres 9.0 also features a hot standby ability. This allows a secondary database to be designated as a hot standby should the primary database fail.

Another big new feature is the addition of an in-place upgrade.

My question is would PostgreSQL be the way to go with its ability to handle geometric shapes?

Edward Wittlinger

Reply

  • Anish Sahare says:

December 7, 2011 at 5:08 am

how to save json data into oracle using procdure
and which are the packages and type require to run the procedure

plz send the example to email address

Regards
Anish Sahare

Reply

  • Matt Young says:

February 13, 2012 at 1:36 am

Make a version of Json/Bson and call if Sqlson. I did that here: https://github.com/Matt-Young/Semantic-Machine/wiki

I can query the internal of a Bson stream by name or structure. I can join multiple Bson streas. What makes this all possible is a direct map from vertical store format to Bson/Json. It is built on sqlite3 and designed ultimately to support joibs between Bson streams and Bson web bots. Sqlson is a small footprint row format that hold one Bson item and any sub items it may have, recursive of course, so one gets a complete nested store. The code rapidly translates between Bson and Sqlson, but retains all the power of sql searches in Sqlson mode. As a bonus, the code can serials square standard sql and make them look like Bson. Open source, CutnPast license, guaranteed patent free, full network io, support a Lazy Json consolel, including remote terminals.

Matt

Reply

  • Jerome says:

June 1, 2012 at 3:22 am

Hi there,
thanks for the info.
I am serializing part of my users extra info whichis store in a different table ie. address, contacts, name, surname etc.. as they are optional and thus avoid me to add additional columns BUT give me headaches when it comes to lookups. ie lookup users in a specific country which now i am doing a LIKE country name .

how it works i have 3 cols (excluding my auto increment) . col1 will be a pointer(FK) to the user, col 2 is the identifier of the next col. for example col2 will be named address and col 3 will have the value in it -> object containing the full address line.

does somebody can suggest a better alternative if any and hope my explanation makes sense.

Thanks in advance

Jerome

Reply

  • Scott F says:

August 17, 2012 at 1:00 am

Using MongoDB, is always a good option “If the application really is schema-less and has a lot of optional parameters that do not appear in every record.”

If youre open to different databases, and its a possibility you should look into it first, before storing large serializations into MySQL.

Reply

  • SleepyBoBos says:

December 9, 2012 at 7:37 pm

I am import large amount of data from text to sql using .net SqlBulkCopy wrapper. Some entities have an obvious parent child (one to many) relationship which if I put DBA hat on results in two tables ie parent and child.

However doing a bulk import to these two tables isnt possible ie after bulk importing to parent I dont know what my foreign keys are to import into child table and hence be able to match up any children with parents.

The solution Serialize data (save human readable XML to a Text type field), which would ordinarily go to child table, into its own column into parent table instead. This allows me to do my bulk import (ie need for speed). Obviously there is a trade off if doing this if you want to query db direct.

Reply

  • Ssekirime Geofrey says:

February 16, 2013 at 2:46 pm

A lot of attention has been directed toward whether to use serialization or not, a few recommedations have been put forwards but lets all agree that serialization is a necessary evil considering the context of MySQL which is intended for relational data storage.

I wish to give this a new edge of view, I understand the two most common datatypes used for serialization are arrays and objects(classes) but what is not clear to me is which of the two is more efficient in terms of the final output size, and overhead in serializing/unserializing the data. On the current project im working on, we opted to you objects since documentation and code maintainance is quite more trivial with classes than arrays, and designcontrols can easily be enforced which is not the case for arrays which are losely handled. Whats your view?

Reply

  • Eric says:

March 14, 2013 at 12:09 pm

I didnt read every last comment, but it appears no one has raised the issue of conflicting writes to a serialized object. If you application allows more than one actor to read, locally update, then write to the serialized object, then you can easily have scenarios where someones write gets lost.

Say I have an array of prices for given time periods (price fluctuates over time, needs to be managed in realtime). Next Tuesday the price is set to $220. Next Wednesday the price is currently set to $200. Actor A pulls the array and updates the price for next Wednesday to $240. Actor B pulls the array in the middle of that sequence, getting the old values. Actor B updates the value for next Tuesday to $300, because thats Actor Bs job. Now Actor A writes the array to the database with the new Wed value. Now Actor B writes the array to the database, but it has the old Wed value and the new Tues value. Actor A lost his change to Tuesdays price because of Bs write to Wednesdays price.

I have run into this problem when dealing with some legacy code, it was nasty to debug, and it reinforced my policy to never serialized fields that could be written by multiple actors in the same time period. If that case is not explicitly restricted in your app, then serializing can create painful bugs. In general its best normalize your data out into new tables and index. It works, it performs, and integrity is bliss. With Redshift it even scales now.

Reply

  • alexia says:

July 26, 2014 at 7:06 pm

I am wondering in this case of JSON data stored in mysql, how would you store “user comments” on a story, assuming you have a lot of comments/seconds written for each story!

Would you have some kind of:
1 table: Id_comment+json(comment,other params)
1 table : Id_story,json([Idcomment1,…..,Idcomment100])

which means that you constantly push 1 element in the array, along with an insert in the first table? Is this efficient when you have a huge flow of comments?

Or would you suggest another structure?

Reply

  • Vijay Raju says:

August 20, 2014 at 7:21 am

if i want to insert more than 30 columns information to a table ,how can i insert ???.Normal insertion in table requires 30 fields ,So that the table become complex.Is their any solution to insert the form data into database .using either php or jsp.

Reply

  • Jason Kistler says:

September 26, 2014 at 2:34 pm

I have a question regarding .NET object serialization. I have inherited a system that uses serialization to capture historical rate, product and inventory data at the time of a reservation. Thus the serialized object is a larger object with multiple child objects. It is stored in the transaction table which is the most active table in the database. The original author of the system and database did not seem concerned about performance but now I have to deal with serious performance issues. If I run a SELECT * against the transaction table it takes a certain amount of time (lengthy) to retrieve all the data. If I remove the one column from the query that includes the XML serialization data then the query literally goes from 4 minutes to around 15 seconds. In order to eliminate the use of the XML column it would require significant re-architecture of the system to convert it to a table structure and rewrite the code that retrieves the object. It would also require adding quite a few more tables in order to capture the bredth of data contained in this single column. So the question is how can I make this work with performance, or am I stuck re-writing into tabular structure? It is written as standard XML into a TEXT field in SQL. There is no need for querying the data in sql….this is simply to store a snapshot of pricing and availability, so the priority is to gain back the performance lost in this methodology of storing objects in the database. Would it help to normalize it out to a different table with just the index and text fields? Do I need to use compression? Should I use binary instead of straight XML? There is unfortunately not a lot of information pertaining to my situation specifically on the web so hopefully someone here can help. Thanks in advance.

Reply

  • Fabian K says:

February 16, 2015 at 9:11 am

Nice article.
However at some parts I didnt agree 100%.
Of course “SELECT *” patterns are not efficient, but at which part of it do you want to point here exactly? (Using * might not always be bad)
Also interestingly you were mentioning the omission of aggregation functions but nothing regarding their efficency.
In the end i guess it depends a whole lot on the environment and the structure that its beeing applied on.
If the db is well modelled (indexes,ettcc) I wouldnt see problems either using mysql for w/e sized databases.
It is a bit hypothetic here however, in the end it depends on a set of advantages/disadvantages coming with it either more or less fitting the projects requirements.
If for example an application has to receive and write data by millisecond intervals (networking devices),
reads and writes to mysql might become problematic. Workarrounds doing this in a filesystem only way may cause lots of IO sockets in a minimal timeframe up to bsod depending on the repeatition of course.
What could be much more compfortable here might be using memory storage engines as first layer, and a sql storage engine to take snapshots of the memory stored objects at the right time and release it at the same time.

Reply

June 2, 2016 at 12:13 pm

https://github.com/torodb/torodb is a good alternative to storing raw blobs. It gives you the best of both worlds.

Reply

Leave a Reply Cancel reply

Subscribe

Want to get weekly updates listing the latest blog posts? Subscribe now and we'll send you an update every Friday at 1pm ET.

Subscribe to our blog

Polls

How do you currently host your applications and databases?

  • On-premises
  • Co-located or hosted hardware
  • Microsoft Azure
  • IBM Bluemix
  • Google Cloud
  • Amazon AWS
  • Rackspace Cloud (including ObjectRocket)
  • Alibaba Cloud
  • MongoDB Atlas
  • Other

View Results

Loading ... Loading ...

Percona Careers

Categories

__  Percona Blog RSS Feed

Upcoming Webinars

All Webinars »

Archive

Tags

Announcement Big Data Bugs Community Conference Fractal Tree™ indexes galera GitHub High Availability InnoDB MariaDB MongoDB MyISAM MySQL MySQL 5.6 NewSQL NoSQL Open Source Optimizer Oracle Patches Percona percona live Percona Monitoring and Management percona server for MongoDB Percona Server for MySQL Percona Server for MySQL Percona Toolkit Percona XtraBackup Percona XtraDB Cluster Performance Peter Zaitsev PMM Primary Production ProxySQL pxc Release Replication Secondary security Stephane Combaudon Storage Engine Tips TokuDB tokumx Tokutek Tools webinar XtraDB

 

 

PERCONA
SERVICES
SOFTWARE
RESOURCES
CONTACT US
  •  [(888) 316-9775 (USA)][547]
  •  [(208) 473-2904 (USA)][548]
  •  [+44 203 608 6727 (UK)][549]
  •  [0-800-051-8984 (UK)][550]
  •  [0-800-724-4569 (GER)][551]
COMMUNITY
CONFERENCES
ABOUT

![][555]

MySQL and InnoDB are trademarks of Oracle Corp. Proudly running Percona Server. Copyright © 2006-2018 Percona LLC.

[Terms of Use][556] | [Privacy][557] | [Copyright][558] | [Legal][554]

  • [Twitter][559]
  • [LinkedIn][560]

[547]: tel:(888) 316-9775 [548]: tel:(208) 473-2904 [549]: tel:+44 203 608 6727 [550]: tel:0-800-051-8984 [551]: tel:0-800-724-4569 [552]: http://groups.google.com/group/percona-discussion [553]: https://github.com/percona [554]: https://www.percona.com/legal [555]: https://www.percona.com/sites/all/themes/percona2015/images/sm-logo.png [556]: https://www.percona.com/terms-use [557]: https://www.percona.com/privacy-policy [558]: https://www.percona.com/copyright-policy [559]: https://www.percona.com# "" [560]: https://www.linkedin.com/shareArticle?mini=true&ro=true&trk=EasySocialShareButtons&title=When+should+you+store+serialized+objects+in+the+database%3F&url=https://www.percona.com/blog/2010/01/21/when-should-you-store-serialized-objects-in-the-database/ ""