I've got an 7.4Gb csv file. After converting it to a sqlite database with a python script the output DB is 4.7Gb, around 60% of the original size.
The csv has around 150,000,000 rows. It has header:
tkey,ipaddr,healthtime,numconnections,policystatus,activityflag
And each row looks something like
261846,172.10.28.15,2012-02-03 16:15:00,22,1,1
The script uses healthtime to split the data into tables 192 tables
When I first saw these numbers, I assumed I had made an error someplace. How big of a reduction in file size should I expect from the added efficiency of only writing the healthtime 192 times instead of 150,000,000 times?
EDIT: As soon as I posted this I realized the answer. I'm removing about 40% of the string, hence the 40% reduction in size.
Edit 2 Let's calculate the difference in size between the plain text:
"261846,172.10.28.15,2012-02-03 16:15:00,22,1,1"
And the data base entry:
db(261846,'172.10.28.15',22,1,1)
First of all, we drop from 46 to 26 characters in plain text representation.
The remaining characters are:
"261846,172.10.28.15,22,1,1"
or 26 bytes. If each integer needs to be stored in 32 bit (4 bytes), then we have:
12 bytes (ipaddr) + 4 bytes * 4 (integer fields) = 28 bytes.
So it looks like converting to integers makes the storage slightly less efficient, and all of my gains come form reducing the number of characters stored in each row.
Best Answer
SQLite is not running a compression algorithm, but it will store data in a binary file instead of a text file. Which means that the data can be stored more efficiently, for example using a 32-bit (4 byte) number to represent 10,000,000
instead of storing it as 8 bytes of text (or more if the file is unicode).
Here are more details on the SQL Database File Format if you are interested.
Does that make sense?
SQLite, by default, does not compress data it writes to the disk; however, SQLite does have a set of "Proprietary Extensions" for that and other purposes. Look for ZIPVFS
in the links as follows.
http://www.sqlite.org/support.html andhttp://www.hwaci.com/sw/sqlite/prosupport.html
You can achieve a lot of "compression" in your data by encoding fields as integers. For example an IP address was engineered to fit into a word (4 bytes). Each octect of the address may be represented in one byte of a word.
string[] octets = '172.168.0.1'.split('.')int ip = atoi(octets[0]) << 24ip |= atoi(octets[1]) << 16ip |= atoi(octets[2]) << 8ip |= atoi(octets[3])
Additionally, your timestamp may be represented in Unix time, which is the number of seconds since the epoch.
UPDATE mytable SET healthtime = CAST(strftime('%s',healthtime) AS INTEGER);
See the Date and Time functions
Note the CAST
directive in the above SQL: SQLite does not enforce type on a column, so you may have a group of digits stored as a string; increasing your field size more than necessary (this will also make certain queries behave oddly).
One more thing: field size is not the only part of the story. Remember that indexes take up a space too, and indexes on integers are more efficient --in terms of disk size and performance.
I have a rather big SQLite file in my GIT repository and I was wondering why my total repository size was not growing that much but instead, was even smaller than my SQLite-.db
-file. Turns out, GIT compresses the repository by default. A quick check on my .db
-file also indicates this, since zipping the .db
-file resulted in a zip-archive which was only ~20% the size of the .db
-file.
So, at least with the default settings, it doesn't look like SQLite is storing the data in a compressed way. However, regardless of that, it might be fine to add a big SQLite file to a GIT repository, since GIT performs compression automatically.