Reverse Engineering Evernote Penultimate (or: When is a picture not a picture?)

In this post Alex Caithness takes a look at “Penultimate” on the iPad and discovers that a picture paints a thousand words… but only once you work out how that picture is stored.

Recently we came up against an iPad note-taking application by the name of “Penultimate”. As a user it’s actually really quite a nice app to use: you can create multiple “notebooks” each with as many pages as you require on which you can jot down notes, draw diagrams or just doodle with a stylus or your finger. For a user it’s nice, simple and intuitive, for an intrepid forensic analyst, it’s not quite as straight-forward…

Penultimate Hello World Screenshot

Penultimate Screenshot

After installing the application on a test iPad, creating a couple of notebooks and having a doodle in each of them, I performed an extraction of the device. Poking through the application’s folder, this is the layout of the files which we can acquire with a backup-style acquisition:

com.cocoabox.penultimate
└───Library
 ├───Preferences
 │     com.cocoabox.penultimate.plist
 │ 
 └───Private Documents
 │     notebookList
 │ 
 └───notebooks
 │     notebookListBackup
 │ 
 ├───23FDA4FF-CE5E-4353-8D9A-B0C3E3E8AAE7
 │     100F53F2-BD1F-4046-8104-714E42264DAE
 │ 
 └───DD47D226-0CB3-460F-A0F9-3A7E2A795B3D
       BBDC9CD6-2685-4AFF-BCB4-D370F975A3F1
       BEB76173-1F62-472B-B0DC-F69EF039A59F
       C1766626-9BD2-44F9-A2DD-D7C4F23367A8

Now, ideally what we’d liked to have seen at this point are some image files, a nice friendly JPEG, GIF or PNG or two would have been just lovely. But no, it’s not going to be that easy (and, in fairness, if it had been I wouldn’t be bothering with this blog would I?).

So what do we actually have? Well, we can see the familiar “Library” folder in which we find the (also familiar) “Preferences” folder which contains a property list, the contents of which are mostly mundane (though we do have application launch counts and dates which may be of interest in some situations). More interesting is the “Private Documents” folder and the files and folders which are contained within it.

In the root of the Private Documents folder is a binary property list file named “notebookList”. It is a ‘NSKeyedArchiver’ plist, which means, as we discussed in a previous blog, it needs some additional processing in order for us to see the actual structure of the data. Luckily I had PIP and the “ccl_bplist” Python module to hand, both of which can perform that transformation. With the data structure untangled I proceeded with digging around to see what, if anything, we could extract from the “notebookList”.

Among some less interesting data, the root object contains a list of the notebook objects which are stored by the device, under the “notebooks” key.

notebookList in PIP

notebookList in PIP

Each of the notebook objects (which are represented by dictionaries in the property list) contain the keys: ‘name’, ‘title’, ‘pageNames’, ‘created’, ‘modified’,  ‘changeCount’, ‘blankPages’, ‘creatingDeviceId’, ‘editingPageName’, ‘pageDataIsCooked’, ‘versionMajor’, ‘versionMinor’, ‘pagePaperStyles’, ‘paperStyle’, ‘imported’ , ‘coverColor’, ‘originalName’.

The first few there are of immediate interest: the value of “title” does indeed contain the title of the notebooks that I created during my testing and the “created” and “modified” timestamps also ring true. The values of the “name” keys are familiar as well, albeit for a different reason; in my test data we see the “name” values “DD47D226-0CB3-460F-A0F9-3A7E2A795B3D” and “23FDA4FF-CE5E-4353-8D9A-B0C3E3E8AAE7” – matching the directory names that were extracted underneath the “notebooks” folder (see above). Beyond this, the “pageName” key contains a list of values which also match the names of files in each of the “name” directories.

So, with the “notebookList” file we have some useful metadata and a helpful guide to how the other files are organised, but there’s still no sign of the content of the notes themselves. Delving deeper into the folder structure, our next stop is one of the files which was named in the “pageName” list mentioned above.

Opening one of the “page” files we find another “NSKeyedArchiver” property list. After unravelling the structure of the file we find a top-level object containing further metadata (including a “blankDate” which appears to match the “created” timestamp reported in the “notebookList” and the dimensions of the note) along with a list of “layers”. Each of the “layer” objects (again represented by dictionaries) have keys for the layer’s colour (more on that later) the layer’s dimensions and a list of “layerRects” – sections of the layer where the user has drawn their notes; and that’s where we finally find the image itself.

Sort of.

Structure of the page object

Structure of the page object

Each of the “layerRects” objects are represented by (and this shouldn’t be a surprise by now) a dictionary. There are two keys: firstly “rect” which contains a string of two sets of “x,y” co-ordinates: the bounds of this layerRect on the page. The second key is “values” which requires a little extra explanation. As noted in my previous post on NSKeyedArchiver files, their function is to represent a serialisation of objects; the object under the “values” key is a “CBMutableFloatArray” which is programmer talk for a list of floating-point numbers. With this in mind we quite reasonably expect to see just that – a list of floating point numbers; but no, instead we get a data field containing binary data! As nothing else in this data had been straight-forward this was disappointing, but by this point didn’t surprise me. Floating-point numbers (like any numerical data) can be represented quite happily in binary, so I set about trying to turn the binary data into a list of floating-point numbers. I extracted one of these data fields and with a nice Python one-liner did just that:

>>> struct.unpack("<18f", data)
(282.0, 589.0, 5.506037712097168, 281.8666687011719, 588.2999877929688, 5.506037712097168, 281.73333740234375, 587.5999755859375, 5.506037712097168, 281.5999755859375, 586.9000244140625, 5.506037712097168, 281.4666442871094, 586.2000122070312, 5.506037712097168, 281.33331298828125, 585.5, 5.506037712097168)

So now we can see the numbers, but what did they mean? The meaning becomes clearer if we group them into threes:

282.0, 589.0, 5.506037712097168
281.8666687011719, 588.2999877929688, 5.506037712097168
281.73333740234375, 587.5999755859375, 5.506037712097168
281.5999755859375, 586.9000244140625, 5.506037712097168
281.4666442871094, 586.2000122070312, 5.506037712097168
281.33331298828125, 585.5, 5.506037712097168

What we have here is sets of x-y co-ordinates and another value (the meaning of which will become clearer later). So, here, finally is our drawing; stored as a list of co-ordinates rather than any conventional graphics format – the combination of the points in the “layerRects” in each of the layer objects gives you the full picture.

But how to present this data? A list of co-ordinates like this is of little to no use; we want to see the image itself. So I got to thinking: how does the application treat these co-ordinates? Well, when the user draws in the application they are essentially ‘painting’: moving a circular ‘brush’ around the screen making up the lines in the drawing, so if I was able to plot a circle at each of these co-ordinates perhaps I would see the picture? This thought process also gave me an idea as to what the third value in each of the co-ordinate groupings might represent: part of what makes the drawings look natural is that the “line” that is drawn is not of a uniform width, it grows and shrinks with the speed at which the finger or stylus moves, giving an impression of weight – perhaps the final value related to this?

So how to plot these co-ordinates? First of all I looked at various Python imaging libraries (PIL and PyGame both came on my radar), but there were issues with both (especially with PIL which still lacks a proper Py3k port) so I turned my attention to an alternative solution: SVG. SVG (Scalable Vector Graphics) is a graphics format specified by the W3C; it uses XML to describe 2D graphics. Because SVG just uses XML, no imaging libraries would be required; I could simply generate textual mark-up describing where each of my circular points should be plotted on the screen. Taking the data above extracted above as an example, the mark-up would be along the lines of:

<!DOCTYPE svg>
<svg height="865" version="1.1" viewBox="0 0 718 865" 
     width="718" xmlns="http://www.w3.org/2000/svg">
    <circle cx="282.0" cy="589.0" 
            fill="#000000" r="5.506037712097168"/>
    <circle cx="281.8666687011719" cy="588.2999877929688" 
            fill="#000000" r="5.506037712097168"/>
    <circle cx="281.73333740234375" cy="587.5999755859375" 
            fill="#000000" r="5.506037712097168"/>
    <circle cx="281.5999755859375" cy="281.5999755859375" 
            fill="#000000" r="5.506037712097168"/>
    <circle cx="281.4666442871094" cy="586.2000122070312" 
            fill="#000000" r="5.506037712097168"/>
    <circle cx="281.33331298828125" cy="585.5" 
            fill="#000000" r="5.506037712097168"/>
</svg>

Each “circle” tag has an x and y co-ordinate (“cx” and “cy”) a fill colour (here expressed as an html colour code – black in this case) and a radius (“r”). Running some tests gave some good output; but things weren’t quite right, firstly, and most importantly, the image was mirrored in the y-axis, presumably caused by a difference of opinion between Penultimate and SVG as to where the y-axis originates, an easy fix (just subtract the ‘y’ value from the height of the image and use that value instead). Also, the lines in the writing all looked very chubby, making writing very tricky to read. Theorising that this was caused by Penultimate storing diameters and SVG requiring radii, I halved the value which improved things, but comparing the output with what I could see on the screen things still weren’t quite looking right so on a whim I halved the value again which made things look right (I’m not entirely sure why it should be the case that you have to quarter the value – it may be to do with penultimate adding a ‘feathered’ fade to the brush which increases it’s diameter though).

I created a proof-of-concept Python script to check that my thinking was correct and I was pleased to see that the output now matched what I could see on the iPad’s screen, save for the fact that the iPad was in glorious Technicolor and my script’s output was monochrome. I mentioned previously that the “layer” objects contained colour information under the unsurprisingly named “color” key.

The object stored in the “color” key has values for red, blue and green – each one a value between 0 and 1. Those readers who have had even the briefest dalliance with graphic manipulation programs will be familiar with colour-sliders: combining red, green and blue in different proportions in order to generate different colours, which is what these values represented. My SVG generated output was working with HTML colour codes which are made up of 3 bytes, each representing an amount of red, green and blue, this time using values from 0x00 to 0x0FF, to get colour into my output, all I had to do is multiply 0xFF by the correct value from the “layer” object’s colour fields and recombine those values into an HTML colour code. I modified my script and now the output reflected both the form and colour displayed by the App on the iPad.

Comparison between iPad and script output 1

Comparison between iPad and script output (1)

Comparison between iPad and script output 2

Comparison between iPad and script output (2)

Working on extracting data from Penultimate was a particularly enjoyable experience as it required a combination of a number of different concepts and led to the use of a number of different technologies to create a solution to automate the extraction of the data, which when it all boiled down was satisfyingly simple.

And as to the question: when is a picture not a picture? Well, quite simply: when it’s a series of serialised floating-point co-ordinate triplets representing points on a page, broken up into rectangles on a layer on a page which is stored in a NSKeyedArchiver property list file (obviously!).

If you have any questions, comments or queries regarding this post, as always you can contact us by dropping an email to research@ccl-forensics.com or by leaving a comment below.

Update: As requested I’ve uploaded the script for research purposes. You can find it here: http://pastebin.com/VYenpXUi 

Alex Caithness, CCL Forensics

Advertisements

New Epilog Signature files released

Epilog Signature files allow users to add specific support for new databases they encounter and although they are designed so that Epilog’s users can create their own signatures when the need arises, CCL-Forensics are committed to updating and releasing a sets of signatures, pre-written and ready to use.

In this new release we have had a real focus on smartphones adding support for:
• iOS6
• Android 4.0 (Ice Cream Sandwich)
• Android 4.1 (Jelly Bean)
• Android 3rd Party Applications
• iOS 3rd Party Applications
• Skype

We always welcome suggestions for signatures that you’d like to see added to the signature collection so please get in touch on epilog@ccl-forensics.com

For more information on epilog please visit our website – www.cclgroupltd.com/Buy-Software/

Chrome Session and Tabs Files (and the puzzle of the pickle)

In this blog post Alex Caithness investigates the file format and contents of Chrome’s “Current Tabs”, “Current Session”, “Last Tabs” and “Last Session” files and discovers that, even with the original source code at your side, you can still end up getting yourself into a Pickle.

A link to a Python script for automating the process can be found at the end of the post.

I’ve been on a bit of a browser artefacts kick as of late, digging around both on desktop and mobile platforms for stuff I haven’t tackled before. Taking a peek in my preferred browser’s (Chrome) “AppData” folder revealed that the ubiquitous-ness of SQLite as a storage format means that inspecting the data for a lot of artefacts has been made pretty simple. I had also recently tackled the Chromium web-cache format for another project (the format is now also used both on Android and RIM Playbooks) and, with the pain that caused me still fresh in my mind I had no desire to revisit it. There were, however, four likely looking candidates for a quick probing in the form of the “Current Tabs”, “Current Session”, “Last Tabs” and “Last Session” files.

Chrome AppData Folder

Hello there…

Broadly speaking, these files store the state of the opened tabs, their back-forward lists and the sites displayed therein. The files can be used by Chrome to restore your previous browsing session when you restart the browser (if that’s how you have it set up) or in the event of a crash. It turns out that these files can contain some really rich data, but first you had to do battle with the file format…

In previous posts I’ve made mention of the usefulness of having access to the source code that governs the format in which the data is to be stored, and as Chrome is open source I was heartened. “This shouldn’t be too tricky,” I thought to myself as I set about finding the ‘few lines of code’ which would unlock the file’s secrets… Let me tell you now: the Chrome source is a sprawling behemoth and my journey across the codebase (and on one occasion, outside of it) was long and arduous, and, when it comes down to it, it all boils down to understanding the ‘Pickle’…

Header of the Session file

The file header

The file header was easy to track down, I headed over to the definition for session_backend (src/chrome/browser/sessions/session_backend.cc) where we confirm that “SNSS” is simply a file signature followed by a 32bit integer giving the version of the file, which, at the time of writing, should always be 1 (all data is stored little-endian). Also in this file we encounter a method named “AppendCommandsToFile” which appears to be responsible for writing the details into the files. The method describes that for each record, a 16-bit integer is written to the file giving the size in bytes of the record (not including this value), followed by an 8-bit “id” (which appears to relate to the ‘type’ of the record) and the contents of the “SessionCommand”.

Record structure overview

Record structure overview

So now I knew what the overview of the structure in the file was: a nice simple size, contents, size, contents, size, contents… etc. file format, with the records written sequentially, one after another. But I still had no information about the structure of those contents. SessionBackend was operating with a SessionComand object so I tracked down the source code describing this object (src/chrome/browser/sessions/session_command.h) but was disappointed to find the following explanation in the source code’s comments:

“SessionCommand contains a command id and arbitrary chunk of data. The id and chunk of data are specific to the service creating them.”

OK, so the information I wanted isn’t going to be here, but the comments go on to say:

“Both TabRestoreService and SessionService use SessionCommands to represent state on disk”

Aha! So although I hadn’t quite found what I was looking for here, I have found a useful signpost pointing in the right direction. Now, neither “TabRestoreService“ (src/chrome/browser/sessions/tab_restore_service.h) nor “SessionService” (src/chrome/browser/sessions/session_service.h) themselves give us the information we’re after, but both of them ‘inherit’ from a common base class called “BaseSessionService” (src/chrome/browser/sessions/base_session_service.cc) (I gave a brief overview of object oriented principals including inheritance in a previous blog post)  and it is in BaseSessionService where we finally get what we’re after…

BaseSessionService contains a method called “CreateUpdateTabNavigationCommand” which is responsible for writing that “arbitrary chunk of data” into the SessionCommand which eventually gets written to disk. The record starts with a 32 bit integer which gives the length of the data (this is in addition to the length value outside the SessionCommand). The rest of the SessionCommand’s contents structure is described in the table below.

SessionCommand serialisation

SessionCommand structure

Data type Meaning
32 bit Integer Tab ID
32 bit Integer Index in this tab’s back-forward list
ASCII String (32 bit Integer giving the length of the string in characters followed by an ASCII string of that length) Page URL
UTF-16 String (32 bit Integer giving the length of the string in characters followed by a UTF-16 string of that length) Page Title
Byte string (32 bit Integer giving the length of the string in bytes followed by a byte string of that length) “State” (A data structure provided by the WebKit engine describing the current state of the page. We will look at it in detail later)
32 bit Integer Transition type (explained below)
32 bit Integer 1 if the page has POST data, otherwise 0
ASCII String (see above) Referrer URL
32 bit Integer Referrer’s Policy
ASCII String Original Request URL (for example if a redirect took place)
32 bit Integer 1 if the user-agent was overridden, otherwise 0

As SessionCommands contents can be populated by other means, not every Session command contains data formatted as shown above. During testing it was shown that it is the SessionCommand’s 8-bit ID which identifies whether the record contains this kind of data (when the ID was 1 or 6 then this data format was found). Those with other IDs were typically much shorter (usually around16-32 bytes in length) and did not appear to contain information which was of so much interest.

There are a few fields in the table above which are worth taking a closer look at; the “State” field we’ll explore in detail later as it’s a complicated one. The “Transition type” is a little easier to explain; this field tells Chrome how the page was arrived at. The field will be an integer number, the meaning of which is described in the tables below. The value is essentially split into two sections: the least significant 8-bits of the integer give a type of transition and the most-significant 24-bits form a bit-mask which gives other details. These details are gathered from page_transition_types (content/public/common/page_transition_types.h).

Least Significant 8-bits Value Meaning
0 User arrived at this page by clicking a link on another page
1 User typed URL into the Omnibar, or clicked a suggested URL in the Omnibar
2 User arrived at page through a  bookmark or similar (eg. “most visited” suggestions on a new tab)
3 Automatic navigation within a sub frame (eg an embedded ad)
4 Manual navigation in a sub frame
5 User selected suggestion from Omnibar (ie. typed part of an address or search term then selected a suggestion which was not a URL)
6 Start page (or specified as a command line argument)
7 User arrived at this page as a result of submitting a form
8 Page was reloaded; either by clicking the refresh button, hitting F5 or hitting enter in the address bar. Also given this transition type if the tab was opened as a result of restoring a previous session.
9 Generated as a result of a keyword search, not using the default search provider (for example using tab-to-search on Wikipedia). Additionally a transition of type 10 (see below) may also be generated for the url: http:// + keyword
10 See above
Bit mask Meaning
0x01000000 User used the back or forward buttons to arrive at this page
0x02000000 User used the address bar to trigger this navigation
0x04000000 User is navigating to the homepage
0x10000000 The beginning of a navigation chain
0x20000000 Last transition in a redirect chain
0x40000000 Transition was a client-side redirect (eg. caused by JavaScript or a meta-tag redirect)
0x80000000 Transition was a server-side redirect (ie a redirect specified in the HTTP response header)

NB during testing, although the transition types looked correct in the “Current Session” and “Last Session” files, in the “Current Tabs” and “Last Tabs” files the transition type was always recorded as type 8 (Reloaded page).

When it comes to the record structure, there is still a little more to the story, and yes, this is where the Pickles come in.

This data structure is not being written directly to a file, but rather to what Chrome calls a “Pickle” (src/base/pickle.h). A Pickle is a sort of ‘managed buffer’; a way for Chrome to write (and read) a bunch of values, like those in the tables above, into an area of memory in a controlled way. Indeed, the “length-value” structure we see with the strings is down to the way Pickles write strings into memory, as is the, apparently superfluous, extra ‘length’ field at the start of the record structure. One other pickle-related side-effect which isn’t necessarily immediately obvious when you look at the data in a hex editor is that pickles will always write data so it is uint32-aligned. This means that data will always occupy blocks of 4 bytes and if needed (such as in the case of strings) will be padded to ensure that the next data begins at the start of the next 4-byte block.

It turns out that the contents of the mysterious “State” field are also governed by a Pickle. This field contains serialised data from the WebKit engine. The data is held in a “NavigationEntry” (content/public/browser/navigation_entry.h) “content state” field, but is originally populated by glue_serialize  (webkit/glue/glue_serialize.cc). It duplicates some of the data that we have already described from the outer record, but also contains some more detailed information regarding the state of the page, not least the contents of any forms on the page. The code describing the serialisation process is found in glue_serialize in the WriteHistoryItem method.

The state byte string begins with a 32 bit Integer giving the length of the rest of the record (this is in addition to the length defined in the outer record structure) and then continues with the “WebHistoryItem” structure shown in the table below:

WebHistoryItem structure

WebHistoryItem structure

Data type Meaning
32 bit Integer Format Version
String (see below) Page URL
String (see below) Original URL (for example if a redirect took place)
String (see below) Page target
String (see below) Page parent
String (see below) Page title
String (see below) Page alternative title
Floating point number (see below) Last visited time
32 bit Integer X scroll offset
32 bit Integer Y scroll offset
32 bit Integer 1 if this is a target item otherwise 0
32 bit Integer Visit count
String (see below) Referrer URL
String Vector (see below) Document state (form data) – explained in more detail below
Floating point number (see below) Page scale factor (Only present if the version field is greater than or equal to 11)
64 bit Integer “Item sequence number” (Only present if the version field is greater than or equal to 9)
64 bit Integer “Document sequence number” (Only present if the version field is greater than or equal to 6)
32 bit Integer 1 if there is a “state object” otherwise 0 (Only present if the version field is greater than or equal to 7)
String (see below) “State Object” (only present if the value above is 1 and the version field is greater than or equal to 7)
Form data (see below) Form data
String (see below) HTTP content type
String (see below) Referrer URL (again, for backwards compatibility apparently)
32 bit Integer Number of sub-items in the field below
WebHistoryItem Vector (see below) A number of sub items (for example embedded frames). Each record has the same structure as this one

That table has a lot of “See below” in it, so let’s get down to explaining some of the subtleties/oddities that this data structure provides.

Strings: strings are actually stored differently to those in the outer record. Despite the fact that the data is still being written into a Pickle, the source code uses a different mechanism to do so. The source code forsakes the Pickle’s built in string serialisation methods (for reasons best known to the Chrome programmers), instead taking a more direct route of writing the length of the string directly, followed by the in-memory representation of the string. Basically, this results in the string fields comprising a 32-bit Integer giving the length of the string followed by a UTF-16 string only, this time the length refers to the length in bytes, not the length in characters. To further confuse matters, if the length is -1 (0xFFFFFFFF) this indicates that the string is not present (or ‘null’ in programming terms) or un-initialised (and therefore empty). There is an exception to this structure: if the version field is 2, where, as the comments in the source code suggest, the format was “broken” and stored the number of characters, this was fixed in version 3 onwards.

String Vector: “Vector” in this case essentially means ‘List’. The vector begins with a 32-bit Integer giving the number of entries in the list which is then followed by that many strings in the format described above. In the data structure above this is used to serialise what is described as the “document state”. In testing this appeared to contain information regarding any form fields that may be present on the page (including hidden fields). The list of strings can be broken up into groups of 3 strings, the first of which gives the name of the form field, the second the type of field and the third the current contents of the field.

Floating Point Numbers: IEEE 754 double-precision floating point numbers are used as a representation, but Pickles do not directly support this data type. Because of this, the code uses the Pickle’s “WriteData” method, passing the internal, in-memory representation of the floating point number into the Pickle. The upshot of using the “WriteData” method is that the 64-bit floating point number is prefaced with a 32-bit integer giving the length of the data (which will always be 8 for a double-precision float).

Form Data: the (slightly convoluted) format for this data serialisation is detailed in the WriteFormData method in glue_serialize, however across testing this data was never populated so I can’t vouch for its contents.

Sub items: this contains further WebHistoryItems for any embedded pages or resources on the page. During testing I saw it used to store details of adverts, Facebook “like” buttons and so on. The structure for these sub items is identical to the structure described in the table (note, however, that unlike the top-level WebHistoryItem they do not begin with a size value).

So that’s the structure of the file – not the most pleasant file format I’ve ever dealt with and, even with the source code on hand, it was a lengthy task. So was it worth it?

Well first the case against: a lot of the data is duplicated in other places, not least the History database (which is SQLite so much nicer to work with), and between the “Current” and “Last” versions of the files you only have information regarding 2 sessions worth of browsing, although, increasingly in today’s “always-on” culture, this could still account for a significant period of browsing. Which brings me to the other significant disappointment for these files – timestamps (or rather the apparent lack of them); of course, this makes perfect sense when you consider what Chrome needs the files for – timestamps simply aren’t required for restoring sessions, all the same, it’d make the file more useful to us if they were there.

But it’s not all doom and gloom (which is lucky, otherwise this blog post would be a bit of a waste of time). Firstly, although we only have 2 sessions worth of browsing live on the system, colleagues have already demonstrated to me that there is plenty of scope for recovering previous examples of the files – especially from volume shadow copies, and the 8-byte long static header means that carving files from unallocated space may be possible (no footer though, so some judgement would need to be made regarding the length of the files). Probably more importantly these files give us access to information which it would be tricky to acquire otherwise (or at the very least another opportunity to recover information which may have been deleted); the form contents are obviously a nice additional source of intelligence, both in terms of user credentials, email addresses and possibly message contents (I was able to recover Facebook chat messages from the form data in the “document state” for example). Also, the presence of the transition types, referrer and requested URL fields means that you can build up detailed browsing behaviour profiles, tracking the movement between sites and tabs.

This is not a file format that I would want to parse by hand again, so to automate the process I have written a Python script which we’re happy to make available to the forensics community. The script is designed both as a command line tool which generates a simple HTML report and a class library in case anyone wishes to integrate it into other tools (or create a different reporting format). You can download the script from http://code.google.com/p/ccl-ssns/.

As always, if you have any comments or questions you can get in touch in the comments or by emailing research@ccl-forensics.com

Alex Caithness

Parsing Apple System Log (ASL) files on iOS and OSX for Fun and Evidence (and a Python script to do it for you)

(If you’re dying to get stuck in and are only after the links to the Python scripts, they can be found at the bottom of the post!)

After every update to iOS I like to take a file system dump of one of our test iDevices and have a poke around to see what’s changed and what’s new. Recently, on one of my excursions around the iOS file system, I came across something that looked promising that I hadn’t dug into before: a bunch of files with the “.asl” extension which were located on the data partition in “log/DiagnosticMessages”. There were lots of them too –each with a file name referring to a particular date – they went back months!

DiagnosticMessages file listing

Log Files!

“Loads of lovely log files!” I thought to myself as I excitedly dropped one of the files into my current text editor of choice (Notepad++ if you’re interested) only to be disappointed by what was clearly a binary file format.

ASLDB File in a text editor

Curses!

So I headed over to Google and entered some hopeful sounding search queries and came across a very useful blog post (http://crucialsecurityblog.harris.com/2011/06/22/the-apple-system-log-%E2%80%93-part-1/) which described the role of ASL files on OSX and listed some ways for accessing the logs from within OSX, but I was interested in gaining a better understanding of the file format (besides, the nearest Mac to me was on a different floor!).

A little more digging revealed that the code that governed the ASL logging, and the files it generated were part of the Open Source section of OSX, as a result I was able to view the code that was actually responsible for creating the file – my luck was looking up!

The two files I was particularly interested in were “asl.h” (most recent version at time of posting: http://opensource.apple.com/source/Libc/Libc-763.13/include/asl.h) and “asl_file.h” (most recent version at time of posting: http://opensource.apple.com/source/Libc/Libc-763.13/gen/asl_file.h). C header files are great; basically, their purpose is to define the data structures that are subsequently used in the functional code, so when it comes to understanding file formats, quite often they’ll tell you all you need to know without having to try and follow the flow of the actual program. Better yet, these files were pretty well commented. I know that not everyone reading this is going to want to read through a bunch of C code, so I’ll summarise the file format below (all numeric data is big endian):

First, the file Header:

Offset Length Data Type Description
0 12 String “ASL DB” followed by 6 bytes of 0x00
12 4 32bit Integer File version (current version is: 2)
16 8 64bit Integer File offset for the  first record in the file
24 8 64bit Integer Unix seconds timestamp, appears to be a file creation time
32 4 32bit Integer String cache size (not 100% sure what this refers to, may be maximum size for string entries in the records)
36 8 64bit Integer File offset for the last record in the file
44 36 Padding Should all be 0x00 bytes

So nothing too ominous there, although all of those pad-bytes at the end of the header suggest redundancy in the file spec in case apple ever fancy changing something. Indeed the fact that the header tells us that we’re on version 2 of the file format suggests that this has already happened.

The records in the file are arranged in a “doubly linked list”, that is, that every record in the file contains a reference (ie. the file offset of) the next and previous records.  From a high level, the records themselves are made up of a fixed length data section, followed by a variable length section which allows the storage of additional data in a key-value type structure, finally followed by the offset of the previous record. The table below explains the structure in detail.

NB: The string storage mechanism the records use is a little bit…interesting – I’ll explain in detail later in this post, but for now if you see a reference to an “ASL String”, I mean one of these “interesting” strings!

Offset Length Data Type Description
0 2 Padding 0x00 0x00
2 4 32bit Integer Length of this record (excluding this and the previous field)
6 8 64bit Integer File offset for next record
14 8 64bit Integer Numeric ID for this record
22 8 64bit Integer Record timestamp (as a Unix seconds timestamp)
30 4 32bit Integer Additional nanoseconds for timestamp
34 2 16bit Integer Level (see below)
36 2 16bit Integer Flags
38 4 32bit Integer Process ID that sent the log message
42 4 32bit Integer UID that sent the log message
46 4 32bit Integer GID that sent the log message
50 4 32bit Integer User read access
54 4 32bit Integer Group read access
58 4 32bit Integer Reference PID (for processes under the control of launchd)
62 4 32bit Integer Key-Value count: The total number of keys and values in the key-value storage of the record
66 8 ASL String Host that the sender belongs to (usually the name of the device)
74 8 ASL String Name of the sender (process) which send the log message
82 8 ASL String The sender’s facility
90 8 ASL String Log Message
98 8 ASL String The name of the reference process (for processes under control of launchd)
106 8 ASL String The session of the sender (set by launchd)
114 8 * Key-Value count ASL String[Key-Value count] The key-value storage: A key followed by a value, followed by a key followed by a value… and so on. All keys and values are strings

The level field mentioned above will have a numerical value which refers to the levels shown below:

Level Meaning
0 Emergency
1 Alert
2 Critical
3 Error
4 Warning
5 Notice
6 Info
7 Debug

As mentioned, the “ASL String” data type is a little odd. The ASL fields above take up 8 bytes, if the most significant bit in the 8 bytes is set (ie is 1), the rest of the most significant byte gives the length of the string, which occupies the remaining 7 bytes (unused bytes are set to 0x00). Conversely, if the top bit in the ASL String data type is not set (ie. Is 0) the entire 8 bytes should be interpreted as a 64bit Integer which gives the file offset where the string can be found. The string will be stored thusly:

Offset Length Data Type Meaning
0 2 Padding Padding bytes 0x00 0x01
2 4 32bit Integer String length
6 String length UTF8 String (nul-terminated) The string data

In order to get a better grip of what can be held in these files I decided to create a Python module to read these files and used it to dump out the contents of the ASL files I found on the iPhone.

Running the script

Running the script

Output from the script (iOS)

A snippet of the output produced by processing an iPhone’s ‘DiagnosticMessages’ folder

The first thing that struck me after running the script was the volume of messages: 16161 log messages spanning 10 months – and this was on a test handset which had lay idle for weeks at a time. The second thing was the prevalence of messages sent by the “powerd” daemon, over 87% of the messages had been sent by this process. The vast majority of these messages related to the device waking and sleeping – not through user interaction, but while the device was idle. Most of these “Wake” events occurred 2-5 minute apart, presumably to allow brief data connectivity to receive updates and push messages from apps.

Output from the script (iOS powerd messages)

Some powerd Wake and Sleep messages

The key thing that interested me about these messages was that they also noted the current battery-charge percentage in their text: this is the sort of data that just begs to be graphed, so I knocked up a little script which utilised the parsing module I had just written to extract just this data and present it in a graph-friendly manner.

Graph Friendly powerd Data

Graph Friendly Data

After graphing it (you want to use a scatter graph in Excel for this, not line as I discovered after some shouting at my screen) you are left with a graph which gives you some insight into the device’s use.

iOS Battery Use Graph

Some iPhone Power Usage (click for full-size)

The graph above shows around 3 weeks of battery usage data from the test handset. As noted previously, this test device would lay idle for days at a time (as suggested by the gentle downward gradients) but there were periods when the handset was in use, as shown by the steeper downward gradients on the 26th and 27th of April, which mostly took place within office hours. You can also clear see the points where the device was plugged in to be charged, suggested by the very steep upward gradients. As noted the power messages occur around ever 2-5 minutes, so the resolution is actually fairly good. The exception to this is while the device is plugged in as it no longer needs to sleep to preserve battery charge; typically I only saw an event when charging began and another when the device was unplugged and the battery began to discharge again.

There are a few other messages in the iOS ASL log that look interesting, but at this time I don’t have enough nice control data to make much of them. One thing that did hearten me somewhat was the fact that on the few extractions I’ve had the opportunity to take a look at from later revisions of iOS 5, there did seem to be some extra processes that were logging messages, so it’s my hope that we’ll see more and more useful data make its way into the ASL logs on iOS.

In addition to looking at iOS ASL files, I thought I’d take a look at some from an OSX installation. Pulling the logs from the “var/log/asl” on Lion (10.7.3) and running the parsing script across the whole directory brought back a far more varied selection of messages.

Output from the script (OSX)

Variety is the spice of life.

The number of records returned was actually far less than on iOS, partially due to the iOS “powerd” being so chatty, but more crucially because OSX tidies up its logs on a weekly basis. That’s not to say that you will only recover a week’s worth of logs though – on this test machine I recovered logs spanning 7 months. Rather, OSX has short-term log files (those with file names which begin with a timestamp) which have a shelf-life of a week and long term log files (those with file names which begin with “bb” followed by a timestamp). The “bb” in the long term log’s file name presumably stands for “best before” and the date, which is always in the future, is the date that the file should be cleared out. The short term log files tend to hold more “intimate” entries, often debug messages sent from 3rd party applications; the long term logs err more on the side of system messages. One particularly useful set of messages in the long term log are records pertaining to booting, shutting down, logins and logouts (hibernating, waking and failed logins are recorded too, but they end up in the short-term logs).

(As an aside:  one of my favourite things that I discovered when looking through these logs was the action of waking a laptop running OSX by wiggling a finger on the trackpad is recorded in the logs as a “HID Tickle”. Lovely.)

Like I did with the iOS power profiling, I put together a script which extracted these login and power records and timelines them.

OSX Login and power timeline

Login and power timeline

A couple of things worth noting beyond the basic boot/shutdown/login records: firstly when the device wakes it records why it happened – this can be quite specific: a USB device, the lid of a laptop being opened, the power button being pressed, etc. Secondly, you can see terminal windows (tty) being opened and closed as opening a terminal window involves logging in to a terminal session (OSX does this transparently, but it’s still logged).

We’ve released the scripts mentioned in this post to the community and they can be downloaded from https://code.google.com/p/ccl-asl/. The “ccl_asl” script is both a command line utility for dumping the contents of ASL files as well as a fully featured class module which you can use to write scripts along the lines of the battery profiler and login timeline scripts.

ASL files are, on the one hand, fairly dry system logs, but on the other, with a little work you can harvest some really insightful behavioural intelligence. As always if you have any questions, comments or suggestions you can contact us on research@ccl-forensics.com or leave a comment below.

Alex Caithness

The Forensic Implications of SQLite’s Write Ahead Log

By Alex Caithness, CCL-Forensics

SQLite is a popular free file-based database format which is used extensively both on desktop and mobile operating systems (it is one of the standard storage formats available on both Android and iOS). This article sets out to examine the forensic implications, both pitfalls and opportunities, of a relatively new feature of the database engine: Write Ahead Log.

Before we begin, it is worth taking a moment to describe the SQLite file format. Briefly, records in the database are stored in file which in SQLite parlance is called the ‘Database Image’. The database image is broken up into “pages” of a fixed size (the size is specified in the file header). Each page may have one of a number of roles, such as informing the structure of the database, and crucially holding the record data itself. The pages are numbered internally by SQLite starting from 1.

Historically SQLite used a mechanism called “Rollback Journals” for dealing with errors occurring during use of the database. Whenever any data on a page of the database was to be altered, the entire page was backed up in a separate journal file. At the conclusion of a successful transaction the journal file would be removed; conversely if the transaction was interrupted for any reason (crash, power cut, etc.) the journal remained. This means that if SQLite accesses a database and finds that a journal is still present something must have gone wrong and the engine will restore the database to its previous state using the copies of pages in the journal, avoiding corrupted data.

From version 3.7.0 of the SQLite engine an alternative journal mechanism was introduced called “Write Ahead Log” (ubiquitously shortened to “WAL”). WAL effectively turned the journal mechanism on its head: rather than backing up the original pages then making changes directly to the database file, the database file itself is untouched and the new or altered pages are written to a separate file (the Write Ahead Log). These altered or new pages will remain in the WAL file, the database engine reading data from the WAL in place of the historic version in the main database. This continues until a “Checkpoint” event takes place, finally copying the pages in the WAL file into the main database file. A Checkpoint may take place automatically when the WAL file reaches a certain size (by default this is 1000 pages) or performed manually by issuing an SQL command (“PRAGMA wal_checkpoint;”) or programmatically if an application has access to the SQLite engine’s internal API.

Initial state: No pages in the WAL

Initial state: No pages in the WAL

Page Altered - new version written to WAL

Page 3 is altered. The new version of the page is written to the WAL and the database engine uses this new version rather than the old version in the database file itself.

Checkpoint

A checkpoint operation takes place and the new version of the page is written into the database file.

It is possible to detect whether a database is in WAL mode in a number of ways: firstly this information is found in the database file’s header; examining the file in a hex editor, the bytes at file offset 18 and 19 will both be 0x01 if the database is using the legacy rollback journal or 0x02 if the database is in WAL mode. Secondly you can issue the SQL command “PRAGMA journal_mode;” which will return the value “wal” if the database is in WAL mode (anything else indicates rollback journal). However, probably the most obvious indication of a database in WAL mode is the presence of two files named as “<databasefilename>-wal” and “<databasefilename>-shm” in the same logical directory as the database (eg. if the database was called “sms.db” the two additional files would be “sms.db-wal” and “sms.db-shm”).

The “-wal” file is the actual Write Ahead Log which contains the new and updated database pages, its structure is actually fairly simplistic. The “-wal” file is made up of a 32 byte file header followed by zero or more “WAL frames”. The file header contains the following data:

Offset Size Description
0 4 bytes File signature (0x377F0682 or 0x377F0683)
4 4 bytes File format version (currently 0x002DE218 which interpreted as a big endian integer is 3007000)
8 4 bytes Associated database’s page size (32-bit big endian integer)
12 4 bytes Checkpoint sequence number (32-bit big endian integer which is incremented with every checkpoint, starting at 0)
16 4 bytes Salt-1 Random number, incremented with every checkpoint *
20 4 bytes Salt-2 Random number, regenerated with every checkpoint
24 4 bytes Checksum part 1 (for the first 24 bytes of the file)
28 4 bytes Checksum part 2 (for the first 24 bytes of the file)

* In testing it was found that although the official (and at the time of writing, up to date) command line version of SQLite v3.7.11 behaved correctly, when using SQLite Expert v3.2.2.2.2102 this value appeared to be regenerated after each checkpoint (which is assumed by the author to be incorrect behaviour)

The WAL Frames that follow the header consist of a 24 byte header followed by the number of bytes specified in the file header’s “page size” field which is the new or altered database page. The Frame Header takes the following form:

Offset Size Description
0 4 bytes Database page number (32-bit big endian integer)
4 4 bytes For a record that marks the end of a transaction (a commit record) this will be a 32-bit big endian integer giving the size of the database file in pages, otherwise 0.
8 4 bytes Salt-1, as found in the WAL header at the time that this Frame was written
12 4 bytes Salt-2, as found in the WAL header at the time that this Frame was written
16 4 bytes Checksum part 1 – cumulative checksum up through and including this page
20 4 bytes Checksum part 2 – cumulative checksum up through and including this page

There are a number of potential uses and abuses for the WAL file in the context of digital forensics, but first, the behaviour of SQLite while in WAL mode should examined. A number of operations were performed on a SQLite database in WAL mode. After each operation the database file along with its “-shm” and “-wal” files were copied, audited and hashed so that their states could be examined.

Step 1: Create empty database with a single table:

8a9938bc7252c3ab9cc3da64a0e0e06a *database.db
b5ad3398bf9e32f1fa3cca9036290774 *database.db-shm
da1a0a1519d973f4ab7935cec399ba58 *database.db-wal

1,024       database.db
32,768      database.db-shm
2,128       database.db-wal

WAL Checkpoint Number:  0
WAL Salt-1: 3046154441
WAL Salt-2: 220701676

Viewing the database file using a hex editor we find a single page containing the file header and nothing else. As noted as well as creating a database file, a table was also created, however this data was written to the WAL in the form of a new version of this page. The WAL contains two frames, this new version of the first page in addition to a second frame holding an empty table page. When accessing this database through the SQLite engine this information is read from the “-wal” file transparently and we see the empty table, even though the data doesn’t appear in the database file itself.

Step 2: Force a checkpoint using PRAGMA command:

dd376606c00867dc34532a44aeb0edb6 *database.db
1878dbcefc552cb1230fce65df13b8c7 *database.db-shm
da1a0a1519d973f4ab7935cec399ba58 *database.db-wal

2,048       database.db
32,768      database.db-shm
2,128       database.db-wal

WAL Checkpoint Number:  0
WAL Salt-1: 3046154441
WAL Salt-2: 220701676

Using the pragma command mentioned above, the database was “checkpointed”. Accessing the database through SQLite we see no difference to the data but examining the files involved, we can clearly see that the database file has changed (it has different hash) furthermore it has grown. Looking inside the database file we can see the two pages from the “-wal” file have now been written into the database file itself and SQLite will be reading this data from here rather than the “-wal” file.

The WAL Checkpoint number and salts were not changed at this point, as we will see they are altered the next time that the WAL is written to.

Another interesting observation is that the “-wal” file was left completely unchanged during the checkpoint process – a fact that will become extremely important in the next step.

Step 3: Insert a single row:

dd376606c00867dc34532a44aeb0edb6 *database.db
6dc09958989a6c0094a99a66531f126f *database.db-shm
e9fc939269dbdbfbc157d8c12be720ed *database.db-wal

2,048       database.db
32,768      database.db-shm
2,128       database.db-wal

WAL Checkpoint Number:  1
WAL Salt-1: 3046154442
WAL Salt-2: 534753839

A single row was inserted into the database using a SQL INSERT statement. Once again we arrive at a situation where the database file itself has been left untouched, evidenced by the fact that the database file’s hash hasn’t altered since the last step.

The “-wal” file hasn’t changed size (so still contains two WAL frames) but clearly the contents of the file have changed. Indeed, examining the file in a hex editor we find that the first frame in the file contains a table page containing the newly inserted record as we would expect. What is interesting is that the second frame in the file is the same second frame found in the file in the previous two steps. After a checkpoint the “-wal” file is not deleted or truncated, it is simply reused, frames being overwritten from the top of the file.

Examining the Frame’s headers we see the following:

Frame Page Number Commit Size Salt-1 Salt-2
1 2 2 3046154442 534753839
2 2 2 3046154441 220701676

Both frames relate to the same page in the database but their salt values differ. As previously noted these two salt values are copied from the WAL file header as they are at the time of writing. Salt-2 is regenerated upon each checkpoint, but key here is Salt-1 which is initialised when the WAL is first created and then incremented upon each checkpoint. Using this value we can show that the page held in second frame of the WAL is a previous version of page held in the first frame: we can begin to demonstrate a timeline of changes to the database.

Step 4: Force a checkpoint using PRAGMA command:

704c633fdceceb34f215cd7fe17f0e84 *database.db
a98ab9ed82393b728a91aacc90b1d788 *database.db-shm
e9fc939269dbdbfbc157d8c12be720ed *database.db-wal

2,048       database.db
32,768      database.db-shm
2,128       database.db-wal

WAL Checkpoint Number:  1
WAL Salt-1: 3046154442
WAL Salt-2: 534753839

Once again a checkpoint was forced using the PRAGMA command.  As before the updated pages in the WAL were written into the database file and this operation had no effect on the contents of the “-wal” itself. Viewing the database using the SQLite engine shows the same data as in the previous step.

Step 5: Insert a second row, Update contents of the first row:

704c633fdceceb34f215cd7fe17f0e84 *database.db
d17cf8f25deaa8dbf4811b4d21216506 *database.db-shm
ed5f0336c23aef476c656dd263849dd0 *database.db-wal

2,048       database.db
32,768      database.db-shm
2,128       database.db-wal

WAL Checkpoint Number:  2
WAL Salt-1: 3046154443
WAL Salt-2: 3543470737

A second row was added to the database using a SQL INSERT statement and the previously added row was altered using an UPDATE statement.

Once again, and as is now fully expected, the database file is unchanged, the new data has been written to the WAL. The WAL contains two frames: The first holds a table page containing the original record along with our newly added second record; the second frame holds a table page containing the updated version of our original record along with the new, second record. Examining the frame headers we see the following:

Frame Page Number Commit Size Salt-1 Salt-2
1 2 2 3046154443 3543470737
2 2 2 3046154443 3543470737

In this case both frames contain data belonging to the same page in the database and the same checkpoint (Salt-1 is the same for both frames); in this case the order of events is simply detected by the order in which the frames appear in the file – they are written to the file from the top, down.

Step 6: Insert a third row:

704c633fdceceb34f215cd7fe17f0e84 *database.db
5ac6d9e56e6bbb15981645cc6b4b4d6b *database.db-shm
672a97935722024aff4f1e2cf43d83ad *database.db-wal

2,048       database.db
32,768      database.db-shm
3,176       database.db-wal

WAL Checkpoint Number:  2
WAL Salt-1: 3046154443
WAL Salt-2: 3543470737

Next, a third row was added to the database using an INSERT statement. Viewing the database logicaly with the SQLite engine we see all three records. While database file remains unchanged, the “-wal” file now contains 3 frames: the first two are as in the previous step with the third and final new frame holding a table page with all three records. The frame headers contain the following information:

Frame Page Number Commit Size Salt-1 Salt-2
1 2 2 3046154443 3543470737
2 2 2 3046154443 3543470737
3 2 2 3046154443 3543470737

We now have three versions of the same page, as before the sequence of events is denoted by the order they occur in the file.

Step 7: Force a checkpoint using PRAGMA command:

04a16e75245601651853fd0457a4975c *database.db
05be4054f8e33505cc2cd7d98c9e7b31 *database.db-shm
672a97935722024aff4f1e2cf43d83ad *database.db-wal

2,048       database.db
32,768      database.db-shm
3,176       database.db-wal

WAL Checkpoint Number:  2
WAL Salt-1: 3046154443
WAL Salt-2: 3543470737

As we have observed before the checkpoint results in to the up-to-date records being written into the database, the “-wal” file is unaffected.

Step 8: Delete A Row:

04a16e75245601651853fd0457a4975c *database.db
dca5c61a689fe73b3c395fd857a9795a *database.db-shm
3b518081a5ab4a7be6449e86bb9c2589 *database.db-wal

2,048       database.db
32,768      database.db-shm
3,176       database.db-wal

WAL Checkpoint Number:  3
WAL Salt-1: 3046154444
WAL Salt-2: 2798791151

Finally in this test, the second record in the table (the record added in Step 5) was deleted using an SQL DELETE statement. Accessing the database using the SQLite engine shows that the record is no longer live in the database.

As per expectations the database file is unaffected by this operation, the changes instead being written to the WAL. The “-wal” file contains three frames:  the first frame holds a table page with the second record deleted (the data can still be seen, and could be recovered using a tool such as Epilog, however the metadata on the page shows that the record is not live). The remaining two pages are identical to the final two frames in the previous step. Examining the frame headers we see the following:

Frame Page Number Commit Size Salt-1 Salt-2
1 2 2 3046154444 2798791151
2 2 2 3046154443 3543470737
3 2 2 3046154443 3543470737

Here we once again see three frames all containing data from the same database page, this time the most recent version of the page is found in frame 1 as it has the highest Salt-1 value; the other two frames have a lower Salt-1 value and are therefore older revisions; as they both share the same Salt-1 value we apply the “position in file” rule, the later in the file the frame occurs, the newer it is. So in order of newest to oldest the frames are ordered: 1, 3, 2.

Summarising the findings in this experiment:

  • Altered or new pages are written to the WAL a frame at a time, rather than the database file
  • The most up-to-date pages in the WAL are written to the database file on a Checkpoint event – this operation leaves the “-wal” file untouched
  • After a Checkpoint, the “-wal” file is reused rather than deleted or truncated,  with new frames
  • Multiple frames for the same database page can exist in the WAL, their relative ages can be derived by first examining the frame header’s Salt-1 value with newer frames having higher values. Where multiple frames have the same Salt-1, their age is determined by their order in the WAL, with newer frames occurring later

Pitfalls and Opportunities

The most obvious opportunity afforded by the Write Ahead Log is the potential for time-lining of activity in database. To prove the concept, a small Python script was written which would automate the analysis of the frames in a WAL file and provide a chronology of the data; a sample output is shown below:

Header Info:
    Page Size: 1024
    Checkpoint Sequence: 3
    Salt-1: 3046154444
    Salt-2: 2798791151

Reading frames...

Frame 1 (offset 32)
    Page Number: 2
    Commit Size: 2
    Salt-1: 3046154444
    Salt-2: 2798791151

Frame 2 (offset 1080)
    Page Number: 2
    Commit Size: 2
    Salt-1: 3046154443
    Salt-2: 3543470737

Frame 3 (offset 2128)
    Page Number: 2
    Commit Size: 2
    Salt-1: 3046154443
    Salt-2: 3543470737

Unique Salt-1 values:
    3046154443
    3046154444

Chronology of frames (oldest first):

Page Number: 2
    Frame 2
    Frame 3
    Frame 1

With further work it should be possible to display a sequence of insertions, updates and deletions of records within a database – a feature which is a top priority for the next update of Epilog. Even without the ability to timeline, it is clear that deleted records can be stored and recovered from the WAL (functionality already present in Epilog).

One behaviour which hasn’t been described in full so far is that a database file in WAL mode isolated from its associated “-wal” file is, in almost all circumstances, a valid database in its own right. For example, consider the test database above as it is at the end of Step 8. If the database file was moved to another directory, as far as the SQLite database engine is concerned this is a complete database file. If this isolated database file was queried, the data returned will be that which was present at the last checkpoint (in our test case, this would be the 3 live records present at the checkpoint performed in step 7).

This raises an important consideration when working with a SQLite database contained in a disk image or other container (eg. a TAR archive): if the database file is extracted from the image or container without its associated WAL files, the data can be out-of-date or incomplete. The other side of the coin is that the “full up-to-date” version of the data (viewed with the WAL present) may lack records present in the isolated database file because of deletions pending a checkpoint. There is, then, an argument for examining databases both ways: complete with WAL files and isolated as it may be possible to obtain deleted records “for free”.

Summing Up

The Write Ahead Log introduced in SQLite 3.7 may afford digital forensics practitioners new opportunities to extract extra data and behaviour information from SQLite databases; however the mechanism should be understood to get the most of the new opportunities and avoid confusion when working with the databases.

If you have any comments or questions, please leave a comment below or get in touch directly at research@ccl-forensics.com.

References:

SQLite File Format

Write Ahead Log

Alex Caithness, CCL-Forensics

Geek post: NSKeyedArchiver files – what are they, and how can I use them?

If you have spent any time investigating iOS or OSX devices you will probably have come across files (usually property lists) making reference to NSKeyedArchiver. Certainly, in my experience working with iOS, these files can often contain really interesting data (chat data for example) but the data can appear at first glance unstructured and difficult to work with.

In this blog post I aim to explain where these files come from, how they are structured and how you can get the most out of them.

Remember, remember…

NSKeyedArchiver is a class in the Mac programming API which is designed to serialise data. Data serialisation is the process through which data belonging to a program currently held in memory is written to a file on disk, so that it may be reloaded into memory at some point in the future.

This process can be achieved in a number of ways depending on the requirements of the programmer; however, NSKeyedArchiver provides a convenient way for entire “objects” in memory to be serialised, so it is a widely-used method.

Let’s take a moment to consider what is meant by an “object” in terms of programming (don’t worry; I’m not going to get too programmery). Many modern programming languages allow for (or are entirely based upon) the Object Oriented Programming paradigm. Put very generally this means that they are based around data structures (objects) which contain both data fields and built-in functionality (usually known as “methods”).

So let’s imagine that we were writing the code to define a “Person” object: we might define data fields such as: “Name”; “Age”; “Height”; and “Weight” – but we might also want to give it functionality. For example: “Speak” and “Wave”.

Obviously, a “Car” object would be different from a “Person” – it would have data fields like: “Make”; “Model”; and “Fuel-Type”. It might also have a data field for “Driver” which would hold a reference to a “Person” object.

A “Road” object might have a data fields for: “Name”; “Length”; and “Speed-Limit” along with a data field containing a collection of “Car” objects (each having a reference to a “Person” object in their “Driver” data field).

Similar (and often far more complicated) data structures might be represented in a chat application: a “Message-List” object containing a collection of “Message” objects containing fields for “Sent-Time”, “Message-Text” and “Sender”, which itself contains a reference to a “Contact” object which contains fields for “Nickname”, “Email-Address” and so on.

It’s these kinds of data structures that NSKeyedArchiver converts from objects in memory and stores as a file which can subsequently be loaded back into the memory to rebuild the structure.

So what must NSKeyedArchiver store in order to achieve this? Well, there are two requirements: it has to store details of the type of object it’s serialising; and it has to store the data held in the objects (and the correct structure of the data).

NSKeyedArchiver property lists

NSKeyedArchiver serialises the object data into a binary property list file, the basic layout of which is always the same:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
    <dict>
        <key>$archiver</key>
        <string>NSKeyedArchiver</string>
        <key>$objects</key>
        <array>
            <null/>
            <string>Alex</string>
            <dict>
                <key>Name</key>
                <dict>
                    <key>CF$UID</key>
                    <integer>1</integer>
                </dict>
            </dict>
        </array>
        <key>$top</key>
        <dict>
        <key>root</key>
        <dict>
            <key>CF$UID</key>
            <integer>2</integer>
        </dict>
        </dict>
        <key>$version</key>
        <integer>100000</integer>
    </dict>
</plist>

Listing 1: Overview of an NSKeyedArchiver XML property list file.

In Listing 1 we have an example of an NSKeyedArchiver property list (converted to XML for ease of reading).  At the top level, every NSKeyedArchiver file is made up of a dictionary with four keys.

Two of the keys simply provide metadata about the file: the “$archiver” key should always be followed by a string giving the name of the archiver used to create this file, which should obviously always be “NSKeyedArchiver” and the “$version” key should be followed by an integer giving the version of the archiver (100000 appears to be the only valid value).

The other two keys (“$objects” and “$top”) contain the data that has been serialised and its structure.

The “$objects” key is followed by an array containing all the pieces of data involved in the serialisation, but stored flat with little or no structure. Each of these pieces of data can be understood as being enumerated starting from zero. Within this array of objects may be data which contains the structure shown in Listing 2:

<dict>
        <key>CF$UID</key>
        <integer>0</integer>
</dict>

Listing 2: an example of the CF$UID data type.

The CF$UID data type in Listing 2 is a dictionary with a single key (“CF$UID”) which is followed by an integer number (this layout is what you will see when the property list is represented in XML; in the raw binary format the “UID” data type is a separate entity which doesn’t require the dictionary structure).

These data types represent a reference to another entity in the “$objects” array. The number of the CF$UID gives the position of the array. Consider the snippet shown in Listing 3:

    <key>$objects</key>
    <array>
        <null/>
        <string>Alex</string>
        <dict>
            <key>Name</key>
            <dict>
                <key>CF$UID</key>
                <integer>1</integer>
            </dict>
        </dict>
    </array>

Listing 3: an example of a “$objects” array.

Listing 3 shows an “$objects” array containing three pieces of data. Indexing them starting from 0 we have:

  1. A null
  2. A string containing the value “Alex”
  3. A dictionary

The dictionary at index 2 contains a single key: “Name”. The following value is a “CF$UID” data type referencing index 1 in the “$objects” array so we could consider the data to be equivalent to Listing 4:

    <key>$objects</key>
    <array>
        <null/>
        <string>Alex</string>
        <dict>
            <key>Name</key>
            <string>Alex</string>
        </dict>
    </array>

Listing 4: the “$objects” array from Listing 3 “unpacked”.

This example is very simplistic; in reality the structure revealed by unpacking the object array can be extremely deeply nested with objects containing references to objects containing references to objects…and so on.

The observant among you may be thinking “this seems like a very inefficient way to represent the data”, and for this example you’d certainly be right! However, in most real-life cases the complex data held in these files contains many repeating values which, when arranged this way, only have to be stored once but can be referenced in the “$objects” array multiple times.

The “$top” key is our entry point to the data, so it is the data held at this key that represents the total structure of the object that has been serialised. This key will be followed by a single dictionary which again contains a single key “root”. The “root” key will be followed by a single CF$UID data type which will be a reference the top level object in the “$objects” array.

Returning to the example in Listing 1 the “root” is referencing the object at index 2 in the objects array. So expanding this, our complete data structure is shown in Listing 5:

    <key>$top</key>
    <dict>
        <key>root</key>
        <dict>
            <key>Name</key>
            <string>Alex</string>
        </dict>
    </dict>

Listing 5: Expanded “$top” object, showing complete data structure.

A sense of identity

So far we have only seen examples of basic data stored in this structure where the type of data is implicit but in most files you are likely to encounter you will see additional data relating to the type of the objects being stored.

Listing 6 shows an unpacked “$top” object from a “CHATS2.plist” file produced by the iOS application “PingChat”:

    <key>$top</key>
    <dict>
    <key>root</key>
    <dict>
        <key>$class</key>
        <dict>
            <key>$classes</key>
            <array>
                <string>NSMutableDictionary</string>
                <string>NSDictionary</string>
                <string>NSObject</string>
            </array>
            <key>$classname</key>
            <string>NSMutableDictionary</string>
        </dict>
        <key>NS.keys</key>
        <array>
            <string>pingchat</string>
        </array>
        <key>NS.objects</key>
        <array>
            <dict>
                <key>$class</key>
                <dict>
                    <key>$classes</key>
                    <array>
                        <string>NSMutableArray</string>
                        <string>NSArray</string>
                        <string>NSObject</string>
                    </array>
                    <key>$classname</key>
                    <string>NSMutableArray</string>
                </dict>
                <key>NS.objects</key>
                <array>
                    <dict>
                        <key>$class</key>
                        <dict>
                            <key>$classes</key>
                            <array>
                                <string>BubbleItem</string>
                                <string>NSObject</string>
                            </array>
                            <key>$classname</key>
                            <string>BubbleItem</string>
                        </dict>
                        <key>state</key>
                        <integer>1</integer>
                        <key>image</key>
                        <string>$null</string>
                        <key>msg</key>
                        <string>Yo</string>
                        <key>author</key>
                        <string>testingtesting</string>
                        <key>time</key>
                        <dict>
                            <key>$class</key>
                            <dict>
                                <key>$classes</key>
                                <array>
                                    <string>NSDate</string>
                                    <string>NSObject</string>
                                </array>
                                <key>$classname</key>
                                <string>NSDate</string>
                            </dict>
                            <key>NS.time</key>
                            <real>307828812.649871</real>
                        </dict>
                    </dict>
                </array>
            </dict>
        </array>
    </dict>
    </dict>

Listing 6: Expanded “$top” object taken from a PingChat “CHATS2.plist” file.

In Listing 6 we can begin to see how complex the serialised data can become (and this is a simpler example). However, if you keep your cool and realise that the data is still well-structured it is possible to parse the data into something more meaningful.

One new data structure we encounter for the first time here is the “$class” structure. “$class” isn’t part of the data itself, but rather information about which type of object has been serialised. This information is obviously important when the program that serialised the data comes to deserialise it, but we can also use it to give us clues about the meaning of the data; consider the snippet in Listing 7:

<dict>
    <key>$class</key>
    <dict>
        <key>$classes</key>
        <array>
            <string>NSMutableArray</string>
            <string>NSArray</string>
            <string>NSObject</string>
        </array>
        <key>$classname</key>
        <string>NSMutableArray</string>
    </dict>
    <key>NS.objects</key>
    <array>
        <dict>
            <key>$class</key>
            <dict>
                <key>$classes</key>
                <array>
                    <string>BubbleItem</string>
                    <string>NSObject</string>
                </array>
                <key>$classname</key>
                <string>BubbleItem</string>
            </dict>
            <key>state</key>
            <integer>1</integer>
            <key>image</key>
            <string>$null</string>
            <key>msg</key>
            <string>Yo</string>
            <key>author</key>
            <string>testingtesting</string>
            <key>time</key>
            <dict>
                <key>$class</key>
                <dict>
                    <key>$classes</key>
                    <array>
                        <string>NSDate</string>
                        <string>NSObject</string>
                    </array>
                    <key>$classname</key>
                    <string>NSDate</string>
                </dict>
                <key>NS.time</key>
                <real>307828812.649871</real>
            </dict>
        </dict>
    </array>
</dict>

Listing 7: Snippet of a single object in a PingChat “CHATS2.plist” file.

Let’s take a look at the objects involved here and what the “$class” sections can tell us about the data held. The “$class” structure takes the form of a dictionary containing two keys. The “$classname” section is fairly straightforward; it simply gives us the name of the type of object we’re dealing with.

So in the case of the first “$class” structure encountered, we find that the object is of type “NSMutableArray”, which a quick Google search tells us is a “Modifiable array of objects” – so the data held in this object is going to take the form of an array or list.

The other key in the “$class” structure is “$classes”; this is a little more subtle and requires a little more explanation of one of the key concepts in most object-oriented programming languages: inheritance.

Think back to the explanation of the “Person” object. A “Person” object had the fields: “Name”; “Age”; “Height”; and “Weight” and the functionality to “Speak” and “Wave”. Now imagine that we wanted to create a new type of object: “DigitalForensicAnalyst”. We would want this new type of object to have some specialised functionality: “Image”; “Analyse”, and so on.

However, a “DigitalForensicAnalyst” is a “Person” too – they have a name, they can speak and wave. Now, programmers are stereotyped as being lazy (because they are) so it is unlikely that after spending all that time writing and debugging the code to represent a “Person” that they are going to duplicate all that hard work when it comes to creating a “DigitalForensicAnalyst”.

Instead they would have the “DigitalForensicAnalyst” object inherit the functionality from “Person”; this means that only the new functionality of “Image” and “Analyse” need be created from scratch, all of the other functionality comes free thanks to this inheritance.

Coming back to the “$classes” key, this will be followed by an array containing the names of all of the types that this object inherits from. So in the case of our first “$class” structure we can see that the “NSMutableArray” inherits functionality from both “NSArray” and “NSObject” which may give us further  hints about what data might be held.

So, this “NSMutableArray” is going to contain a collection of other objects, and looking at the rest of this object’s structure we find a key “NS.Objects” which is followed by an array containing just that collection. This array has only one item, another object containing a “$class” definition, so let’s take a look. This time our “$classname” is particularly useful: “BubbleItem” – making reference to the “speech bubbles” displayed on screen; and indeed we find message details (author, message text, timestamp) in the object’s data fields.

There’s got to be an easier way…

NSKeyedArchiver files can contain really key evidence but even a data-sadist like me is going to lose their mind converting all of these files into a usable format by hand; so how can we speed things up?

Well our PIP tool has a “Unpack NSKeyedArchiver” feature which reveals the object structure so that you can use XPaths to parse the file (either by using one of many already in the included XPath library or by writing your own).

Also, if you are a Python fan (if not, why not?) I have updated our ccl_bplist python module, which you can get here, with a “deserialise_NsKeyedArchiver” function to unpack the “$top” object.

I hope you found this blog post useful. As always, if you have any questions or suggestions, leave a comment below or contact the R&D team at research@ccl-forensics.com.

Alex Caithness, Lazy Data-Sadist, CCL-Forensics

Special thanks to Arun Prasannan for assisting the BlogKeeper by rendering the code readable.

Free Python module for processing plist files – get ’em while they’re hot, they’re lovely!

As anyone who has examined an iOS device (or an OSX device for that matter) will know, property list files are a major source of potential evidence. Being one of the main data storage formats they might contain anything from configuration details to browsing history to chat logs.

We’ve examined property lists in detail previously, covering their file formats and the challenges they might present (you can download the white paper here). We have also released our tool PIP, which can be used to parse data from plists and other XML data in a structured way.

As regular readers might have gathered I’m pretty keen on the Python scripting language and while the built-in libraries do support reading from the XML property list format (available through the plistlib module) there is no support for the binary property list format which is increasingly becoming the standard format.

Recently I had a task where I needed to parse a large number of binary format property list files inside a script I was writing. It was theoretically possible to have the script export them all to a single location and then perhaps run PIP separately but I really wanted to have everything self-contained and besides, I needed to make use of the data extracted from the plist files later on in the script.

One of the beauties of Python is how quickly you can go from concept to “product”, so I decided that rather than waste time finding a work-around I would craft a proper solution. After a few hours’ work I had a fully functioning module for dealing with binary plist files in Python and today we’re excited to be releasing the code so that other practitioners and coders can make use of it.

You can get the source from our Google Code page (while you’re at it you can also download our module for dealing with BlackBerry IPD backup files here.

I designed the module to provide the parsed data in as native a format as possible (see Table 1) so when writing your code you do not have to deviate from the normal Pythonic constructs – the data structure returned contains everyday Python objects. The data structure returned is also vastly interchangeable with that returned by plistlib’s “readPlist” function (the exception being that plistlib wraps “data” fields in a “Data” object rather than giving direct access to the underlying “bytes” object).

Table 1: Converted data types returned by ccl_bplist

The module only has one function that you need to know about in order to use it: “load()”. This function takes a single argument which should be a binary file-like object* and it returns the python representation of the data in the property list.

In addition to the ccl_bplist module, the Google Code repository contains an example of the module in use, parsing the “IconState.plist” from an iOS device auditing the Apps and folders present on the Springboard home screen.

Icon state screenshot

Script results

We really hope that the community will be able to make use of this module and if you have any questions please leave a comment or email us at research@ccl-forensics.com.


*When working from a file on disk you should use the open() function with “b” in the mode string e.g.: open(file_name, “rb”).

There may be other times when the bplist has come from another source, e.g. a BLOB field in a database or even a property list embedded in another’s data field. In these cases you can wrap a bytes object in a BytesIO object found in the “io” module e.g.: io.BytesIO(some_data).