Chrome Session and Tabs Files (and the puzzle of the pickle)

In this blog post Alex Caithness investigates the file format and contents of Chrome’s “Current Tabs”, “Current Session”, “Last Tabs” and “Last Session” files and discovers that, even with the original source code at your side, you can still end up getting yourself into a Pickle.

A link to a Python script for automating the process can be found at the end of the post.

I’ve been on a bit of a browser artefacts kick as of late, digging around both on desktop and mobile platforms for stuff I haven’t tackled before. Taking a peek in my preferred browser’s (Chrome) “AppData” folder revealed that the ubiquitous-ness of SQLite as a storage format means that inspecting the data for a lot of artefacts has been made pretty simple. I had also recently tackled the Chromium web-cache format for another project (the format is now also used both on Android and RIM Playbooks) and, with the pain that caused me still fresh in my mind I had no desire to revisit it. There were, however, four likely looking candidates for a quick probing in the form of the “Current Tabs”, “Current Session”, “Last Tabs” and “Last Session” files.

Chrome AppData Folder

Hello there…

Broadly speaking, these files store the state of the opened tabs, their back-forward lists and the sites displayed therein. The files can be used by Chrome to restore your previous browsing session when you restart the browser (if that’s how you have it set up) or in the event of a crash. It turns out that these files can contain some really rich data, but first you had to do battle with the file format…

In previous posts I’ve made mention of the usefulness of having access to the source code that governs the format in which the data is to be stored, and as Chrome is open source I was heartened. “This shouldn’t be too tricky,” I thought to myself as I set about finding the ‘few lines of code’ which would unlock the file’s secrets… Let me tell you now: the Chrome source is a sprawling behemoth and my journey across the codebase (and on one occasion, outside of it) was long and arduous, and, when it comes down to it, it all boils down to understanding the ‘Pickle’…

Header of the Session file

The file header

The file header was easy to track down, I headed over to the definition for session_backend (src/chrome/browser/sessions/session_backend.cc) where we confirm that “SNSS” is simply a file signature followed by a 32bit integer giving the version of the file, which, at the time of writing, should always be 1 (all data is stored little-endian). Also in this file we encounter a method named “AppendCommandsToFile” which appears to be responsible for writing the details into the files. The method describes that for each record, a 16-bit integer is written to the file giving the size in bytes of the record (not including this value), followed by an 8-bit “id” (which appears to relate to the ‘type’ of the record) and the contents of the “SessionCommand”.

Record structure overview

Record structure overview

So now I knew what the overview of the structure in the file was: a nice simple size, contents, size, contents, size, contents… etc. file format, with the records written sequentially, one after another. But I still had no information about the structure of those contents. SessionBackend was operating with a SessionComand object so I tracked down the source code describing this object (src/chrome/browser/sessions/session_command.h) but was disappointed to find the following explanation in the source code’s comments:

“SessionCommand contains a command id and arbitrary chunk of data. The id and chunk of data are specific to the service creating them.”

OK, so the information I wanted isn’t going to be here, but the comments go on to say:

“Both TabRestoreService and SessionService use SessionCommands to represent state on disk”

Aha! So although I hadn’t quite found what I was looking for here, I have found a useful signpost pointing in the right direction. Now, neither “TabRestoreService“ (src/chrome/browser/sessions/tab_restore_service.h) nor “SessionService” (src/chrome/browser/sessions/session_service.h) themselves give us the information we’re after, but both of them ‘inherit’ from a common base class called “BaseSessionService” (src/chrome/browser/sessions/base_session_service.cc) (I gave a brief overview of object oriented principals including inheritance in a previous blog post)  and it is in BaseSessionService where we finally get what we’re after…

BaseSessionService contains a method called “CreateUpdateTabNavigationCommand” which is responsible for writing that “arbitrary chunk of data” into the SessionCommand which eventually gets written to disk. The record starts with a 32 bit integer which gives the length of the data (this is in addition to the length value outside the SessionCommand). The rest of the SessionCommand’s contents structure is described in the table below.

SessionCommand serialisation

SessionCommand structure

Data type Meaning
32 bit Integer Tab ID
32 bit Integer Index in this tab’s back-forward list
ASCII String (32 bit Integer giving the length of the string in characters followed by an ASCII string of that length) Page URL
UTF-16 String (32 bit Integer giving the length of the string in characters followed by a UTF-16 string of that length) Page Title
Byte string (32 bit Integer giving the length of the string in bytes followed by a byte string of that length) “State” (A data structure provided by the WebKit engine describing the current state of the page. We will look at it in detail later)
32 bit Integer Transition type (explained below)
32 bit Integer 1 if the page has POST data, otherwise 0
ASCII String (see above) Referrer URL
32 bit Integer Referrer’s Policy
ASCII String Original Request URL (for example if a redirect took place)
32 bit Integer 1 if the user-agent was overridden, otherwise 0

As SessionCommands contents can be populated by other means, not every Session command contains data formatted as shown above. During testing it was shown that it is the SessionCommand’s 8-bit ID which identifies whether the record contains this kind of data (when the ID was 1 or 6 then this data format was found). Those with other IDs were typically much shorter (usually around16-32 bytes in length) and did not appear to contain information which was of so much interest.

There are a few fields in the table above which are worth taking a closer look at; the “State” field we’ll explore in detail later as it’s a complicated one. The “Transition type” is a little easier to explain; this field tells Chrome how the page was arrived at. The field will be an integer number, the meaning of which is described in the tables below. The value is essentially split into two sections: the least significant 8-bits of the integer give a type of transition and the most-significant 24-bits form a bit-mask which gives other details. These details are gathered from page_transition_types (content/public/common/page_transition_types.h).

Least Significant 8-bits Value Meaning
0 User arrived at this page by clicking a link on another page
1 User typed URL into the Omnibar, or clicked a suggested URL in the Omnibar
2 User arrived at page through a  bookmark or similar (eg. “most visited” suggestions on a new tab)
3 Automatic navigation within a sub frame (eg an embedded ad)
4 Manual navigation in a sub frame
5 User selected suggestion from Omnibar (ie. typed part of an address or search term then selected a suggestion which was not a URL)
6 Start page (or specified as a command line argument)
7 User arrived at this page as a result of submitting a form
8 Page was reloaded; either by clicking the refresh button, hitting F5 or hitting enter in the address bar. Also given this transition type if the tab was opened as a result of restoring a previous session.
9 Generated as a result of a keyword search, not using the default search provider (for example using tab-to-search on Wikipedia). Additionally a transition of type 10 (see below) may also be generated for the url: http:// + keyword
10 See above
Bit mask Meaning
0x01000000 User used the back or forward buttons to arrive at this page
0x02000000 User used the address bar to trigger this navigation
0x04000000 User is navigating to the homepage
0x10000000 The beginning of a navigation chain
0x20000000 Last transition in a redirect chain
0x40000000 Transition was a client-side redirect (eg. caused by JavaScript or a meta-tag redirect)
0x80000000 Transition was a server-side redirect (ie a redirect specified in the HTTP response header)

NB during testing, although the transition types looked correct in the “Current Session” and “Last Session” files, in the “Current Tabs” and “Last Tabs” files the transition type was always recorded as type 8 (Reloaded page).

When it comes to the record structure, there is still a little more to the story, and yes, this is where the Pickles come in.

This data structure is not being written directly to a file, but rather to what Chrome calls a “Pickle” (src/base/pickle.h). A Pickle is a sort of ‘managed buffer’; a way for Chrome to write (and read) a bunch of values, like those in the tables above, into an area of memory in a controlled way. Indeed, the “length-value” structure we see with the strings is down to the way Pickles write strings into memory, as is the, apparently superfluous, extra ‘length’ field at the start of the record structure. One other pickle-related side-effect which isn’t necessarily immediately obvious when you look at the data in a hex editor is that pickles will always write data so it is uint32-aligned. This means that data will always occupy blocks of 4 bytes and if needed (such as in the case of strings) will be padded to ensure that the next data begins at the start of the next 4-byte block.

It turns out that the contents of the mysterious “State” field are also governed by a Pickle. This field contains serialised data from the WebKit engine. The data is held in a “NavigationEntry” (content/public/browser/navigation_entry.h) “content state” field, but is originally populated by glue_serialize  (webkit/glue/glue_serialize.cc). It duplicates some of the data that we have already described from the outer record, but also contains some more detailed information regarding the state of the page, not least the contents of any forms on the page. The code describing the serialisation process is found in glue_serialize in the WriteHistoryItem method.

The state byte string begins with a 32 bit Integer giving the length of the rest of the record (this is in addition to the length defined in the outer record structure) and then continues with the “WebHistoryItem” structure shown in the table below:

WebHistoryItem structure

WebHistoryItem structure

Data type Meaning
32 bit Integer Format Version
String (see below) Page URL
String (see below) Original URL (for example if a redirect took place)
String (see below) Page target
String (see below) Page parent
String (see below) Page title
String (see below) Page alternative title
Floating point number (see below) Last visited time
32 bit Integer X scroll offset
32 bit Integer Y scroll offset
32 bit Integer 1 if this is a target item otherwise 0
32 bit Integer Visit count
String (see below) Referrer URL
String Vector (see below) Document state (form data) – explained in more detail below
Floating point number (see below) Page scale factor (Only present if the version field is greater than or equal to 11)
64 bit Integer “Item sequence number” (Only present if the version field is greater than or equal to 9)
64 bit Integer “Document sequence number” (Only present if the version field is greater than or equal to 6)
32 bit Integer 1 if there is a “state object” otherwise 0 (Only present if the version field is greater than or equal to 7)
String (see below) “State Object” (only present if the value above is 1 and the version field is greater than or equal to 7)
Form data (see below) Form data
String (see below) HTTP content type
String (see below) Referrer URL (again, for backwards compatibility apparently)
32 bit Integer Number of sub-items in the field below
WebHistoryItem Vector (see below) A number of sub items (for example embedded frames). Each record has the same structure as this one

That table has a lot of “See below” in it, so let’s get down to explaining some of the subtleties/oddities that this data structure provides.

Strings: strings are actually stored differently to those in the outer record. Despite the fact that the data is still being written into a Pickle, the source code uses a different mechanism to do so. The source code forsakes the Pickle’s built in string serialisation methods (for reasons best known to the Chrome programmers), instead taking a more direct route of writing the length of the string directly, followed by the in-memory representation of the string. Basically, this results in the string fields comprising a 32-bit Integer giving the length of the string followed by a UTF-16 string only, this time the length refers to the length in bytes, not the length in characters. To further confuse matters, if the length is -1 (0xFFFFFFFF) this indicates that the string is not present (or ‘null’ in programming terms) or un-initialised (and therefore empty). There is an exception to this structure: if the version field is 2, where, as the comments in the source code suggest, the format was “broken” and stored the number of characters, this was fixed in version 3 onwards.

String Vector: “Vector” in this case essentially means ‘List’. The vector begins with a 32-bit Integer giving the number of entries in the list which is then followed by that many strings in the format described above. In the data structure above this is used to serialise what is described as the “document state”. In testing this appeared to contain information regarding any form fields that may be present on the page (including hidden fields). The list of strings can be broken up into groups of 3 strings, the first of which gives the name of the form field, the second the type of field and the third the current contents of the field.

Floating Point Numbers: IEEE 754 double-precision floating point numbers are used as a representation, but Pickles do not directly support this data type. Because of this, the code uses the Pickle’s “WriteData” method, passing the internal, in-memory representation of the floating point number into the Pickle. The upshot of using the “WriteData” method is that the 64-bit floating point number is prefaced with a 32-bit integer giving the length of the data (which will always be 8 for a double-precision float).

Form Data: the (slightly convoluted) format for this data serialisation is detailed in the WriteFormData method in glue_serialize, however across testing this data was never populated so I can’t vouch for its contents.

Sub items: this contains further WebHistoryItems for any embedded pages or resources on the page. During testing I saw it used to store details of adverts, Facebook “like” buttons and so on. The structure for these sub items is identical to the structure described in the table (note, however, that unlike the top-level WebHistoryItem they do not begin with a size value).

So that’s the structure of the file – not the most pleasant file format I’ve ever dealt with and, even with the source code on hand, it was a lengthy task. So was it worth it?

Well first the case against: a lot of the data is duplicated in other places, not least the History database (which is SQLite so much nicer to work with), and between the “Current” and “Last” versions of the files you only have information regarding 2 sessions worth of browsing, although, increasingly in today’s “always-on” culture, this could still account for a significant period of browsing. Which brings me to the other significant disappointment for these files – timestamps (or rather the apparent lack of them); of course, this makes perfect sense when you consider what Chrome needs the files for – timestamps simply aren’t required for restoring sessions, all the same, it’d make the file more useful to us if they were there.

But it’s not all doom and gloom (which is lucky, otherwise this blog post would be a bit of a waste of time). Firstly, although we only have 2 sessions worth of browsing live on the system, colleagues have already demonstrated to me that there is plenty of scope for recovering previous examples of the files – especially from volume shadow copies, and the 8-byte long static header means that carving files from unallocated space may be possible (no footer though, so some judgement would need to be made regarding the length of the files). Probably more importantly these files give us access to information which it would be tricky to acquire otherwise (or at the very least another opportunity to recover information which may have been deleted); the form contents are obviously a nice additional source of intelligence, both in terms of user credentials, email addresses and possibly message contents (I was able to recover Facebook chat messages from the form data in the “document state” for example). Also, the presence of the transition types, referrer and requested URL fields means that you can build up detailed browsing behaviour profiles, tracking the movement between sites and tabs.

This is not a file format that I would want to parse by hand again, so to automate the process I have written a Python script which we’re happy to make available to the forensics community. The script is designed both as a command line tool which generates a simple HTML report and a class library in case anyone wishes to integrate it into other tools (or create a different reporting format). You can download the script from http://code.google.com/p/ccl-ssns/.

As always, if you have any comments or questions you can get in touch in the comments or by emailing research@ccl-forensics.com

Alex Caithness

Advertisements

An analyst enthuses about Python. No, not that one. The geeky stuff.

You’ll have to excuse me for a moment while I climb up onto my soapbox because this blog is going to be a preachy one. Today I want to evangelise on a subject very dear to my heart: the scripting language known as Python.

“But I’m not a programmer Alex, I’m a digital forensic analyst*!”

I know, and I’m not for one moment suggesting that you should be looking at a change of career, but just as EnCase, FTK, TSK, XRY, Cellebrite, Oxygen and their ilk are essential tools of our trade, which we keep clipped to our utility belt at all times, a scripting language like Python should also feature in the list of tools we are proficient at using.

And there surely are other scripting languages out there such as Ruby, JavaScript and Perl (which is a good language as long as you like to have code that looks like you’ve held the shift key down and head-butted the keyboard repeatedly), but for me Python has the perfect combination of power, expressiveness and ease of use that makes it so suitable.

“But why should I trouble myself with learning another tool when the off-the-shelf tools do so much?”

The answer is simple: because laziness is a virtue.

Allow me to explain my reasoning: with the best will in the world these tools cannot, and should not, be expected to do everything. When one of these tools has a gap in their capabilities we are faced with the prospect of completing the task manually. These tasks will all have a certain level of complexity, time-intensiveness and mundaneness, which, according to “Caithness’ Law” all increase exponentially with proximity to the task’s deadline.

So you grit your teeth, clench your fists and get down to it, derive the solution and pull the requisite all-nighters to get the case out the door. At this point the way I see it is you have three options: you sacrifice a goat to the dark gods of digital forensics in order for this problem to never rear its ugly head again; you resign yourself to a fate of repeating this task until whatever hellish application or system that created this artefact goes out of circulation; or you get lazy and automate the task so that neither you nor any of your colleagues ever have to go through that pain again.

And that’s when it’s so useful to have a scripting language available to you.

I’m not going to attempt to teach you to program in Python in a single blog post as that would be both arrogant and misguided, but I do want to give you an example of a simple Python script I wrote a while back to automate a boring but necessary task that saves me time on a day-to-day basis.

When examining an image of an iOS device, inevitably one of the most interesting areas of the file system is the “mobile/Applications” folder where all the third-party applications store their data. The folder contains a number of folders (one for each app installed) which are named, not with the application’s name, but rather with a UID string.

Applications folder in an iOS device

In order to find out which folder contains which application you have to dive inside each one in turn and look for the “.app” folder which gives you the name of the app.

Inside the applications folder

As you can imagine, even with a modest number of applications this is a needlessly time-consuming exercise and when faced with an iPad belonging to a real app-collector it can put you into a catatonic state. Therefore, to ease the tedium of trawling the application folder I knocked together a little script which would audit the folders automatically.

I can sense that at this stage you’re itching to take a look at some actual, honest-to-goodness Python code, but first let’s consider the algorithm that we want to express. We have a folder full of folders, and inside each of those folders is a folder named “ApplicationName.app” where ApplicationName is the name of (you guessed it) the application. So I would suggest that we want to express an algorithm along the lines of:

  • Accept the path of the  “mobile/Application” directory as input to our script
  • Get a list of the folders held in this directory
  • For each of these folders look inside and find the *.app folder
  • Output the ugly UID folder name alongside the friendly *.app folder name

OK, looks simple enough – let’s see how that looks as a Python script:

The script

The first thing to note about this script is that there are a lot of lines which begin with a hash symbol (#); these are “comments”. Comments are just notes left in the code by the programmer to help someone reading the script understand the code – they are completely ignored when the script is executed. This means that almost half of the code isn’t python at all; in fact there are only nine lines of actual code here!

So, we know what algorithm is being expressed here; let’s take a quick look at what the code is doing line by line:

import sys
import os
import os.path

These lines are bringing extra functionality into our script. Python comes pre-installed with a number of modules which add functionality to your scripts. These modules include regular expressions, hashing, database handling, JSON, decoding of binary data, file archiving and compression and loads more – far too much to list here. If Python was to get prepared to use all of this cool stuff at the start of every script it would take a long time to get started, so instead we use “import statements” to let Python know which modules we want to use in our script.

So what are we importing? Firstly “sys” contains system-specific functionality, some of it fairly low-level, but we are simply going to use it to get our command line arguments. Next up, “os” contains operating system functionality; in this script we’ll be using it to get a list of a folder’s contents. Finally “os.path” contains functionality for path manipulation; it’s used for joining paths together and checking whether a path leads to a file or a directory.

root_path = sys.argv[1]

“sys.argv” is a list of command line arguments. The number in the square brackets tells us which item in the list we’re interested in. In Python, lists are “zero-indexed” meaning that the first item is numbered “0”; the second is “1” and so on. The first item (index 0) in the “sys.argv” list will be the name of our script, followed by any other arguments we pass to it at the command line. That means that this line gets the first command line argument after the script’s filename and assigns it to a variable “root_path” so that we can use it later in our script.

for app_folder in os.listdir(root_path):

This line is starting a loop. There are two types of loops in Python; here we are using the “for” loop which take the form:

“for each item in a sequence”

The code inside the loop takes place once for each item in the sequence. In our case, the sequence is provided by

os.listdir(root_path)

which gives a list of the contents of the folder we were provided by the command line argument.

One of my favourite things about Python is that good code layout is actually part of the language syntax. If you look at the listing above you can see how the code after our for loop is started is indented, which means that the indented code is taking place inside the loop. If we wanted code to run after the loop has finished we would simply remove the indent at that line.

    app_folder_path = os.path.join(root_path, app_folder)

Later on in the script we’re going to need the full path of the app’s folder so here we use some of the functionality in “os.path” to join the path we were supplied at the command line to the current app’s folder as served up by our for loop. We then store this complete path in a variable named “app_folder_path”:

    for app_folder_content in os.listdir(app_folder_path):

Here’s another for loop. Again we’re using “os.listdir” but this time we’re getting the contents of our current app’s directory, inside which we’re going to look for the “.app” folder.

        if app_folder_content.endswith(".app"):
            print(app_folder_content + ":\t" + app_folder)

Inside this for loop we check each of the files and folders in the application’s directory looking for one which ends with that magic “.app” extension. If we find one, we print the details out to the screen.

And that’s it, just nine lines of straight-forward code! So now we can run the script in a command window. Running the script we see the following output:

Script output

This shows us at-a-glance which application is found in each of the folders, a boring task which never has to be completed by hand again and just lets the analyst get on with actually analysing the data.

Obviously there’s scope to automate lots of other tasks, whether it’s parsing raw binary data untouched by other tools, reading information from databases and generating reports, moving files into a folder structure based on their content or any other task which is currently consuming more time than it needs to when performing it by hand. Building up a library of scripts to perform these tasks for you can make you a more efficient, and more importantly, a happier analyst.

If this post has whetted your appetite, you can download the newest version of Python from www.python.org which also gives a number of suggestions for learning resources. You can also download the presentation slides and annotated code examples (which include file reading and writing, parsing cookie files, processing SQLite databases and more) that I presented for F3 last year which relate more directly to digital forensics from here.

I hope this post has encouraged some of you to check Python out, if you have any questions then please leave a comment or you can contact me at acaithness@ccl-forensics.com.

Alex Caithness, Python fan at CCL-Forensics

* Or your preferred synonym.