WikiLeaks embassy cables revelations cover a huge dataset of official documents: 251,287 dispatches, from more than 250 worldwide US embassies and consulates. It's a unique picture of US diplomatic language - including over 50,000 documents covering the current Obama administration. But what does the data include?
The cables themselves come via the huge Secret Internet Protocol Router Network, or SIPRNet. SIPRNet is the worldwide US military internet system, kept separate from the ordinary civilian internet and run by the Department of Defense in Washington. Since the attacks of September 2001, there has been a move in the US to link up archives of government information, in the hope that key intelligence no longer gets trapped in information silos or "stovepipes". An increasing number of US embassies have become linked to SIPRNet over the past decade, so that military and diplomatic information can be shared. By 2002, 125 embassies were on SIPRNet: by 2005, the number had risen to 180, and by now the vast majority of US missions worldwide are linked to the system - which is why the bulk of these cables are from 2008 and 2009.
An embassy dispatch marked SIPDIS is automatically downloaded on to its embassy classified website. From there, it can be accessed not only by anyone in the state department, but also by anyone in the US military who has a security clearance up to the 'Secret' level, a password, and a computer connected to SIPRNet - which astonishingly covers over 3m people. There are several layers of data in here - ranging up to the "SECRET NOFORN" level, which means that they are designed never be shown to non-US citizens. Instead, they are supposed to be read by officials in Washington up to the level of current Secretary of State Hillary Clinton. The cables are normally drafted by the local ambassador or subordinates. The "Top Secret" and above foreign intelligence documents cannot be accessed from SIPRNet.
We've broken down the data for you - and you can download the basic details of every cable (without the actual content) below. Each cable is essentially very structured data. This is what's included:
• A source, ie the embassy or body which sent it
• There is a list of recipients - normally cables were sent to a number of other embassies and bodies
• There is a subject field - basically a summary of the cable
• Tags - each cable was tagged with a number of keyword abbreviations. We've put together a downloadable Google glossary spreadsheet of most of the important ones here
• Body text - the cable itself. We have opted not to publish these in full for obvious security reasons
• 251,287 dispatches
• The state department sent the most cables in this set, followed by Ankara in Turkey, then Baghdad and Tokyo
• 97,070 of the documents were classified as 'Confidential'
• 28,760 of them were given the tag 'PTER' which stands for prevention of terrorism
• The earliest of the cables is from 1966 - with most, 56,813, from 2009
What can you do with the data?
• DATA: every cable with date, time and tags, EXCLUDING BODY TEXT (via Google fusion tables, subject to heavy traffic)
• DATA: every cable with date, time and tags, EXCLUDING BODY TEXT (Zipped CSV file, 3.1MB)
• DATA: our analysis of the cable by location and tag
• DATA: glossary of keywords and tags