Archives For Cloud computing

    OK, I decided to revive this blog from hibernation. I guess, it is more like a website and not a real blog that you can subscribe to. Anyway, recently, I was evaluating one pretty good search engine and hit a performance issue that might be interesting to some people. The issue was with the speed of data indexing for search – after some basic perf tuning, I reached certain speed (in “documents per second”), but it was still not sufficient for me. So I decided to do more parameter tweaking to see if it can be improved, but nothing helped. It looked like I hit its upper perf limit.  

    This graph shows how different resources were utilized during most of the indexing time (there were some minor variations, but this graph shows the most representative part):

Disk IO vs CPU During Data Indexing

    As you can see that the most used resource was the disk IO (not a surprise). Specifically, the engine was writing data most of the time. This makes sense, since it is creating the search index on the disk J. What’s interesting – even though the search indexer needs to do word breaking and some textual processing, which are processor intensive operations, CPU was not used most of the time. The hardware that I had was: Intel QuadCore  2.3 GHz, 8GB RAM and two 500GB SATA Hard Drives with one hard drive dedicated to indexing (OS ran on another drive). So my computer had plenty of CPU power, medium size memory and pretty slow commodity hard drives. When I ran the same indexing on a better SCSI drive it worked faster (as expected).

    I started thinking how to make it run faster on my existing SATA drive and tried different variations of parameters (increased memory caching, changed number of threads, changed number of documents per indexing batch, etc.), but it had almost zero effect on the speed of indexing. Then I stumbled on some parameters that were controlling compression of index chunks and some temporary text files used during indexing. The manual for this search engine said clearly that if I turn compression “off”, then indexing should run faster. This made sense since I remembered that in the “old” days compression was expensive, so I turned it off. To my big surprise, I found that the indexing process became much slower without compression. When I turned it back on, the indexing performance had improved.

    At this point I realized that in this setup, where the biggest bottleneck is the disk IO and where CPU power/memory is in abundance, compression can actually help improving disk intensive operations – data gets compressed/decompressed in memory (indexing chunks in my case were pretty small) and then gets written/read to/from the disk much faster. The time to compress/decompress small chunks of data in memory is negligible compared to time needed to write/read it from/to the disk, if you have plenty of CPU power and if your data compresses well. My intuition from the older days, when compression was costly and was all about saving hard disk space, was wrong. Modern systems are mostly bound by the disk IO and not by CPU/memory, so compression can improve performance of disk intensive applications/servers.

    Later, I found that other people knew this fact all along. For example, in this excellent book – Introduction to Information Retrieval, I found the following:

“The second more subtle advantage of compression is faster transfer data from disk to memory … We can reduce input/output (IO) time by loading a much smaller compressed posting list, even when you add on the cost of decompression. So, in most cases, the retrieval system runs faster on compressed postings lists than on uncompressed postings lists.”

And later also:

“Choosing the optimal encoding for an inverted index is an ever-changing game for the system builder, because it is strongly dependent on underlying computer technologies and their relative speeds and sizes. Traditionally, CPUs were slow, and so highly compressed techniques were not optimal. Now CPUs are fast and disk is slow, so reducing disk postings list size dominates. However, if you’re running a search engine with everything in memory, the equation changes again.” 

    So, if you develop data intensive applications, then compression might be your friend if the disk IO is a bottleneck. This may change again with the arrival of solid state disks… 


As I posted earlier, my team at Microsoft (Windows Live Core) shipped the Tech Preview of the new online service – Live Mesh. I worked on the Account service that takes care of account management, and user/device authentication and authorization. There were many questions asked by early adopters about how their data is transmitted and stored in Live Mesh, and how access is controlled. In this post I’ll talk about Live Mesh security and authorization architecture, so that you understand its internals and feel better about trusting your data to the Mesh.

Here is the diagram that illustrates all communications between user devices and Live Mesh cloud services and encryption/security mechanisms used in these communication channels:

Live Mesh Security

Live Mesh security is rooted at the authentication provider (Windows Live ID, aka Microsoft Passport, is our provider today) which is used for initial user and device authentication. Once a user or a device is authenticated and a corresponding authentication token is obtained, the Live Mesh client passes this token to the Live Mesh Account service to access the root of the user’s mesh and to get the initial set of Live Mesh tickets. These tickets are used for further Mesh operations on other resources that this root is pointing to. All communications with the Live Mesh cloud services are done via HTTPS / SSL, so 3rd parties cannot intercept and read client-server communication.

All user (or device) related resources in Live Mesh are organized in a RESTful manner, i.e. they form a graph where each node is identified by a unique URL and represents a given resource. Nodes contain resource metadata and links to other resources. Mesh operations are essentially CRUD operations on the nodes of the user tree or nodes of other user trees if those users shared any data. Live Mesh cloud services check access rights in each operation by inspecting passed tickets and authorizing access only if a correct set of tickets is passed. Tickets can be obtained from the Account service or from responses to previous cloud operations.

Live Mesh authorization tickets are standard SAML tickets. They are digitally signed with the Live Mesh private key to prevent spoofing and they expire after a limited lifetime. Some tickets are used to just authenticate users or devices, other tickets contain authorization information about user/device rights. Cloud services inspect each resource request and authorize access only if it contains valid tickets (correctly signed and not expired) and these tickets specify that the requestor indeed has access to the requested resource. For example, a device X can initiate P2P data synchronization with device Y only if it presents a ticket that is correctly signed by Live Mesh and contains a record saying that both device X and Y are claimed by the same user OR if it contains a record saying that X and Y have the same Live Mesh Folder mapped on them (in the case that the devices are claimed by different users that are members of this Live Mesh Folder). Tickets are passed to the cloud services in the Authorization header using HTTPS to prevent replay attacks.

Each device in Live Mesh (computers, PDAs, mobile phones) has a unique private key that is generated during Live Mesh installation and used to authenticate the device in P2P communications with other devices. When a P2P communication is being established between two devices, they first use asymmetric encryption (RSA algorithm) to exchange encryption keys and then use symmetric encryption (AES with 128 bit key) to transfer data/files over TCP/IP. The RSA exchange guards against leaking symmetric encryption keys. AES encryption protects actual data from prying eyes. Live Mesh also uses a keyed message authentication code (HMAC) to verify the integrity of the data exchanged on a P2P channel.

If there is no direct connection between two devices (for example, if one device is behind a firewall), then the cloud communication relay located in the Microsoft data center is used to forward data packets from one device to another. All the traffic is encrypted in the same way as in the case with direct P2P link, i.e. first keys are exchanged with RSA and then traffic is encrypted with AES. The cloud relay cannot decrypt/read user data, since encryption keys are exchanged with the use of asymmetric encryption (RSA).

Live Mesh cloud services help devices find each other and establish communications. They cannot read synchronized user data/files relayed through the cloud, except for the case when user files are synchronized with the cloud storage (i.e. Live Desktop). At the moment, the limited Tech Preview of Live Mesh synchronizes your files not only between your devices, but also with your cloud storage (which you can access via Live Desktop) until you reach your storage quota (5GB as of today). So your files and metadata that describes them are stored in the Microsoft datacenter. They are protected by strong access control mechanisms, but the data is not stored in encrypted form. After the storage quota has been reached, all files are synchronized only P2P and not stored in the cloud (only metadata is stored in the datacenter). In the future, Live Mesh will allow users to selectively choose which files or Live Mesh Folders they want to synchronize with the cloud. If you choose to synchronize your data/files between your devices only, Live Mesh will not store your files in the cloud and will only store metadata that lets the service to operate.

UPDATE: Some of Live Mesh services were later integrated into Microsoft’s SkyDrive, but it’s been awhile since I worked there, I am not sure which parts of this article are valid in case of SkyDrive. Read it as a historical post if you wish 🙂


On April 22nd, my team at Microsoft (Windows Live Core) released a limited public beta of new online service – Live Mesh. Since then, I read several online reviews about it and got some feedback from people I invited into the system. I found one interesting common trait – not all users/reviewers actually use all of the present capabilities of the system. That’s why I decided to review its main user visible features and how I use it, so it may be helpful to people interested in Live Mesh. A more detailed discussion can be found here.

Live Mesh is a service that allows accessing your data and devices from anywhere. It consists of a set of cloud and client side services. Cloud services manage user accounts, help your devices to find each other and connect to each other, provide cloud storage for your files (5GB at the moment). Client side services synchronize data between your devices, between your devices and the cloud storage, and provide remote access to your devices. Also, Live Mesh is a platform and when its SDK is released it will help developers in creating distributed applications that use synchronized user data. Applications will just read/write from/to local data objects that will be automatically synchronized by the service in the background.

From the user experience perspective, Live Mesh is exposed at the moment as 3 main things: file synchronization between user devices (computers, PDAs, mobile phones), remote access to computers (aka “Live Remote Desktop“) and the web desktop (aka “Live Desktop“) where people can see their files stored in “the cloud”.

Here is the screenshot of my “Live Desktop” that I can access from any web browser. You can see my Live Mesh folders with files on the left. 2 folders are opened. One of the folders has a train picture rendered by the SilverLight viewer.

Live Desktop

Another opened folder has several subfolders and files. The companion window on its right shows news about activities in this folder, e.g. when stuff was added and messages from members of this folder (users that have access to it). On the right, you can see a pane window with my devices that are in my “Mesh”. I can remotely access these computers right from the web browser by clicking on “connect to device” links.

All people that I invited, started with their Live Desktops (it is the 1st feature users see when they sign up), but not all moved beyond it to the other 2 very interesting parts of Live Mesh – actual file sync and Live Remote. They are installed as part of the Live Mesh client (set of client side services), which is available via “Devices” button on your Live Desktop. This button leads to the famous device ring that by some reason all online reviews were focused on. If you click on “Add device” icon on the ring, you can get to the client setup.

After you install the client portion of Live Mesh and claim your computer on first sign in (claiming adds it to your Mesh of devices), the service starts synchronizing your files between the cloud, your other computers and this newly added computer. Here is the screenshot of my desktop with several Live Mesh Folders opened:

Desktop with Live Mesh Folders

At the top part of my desktop, you can see the same folders as in my Live Desktop. 4 of them in solid colors are “mapped” to this computer and are synchronized by Live Mesh. 2 of them are shown as ghosted and even though they are available in the cloud they are not synchronized to this device. I can start synchronizing them by just double clicking on them. The window on the upper left shows all Live Mesh folders that I created or a member of. The middle window is the opened Live Mesh folder with its members on its right (users that have access to it). I can invite more people to this folder by clicking on “members” tab and entering their LiveID. In this case, they will get an email with a link that points to Live Desktop where they can accept my invitation and provision their user account if they don’t have one. Once they join this folder and install the Live Mesh client, they can map this folder to their desktops and get all its files and file changes. I share photos with my family this way – once I make new photos I just drop them to one of the synchronized folders and my family members can see them once they are synchronized to their computers. It is much more convenient than publishing photos one by one on online sites, the viewing experience is better, since photos can be nicely rendered by the desktop software in original resolutions and I can argue that it is more secure and private. I think that Live Mesh client UX clearly demostrates the advantage of Rich Internet Applications over web applications (even if they are Web 2.0 apps like Live Desktop).

The pane window on the right is the Live Mesh client UI and shows all my devices where I installed the client. I can see their status (online/offline) or remotely access them. You can see that Live Desktop is just another device in my “Mesh”. If I click on one of my computers (“connect to device” link), Live Remote Desktop starts and I can actually work on this computer remotely. Here is the screenshot of Live Remote with my home computer in it:

Live Remote in Live Mesh

You can see the same Live Mesh Folders mapped on this remote computer (on the left), the train picture opened in the photo viewer, the opened Live Mesh folder and the Live Mesh client UI with my devices. I can work remotely on this computer as if I were there in front of it. If I drop a file into one of the Live Mesh folders, it will be automatically synchronized and appear on my other computers/devices where this folder is mapped to and in my Live Desktop. My home computer works behind NAT, but I can still access it. Live Mesh makes it possible to synchronize and remotely access computers even if they are behind firewalls, NATs, etc.

If I need to get my photos from my home computer while I am at work I can remote to it, drop them into one of the Live Mesh folders and after they are synchronized browse them on my work computer. Or when I go home and need to continue working on something, I can just drop relevant files into a Live Mesh folder and when I arrive home use them. No more flash drives or sending emails to myself 🙂 In addition to usual file synchronization, you can drag and drop files between your desktop and a Live Remote window or between 2 Live Remote windows. In this case, files will be just copied to a remote computer or to your current computer. Recently, I was quite surprised that I was able to remote to my home computer while I was riding a bus with a WiFi connection. File synchronization also worked fine (it was a bit slow though). So you can really get to your data and devices from anywhere with this service.