161 84 7MB
English Pages 156 [144] Year 2006
CryptoGraphics Exploiting Graphics Cards for Security
Advances in Information Security Sushil Jajodia Consulting Editor Center for Secure Information Systems George Mason University Fairfax, VA 22030-4444 email: jajodia @ smu. edu The goals of the Springer International Series on ADVANCES IN INFORMATION SECURITY are, one, to establish the state of the art of, and set the course for future research in information security and, two, to serve as a central reference source for advanced and timely topics in information security research and development. The scope of this series includes all aspects of computer and network security and related areas such as fault tolerance and software assurance. ADVANCES IN INFORMATION SECURITY aims to publish thorough and cohesive overviews of specific topics in information security, as well as works that are larger in scope or that contain more detailed background information than can be accommodated in shorter survey articles. The series also serves as a forum for topics that may not have reached a level of maturity to warrant a comprehensive textbook treatment. Researchers, as well as developers, are encouraged to contact Professor Sushil Jajodia with ideas for books under this series.
Additional titles in the series: UNDERSTANDING INTRUSION DETECTION THROUGH VISUALIZATION by Stefan Axelsson; ISBN-10: 0-387-27634-3 HOP INTEGRITY IN THE INTERNET by Chin-Tser Huang and Mohamed G. Gouda; ISBN10: 0-387-22426-3 PRIVACY PRESERVING DATA MINING by Jaideep Vaidya, Chris Clifton and Michael Zhu; ISBN-10: 0-387- 25886-8 BIOMETRIC USER AUTHENTICATION FOR IT SECURITY: From Fundamentals to Handwriting by Claus Vielhauer; ISBN-10: 0-387-26194-X IMPACTS AND RISK ASSESSMENT OF TECHNOLOGY FOR INTERNET SECURITY.'Enabled Information Small-Medium Enterprises (TEISMES) by Charles A. Shoniregun; ISBN-10: 0-387-24343-7 SECURITY IN E-LEARNING by Edgar R. Weippl; ISBN: 0-387-24341-0 IMAGE AND VIDEO ENCRYPTION: From Digital Rights Management to Secured Personal Communication by Andreas Uhl and Andreas Pommer; ISBN: 0-387-23402-0 INTRUSION DETECTION AND CORRELATION: Challenges and Solutions by Christopher Kruegel, Fredrik Valeur and Giovanni Vigna; ISBN: 0-387-23398-9 THE AUSTIN PROTOCOL COMPILER by Tommy M. McGuire and Mohamed G. Gouda; ISBN: 0-387-23227-3 Additional information about http://www.springeronline.com
this
series
can
be
obtained
from
CryptoGraphics Exploiting Graphics Cards for Security
by
Debra L. Cook Angelos D. Keromytis Columbia University NewYork, USA
Springer
Debra L. Cook Department of Computer Science 450 Computer Science Building Columbia University 1214 Amsterdam Avenue, M.C. 0401 New York, NY 10027-7003
AngelosD. Keromytis Department of Computer Science 450 Computer Science Building Columbia University 1214 Amsterdam Avenue, M.C. 0401 New York, NY 10027-7003
Library of Congress Control Number: 2006925092 CRYPTOGRAPHICS: Exploiting Graphics Cards for Security by Debra L. Cook and Angelos D. Keromytis ISBN-13: 978-0-387-729015-7 ISBN-10: 0-387-29015-X e-ISBN-13: 978-0-387-34189-7 e-ISBN-10:0-387-34189-7 Printed on acid-free paper.
© 2006 Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now know or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks and similar terms, even if the are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in the United States of America. 9 8 7 6 5 4 3 2 1 springer.com
Contents
List of Figures List of Tables Preface Acknowledgments
ix xi xiii xv
1. INTRODUCTION
1
1.1 Overview
1
1.2 GPUs
3
1.3 Motivation
3
1.4 Encryption in GPUs
4
1.5 Remotely Keyed CryptoGraphics
5
1.6 Related Issues
5
1.7 Extensions
6
1.8 Conclusions
6
2. GRAPHICAL PROCESSING UNITS 2.1 Overview
9 9
2.2 GPU Architecture
10
2.3 GPUs and General Purpose Programming
15
2.4 APIs
17
2.5 OpenGL and Pixel Processing
19
2.6 Representing Data with Vertices
22
2.7 Non-Graphic Uses of GPUs
23
vi
CRYPTOGRAPHICS
3. MOTIVATION
25
3.1 Overview
25
3.2 Accelerating Cryptographic Processing 3.2.1 Issue 3.2.2 Previous Approaches 3.2.3 Summary of the GPU-Based Approach
25 25 26 27
3.3 Malware and Spy ware 3.3.1 Issue 3.3.2 Motivating Applications 3.3.3 Other Related Work 3.3.4 Summary of the GPU-Based Approach
28 28 28 30 33
3.4 Side Channel and Differential Fault Analysis
33
4. ENCRYPTION IN CPUS 4.1 Overview
37 37
4.2 Feasibility of Asymmetric Key Ciphers
38
4.3 Feasibility of Symmetric Key Ciphers
40
4.4 Modes of Encryption
45
4.5 Example: AES 4.5.1 AES Background 4.5.2 AES in OpenGL 4.5.3 AES Experiments 4.5.4 Use of Parallel Processing in Attacks
48 48 53 58 64
4.6 GPUs and Stream Ciphers 4.6.1 Overview 4.6.2 Experiments
64 64 65
4.7 Conclusions
67
5. REMOTELY KEYED CRYPTOGRAPHICS
69
5.1 Overview
69
5.2 Keying of GPUs
69
5.3 Prototype 5.3.1 Purpose 5.3.2 Architecture 5.3.3 Implementation
72 72 72 74
5.4 Design Decisions 5.4.1 Remote Keying 5.4.2 Decryption of Data in the GPU
78 79 80
Contents
vii
5.5 Experiments
82
5.6 Conclusions
87
6. RELATED ISSUES
89
6.1 Overview
89
6.2 Protecting User Input
89
6.3 Keying the GPU
90
6.4 Attacks
93
6.5 Trusted Platform Module
95
6.6 Data Compression
97
7. EXTENSIONS 7.1 Overview
99 99
7.2 Graphics-based Cipher
99
7.3 Encryption within DSPs
101
8. CONCLUSIONS
103
8.1 Summary
103
8.2 Suggested Projects
105
Appendices A AES OpenGL Code for Encryption
107 107
A.l Overview
107
A.2 Version Using the Red Pixel Component and the Back Buffer
107
A.3 Version Using the RGB Pixel Components and the Front Buffer
116
References
131
Index
139
List of Figures
2.1 2.2 2.3 2.4 3.1 4.1 4.2 4.3 4.4 4.5 4.6 4.7 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.1 6.2
High Level View of GPU Hardware GPU's Main Processing Steps OpenGL Version 2.0 General Pipeline OpenGL Pipeline for Pixel Processing Various Attack Points for Phishing ECB Encryption Mode CBC Encryption Mode CTR Encryption Mode OFB Encryption Mode CFB Encryption Mode Layout of Data in Pixel Coordinates used in the OpenGL Version of AES Encryption of 300 Identical Blocks in RGB Components Malware on Untrusted Client with OS-based Decryption Malware on Untrusted Client with GPU-based Decryption Architecture for Remotely Keyed Decryption in the GPU Remotely Keyed Decryption in GPU Protocol Encrypted Image Received by GPU Decrypted Image Displayed in GPU Decryption Rates: All Entities on a Single System Decryption Rates: Dedicated Lan and Client 1 Decryption Rates: Shared Lan and Client 2 Graphical Keypad for Digits Graphical Keypad for Hex Values
11 12 13 20 29 45 46 46 47 48 59 60 70 71 73 76 77 78 84 85 86 91 92
List of Tables
4.1 4.2 4.3 4.4 4.5 4.6
AES S-Box for Encryption AES S-Box for Decryption Encryption Rates for AES XOR Rate Using System Resources (CPU) XOR Rate Using CPUs - RGB Pixel Components XOR Rate Using CPUs - RGBA Pixel Components
50 51 63 66 66 66
Preface
CryptoGraphics: Exploiting Graphics Cards for Security explores the potential for implementing ciphers within graphics processing units (GPUs), and describes the relevance of GPU-based encryption and decryption to the security of applications involving remote displays. As the processing power of GPUs increases, researchers have started to study the use of GPUs for general purpose computing. While GPUs do not support the range of operations found in CPUs, their processing power has grown to exceed that of CPUs and their designs are evolving to increase their programmability. GPUs are especially attractive for applications requiring a large quantity of parallel processing. This work extends such research by considering the use of GPUs as a parallel processor for encrypting (and decrypting) data. The authors examine the operations found in symmetric and asymmetric key ciphers to determine if encryption can be programmed in existing GPUs. While certain operations make it impossible to implement some ciphers in a GPU, the operations used in most block ciphers, including the Advanced Encryption Standard (AES), can be performed in GPUs. A detailed description and code for a GPU-based implementation of AES is provided. The feasibility of GPU-based encryption allows the authors to explore the use of a GPU as a trusted system component, motivated by the use of thin-client and remote conferencing applications on untrusted or untrustworthy systems. By enabling encryption and decryption in GPUs, unencrypted display data can be confined to the GPU to avoid exposing it to any malware running on the operating system. The authors describe a prototype implementation of GPUbased decryption for protecting displays exported to untrusted clients. Issues and solutions related to fully securing data on untrusted clients, including the protection of user input, are also discussed. Additional capabilities are constantly being added to GPUs: when the first experiments described in this book were performed, programmable pixel processors were a new feature. Improved programmability of GPUs will likely
xiv
CRYPTOGRAPHICS
remove some of the limitations encountered when implementing ciphers to run in GPUs within the next couple of years, while other limitations are not likely to be addressed as long as GPUs are not designed or marketed for general purpose processing. While the capabilities of GPUs are growing, the concepts and proposed architectures described within this book are independent of the changes in GPUs and will only become easier to implement as the general programmability of GPUs evolves.
Acknowledgments
The authors jointly wish to thank John loannidis for suggesting the idea of performing encryption in a GPU which lead to this work and Ricardo Baratto for providing information on thin clients. Eran Tromer pointed out that moving encryption into GPUs can be a preventive measure against some existing side channel attacks on block ciphers. Angelos Keromytis also wishes to thank his wife Elizabeth for her patience and understanding, as well as her careful reading of drafts of this manuscript.
Chapter 1 INTRODUCTION
1.1
Overview
The focus of this book is the use of graphics processing units (GPUs) for cryptographies operations, hence the term CryptoGraphics. The computing power of GPUs has increased substantially over the past several years to the point that GPUs are more efficient than CPUs for certain tasks. As a result, even though GPUs are not intended to be general purpose processors, researchers have begun to study the use of GPUs for non-graphics applications. In most cases, the goal is to increase the rate at which computations can be performed by an application by using the GPU for specific types of calculations. Applications that are well suited to run in a GPU use data representations and types that are compatible with the GPU's abstraction of pixels. Compatible computations involve operations that take a single pixel's value, apply a simple function to it and output the result as a new pixel value. Parallel processing on multiple data sets can be performed by using multiple sets of pixels to represent the data sets and by applying the application simultaneously to each set and/or by treating each color component of a pixel as a separate set of data and applying the algorithm in parallel to each color component. The potential for increased processing power was the original reason for investigating the use of GPUs for cryptographic operations. As the work evolved, other benefits emerged, such as avoiding the exposure of unencrypted data to an untrusted operating system where spyware can access it, and designing ciphers based on operations commonly found in graphics processing. Another, less obvious, benefit is that executing cryptographic operations entirely in a GPU provides a preventive measure against some existing side channel attacks and differential fault analysis on ciphers.
2
CRYPTOGRAPHICS
The work described within this book explores the possibility of implementing asymmetric key and symmetric key ciphers within GPUs, and describes the relevance of GPU-based encryption and decryption to applications involving remote displays, such as video conferencing and thin-client applications. An implementation of AES in OpenGL serves as an example of the feasibility of encrypting within a GPU. It also reflects the obstacles encountered due to limitations of GPUs and their APIs. A prototype application involving streaming video and GPU-based decryption is described to illustrate the benefits and issues of running a cipher within a GPU. Suggestions for GPU enhancements and a proposal for a GPU friendly cipher are included. In addition, methods for securing other data inputs relevant to the applications, such as keyboard input and audio, are briefly described. The relationship of this work to that of the Trusted Computing Group (TCG) is also discussed. GPU vendors are constantly increasing the capabilities of GPUs. When the first experiments described in this book were performed, programmable pixel (fragment) processors were just being added to GPUs. During the time this book was being written, the increase in supported pixel size has resulted in an increase in the amount of data that can be encrypted simultaneously, but no new capabilities became available to address the obstacles encountered when attempting to perform certain cryptographic operations within a GPU. In the next couple of years the growing programmability of GPUs and the introduction of an API that improves access to GPUs' capabilities will likely eliminate some of the obstacles encountered, but other limitations are not likely to be addressed as long as GPUs are not designed or marketed for general purpose processing. Chapter 2 provides background information on GPUs and their APIs, which will assist the reader in understanding the capabilities and limitations of using a GPU as a general purpose processor. The background information also clarifies why certain implementation decisions were made in the experiments described in Chapters 4 and 5. The motivation for the work is described in Chapter 3. The protection that GPU-based encryption and decryption provide against side channel attacks and differential fault analysis is also discussed. Chapter 4 discusses the implementation of encryption within a GPU, including an implementation of AES in OpenGL. The code for the OpenGL version of AES's encryption function is provided in Appendix A. Chapter 5 describes a prototype for encrypting displays sent to untrusted remote clients. Chapter 6 describes issues related to fully implementing a secure system based on the prototype described in Chapter 5. This includes protecting the user's inputs on the untrusted client, an option for conveying a secret key to the GPU, notes on compression of images, and the relevance of certain types of attacks to the prototype. An overview of the TCG's trusted platform module (TPM) and how GPU-based encryption can utilize the TPM is provided. Chapter 7 discusses related ideas and future work, including the encryption of audio in digital signal
Introduction
3
processors (DSPs) and designing a stream cipher to run in a GPU. Chapter 8 summarizes the work and the insights gained from the experiments. The following is an overview of each chapter.
1.2
GPUs
This chapter provides background information on GPUs and their APIs, which will assist the reader in understanding both the motivation for the work and the implementation decisions made in the experiments described in later chapters. An overview of GPUs is provided and existing APIs to GPUs are discussed. Of the APIs, the lowest level that is publicly available and is independent of the operating system is OpenGL [58]. DirectBD [51] is at the same layer as OpenGL but is Microsoft-specific.^ The experiments and implementations described within this book use only the OpenGL API. Other, more user-friendly, APIs exist that provide a user interface layer above OpenGL and Direct3D. However, these provide the programmer less control over which operations are executed in the GPU v^. in the CPU, and over the exact commands issued to the GPU. Processing in GPUs is split between operating on vertices (vertex processor) and on pixels (pixel processor). The cryptographic operations under consideration require that data be stored in and processed as pixels as opposed to vertices. This is also the case for other types of applications that have experimented with using a GPU as a general purpose processor. An explanation for why vertices cannot be used to store and process data is provided. In order to provide an understanding of what operations are provided by a GPU for cryptographic algorithms, some details on OpenGL and pixel processing are included. Finally, a few non-graphic applications utilizing GPUs in areas other than cryptography and security are mentioned to illustrate the growing use of GPUs as general purpose processors.
1.3
Motivation
This chapter describes the motivation for experimenting with the use of GPUs for performing cryptographic operations. The main reasons are accelerating the execution of cryptographic operations using commodity hardware, protecting data from spy ware and certain types of phishing attacks. The use of GPUs also eliminates the possibility of existing side channel attacks and existing differential fault analysis. Cryptographic operations serve a critical role in protecting data and in insuring the authenticity and integrity of data. The need to perform such operations without consuming shared system resources in certain environments has lead to the development of specialized cryptographic hardware. However, such hardware is not a common component of most systems. In contrast, GPUs are
4
CRYPTOGRAPHICS
a widely available and fairly inexpensive resource. GPUs offer a high level of parallel processing and have processing speeds that exceed that of CPUs (although GPUs do not offer the general processing capabilities provided by CPUs). Therefore, GPUs may serve as a viable alternative to dedicated hardware for performing cryptographic operations. Aside from leveraging the CPU's processing power, CPU-based encryption assists in protecting displays sent to untrusted systems. Software that covertly monitors user actions, also known as spy ware, has become a first-level security threat due to its ubiquity and the difficulty of detecting and removing it. Such software may be inadvertently installed by a user that is casually browsing the web, or may be purposely installed by an attacker or even the owner of a system. This is particularly problematic in the case of utility computing, early manifestations of which are Intemet cafes and thin-client computing. Chapter 5 examines the problem of protecting a user accessing specific services in such an environment. As our two example applications, the focus is on secure video broadcasts and remote desktop access when using any convenient (possibly untrusted) terminal. For such applications, confining the trusted computing base to a suitably modified GPU prevents spy ware running on the operating system from accessing the displayed data. This involves moving image decryption into GPUs. A final benefit of CPU-based decryption is that it prevents some existing side channel attacks and existing differential fault analysis. A summary of such attacks is provided. Moving encryption into the GPU prevents attacks that require access to memory used by the encryption algorithm, attacks that measure CPU usage, or attacks that require the ability to introduce specific faults into the software or hardware. Existing results concerning the potential for attacks that measure acoustics or power utilization, or attacks that inject flaws into the hardware will not directly work on GPUs. Conceptually, some of these types of attacks can be applied to GPUs. Experimentation is needed to determine what types of measurements can be obtained from the GPU and provide useful information to an adversary.
1.4
Encryption in GPUs
This chapter discusses the feasibility of performing encryption within GPUs based on the types of operations and data structures supported. B oth asymmetric and symmetric key ciphers are considered. A summary of common public key methods is provided along with an explanation as to why they are not suitable for implementation in existing GPUs. In contrast, common operations found in symmetric key ciphers are implementable within GPUs, with a few exceptions. To illustrate the potential for providing encryption within a GPU using a symmetric key cipher, an OpenGL implementation of AES is described in detail. The common modes of encryption for block ciphers and how they
Introduction
5
can be performed in a GPU while allowing for parallel processing of data are described. Finally, a partial implementation of a stream cipher within a GPU is discussed. This component of the work provides a basis for applying stream ciphers in GPUs.
1.5
Remotely Keyed CryptoGraphics
The applicability of GPU-based decryption to video broadcasts, such as found in desktop video conferencing applications, and remote desktop access, such as with thin-clients, is investigated. An architecture for providing GPUbased decryption for these applications is defined and a prototype for use with streaming video is described. In the prototype, a stream cipher is used for encrypting data at the server and decrypting data at the client. The secret key for the stream cipher must be known by both the server and the client's GPU. When performing decryption in a GPU, the issue of how to securely convey the secret key to the GPU must be addressed in order to avoid exposing plaintext to processes running on an untrusted operating system. Remote keying of the GPU is one solution. The secret key for the stream cipher is sent to the GPU via a proxy. A certificate stored in the GPU's memory contains a public/private key pair for the GPU. The proxy establishes a secure session with the server over which it receives the secret key. The proxy encrypts the secret key with the GPU's public key and sends it to the GPU via the client's operating system. The GPU decrypts the secret key and uses it for the stream cipher. Encrypted data is sent directly from the server to the client, where it is written to the GPU and XORed with the key stream. The purpose of the prototype is to simulate the target architecture. A few of the operations in the prototype had to be performed in the CPU instead of the GPU due to GPU limitations. The capabilities of current GPUs in supporting both decryption and the keying of a symmetric key cipher using existing key establishment protocols as well as GPU limitations in regards to such operations are identified. Enhancements to future GPUs are proposed that will allow the full realization of the defined architecture.
1.6
Related Issues
This chapter contains solutions to issues relating to protecting user input on untrusted clients, an alternative to the method described in Chapter 5 for keying the GPU, a discussion of man-in-the-middle attacks and phishing attacks as they apply to the prototype described in Chapter 5, an overview of the TCG's TPM and a discussion on the compression of images. While GPU-based encryption protects data sent from a server to an untrusted client, applications deal with more than just display updates. In thin-client applications, the user will provide
6
CRYPTOGRAPHICS
inputs via the mouse and keyboard on the untrusted client. These inputs must be sent to the server and may contain information that must be protected. An alternative to the remote keying of the GPU used in the prototype described in Chapter 5 is presented. The method involves the user selecting colors displayed to the user by the GPU to input the key. The applicability of two common types of attacks to the scenario in the prototype is also considered. The prototype's susceptibility to man-in-the-middle attacks when using a proxy is evaluated. The potential for phishing attacks when using the architecture described in the prototype is discussed. An overview of the TCG's TPM is included in this chapter because it relates to the use of the GPU as a trusted component when using GPU-based decryption to protect displays. The use of the GPU as a trusted module can be incorporated into the TCG architecture and the TPM can be used to generate keys for the GPU. A another issue is the compression of images and video when decrypting in the GPU. Ideally, an encrypted image cannot be compressed since the encryption will result in a pseudorandom bit string representation of the image. Thus, images are compressed before being encrypted and must be decrypted then decompressed. When an image is decrypted in the GPU, it should not be written back to the operating system to allow for decompression but instead should be decompressed in the GPU.
1.7
Extensions
This chapter presents extensions to the work and future areas of research. The concept of designing a cipher based on operations suitable for execution in a GPU is discussed. Ideas for how to create a GPU-based stream cipher are presented. A GPU-based cipher would not only be beneficial to applications requiring encrypted displays, but could also serve as a general purpose cipher in any system containing a GPU. The concept of encrypting and decrypting displays in a GPU to avoid exposing plaintext to the operating system can be extended to audio. The operations supported by programmable DSPs are typical of those supported by CPUs and those found in existing ciphers. This makes performing cryptographic processing of audio in a DSP substantially easier than performing such processing on images in GPUs.
1.8
Conclusions
This chapter provides a summary of the benefits and issues related to performing cryptographic operations in a GPU. Possible enhancements to GPUs that will assist in performing cryptographic operations are reviewed. A Hst of possible projects for students is included.
Introduction
Notes 1 For its latest graphics cards, the XIK series, ATI has announced plans to provide a lower level API than OpenGL and DirectSD. The new API will provide more flexibility for programmers using GPUs as general purpose processors [59].
Chapter 2 GRAPHICAL PROCESSING UNITS
2.1
Overview
Knowledge of the operations supported by GPUs and how data is processed in GPUs is necessary in order to understand how GPUs can be leveraged for cryptographic processing and protecting data. This chapter provides an overview of GPUs and their APIs. While GPUs allow for significant levels of parallel processing, the capabilities supported for graphics processing do not allow for general purpose computing within a GPU equivalent to that of a CPU. GPU capabilities continue to expand and the APIs are evolving to improve programmers' access to these capabilities, some of which can potentially assist in performing cryptographic operations. For the most recent GPU capabilities, the reader should refer to vendors' GPU specifications. This chapter is organized as follows: Section 2.2 provides a summary of the general architecture and capabilities of GPUs. The steps a GPU performs on vertices and pixels when creating an image are described. Section 2.3 provides an overview of the types of operations supported by GPUs and the limitations of GPUs when used for general purpose programming. Section 2.4 lists the common APIs available for GPUs and explains why lower level APIs are more suitable for general purpose programming of a GPU compared to higher level languages that are more user friendly. Data can be processed in GPUs as either vertices or as pixels. For cryptographic applications discussed in this book, the data must be represented as pixels. Section 2.5 describes how pixels are processed in a GPU, focusing on the operations relevant to creating ciphers that can execute within a GPU. Section 2.6 discusses why vertex processing is not appropriate for existing cryptographic algorithms. The idea of using GPUs for cryptographic processing arose in part because of the processing power of GPUs and a growing number of other applications experimenting with using
10
CRYPTOGRAPHICS
GPUs in place of CPUs. A few examples of other non-graphics applications utilizing GPUs are listed in Section 2.7.
2.2
GPU Architecture
GPUs contain their own processors and memory. A GPU is connected to the system by either an AGP, PCI or PCI Express bus. The first programmable GPUs operated as a fixed pipeline. A program compiled on the CPU issued API commands to the GPU for execution. This allowed computationally expensive operations to be performed in the GPU to free up the CPU. When programming a GPU, operations are performed on either vertices or pixels. Vertices are specified as coordinates and are the most basic element for defining any line or object. A pixel is a string of bits interpreted according to a specific format to indicate which bits represent the red (R), green (G), blue (B) and alpha (A) components. A typical configuration uses 32-bit pixes with 8 bits for each of the components. In the past two years, the flexibility in programming GPUs has substantially increased with the addition of programmable vertex and pixel (fragment) units that allow for certain programs to execute on the GPU. The term "pixel processor" will be used throughout this chapter to refer to the pixel unit. Some GPU specifications, articles and graphics books use the term "fragment processor" exclusively while others use the term "pixel processor". Vertex and pixels programs are commonly referred to as vertex and pixel (or fragment) shaders. Graphics programming generally uses vertex processing; whereas, non-graphic applications using GPUs typically require the pixel processor [62]. The basic architecture of a GPU is shown in Figure 2.1. The number of vertex and pixel processors, and how the components handling the operations and memory outside of these processors will vary per graphics card. The main point to obtain from the figure is that the GPU contains a series of vertex processors and pixel processors working in parallel. The vertex data are received over the bus from the host processor, and are processed by the vertex shaders, which include a programmable unit. Some fixed steps are performed, including rasterization, of which the output is the fragments given to the pixel processors. Both programmable and fixed steps are performed during pixel processing. The vertex processing steps are not applicable when operating on pixels directly. Bytes stored in the system's memory can be written directly to the CPU's memory and then processed as pixels. "Fixed" steps refer to steps outside the programmable units. These steps are controlled to some extent by the programmer. For example, the programmer defines stencils and depth, and parameters for the viewing angle and perspective, among others. Common components found within the vertex processor are a floating point unit, a floating point vector unit, a unit for fetching textures from the CPU's cache and a branch unit. One or more units for vertex assembly operations and viewport (mapping a 3D scene to the 2D viewing area) may be included.
11
Graphical Processing Units
Host System
I VS
vs
VS
T
culling, clipping, transformations..
I
rasterization
texture 1^ cache
PS
PS
zcull
PS
I
DRAM (partitioned memory)
VS = vextex shader PS = pixel shader Figure 2.1. High Level View of GPU Hardware
Components found within the pixel processor include a texture processing unit that communicates with the cache, one or more floating point units, a branch unit and a fog arithmetic and logic operations unit (ALU). The processing speeds of CPUs have been increasing at a rate faster than CPUs. In the last two to three years, GPUs have evolved to contain more transistors than typical desktop CPUs. Although processing speeds of GPUs now surpass those of CPUs, their capabilities are narrower in scope than those of CPUs and do not offer the general programmability provided by the latter. This is due to both API limitations and GPU capabilities. Newer GPUs process at rates exceeding 40 billionfloatingpoint operations per second (GFlops). For example, peak performance of a Nvidia GeForce 6800 ultra was listed as 40 GFlops in comparison to 6 GFlops on a Pentium 4 with a 3.2 Ghz processor [57]. The RAM in GPUs is of smaller capacity than that commonly found in systems today, with newer GPUs containing a maximum of 256 MB or 512 MB of RAM compared to the 1 GB to 4 GB of RAM available for typical desktop PCs. However, as the number of transistors per GPU (or CPU) increases, the power consumption and heat dissipation become of greater concern.
12
CRYPTOGRAPHICS
Until the year 2005, most GPUs supported 32-bit pixel formats and 32-bit floating point precision while others only supported 16-bit precision. Recently, support for 64-bit pixel formats and 64-bit floating point precision has become more common. Graphics cards with 128-bit floating point precision are becoming available.
pack/unpack pixels pixel processing commands/program
Host System (Inputs)
E "cd c
fetch
Vertex Processing
0 CO CD O
E CO
vertex program transformations lighting, clipping, projection, viewpoint
w 3
Texture Memory
Rasterization
0 Q
E V-
0)
Pixel Processing
z-culling fragment program tests, blending, logical operations...
Figure 2.2. GPU's Main Processing Steps The general flow for processing data in a GPU is shown in Figure 2.2. The general flow in OpenGL 2.0, a platform independent API for GPUs, is shown in Figure 2.3 from the OpenGL Version 2.0 Specification [75]. It is important to understand both the fixed pixel processing pipeline as well as the flow with programmable units because not all operations have been moved into the programmable units. For example, the rasterization and blending steps are outside the programmable units and most of the pixel processing in OpenGL still corresponds to the basic pipeline [39]. The implementation of the block cipher AES described in Chapter 4 uses the basic pixel processing of GPUs. GPUs can be viewed as processing data in two formats. The first and most used in graphics applications is vertex processing. Vertices are specified as sets of coordinates. Any shape or object is formed by a set of connected vertices. Once objects are defined, transformations concerning properties such as the
Graphical Processing Units
13
Display List
Evaluator
Per Vertex Operations Primitive Assembly
Pixel Operations
Rasterization
Per Fragment Operations!
Framebuffer
Texture Memory
Figure 2.3. OpenGL Version 2.0 General Pipeline
angle and direction the scene is viewed from, lighting and intensity are applied. The resulting scene is converted into fragments (pixels) and undergoes pixel processing before being displayed. The coordinates and properties of vertices, including color and location, cannot be tracked and read back to system memory as data. As a result, vertices are not a suitable means for representing data to which cryptographic operations are applied with the intent of offloading work from the CPU then supplying the result to a process running on the operating system. Even when the intent is to decrypt data in the GPU and display it to the user, with no need to transfer the data back to the operating system, vertex processing cannot be used because of the floating point representation and rounding. The rounding in GPUs, even when considering the increasing precision of 64 and 128-bit floating point values, results in a lack of accuracy unacceptable for ciphers where changing one bit will produce an incorrect decryption. Processing vertex data involves the following steps (some of these steps may be performed within the vertex program instead of the traditional pipeline): • The various data needed to construct the image (vertices, including their coordinates and colors, properties, parameters and any textures) are defined. The data is passed into the GPU through API commands. • Transformations are applied to set the vertices in the scene. The vertex coordinates (including depth) are multiplied by model and view transfor-
14
CRYPTOGRAPHICS mations. The model transformation indicates any rotation, translation or scaling of the scene; for example, rotating about the X axis by 30 degrees and doubling the scale of an object. The view transformation is the angle (or camera position) the scene is viewed from.
• Lighting is applied. This sets the angle and intensity of the light. • Clipping, projection and viewpoint are applied. Clipping removes areas outside the scene. Projection can be thought of as viewing the scene through a normal, telephoto or wide angle camera lens. It also determines if all objects appear to be of the same size or if objects that are further away are smaller than those at the front of the scene (as objects appear in real life). The viewpoint defines the shape and area of the screen where the objects will appear. • Rasterization, the converting of vertices into fragments (pixels), is performed. Texture coordinates are interpolated from the texture coordinates of the vertices. • The fragments resulting from rasterization are tested to determine which pixels to keep and which pixels to discard. The scissor test discards portions of the image outside of a defined region. The alpha test discards pixels based on their alpha values. The stencil test discards portions of the image outside of a defined stencil. The depth test discards pixels based on their depth. When a pixel program is applied, fragments that will fail the depth test are discarded before the pixel program is applied. The depth test and the other tests are applied after the pixel program. • The pixels are combined with the current contents of the buffer. The default setting is for the new pixels to overwrite the current pixels in the buffer. The pixels may instead be combined in a few ways. The current and new values may be combined by blending. The resulting value of each color component is based on both the new and current values; for example, by multiplying both the old and new value by some factors then adding the result. By default, no blending is performed. Dithering may be applied. This averages pixels with neighboring pixels to eliminate abrupt color changes. By default, dithering occurs, but can be disabled by an API command. Logical operations can also be applied, such as XORing the new and current pixel values together. Logical operations are off by default. A vertex program can replace the model and view transformations, and any pervertex lighting. The vertex program may define textures and their coordinates. Vertex programs work on a single vertex at a time, with the output continuing through the remainder of the standard pipeline. Operations that require knowledge of multiple vertices and/or of topology are performed according to
Graphical Processing Units
15
the standard pipeline. A vertex program can read textures but cannot currently read from the framebuffer. The second method for processing data in GPUs is pixel processing. Individual pixel values can be set and operated on, as opposed to drawing and manipulating objects. Pixels can be used to store and manipulate byte level data, as described later in this chapter. Pixel values can be transferred between the GPU's framebuffer and system memory (where they are stored as bytes) by executing commands from a program running on the CPU. Therefore, an application using a GPU to offload processing from the CPU can read the result from the GPU to use in a program running on the CPU. Pixel processing is described in more detail in Section 2.5. In Figure 2.2, the vertex processing steps are not applicable when dealing solely with pixels. The program executing on the system's CPU will write pixels to the framebuffer and possibly define textures, then the pixel processing steps will be performed.
2.3
GPUs and General Purpose Programming
The following is an overview of the types of operations supported by GPUs and the limitations of GPUs when used for general purpose programming. The types of applications best suited for GPUs are those that involve operations that take a single pixel's value, apply a function to it (with limitations on what the function can be) and output the result as a new pixel value. Parallel processing of data is performed by using multiple pixels and multiple color components of a pixel. Four streams of data can be operated on simultaneously in the GPU by using each of the four components of a pixel (red, green, blue and alpha — RGB A). As a general rule, data that is stored in an array when programming in the CPU should be represented as a texture in the GPU. Any loop running in the CPU should have the inside of the loop run as a kernel on the pixel processor. Complex functions that cannot be performed in the GPU can be computed in advance in the CPU and the results stored as tables {e.g., as colormaps) or textures to be used by the GPU in some cases. In order to use table lookups in the GPU to represent a function, the function must only take a single input value and the input value must be able to be stored in a color component of a pixel. If the function takes multiple inputs, it may not be possible to represent it as a table lookup on pixel values. Applications that take multiple inputs and produce a single output; applications that require pixels be processed in a particular order; or applications that require using one pixel's value to determine which operation to perform on another pixel are not suitable for implementation in a GPU. Visiting pixels in a particular order in a single pass through the pixel processor is not possible because there is no way to control the order in which pixels are processed. It is also not possible to use the results from an already processed pixel when operating on a pixel that has yet to be processed.
16
CRYPTOGRAPHICS
Pixel processors can currently perform what is referred to as scatter. Scatter is the capabiHty to output results to areas of the image other than those used as the input to the function. However, pixel processors cannot support memory accesses such as a[i] = x where i is a computed address. The operation X = a[z] is possible. If a is a texture and i is a computed value, then a[i] is a texture fetch instruction; whereas, a[i] = x requires a texture write instruction to a computed address, i. This is because in pixel processors the only writes allowed are to pre-computed fragment addresses that cannot be changed by a program running in the pixel processor. The GPU is also designed to only read texture data; whereas, a CPU is designed for read and write operations. This limits how data can be processed in a GPU compared to a CPU. A way around this is to write intermediate results to system memory then read the data back into the GPU. This increases the number of data transfers between the GPU and operating system, increasing the overall execution time. Furthermore, it makes intermediate results available to the operating system, which must be avoided for the applications addressed in this book. A second option is to use the vertex processor, which supports such indexing. This is unsuitable for applications whose data cannot be represented and processed as vertices, including cryptographic processing. Branching is also not readily supported in pixel processors, although a workaround (described by Pharr [62]) can exist for certain applications. The CPU's pixel processor is still often a single instruction, multiple data (SIMD) design without support for branching or with minimal support where both paths of the branch must be taken, slowing processing. In contrast, vertex processors are now multiple instruction, multiple data (MIMD) processors and support branching. Some CPUs, such as Nvidia's Geforce 6 Series, supports different segments of the frame taking different paths. One segment is processed at a time, causing the other to wait. Processing after the branch does not begin until both segments have finished the branch. On Nvidia's FX Series, branching is supported by evaluating both possible paths then only writing the results of the path actually taken. Another limitation is that CPUs treat all values as floating point values, including the values of pixel components. This must be carefully taken into account when the data being processed involves values that are to be interpreted as individual bits. The floating point representation results in limiting the range of integers supported and rounding error. The OpenGL version of AES in Appendix A illustrates the impact of rounding error by having to consider the impact when populating the tables used in the implementation. While the OpenGL shading language supports an integer data type, this is done for the benefit of the programmer. The values are actually stored and processed as floating point values in the hardware. In the OpenGL shading language, integers are currently limited to 16 bits plus a sign bit.
Graphical Processing Units
17
The time to read and write data to and from the GPU must be considered. The processing power of GPUs has increased faster than the data transfer rates between the system's memory and the GPU. Therefore, it is best to limit reads and writes to system memory. When the time to transfer data between system memory and the GPU is considered, functions that can be computed faster in the GPU may take longer than when computed in the CPU. In general, functions that have a large ratio of arithmetic operations compared to memory accesses may perform better on a GPU (provided the arithmetic can be done on the GPU) than a CPU. Whereas, those that require a large number of memory accesses will likely be slower on the GPU. Note, neither symmetric key ciphers nor asymmetric key ciphers fall into the category of having a large ratio of arithmetic operations. Symmetric key ciphers have simple operations repeatedly applied to the data. Asymmetric key ciphers have a few arithmetic operations requiring large data structures; for example, exponentiation involving large integers.
2.4
APIs
The two most common APIs for GPUs are OpenGL and DirectSD. OpenGL is an open source, platform independent API. In contrast, DirectSD is specific to Microsoft Windows. These APIs are the lowest level, publicly available interfaces to GPUs. There are higher level languages built on top of OpenGL and Direct3D that provide a more user-friendly syntax and hide lower level details from the programmer, but such languages provide no additional capabilities in terms of what commands can be executed in the GPU since they rely on the OpenGL and Direct3D APIs. What can be executed within a GPU is restricted to the capabilities of the GPU, which are independent of the level or type of API used. The higher level languages result in code that compiles to a combination of a program (usually C or C++ code) that executes in the CPU and issues commands to the GPU. Such languages do not allow the developer control over which commands the code is translated into or even which commands are executed in the GPU. For example, code in a higher level language that XORs two bytes will likely be transformed into code executed in the operating system rather than converted into OpenGL commands that converts the bytes to pixels and XORs pixels. Using pixels to XOR bytes produces the desired result but is an inefficient way to XOR only two bytes when the operation can easily be performed in the CPU. Using the GPU to XOR two long sequences of bytes in a single step is useful. Cg [25] , Brook GPU [8] and Vertigo [10] are some examples of higher level languages of which Cg is the oldest and the most well-known of the languages. Cg is a C-like syntax that compiles to either OpenGL or DirectSD code, depending on the platform. The Cg code must be included in a main program that compiles on the CPU, such as a C or C++
Graphical Processing Units
19
of the window system and providing a more user-friendly syntax for creating display windows than the APIs for the window systems. GLUT is closed source. Its executable is available from the OpenGL organization at: h t t p : //www. o p e n g l . o r g / r e s o u r c e s / l i b r a r i e s / g l u t . html There are several alternatives to GLUT, including open source versions such as Freeglut. A list of toolkits that provide wrappers for window systems' APIs along with links to their downloads are available at: h t t p : //www. o p e n g l . o r g / r e s o u r c e s / l i b r a r i e s / w i n d o w t o o l k i t s . h t m l . The experiments described within this book required using a low level API in order to issue commands directly to the GPU and required platform independence. Therefore, OpenGL was used in all experiments. GLUT was used to open the display windows. Further details regarding OpenGL pixel processing and vertex processing that are relevant to implementing ciphers within GPUs are provided in the next two sections. At the time this was being written, ATI announced plans to provide support for general purpose GPU programming by publishing an API that is at a lower level than OpenGL and DirectSD in order to provide more direct access to GPU's capabilities [59].
2.5
OpenGL and Pixel Processing
The following is an overview of the OpenGL pixel processing pipeline and the OpenGL commands relevant to the experiments described in subsequent chapters. The implementations used in the experiments process data as 32-bit pixels treated as floating point values, with one byte of data stored in each pixel component. When using 32 bit pixels, 1 byte is typically dedicated to each of the RGBA components. Other formats, such as 10 bits for each of the red, green and blue components and 2 bits for the alpha component may also be supported. Since the time of the experiments, support for 64-bit pixels with 16 bits for each of the color components has become available. The following capabilities are not used in the experiments described in this book and therefore, are not described here: OpenGL's capabilities of processing pixels as color and stencil indices, and OpenGL's vertex processing (refer to [58] and [89] for a complete description). Figure 2.4 shows the components of the OpenGL pipeline that are relevant to pixel processing when pixels are treated as floating point values. While implementations are not required to adhere to the pipeline, it serves as a general guideline for how data is processed. The programmable pixel processor replaces part of the pipeline. Pixel shaders can access and apply textures, compute and set colors and depth, and apply fog. As with vertex shaders, the various tests at the end of the pipeline (scissor, stencil, alpha, etc.) are performed according to the pipeline and are not programmed within the pixel processor. OpenGL requires support for at least a front buffer (image is visible) and a back buffer (image is not visible) but does not require support for the alpha pixel component in the back buffer. This limits
20
CRYPTOGRAPHICS
Texture Memory
Unpack
Pixel Storage Modes
System Memory Pack
1
r Convert to [0,1]
• ^ ^
v ^ v i ' ; . . ^ - • , • • • ;
Figure 5.6. Decrypted Image Displayed in GPU
The prototype uses images encoded with 24 bits per pixel using 8 bits for each of the red, green and blue components. No alpha component is encoded because the image is written to the back buffer (which may not support the alpha component) to be decrypted. The pixel format is a parameter used by certain OpenGL commands, such as the Draw command for writing data to the GPU, and can easily be changed to accommodate other pixel formats. Figure 5.5 shows an encrypted image received by the GPU and Figure 5.6 shows the decrypted result.
5.4
Design Decisions
In this section some design and implementation decisions made when creating the prototype are discussed. These decisions were guided by the constraints of existing GPUs. First, the limitations on programming a GPU to perform general keying and decryption operations are described, and the current inability to provide data compression is discussed. As mentioned in Chapter 2, GPUs are not designed to perform general modular arithmetic and byte-level operations. There are no API commands for common operations such as modular addition and multiplication, and byte-
Remotely Keyed CryptoGraphics
79
level shifts and rotations. Some operations can be performed by a sequence of other commands under certain circumstances, by limiting values to a single byte and/or by reading intermediate results from the GPU to the operating system to allow the result to be a parameter in a subsequent command. The following subsections describe how these limitations impact the ability to remotely key the GPU and decrypt data within the GPU, and the workarounds used to create the prototype. A conclusion is that three enhancements to OpenGL and GPUs are necessary to fully realize the architecture. First, a mechanism for using the contents of a pixel (or pixel component) as a parameter to an OpenGL command without first reading the pixel value from the GPU is required for the remote keying and key stream generation. An example of this is RC4 when computing the index of the next S array entry to write to the key stream is computed {e.g., the step: output(S[(S[i] + S[j]) mode 256]) from the pseudo code in Section 4.3. Second, the ability to perform modular arithmetic using values less than 256 directly is desirable to efficiently implement certain ciphers, such as RC4, within the GPU. Modular arithmetic can currently be performed on values contained within single color components using colormaps when one operand is constant in order to allow for a static colormap and the map is performed on the second operand. Third, support for large integers is needed. Modular arithmetic on the values of magnitude found in public key ciphers is needed to securely implement remote keying of GPUs. While it is feasible that modular arithmetic on integers may be directly supported in GPU APIs as GPUs evolve, it is unlikely large integers, needed for public key ciphers, will be supported in GPUs anytime soon.
5.4.1
Remote Keying
The lack of modular arithmetic and limitations on the range of values in GPUs impacts the implementation of the asymmetric key cipher used in the remote keying. The proxy conveys the secret keys to the GPU via the client's operating system using an asymmetric key cipher. Since existing public-key algorithms require exponentiation and modular arithmetic on large integers, the operations required cannot be emulated in the GPU with existing APIs, except when trivially small values are used or when the values involved can be viewed as a series of smaller values. For example, the exponents and modulus in RSA must each fit within the bits of a single color component of a pixel, making them entirely unsuitable for a security application. The remote keying of the GPU requires only that the GPU be able to perform the decryption function of the asymmetric key cipher. Note that unless the proxy and GPU share a secret key in advance or the user can securely enter the secret key into the GPU when needed, any protocol used to exchange information, whether by merely having the proxy encrypt information with the GPU's public key or by establishing a session key between them, requires use of an asymmetric key cipher.
80
CRYPTOGRAPHICS
Two options were considered for the prototype. First, the operations can be implemented in C code to represent a function that should be in the GPU. Second, restrictions can be imposed on the size of the asymmetric key cipher's components to allow it to be implemented to run in the GPU. However, in the case of RSA this requires that plaintext and ciphertext each be restricted to fit in within a single byte when using 32-bit pixels with one byte per color component, thus requiring the modulus and exponents also each fit within a single byte and resulting in key components too small to be secure, since an exhaustive search for the private key and data is easily performed. In order to illustrate the concept of decryption using public key cryptography within the GPU, "toy" values less than 256 were used in the prototype for the private exponent, public exponent and modulus. The use of RSA with "toy" values will be referred to as mini-RS A. A series of 8-bit values were used to represent the data, specifically the secret key for RC4 in the prototype, encrypted with RSA. Each 8-bit value is encrypted with mini-RS A by the proxy and sent to the GPU where they are decrypted and used as the bytes of the RC4 secret key. When using RC4 as the key stream generator, up to 256 single-byte values can be in the series for RC4's secret key. A third possibility that is worth exploring is the integration of a decrypting GPU with a trusted platform module (TPM) such as the one proposed by the Trusted Computing Group. The TPM could provide certificate storage and handling, as well as remote attestation and key negotiation. The GPU can then handle image decryption using the TPM negotiated session key.
5,4.2
Decryption of Data in the GPU
To decrypt the images received from the server, the GPU on the client must run a symmetric key cipher; as described previously, a stream cipher was used. Two options for the stream cipher were considered: using an existing stream cipher and designing a stream cipher suitable for a GPU. With respect to running an existing stream cipher within a GPU, operations found in stream ciphers make this infeasible either due to the nature and number of OpenGL commands required to emulate the operations or due to the infeasibility to convert the operations to execute within the GPU given limitations of the API. As explained in Section 4.3, existing stream ciphers, such as LILI, RC4, SEAL, SNOW and SOBER, are unsuitable for implementation in a GPU. RC4 was chosen because most of its operations can easily be implemented in OpenGL. However, it is not practical to do so because the specific OpenGL commands required result in poor performance. The use of irregularly clocked feedback shift registers in LILI and SOBER, and 32-bit words in SNOW and SEAL, among other operations, result in these stream ciphers having a lower percentage of operations that can be implemented in OpenGL when compared to RC4.
Remotely Keyed CryptoGraphics
81
The operations in RC4 consist entirely of adding two bytes, modulo 256 and swapping two bytes. Thus, the only operation required of RC4 that is lacking in a GPU is modular arithmetic. Since the modulus is 256, all values can be represented by single bytes and can be stored as individual pixel components. Given two integers, a, h in the range [0,255], a + 6 mod 256 can be computed using a colormap. This requires knowing either a or 6 in advance to determine which colormap to activate. For each integer, a, in the range [0,255], create a colormap where the i^^ entry corresponds to a + i mod 256. To compute a + 5 mod 256, h is stored as a pixel component, the colormap for a is activated, then the pixel containing h is copied to a new location. The result written to the new location will be the h^^ entry of the colormap. This poses two problems. First, while OpenGL is used, the command to activate a colormap must be issued by a program running on the operating system, requiring a to be exposed to the operating system. While this does not expose the key stream to the operating system, it does provide partial information to the operating system, which may be helpful in determining key stream values. Second, the copying of pixels between locations in the buffer is one of the slowest operations within GPUs. In addition to the copy needed to compute the sum, copies are needed to update the indices and move bytes into the appropriate pixel components and locations. As a result, implementing RC4 in OpenGL was not a practical option at the time the prototype was developed. Therefore, the key stream generator of RC4 was implemented in C to represent a function that will eventually be moved into the GPU. The key stream bytes are written to the GPU as they are computed. This requires the C function computing the key stream to read the secret key from the GPU. Initially, each byte of RC4's output was written directly to the GPU as it was generated. However, the number of writes required (750,000 for a 500a:500 image) resulted in poor performance. The prototype was changed to compute the key stream bytes for an entire row of pixels before writing them to the GPU, reducing the number of writes to the height of the image with the tradeoff that a segment of the key stream is temporarily stored in the operating system's memory. Due to the inability to efficiently generate a key stream within a GPU by using an existing stream cipher, a possibility is to design a stream cipher utilizing graphics operations for which GPUs are designed. This is described in Chapter 7. While creation of a new stream cipher suitable for current GPUs is feasible (and in fact may have wider applicability than the GPU-based encryption applications), the same is not true for asymmetric key ciphers, since this would require devising a new one-way function that does not depend on the hardness of factoring or of discrete logs due to the need to avoid exponentiation and modular arithmetic on large numbers. While the proposed approach protects the secrecy of the images sent to the untrusted system, the integrity of these images is not protected. This could allow
82
CRYPTOGRAPHICS
an attacker to change parts of the image, although changing large portions of the image would be immediately detectable by the user, as it would produce corrupt output on the screen (since the attacker does not know the session key for the stream cipher). If single pixels or small areas of pixels area replaced, the alteration is not likely to be noticed by a user, but small, unnoticeable changes will also not be a useful attack because the meaning of the display or image seen by the user is not altered. Adding a message authentication code (MAC) to the scheme is technically feasible if the rate at which frames must be updated does not matter. A MAC is typically computed using a block cipher (such as the CBC-MAC) or with a hash function (HMAC). Whether or not the required operations can be performed in the GPU depends on the specific block cipher or hash function used. Since AES and the CBC mode of encryption can be performed in a GPU, at least the CBC-MAC using AES can be computed on the image. However, if a MAC is computed in the GPU on every frame in streaming video, this will noticeably degrade the rate at which frames can be displayed to the user because small groups of pixels will be treated as blocks of data that have to be processed in serial to compute the MAC in contrast to the decryption step which allows XORing all of a frame's pixels simultaneously with a segment of the key stream. If the display updates are small or there can be a slight delay before the update is visible to the user, as in the case of thin-client applications, then it may be possible to compute the MAC before displaying the update to the user. The GPU will have to be programmed to present an indication that a display update failed authentication for cases where the computed MAC does not match the value sent with the update.
5.5
Experiments
To determine the feasibility of the architecture, two sets of experiments were conducted to measure the ability of current GPUs to sustain decryption rates compatible with the example applications. OpenGL was used as the API to the graphics card driver. No vendor specific OpenGL extensions were used, making the prototype GPU independent. GLUT was used to open the display window. The only requirement is that the GPU must support 32-bit "true color" mode, as the routine for decrypting the secret key requires representing bytes in a single-pixel component. The code for the client consists of C, OpenGL and GLUT, compiled using Visual C++ version 6.0. The processes for the server and proxy are written in JAVA, using version L4.2_03 with the JAVA Cryptography Extension. The experiments utilized three different clients in order to test different GPUs. The environments were selected to represent a fairly current computing environment (at the time the experiments were performed), a laptop and a low-end GPU. In all cases, the display was set to use 32-bit true color with full hardware acceleration. The clients are:
Remotely Keyed CryptoGraphics
83
1 A Pentium IV 1.8 GHz PC with 256KB RAM and an Nvidia GeForceS Ti200 graphics card with 64MB of memory, running MS Windows XP. The GPU driver uses OpenGL version 1.4.0. 2 A Pentium Centrino 1.3 GHz laptop with 256KB RAM and an ATI Mobility Radeon 7500 graphics card with 32MB of memory, running MS Windows XP. The GPU driver uses OpenGL version 1.3.425. 3 A Pentium III 800 Mhz PC with 256KB RAM and an Nvidia TNT32 M64 graphics card with 32MB of memory, running MS Windows 98. The GPU driver uses OpenGL version 1.4.0. Streaming video applications, such a NetMeeting, were simulated by sending a stream of images from the server to the client. Tests were performed with frame sizes of 320x240 and 500x500 pixels. The frames were encrypted and stored in individual files on the server prior to starting the application. A small number of unique frames were created and the server repeatedly cycled through the set. This was done to provide a steady stream of images and avoid any delay in encrypting images on the server from impacting the measurements, which were focused on the client's performance. To measure thin-client performance, the average update size of 2,112 pixels (a 16x132 pixel area) was used. The average is from the distribution of update sizes in the standard i-Bench [86] web benchmark for thin clients. The update sizes in i-Bench range from 1x1 areas to 1,007x622 areas (626,354 pixels). All tests used images encoded as 24-bit RGB pixels, with 8-bits per color component. For each image size, two types of tests were run. The first set of tests determined the delay due to the additional computation needed for the remote keying and decryption, compared to sending unencrypted images. In these tests, all three entities (server, proxy, and GPU) were run on the same PC or laptop. Each of the three clients was tested. The results of the first set of tests are shown in Figure 5.7. The second set of tests involved running each entity on separate systems on a local area network (LAN) to determine the overall performance when the data arrival rate was impacted by network delay. The first client with the Nvidia GeForce3 GPU was used for these tests. Figures 5.8 and 5.9 show the results of these experiments. Two tests were run using two different LANs. In one case, the server and proxy were dedicated to the experiment and there was no traffic leaving the server and proxy aside from that due to the experiment. In the second case, the tests were run on shared servers used for general purpose computing. In both cases, each element had a 100Mbps connection to the LAN. There were three hops between the client and server, and between the client and proxy; there are two hops from the proxy to the server. For all tests, the number of frames per second for both encrypted and unencrypted frames are provided. In video conferencing applications, the number
84
CRYPTOGRAPHICS
DA: client 1 unencrypted El B: client 1 encrypted E!i C: client 2 unencrypted D D: client 2 encrypted • E: client 3 unencrypted m F: client 3 encrypted
A B C D E F
A B C D E F
16x132
320x240
A B C D E F
500x500
frame size in pixels
Figure 5.7. Decryption Rates: All Entities on a Single System
of frames supported per second is important: a minimum rate of 10 fps is required to obtain tolerable video and is typical in such applications, with 24 fps and higher rates required for better quality. In contrast, the rate of updates in thin-client applications is dependent on user requests and will be sporadic. The frames per second reflects the maximum burst rate supported. The intention of the experiments was not to build a robust streaming video application using the Real-Time Transport Protocol (RTP), which accounts for delay, rate of transmission and lost packets. Rather, the focus was to determine the feasibility of remote keying and decryption within the GPU, and to measure the resulting overhead. Therefore, TCP was used for all communication between the entities. When testing streaming images over the LAN, it was necessary for the client to signal the server when it was ready for the next frame to avoid synchronization problems. At least 99% of the delay when decrypting frames with RC4, compared to using unencrypted images, is due to the writing of the key stream bytes to the GPU. The key stream was written to the GPU one row at a time. When the test is run with the write eliminated (all other operations for the decryption
85
Remotely Keyed CryptoGraphics
90 80 70 T3 C
60
Ui
50
o o o k_
CD Q.
:S«|—
1 A: client 1 unencrypted El B: client 1 encrypted
40
HI
30
psHI
20
• •*•#•
10 r*xin
iitit
A B C D E F
A B C D EF
16x132
320x240
A BCD
EF
500x500
frame size in pixels
Figure 5.8.
Decryption Rates: Dedicated Lan and Client 1
are still performed), the average time is the same as that for the unencrypted images. The actual computation of the key stream per frame, enabling the logical operation of XOR in the GPU and swapping of buffers takes less than \ms for the 500x500 frames on all clients. When testing the average thin-client display size update (2,112 pixels), the times for the encrypted updates were the same as for the unencrypted updates because the key stream required only 16 writes to the GPU. In contrast, the 320x240 and 500x500 pixel frames required 240 and 500 writes per frame, respectively. The limiting factor in the processing of the 2,112-pixel updates is the time for the server to create the update (read the update from a file in the experiment). To determine the rate at which the client can process 2,112-pixel updates if creation of updates is not a limiting factor, an array containing 2,112 pixels was stored in memory on the server and repeatedly sent to the client. The server and client were running on the same system to eliminate network delays and bandwidth restrictions. The client can process over 500 updates per second on each of the three platforms, indicating that decryption overhead and the GPU
86
CRYPTOGRAPHICS
90 80 70 60
mi |s:H:| 1 A: client 1 unencrypted
50
Bl B: client 1 encrypted
(D QL
C/) O
E
40 30 20 10
a:! 6«:s;
IJlP A B C D
E F
16x132
A
B C D
E F
320x240
A
B C D
E F
500x500
frame size in pixels
Figure 5.9. Decryption Rates: Shared Lan and Client 2
are not limiting factors for small updates. For larger updates in thin-client applications, an increased delay, e.g.^ when the entire display changes, is not considered to be an issue since such updates are typically infrequent and, from a human factors perspective, are no worse than loading of some web pages or opening of applications. When sending images over a LAN, the decreased rate for the 320x240 and 500x500 pixel frames compared to the case when all processes were on the same PC is due to the rate at which images are sent from the server to the client being limited by the bandwidth. Even if no bandwidth is consumed by protocols, a maximum of 16.66 uncompressed 500x500 RGB frames can be transmitted per second on a 100Mbps interface. The time for the remote keying is mainly dependent on the time to enter the password or insert the smart card into the proxy. This may take a few seconds if a password is entered. Aside from this, the time is dependent on the protocol used and on the transport delay between the entities. Using a public key encryption algorithm (RSA), generating random nonces and encrypting the
Remotely Keyed CryptoGraphics
87
secret key with AES added approximately two seconds to the processing in each environment.
5,6
Conclusions
The prototype addresses the feasibility of decrypting images and displays within a GPU as a way of combating the rising threat of spy ware . The primary insight is that a suitably modified GPU can serve as a minimally trusted computing base for displays in certain types of widely used applications, such as video conferencing and remote desktop display access. The main mechanism in the scheme is decryption of frames exclusively inside the GPU, without storing either the key material or the plaintext on the system's main memory. The technique can protect against many types of spy ware, as well as several attacks aimed at the human interface layer [44]. It was explained why this scheme cannot fully be realized due to current limitations of GPUs. Enhancements needed to GPUs to overcome these limitations were identified. The prototype demonstrates that the concept of GPU-based decryption is feasible for thin-client applications and the video broadcast in conferencing applications. To further improve performance when decrypting video in the GPU, image compression facilities will need to be implemented inside the GPU, a trend which is already occurring. In addition, the performance numbers show that for typical video conferencing frame rates, and web browsing and remote desktop access using thin clients, the lack of compression is not a bottleneck for the performance of the system.
Notes 1 The architecture and experiments for the remote keying of GPUs presented in Sections 5.3.2 and 5.5 were first presented in [16].
Chapter 6 RELATED ISSUES
6.1
Overview
In this chapter, topics related to the architecture and prototype presented in Chapter 5 are discussed. The architecture described in Chapter 5 focuses on securing images sent to an untrusted chent. A complete system must also address the protection of user input on the client that is sent to the server and the protection of audio sent to the client. In addition, an alternative method for keying the GPU is provided. The architecture's susceptibility to man-in-the-middle attacks and phishing attacks is evaluated. The concept of executing cryptographic operations within a GPU can be used in conjunction with the trusted platform module (TPM) defined by the Trusted Computing Group (TCG). An overview of the TPM is provided and how the prototype can utilize the TPM is described. Another issue is where data compression is performed. Compression is unrelated to attacks against the client, but is impacted by moving encryption and decryption into the GPU.
6.2
Protecting User Input
The user responses on the untrusted client pose an interesting problem in that they require preventing input from the keyboard and mouse from being available to the untrusted operating system. One potential solution is to encrypt the keyboard inputs inside the keyboard itself (e.g., on the keyboard's USB controller). This requires a trusted keyboard, which is possible by using a portable folding keyboard that connects to USB port, such as those available for several PDA devices. The mouse may be directly connected to the keyboard (e.g., a TrackPoint device, as is common with laptops), or input may only be taken from the keyboard. A pin can be used as the key to the cipher used for encrypting the inputs. The pin can be of sufficient length to thwart a brute force
90
CRYPTOGRAPHICS
attack. The server may either choose a pin for the user, displaying it securely to the user via GPU-based decryption, or have the user select a pin from a keypad displayed on the GPU. If the server selects the pin for use in the keyboard, the server merely sends the pin as an encrypted image to the client's GPU where it is decrypted and presented to the user who then enters it into the keyboard. The pin can be a relatively small, unpredictable area of the image. An attacker or malware attempting to modify the pin will at best have access to the encrypted image. Other possibilities include the use of graphical passwords [20,83] and shouldersurfing-resistant PIN-entry methods [69]. Another option for conveying user input to an application on an untrusted client is the method described in [48] in which a trusted channel between a PDA (a cell phone) and the application requiring the input is used. The user's PDA provides a trusted device by which the user enters input. Graphically displayed keypads are used on some websites to allow a user to enter a pin to access his or her account by selecting values via mouse clicks. In some implementations the ordering of the values on the keypad change after each mouse click. Variations of such displays can be used to set the pin for the keyboard and to provide a secret key to the GPU for use in a symmetric key cipher. The user can select a pin if the server displays a keypad to the user via the client's GPU. The keypad is sent encrypted from the server to the GPU where it is decrypted and displayed to the user. Then the user selects characters from the keypad by clicking on or entering a series of squares from the keypad, with the coordinates of the selections sent to the server. Even though the client's operating system can see the coordinates of the user's selections (since the keyboard and mouse inputs are not yet encrypted), it does not have access to the unencrypted keypad, making this information useless to an attacker. To avoid guessing attacks based on the relative locations of the mouse pointer, the keypad configuration is changed every time a digit is selected as shown in Figure 6.1. The keypad can be spread across the display with each digit displayed in an arbitrary location determined by the server as shown on the right side of Figure 6.1 instead of in the traditional rectangle form. If an attacker or malware on the client attempts to alter the coordinates sent to the server, the altered values may not correspond to valid positions on the keypad. Minimizing the area of the display corresponding to digits will decrease the probability that malware can select coordinates that correspond to digits.
6.3
Keying the GPU
The idea of a user clicking on a keypad displayed on the screen can be used to convey the secret key used for decryption in the GPU in place of the remote keying protocol. If the key used for encrypting data changes after a certain
91
Related Issues
4 7
enter
5
8 0
3 9 clear
2 6 1 keypad when entering first digit
keypad wlien entering second digit
Figure 6.1. Graphical Keypad for Digits: Each time the user selects a digit the key pad changes to prevent malware from associating coordinates with a specific digit.
number of frames, the key pad can be used to enter a session key that the server and GPU will use to establish the keys for encryption. If one key is used for encryption during the entire session, the user can enter it via the keypad. When the user is inputting the key, colored shapes can be displayed on the screen in place of traditional ASCII characters. The byte level representation of the color corresponds to key bytes. As the user clicks on shapes, the GPU copies the pixel value from the selected coordinate to the area of its memory where the key is stored. The positions of the shapes and how the color values are assigned to them will vary each time a key is entered or each time a selection is made to avoid the possibility for malware on the operating system recording the values, which can occur if the display is static and an adversary manually programs the display information into the malware. There is a human factor's issue with using colors in place of ASCII characters, namely that a series of colors is more difficult to remember and distinguish (especially between shades that differ only in one bit) than characters. While the GPU needs the pixel value, the hex value of the pixel can be displayed within the shape to assist the user. Unlike an alpha-numeric pin, the secret key for a cipher can take on any value and typically consists of 128 or 256 bits. Entering a large segment of the key at once, such as by having the user select four 32-bit values, is infeasible
92
CRYPTOGRAPHICS
Figure 6.2. Graphical Keypad for Hex Values: The user enters a key by selecting the shape containing the hex value corresponding to the next four key bits. Color coding can be used to assist the user in locating a value. In this example, odd values are shaded using the blue color component with the other colors set to 0. Some shapes may not contain a value. Clicking on a shape will trigger the GPU to write a pixel whose color represents the hex value to the framebuffer. The GPU can be programmed to wait for multiple clicks before determining the pixel value.
because of the number (2^^^) of colors that will need to be displayed to the user and a user's inability to distinguish and locate individual values from such a large number. In order to limit the number of colors that must be displayed, only one color component may be used (for example have an 8-bit red value set and the green, blue and alpha components held constant as all O's so they won't influence the color displayed. This still requires 256 colors be displayed to the user and will be difficult for a user to search through even with the hex values indicated if they are in a random order. The green or blue color component can be used for some values instead of red to make it easier for a user to find specific values. For example, representing all values with the least significant bit set to 1 as a shade of red and all other values as a shade of blue. Fewer colors can be used to make it easier for the user to locate the correct value on the screen by increasing the number of values the user must select.
Related Issues
93
Instead of the user selecting one byte at a time, only 16 values can be displayed to the user by using only the lower four bits. Now the user will enter the key as a series of 4-bit values instead of 8-bit values. Refer to Figure 6.2. GPUs can easily be programmed to fill in a pixel value based on a mouse click. The GPU can wait for multiple mouse clicks before populating the entire pixel value when the user is entering only 4 bits at a time. An alternative for how the GPU handles the 4-bit inputs is to alternate which 4 bits of a color component are used when creating the keypad. Then the pixels from two sequential selections can be XORed together to produce an 8-bit value that is stored in a single color component. For example, assume an 8-bit red pixel component is being used. Let kh refer to a pixel location that will store the first byte of the key the user enters. Display the hex values in the range of 0x00 to OxOf to the user in shapes colored with pixels whose blue and green components are set to 0, and the red component takes on the values 0x00 to OxOf. Write the pixel value from the shape the user selects to kb. Then change the display so the hex values are now displayed using red values that are 0x00, 0x10, 0x20 ... OxfO. Turn the logical operation of XOR on and write the pixel value from the shape the user selects to kh. If the user selected "7" and "4" sequentially from the display, the resulting pixel value in kh has a red component of 0x47, and blue and green components ofO.
6A
Attacks
The scheme for remotely keying the GPU using the proxy as described in Chapter 5 is susceptible to a man-in-the-middle attack. This is because the proxy, server, and client are assumed to communicate over an untrusted network that includes the client's operating system, making it possible for an attacker to perform a man-in-the-middle attack using another system (a client #2 that has a GPU with a valid certificate) to perform the key exchange with the proxy device. Let client #1 refer to the client where the user wishes to see the display. Refer to Figure 5.4 for the protocol. Client #2, pretending to be client #1, sends the certificate to the proxy. The proxy may either be connected directly to client #1, such as by a USB port, or communicating with client #1 over a LAN or WAN. In the first case, malware running on the client will have to serve as an intermediate entity {i.e., a proxy, although this term is not used here to avoid confusion with the proxy involved in the protocol) in the communication between the proxy and client #2. Malware running on client #1 intercepts the certificate from client # r s GPU so it is not sent to the proxy (which instead receives client #2's certificate). The proxy, server and client #2 then complete the key establishment protocol, at which point client #2 will have the secret key needed to decrypt the displays. In addition, client #2 impersonates the proxy in communication with the GPU and establishes the same secret key with client # r s GPU. Client #1
94
CRYPTOGRAPHICS
and the server establish communication per the protocol, with client #1 sending the session request and request for images or display updates, and the server sending the images or display updates to client #1. The malware on client #1 copies the information received from the server and sends it to client #2 where it is decrypted and provided to the attacker. The images or display updates are also written to client #Vs GPU as normally would occur so the user is unaware that the information has been copied and decrypted elsewhere. This attack is feasible because the proxy cannot verify that the GPU with which it is exchanging information resides on client #1. A possible solution is the use of packet leashes [33] in the context of the communication between the proxy and the GPU. A packet leash involves including an identification tag in each packet that allows the receiving entity to determine where the packet originated. However, this would place additional requirements on both the GPU and smart card, and increase their costs. The attack is not applicable when the GPU is keyed by the user selecting from a keypad as described in Section 6.3. Phishing attacks involving the redirection of web page requests are more difficult to perform in the architecture used in the remotely keyed GPU prototype. This is because, without a man-in-the-middle attack as described above, the phishing must be performed at the server, which will be referred to as server #1. Consider what happens if the phishing attack redirects requests from the client intended for server #1 to a web server, referred to as server #2, containing a fake web site. The display the user sees and the user's responses are normally encrypted. If the user's responses are redirected from the client, server #2 will not be able to decrypt them. If server #2 attempts to send a web page or other display to the client, the client's GPU will attempt to decrypt it even though it is already plaintext and present a meaningless display to the user. Instead, the phishing attack must be able to redirect requests from the server intended for a valid web site to a fake web site provided by server #2. Server #1 will encrypt any web pages received from server #2 and send them to the client's GPU, where they will be properly decrypted and displayed to the user. Any information the user enters on a fake web page will be conveyed to server #2 via server #1. If server #1 is assumed to be trusted, in order to perform the redirection, the attacker must perform the redirection in the network, such as by false DNS entries, as opposed to methods that require access to server #1, such as modifying the host file or running malware on the server. Phishing attacks that operate by sending email to the user in an attempt to get the user to click on a url contained within the email for a fake web site can continue to work even if all displays the user sees pass through server #1 first to be encrypted. In a thin-client scenario where a user is reading email, server #1 will send display updates to the client reflecting the contents of the email. If the user clicks on a url in the email, server #1 will receive the request from the user, retrieve the web page from server #2, then send an encrypted display update
Related Issues
95
corresponding to the web page to the cHent where is it decrypted in the GPU. Any information the user enters on the web page will be returned to server #2 via server #1.
6.5
Trusted Platform Module
It is worth considering how the use of a trusted GPU for protecting displays can be incorporated into or utilize the architecture defined by the TCG. The core component of the TCG's architecture is the TPM. First, an overview of the TPM is provided. Second, how the GPU-based decryption can utilize the TCG architecture is discussed. The TPM is a microcontroller that provides key generation and a mechanism for authentication of the platform or its components. Specifications of the TPM are available in [84]. The TPM functions include key storage, key generation and digital certificate storage. The intent is that the keys stored and generated by the TPM are more secure from both physical and software attacks than if this information was stored outside the TPM. It also stores digests of certain system measurements, referred to as integrity measurements, in its platform configuration registers (PCRs). The TPM contains a random number generator for use in key and nonce generation, support for RSA, HMAC and a hash function (the current specification indicates SHA-1). It may also contain code for measuring platform devices, but this function is allowed to be located outside the TPM if necessary for implementation reasons. Hardware implementations of TPMs must be tamper resistant and nonremovable. Hardware versions should be attached to the motherboard of the PC. Software implementations are required to have a level of tamper resistance equivalent to that of hardware implementations, although no recommendations or suggestions for how to obtain this are included in the TCG specifications. The presence of a TPM does not guarantee the security or safeness of all software executing on a platform because the TPM does not control what software can run or report the status of running software. Software does not have to be certified to run on a platform with a TPM and detection of most threats by applications is left to the operating system. When a system containing a TPM is started, the TPM will go through a self initialization phase then take the integrity measurements, but does not analyze these measurements. It is not the function of the TPM to detect potential threats from these measurements. Any analysis and action are left to components outside the TPM, such as the operating system: "The operating system program loader is the next logical soft component to measure a program prior to loading it. Since the operating system helps enforce system integrity, it is reasonable for the program loader to both measure and enforce policies describing unacceptable software configuration state." [84], page 25. An application can contain a policy defining what it considers to be a trusted platform configuration. The application may not
96
CRYPTOGRAPHICS
execute or may limit the permitted interaction with the platform if the platform fails to meet the policy. The TPM can be deactivated for a short period by an "operator" to allow interactions with the platform without the TPM. The TPM must have various credentials or certificates installed on it, and full realization of the TCG's architecture requires credentials installed in all system components and software. The TPM must be delivered with an endorsement credential embedded in it by the manufacturer that indicates the manufacturer, part number, TPM version number and that contains an endorsement public key, referred to as the endorsement key (EK). The EK is a unique value that identifies the TPM. Conformance credentials are added by an evaluator. These contain information about the platform and TPM manufacturers. A platform credential contains information about the manufacturer. An attestation identity credential (AIK) that has the private key used to sign PCR values and the AIK public key is also present in the TPM. The AIK also contains references or pointers to the manufacturer information in the other credentials. Each credential is signed by its creator. Manufacturers of individual components, such as the mouse, keyboard, adapters, GPU and software, can include a validation credential in their products that contains information about the manufacturer, some measured values and possibly a list of capabilities of the product. However, a validation credential is currently not required for a product to run on a platform with a TPM. Notice that the TPM requires the installation of several credentials, including a key unique to the TPM. The prototype in Chapter 5 requires a certificate be installed on the GPU, which is less difficult to implement than the credentials defined by the TCG. In fact, if GPU manufacturers fully participate in the TCG architecture, GPUs will have validation credentials added to them, at which point a certificate can be included. However, with the presence of a TPM and a GPU that utilizes the software interface to it, the TPM can generate the RSA keys required by the GPU for the remote keying protocol. Note that if the user can enter the key into the GPU via the keypad method, the RSA keys are not needed. The TCG architecture can be used to attest to whether or not the GPU has been modified, which provides a mechanism for a user to determine whether or not a GPU on a remote client can be trusted. This currently requires that the operating system itself be trusted (validated), which is not assumed when using the GPU-based decryption. If all software installed on a system must be validated and/or the operating system can be validated and does not allow copying of data by software not supplying credentials, this will significantly reduce the chance of spy ware copying the display data if it is decrypted by a process running on the operating system. However, a shared, publicly available, client is unlikely to have all software on it validated if the client's purpose is to serve as a general purpose client because all software, especially freeware, will not adhere to the TCG architecture. Specifically, there will likely be software
Related Issues
97
without credentials that users will still want to use (and have valid reasons to run) on the client. In this case, GPU-based decryption is beneficial.
6.6
Data Compression
Traditionally, remote display and video conferencing systems have made extensive use of data compression in order to maximize network utilization and allow use of the application in bandwidth-limited environments. In most cases data compression is handled outside the GPU. Encrypted data, which ideally is pseudorandom bits, is not compressible. Any application involving encrypting data on a server and sending it to a client must compress then encrypt the data on the server. The client will decrypt the data then decompress it. Performing compression and decompression outside the GPU is not a concem when we are only trying to leverage the GPU's processing power as a cryptographic co-processor, since the data is returned to whatever application running on the CPU utilizes the data with no need to hide the plaintext the operating system. However, when the goal is to protect the plaintext from spyware on an untrusted system, decrypting display data in a GPU serves no purpose if the data then needs to be read back to the untrusted system for decompression. A straightforward solution would be to add hardware decompression abilities to the GPU. This could be accomplished by using widely available data decoding chips, such as MPEG hardware decoders; indeed, several DYD-ready GPUs contain such logic already. An alternative approach, in particular for thin-client scenarios, would be to tailor the display protocol and its compression to use operations available in the GPU. More recent thin-client systems have proposed remote display protocols that employ different types of commands and compression algorithms for different kinds of display updates [74]. The advantage of this approach derives from the characteristics of the protocol commands that provide inherent compression, negating the need for additional, specialized compression algorithms. For example, a command that instructs the client to fill a rectangular region with a particular color consumes very little bandwidth while compressing a potentially large region of the screen. E.g., draw ( 1 0 , 2 0 , 5 0 , 5 0 ) Oxef557777 to draw a 50 by 50 pixel rectangle with the lower left comer at (x,y) coordinates (10,20) and fill with the color 0xef557777 compared to sending 2500 32-bit pixel values of Oxef 557777. Execution of such a command is clearly within the operations available in existing GPUs. By appropriately designing the remote display protocol to utilize similar operations, it is possible to improve the architecture to consume reasonable bandwidth without compromising security.
Chapter 7 EXTENSIONS
7.1
Overview
In this chapter extensions to the work described earlier are presented. The first topic is the design of a symmetric key cipher for use in a GPU. An overview of how a stream cipher may be created using graphics operations is presented. The second topic is the protection of audio from malware on an untrusted client through DSP-based encryption.
7.2
Graphics-based Cipher
A GPU-based cipher would not only be beneficial to thin-client applications and remote video display discussed in previous chapters, but also serve as a general purpose cipher in any system containing a GPU. By mapping a texture exhibiting sufficient randomness to a continuously morphing image while changing certain variables, such as viewpoint and lighting, and extracting pixels from the image, a key stream is generated. The key stream is never within the client's memory in this case unless it is read from the GPU for use in an application running on the CPU. Experiments with an initial version were performed in order to estimate the time to compute the key stream. The first step is to generate the initial texture. Given a secret key, an initial pseudorandom texture can be generated that will serve as the seed texture for the GPU-based stream cipher. An existing stream cipher or random-bit generator (using the secret key as the seed) can be run to create enough bits to fill the targeted display size. The output is converted to pixels and used as the texture. A second option is to encrypt an image or any data of sufficient length using an existing block cipher and use the resulting ciphertext as the texture. In both options, the texture can be computed in advance then treated as if it was the secret key or the texture generation can be viewed as the first step in the GPU-
100
CRYPTOGRAPHICS
based stream cipher. For applications using the GPU to generate and apply a key stream to data used in other applications (i.e., not for encrypting or decrypting displays), the time to generate the initial texture may be less of an issue than in applications involving real time display updates. When decryption in real time is required, the initial texture should be generated before the user expects the first image to avoid any perceived delay in the time to display the first image. Once the initial texture is generated, it is mapped to one or more three dimensional objects whose surface encompasses at least the entire viewing area. The objects are then manipulated. The goal is to find a series of operations which produce pseudorandom pixels. The entire resulting image does not need to be pseudorandom; instead a subset of pixels from it can be added to the key stream after each iteration of the steps. Obviously, the larger number of pixels which can be extracted after each iteration, the faster the key stream is generated. If neighboring pixels are extracted for the key stream, dithering must be disabled when generating the image. In order for the same pixel values to appear in the same location on multiple CPUs running the stream cipher with the same key, the pixel size and resolution of the display must be identical in the GPUs. Rounding must also be considered if the key stream is to be reproduced on different graphics cards. For example, if a vertex program is used that alters the location of an object's vertices as part of the manipulation, the resulting coordinates may be impacted by rounding. The coordinates will not be exact if an equation can produce coordinate with a fractional part before rounding. If a vertex ends up with (x,y,z) coordinates of (100.3000,100.4999,0), the pixel at the location with (x,y) coordinates of (100,100) will contain a pixel that is the color of the vertex in the object. Slight differences in the preciseness between GPUs can result small discrepancies in the images that are undetectable by the human eye but that result in some pixel values differing between the image when it is generated in different GPUs. For example, if the y coordinate becomes 100.5000 instead of 100.4999, the pixel at location (100,101) on the display will be impacted instead of the one at (100,100). If the GPU truncates values instead of rounding them, the values of 100.999 and 101 will produce a coordinate value of 100 and 101, respectively. One idea that can eliminate the effects of vertices or shapes differing in location slightly due to rounding is to have a texture which involves blocks of different colors. The colors are pseudorandom. After an iteration of the steps manipulating the objects, pixels from the center of blocks at certain locations are added to the key stream. To estimate the time required for computing a key stream designed for the GPU, an initial image of a cube was loaded into the GPU with a random texture. The texture was pixels formed from bytes generated from the RC4 stream cipher. The cube was rotated, and its position, orientation and the angle from which it was viewed were altered. The lighting and fog settings were also changed. The time to execute all of the OpenGL operations under consideration was
Extensions
101
measured. After each series of executions, the resulting image is the key stream and is XORed with the current encrypted frame. The execution per frame is less then Ims, indicating that any differences in the time to process encrypted frames versus the time to process unencrypted frames will be unperceivable to the user. In proposing to design a new stream cipher suitable for executing within GPUs, it must be ensured that the cipher can also be efficiently implemented on the server for the cases where a server is encrypting data before sending it to a client which uses a GPU for decryption. If the encryption algorithm is such that it must run in a GPU, the server can encrypt the update by writing the image to its GPU and reading the result; otherwise, the server can perform the encryption in its operating system. In video conferencing applications, the images being encrypted may appear on the monitor of the speaker and can be encrypted in the GPU before being sent to the server or other conference participants.
7.3
Encryption within DSPs
Performing encryption within a GPU exemplifies the concept of performing cryptographic operations outside the CPU. This concept can be extended to audio when using programmable digital signal processors (DSPs). In addition to images, video conferencing and certain remote desktop applications exchange audio between the server and remote clients, or between clients. Audio can be encrypted and decrypted in the DSP so the CPU on the client only has access to the encrypted audio stream. Implementation of encryption within a programmable DSP is significantly easier than implementing GPU-based encryption. This is due to the operations supported in programmable DSPs. Texas Instruments (TI) programmable DSPs, such as their TMS320C55X series, include a CPU and up to 16MB of memory. The operations supported include the typical byte-level operations found in symmetric key ciphers. Bytes can be processed within the DSP as they would within the operating system's CPU; as a result, there is no need to derive alternate representations of existing symmetric key ciphers or devise a new cipher to work in the DSP. Software development kits (SDKs) assist in moving encryption into a programmable DSP In some SDKs, such as the SDK for the TI TMS320C55X series, code can be written in C or C++ and converted into assembly language for the DSP as opposed to programming directly in a DSP's assembly language. Programming directly in assembly language may be needed to fully optimize the program. Public key ciphers involving large integers are not directly supported in programmable DSPs. This is due to the lack of support for large integers. For example, there is no equivalent of the C/C++ GMP library or JAVA Biglnteger in programmable DSPs. Therefore, conveying a secret key for use in a symmetric key cipher to a DSP via a protocol using public key encryption is an issue.
102
CRYPTOGRAPHICS
An alternative to using a remote keying protocol with DPSs is to convey the key via audio. The user could speak the key (assuming no one is within range to eavesdrop), although this may be difficult to convert into precise key bits given variations in the human voice and that the logic to correctly deal with any fluctuations must fit into the DSP. A more realistic option is to play a series of tones using a PDA which has substantially less variation than the human voice. This is a similar concept to the idea of a user clicking on shapes to convey a key directly to the GPU; now the input is audio. This method increases the potential for an adversary in close proximity to the client from determining the key compared to the direct keying of the GPU via mouse clicks because now the key can be recorded by a hidden device in the vicinity of the client instead of requiring that the adversary see the user's keypad selections.
Chapter 8 CONCLUSIONS
8.1
Summary
The use of GPUs for cryptographic processing was investigated to determine if GPUs can be used to offload processing from the CPU and if GPU-based encryption and decryption can assist in protecting data on untrusted clients. GPUs provide a significant amount of parallel processing compared to any existing multi-CPU configuration. Data can be stored in pixels that are processed in parallel. While the programmability of GPUs has been increasing, GPUs are not designed to be general purpose processors and what algorithms can be implemented in GPUs are limited. The addition of a programmable pixel processor and larger supported pixel sizes increases the potential for using GPUs as general purpose processors, but common capabilities of CPUs are still missing. This is partially due to the APIs for GPUs and partially due to hardware limitations. The implementation of AES demonstrates that GPU-based encryption and decryption is possible with a symmetric key cipher. However, public key ciphers and some symmetric key ciphers involve data types and/or operations that cannot be programmed within existing GPUs. Other ciphers that can be programmed within a GPU require multiple steps to perform some basic operations, such as shifts. The prototype of the remotely keyed GPU demonstrates the concept of decrypting images and displays within a GPU as a means of combating spy ware. It is applicable to scenarios where untrusted clients are used to access remote desktops or to participate in video conferences. The primary insight is that a suitably modified GPU can serve as a minimally trusted computing base for displays in these applications. The main mechanism in the scheme is decryption of frames exclusively inside the GPU, without storing either the key material
104
CRYPTOGRAPHICS
or the plaintext on the system's main memory. The use of graphical keypads can be used as an alternative method for keying the GPU. The following enhancements to CPU's and/or APIs are needed to easily program existing cryptographic operations within CPUs: • Support for modular arithmetic for use in both asymmetric and symmetric key ciphers. • Support for a data type of unsigned integer for use in symmetric key ciphers. • Support for byte-level operations including rotations and shifts across single bytes, individual color components of pixels and entire pixels for use in symmetric key ciphers. • Support for branching in the pixel processor for use in symmetric key ciphers. • Support for using the value of a pixel and the value of a color component of a pixel as an argument in operations is required for the conditional statements found in most stream ciphers. • Support for large integers to allow asymmetric key ciphers to be executed in the GPU. Support for additional operations, including branching and byte-level operations, is feasible. Support for new data types, especially large integers, is less likely. Aside from the above capabilities for programming the cryptographic algorithms in a GPU, the following new GPU capabilities are needed to fully realize the architecture presented in Chapter 5. The second item is not required when using the keypad method to convey a secret key to the GPU instead of a proxy and remote keying protocol. • An easy mechanism for blocking malware on a system from reading unencrypted data from a CPU is required for the architecture to be useful for protecting displays on untrusted systems. This can be accomplished with a capability for a process to temporarily disable the CPU from responding to any write or read command issued by another process. (Le., disabling of the ability to read data from the GPU by all but one process). • If a protocol involving the use of a public key for the GPU is used to establish the secret key, a defined location for storing a public key or certificate with additional information in the GPU would be useful instead of loading the public key into memory the CPU uses for general operations. Dedicated storage will be needed for the credentials required by the TCG architecture and a certificate can be included.
Conclusions
8.2
105
Suggested Projects
Several extensions to the experiments described in Chapters 4 and 5 are possible. These may serve as exercises for students. The first set of exercises below build upon the OpenGL version of AES. These will familiarize students with programming encryption in a GPU. 1 When AES was implemented in OpenGL, there was no support for 64-bit and 128-bit pixels. Using a graphics card that supports 64-bit or 128-bit pixels, modify the implementation to use 64-bit or 128-bit pixels to process more blocks of data simultaneously compared to what was achieved using 32-bit pixels. 2 An implementation that processes identical blocks of data, each with a different key, can be created. Given a plaintext, ciphertext pair and a partial key, use the GPU to perform an exhaustive search on the remainder of the key by encrypting the plaintext (or decrypting the ciphertext) with the possible keys and then checking the resulting ciphertext (or plaintext) against the known value. Compare the time it takes to find the remaining key bits to an exhaustive search using the CPU. Determine what is a reasonable number of keys to test in parallel based on the GPU being used and the display size then set the number of known key bits accordingly. 3 Implement the modes of encryption described in Chapter 4 to run with AES in the GPU. The OpenGL version of AES was run in ECB mode. The other modes can easily be implemented in OpenGL. 4 Implement AES's key schedule in OpenGL. 5 Implement AES's decryption function in OpenGL. The code for the encryption function can be modified and the data layout shown in Figure 4.6 in Chapter 4 used. The tables for encryption provided in the appendix will have to be replaced with the tables needed for decryption. Decrypt the test value from FIPS 197 to verify the code is working. The test value is included in the encryption code in Appendix A. The following exercises involve experimenting with ideas described in previous chapters. 1 It may be feasible to design a symmetric key cipher (most likely a stream cipher) for GPUs. Experiment with graphic operations to produce a key stream and use it to encrypt images in a GPU. Test the randomness of the key stream bits. Note: h t t p : / / c s r c . n i s t . g o v / r n g / includes descriptions of tests for detecting non-randomness in binary sequences. Determine if the implementation works on different GPUs by encrypting the image on one GPU and decrypting it on another or by computing the key stream in
106
CRYPTOGRAPHICS
different GPUs, reading the pixels to the system memory and comparing the resulting bytes. Recall that rounding within a GPU may impact the exact pixel values produced. Therefore, an algorithm may not be portable amongst different GPUs unless steps are taken to avoid error due to rounding. 2 Implement a version of the method for keying the GPU described in Chapter 6 involving a user selecting from colored squares on the display. 3 The concept of encryption within devices on PCs can be extended to encrypt audio in programmable DSPs. Using a programmable DSP, implement a symmetric key cipher within the DSP and demonstrate the encryption and decryption of audio within the DSP.
Appendix A AES OpenGL Code for Encryption
A.l
Overview
This appendix contains code for an OpenGL version of the AES encryption function. The code encrypts identical copies of a 16-byte data block. The block of data and expanded key are predefined and written to the GPU. The pixel format used is 32 bits per pixel with 8 bits per color component. Two versions of the program are provided. The first version uses the red pixel component and the back buffer. It performs the operations in the back buffer then displays the final result to the front buffer. The second version uses the red, green and blue pixel components and the front buffer. It performs the operations in the front buffer, allowing the user to see the pixels being updated.
A.2
Version Using the Red Pixel Component and the Back Buffer
THIS SOFTWARE IS PROVIDED BY THE AUTHORS " A S I S " AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Example AES implementation using OpenGL and the red pixel component. All work is performed in the back buffer. The key expansion is not performed in the GPU. In this sample code, the data to be encrypted and the expanded key are predefined. They are taken from FIPS 197.
108
CRYPTOGRAPHICS
Press 'e' to trigger encryption. One block of the resulting ciphertext will be printed to the window from which the program is executed to allow the user to verify the data was correctly encrypted. This code runs with Microsoft Visual C++ and requires OpenGL and GLUT. The data block from FIPS 197 and its corresponding ciphertext: data: 32 88 31 eO 43 5a 31 37 ±6 30 98 07 a8 8d a2 34 ciphertext: 39 02 dc 19 25 dc 11 6a 84 09 85 Ob Id fb 97 32 The expanded key is defined below in the array ekey. * * * * * * * * * * * * * * * * * * * * * * * * 5 f C * * * *
* * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * *
#include #include #include /* # of blocks encrypted simultaneously in one pixel component */ #define NBLKS 500 /* the expanded key is loaded starting at pixel (KEY_START_POS,0) */ #define KEY_START_POS 16 /* 16 bytes in 128 bit block */ #define BYTES_PER_BLK 16 /* 176 bytes in expanded key */ #define EKEY.BYTES 176 /* contains data that will be encrypted */ GLubyte data[BYTES_PER_BLK*NBLKS]; /* contains ciphertext */ GLubyte out_data[BYTES_PER_BLK*NBLKS]; /* expanded key */ GLubyte ekey[EKEY.BYTES] = { /* initial whitening */ 0x2b,0x28,Oxab,0x09,0x7e,Oxae,Oxf7,Oxcf, 0x15,0xd2,0x15,Ox4f,0x16,0xa6,0x88,0x3c, /* 1st roundkey */ OxaO,0x88,0x23,0x2a,Oxfa,0x54,0xa3,0x6c, Oxfe,0x2c,0x39,0x76,0x17,Oxbl,0x39,0x05, /* 2nd round key */ Oxf2,0x7a,0x59,0x73,0xc2,0x96,0x35,0x59, 0x95,0xb9,0x80,Oxf6,Oxf2,0x43,0x7a,0x7f, /* 3rd round key */ 0x3d,0x47,Oxle,0x6d,0x80,0x16,0x23,0x7a,
Appendix A: AES OpenGL Code for Encryption
109
0x47,Oxfe,0x7e,0x88,0x7d,0x3e,0x44,0x3b, / * 4 t h round key */ Oxef,0xa8,0xb6,Oxdb,0x44,0x52,0x71,OxOb, 0xa5,0x5b,0x25,Oxad,0x41,0x7f,0x3b,0x00, /* 5th round key */ Oxd4,0x7c,Oxca,0x11,Oxdl,0x83,Oxf2,Oxf9, 0xc6,0x9d,0xb8,0x15,Oxf8,0x87,Oxbc,Oxbc, /* 6th round key */ 0x6d,0x11,Oxdb,Oxca,0x88,OxOb,Oxf9,0x00, 0xa3,0x3e,0x86,0x93,0x7a,Oxfd,0x41,Oxfd, /* 7th round key */ 0x4e,0x5f,0x84,0x4e,0x54,0x5f,0xa6,0xa6, Oxf7,0xc9,0x4f,Oxdc,OxOe,Oxf3,0xb2,0x4f, /* 8th round key */ Oxea,0xb5,0x31,0x7f,0xd2,0x8d,0x2b,0x8d, 0x73,Oxba,Oxf5,0x29,0x21,0xd2,0x60,0x2f, /* 9th round key */ Oxac,0x19,0x28,0x57,0x77,Oxfa,Oxdl,0x5c, 0x66,Oxdc,0x29,0x00,Oxf3,0x21,0x41,0x6e, /* 10th round key */ OxdO,0xc9,Oxe1,0xb6,0x14,Oxee,0x3f,0x63, Oxf9,0x25,OxOc,OxOc,0xa8,0x89,0xc8,0xa6, }; /* The T tables are written as 3 tables (l*Sbox, 2*Sbox, 3*Sbox) in order to process data in 1 byte segments as a single pixel color component instead of processing 4 bytes. Values are converted to floating point by dividing by 255 then adding 0.000001. The addition of 0.000001 is because OpenGL stores the pixels as floating point values and truncates the values when converting from floating point to integer format. This conversion in format occurs when using a color component as an index into the color map. Therefore, each value needs to be >= the corresponding integer but less than the next integer to avoid errors due to rounding. 0, 1 are set to exactly 0 and 1. */ const GLfloat Tel[256] = { 0.388237,0.486276,0.466668,0.482354,0.949021,0.419609,0.435295,0.772550, 0.188236,0.003922,0.403923,0.168629,0.996080,0.843139,0.670590,0.462746, 0.792158,0.509805,0.788236,0.490197,0.980394,0.349020,0.278433,0.941177, 0.678432,0.831374,0.635296,0.686276,0.611766,0.643138,0.447059,0.752942, 0.717648,0.992158,0.576472,0.142039,0.211766,0.247060,0.968629,0.800001, 0.203923,0.647060,0.898040,0.945099,0.443138,0.847060,0.192158,0.082354, 0.015687,0.780393,0.137256,0.764707,0.094119,0.588237,0.019609,0.603923, 0.027452,0.070589,0.501962,0.886276,0.921570,0.152942,0.698041,0.458824, 0.035295,0.513727,0.172550,0.101961,0.105884,0.431373,0.352942,0.627452, 0.321569,0.231374,0.839217,0.701962,0.160785,0.890197,0.184315,0.517648, 0.325491,0.819609,0.000000,0.929413,0.125491,0.988236,0.694118,0.356864, 0.415687,0.796080,0.745100,0.223530,0.290197,0.298040,0.345099,0.811766, 0.815687,0.937256,0.666668,0.984315,0.262746,0.301962,0.200001,0.521569,
no
CRYPTOGRAPHICS
0.270588,0.976471,0. 007844,0. 498040,0. 313727,0. 235295,0. 623531,0. 658825, 0.317648,0.639217,0. 250980,0. 560786,0. 572551,0. 615687,0. 219609,0. 960785, 0.737256,0.713727,0. 854903,0. 129413,0. 062746,1. 000000,0. 952942,0. 823531, 0.803922,0.047060,0. 074511,0. 925491,0. 372550,0. 592158,0. 266668,0. 090197, 0.768628,0.654903,0. 494118,0. 239216,0. 392158,0. 364707,0. 098040,0. 450982, 0.376472,0.505883,0. 309805,0. 862746,0. 133334,0. 164706,0. 564707,0. 533334, 0.274511,0.933335,0. 721570,0. 078432,0. 870590,0. 368629,0. 043139,0. 858825, 0.878432,0. 196079,0. 227451,0. 039216,0. 286275,0. 023531,0. 141177,0. 360785, 0.760786,0.827452,0. 674511,0. 384314,0. 568628,0. 584315,0, 894119,0. 474511, 0.905884,0.784315,0. 215688,0. 427452,0. 552942,0. 835295,0, 305884,0. 662746, 0.423530,0. 337256,0. 956864,0. 917649,0. 396079,0. 478432,0, 682354,0. 031374, 0.729413,0.470589,0. 145099,0. 180394,0. 109805,0. 650982,0, 705883,0. 776472, 0.909805,0. 866667,0. 454903,0. 121570,0. 294119,0. 741177,0, 545099,0. 541178, 0.439217,0. 243139,0. 709805,0. 400001,0. 282354,0. 011766,0, 964707,0. 054903, 0.380393,0. 207844,0. 341178,0. 725491,0. 525492,0. 756864,0, 113726,0. 619609, 0.882354,0. 972550,0. 596079,0. 066667,0. 411765,0. 850981,0, 556864,0. 580393, 0.607844,0. 117649,0. 529413,0. 913726,0. 807845,0. 333334,0, 156863,0. 874511, 0.549021,0. 631373,0. 537256,0. 050981,0. 749021,0. 901962,0, 258824,0. 407844, 0.254903,0, 600001,0, 176471,0. 058825,0. 690197,0, 329413,0, 733335,0. 086276 }; const GLfloat Te2 [256] = { 0.776472,0.972550 ,0.933335,0. 964707 ,1.000000,0.839217,0. 870590,0. 568628, 0.376472,0.007844 ,0.807845,0.337256 ,0.905884,0.709805,0. 301962,0. 925491, 0.560786,0.121570 ,0.537256,0.980394 ,0.937256,0. 698041,0. 556864,0. 984315, 0.254903,0.701962 ,0.372550,0.270588 ,0.137256,0.325491,0. 894119,0. 607844, 0.458824,0.882354 ,0.239216,0.298040 ,0.423530,0.494118,0. 960785,0. 513727, 0.407844,0.317648 ,0.819609,0. 976471 ,0.886276,0.670590,0. 384314,0. 164706, 0.031374,0.584315 ,0.274511,0.615687 ,0.188236,0. 215688,0. 039216,0. 184315, 0.054903,0.141177 ,0.105884,0.874511 ,0.803922,0.305884,0. 498040,0. 917649, 0.070589,0.113726 ,0.345099,0. 203923 ,0.211766,0.862746,0. 705883,0..356864, 0.643138,0.462746 ,0.717648,0.490197 ,0.321569,0.866667,0. 368629,0..074511, 0.650982,0.725491 ,0.000000,0.756864 ,0.250980,0.890197,0. 474511,0..713727, 0.831374,0.552942 ,0.403923,0. 447059 ,0.580393,0.596079,0. 690197,0..521569, 0.733335,0.772550 ,0.309805,0. 929413 ,0.525492,0.603923,0, 400001,0..066667, 0.541178,0.913726 ,0.015687,0. 996080 ,0.627452,0.470589,0. 145099,0..294119, 0.635296,0.364707 ,0.501962,0. 019609 ,0.247060,0. 129413,0, 439217,0..945099, 0.388237,0.466668 ,0.686276,0. 258824 ,0.125491,0.898040,0, 992158,0..749021, 0.505883,0.094119 ,0.149020,0. 764707 ,0.745100,0.207844,0, 533334,0..180394, 0.576472,0.333334 ,0.988236,0. 478432 ,0.784315,0.729413,0, 196079,0..901962, 0.752942,0.098040 ,0.619609,0. 639217 ,0.266668,0.329413,0, 231374,0..043139, 0.549021,0.780393 ,0.419609,0. 152157 ,0.654903,0.737256,0, 086276,0..678432, 0.858825,0.392158 ,0.454903,0. 078432 ,0.572551,0.047060,0, 282354,0..721570, 0.623531,0.741177 ,0.262746,0. 768628 ,0.223530,0. 192158,0, 827452,0..949021, 0.835295,0.545099 ,0.431373,0. 854903 ,0.003922,0.694118,0, 611766,0..286275, 0.847060,0.674511 ,0.952942,0. 811766 ,0.792158,0.956864,0, 278433,0..062746, 0.435295,0.941177 ,0.290197,0. 360785 ,0.219609,0.341178,0, 450982,0..592158, 0.796080,0.631373 ,0.909805,0. 243139 ,0.588237,0.380393,0, 050981,0..058825, 0.878432,0.486276 ,0.443138,0. 800001 ,0.564707,0. 023531,0 968629,0..109805, 0.760786,0.415687 ,0.682354,0. 411765 ,0.090197,0. 600001,0 227451,0 .152942,
Appendix A: AES OpenGL Code for Encryption
111
0.850981,0.921570,0.168629,0.133334,0.823531,0.662746,0.027452,0.200001, 0.176471,0.235295,0.082354,0.788236,0.529413,0.QQQ66Q,0.313727,0.647060, 0.011766,0.349020,0.035295,0.101961,0.396079,0.843139,0.517648,0.815687, 0.509805,0.160785,0.352942,0.117649,0.482354,0.658825,0.427452,0.172550 }; const GLfloat Te3[256] = { 0.647060,0.517648,0.600001,0.552942,0.050981,0.741177,0.694118,0.329413, 0.313727,0.011766,0.662746,0.490197,0.098040,0.384314,0.901962,0.603923, 0.270588,0.615687,0.250980,0.529413,0.082354,0.921570,0.788236,0.043139, 0.925491,0.403923,0.992158,0.917649,0.749021,0.968629,0.588237,0.356864, 0.760786,0.109805,0.682354,0.415687,0.352942,0.254903,0.007844,0.309805, 0.360785,0.956864,0.203923,0.031374,0.576472,0.450982,0.325491,0.247060, 0.047060,0.321569,0.396079,0.368629,0.152157,0.631373,0.058825,0.709805, 0.035295,0.211766,0.607844,0.239216,0.142039,0.411765,0.803922,0.623531, 0.105884,0.619609,0.454903,0.180394,0.176471,0.698041,0.933335,0.984315, 0.964707,0.301962,0.380393,0.807845,0.482354,0.243139,0.443138,0.592158, 0.960785,0.407844,0.000000,0.172550,0.376472,0.121570,0.784315,0.929413, 0.745100,0.274511,0.850981,0.294119,0.870590,0.831374,0.909805,0.290197, 0.419609,0.164706,0.898040,0.086276,0.772550,0.843139,0.333334,0.580393, 0.811766,0.062746,0.023531,0.505883,0.941177,0.266668,0.729413,0.890197, 0.952942,0.996080,0.752942,0.541178,0.678432,0.737256,0.282354,0.015687, 0.874511,0.756864,0.458824,0.388237,0.188236,0.101961,0.054903,0.427452, 0.298040,0.078432,0.207844,0.184315,0.882354,0.635296,0.800001,0.223530, 0.341178,0.949021,0.509805,0.278433,0.674511,0.905884,0.168629,0.584315, 0.627452,0.596079,0.819609,0.498040,0.400001,0.494118,0.670590,0.513727, 0.792158,0.160785,0.827452,0.235295,0.474511,0.886276,0.113726,0.462746, 0.231374,0.337256,0.305884,0.117649,0.858825,0.039216,0.423530,0.894119, 0.364707,0.431373,0.937256,0.650982,0.658825,0.643138,0.215688,0.545099, 0.196079,0.262746,0.349020,0.717648,0.549021,0.392158,0.823531,0.878432, 0.705883,0.980394,0.027452,0.145099,0.686276,0.556864,0.913726,0.094119, 0.835295,0.533334,0.435295,0.447059,0.141177,0.945099,0.780393,0.317648, 0.137256,0.486276,0.611766,0.129413,0.866667,0.862746,0.525492,0.521569, 0.564707,0.258824,0.768628,0.Q66eeS,0.847060,0.019609,0.003922,0.070589, 0.639217,0.372550,0.976471,0.815687,0.568628,0.345099,0.152942,0.725491, 0.219609,0.074511,0.701962,0.200001,0.733335,0.439217,0.537256,0.654903, 0.713727,0.133334,0.572551,0.125491,0.286275,1.000000,0.470589,0.478432, 0.560786,0.972550,0.501962,0.090197,0.854903,0.192158,0.776472,0.721570, 0.764707,0.690197,0.466668,0.066667,0.796080,0.988236,0.839217,0.227451 }; / * creates NBLKS copies of 16 byte t e s t data * / void maketestdata(void) { int i=0; int cnt=0; for (cnt=0; cnt < NBLKS; ++cnt) { i = 16*cnt; data[i+0] = (GLubyte) 0x32; data[i+l] = (GLubyte) 0x88; data[i+2] = (GLubyte) 0x31;
112
CRYPTOGRAPHICS
d a t a [ i + 3 ] == (GLubyte) OxeO; d a t a [ i + 4 ] == (GLubyte) 0x43; d a t a [ i + 5 ] =•• (GLubyte) 0x5a; d a t a [ i + 6 ] == (GLubyte) 0x31; d a t a [ i + 7 ] == (GLubyte) 0x37; d a t a [ i + 8 ] == (GLubyte) 0xf6; d a t a [ i + 9 ] =•• (GLubyte) 0x30; d a t a [ i + 1 0 ] = (GLubyte) 0x98; d a t a [ i + l l ] = (GLubyte) 0x07; d a t a [ i + 1 2 ] = (GLubyte) 0xa8; d a t a [ i + 1 3 ] = (GLubyte) 0x8d; d a t a [ i + 1 4 ] = (GLubyte) 0xa2; d a t a [ i + 1 5 ] = (GLubyte) 0x34; / / end of f o r i } / * end of maJcetestdata * / /* helper function for encryption */ void add_layer(int dxl,int dyl,int sxl,int syl,int wl,int hi, int dx2,int dy2,int sx2,int sy2,int w2,int h2) { glRasterPos2i(dxl,dyl); glCopyPixels(sxl,syl,wl,hl,GL_COLOR); glRasterPos2i(dx2,dy2); glCopyPixels(sx2,sy2,w2,h2,GL_COLOR); } /* encryption */ void encrypt(void) { int r = 0; int k; int key.ind = KEY_START_POS; int nuin_rnds = 9; int cnt = 0; /* index used in print statements */ /* disable logical operations and color maps when reading in data and key */ glDisable(GL_COLOR_LOGIC_OP); glPixelTransferi(GL_MAP_COLOR,0); /* load expanded key at (KEY_START_POS,0) in RED pixel component NBLKS copies (rows) of expanded key are needed */ for (k = 0; k < NBLKS; ++k) { glRasterPos2i(KEY_START_POS,k); glDrawPixels(EKEY.BYTES,1,GL.RED,GL_UNSIGNED_BYTE,ekey); } // end of for k /* load data at (0,0) into RED pixel component */ glRasterPos2i(0,0); glDrawPixels(BYTES_PER_BLK,NBLKS,GL_RED,GL_UNSIGNED_BYTE,data); /* perform first xor with key */
Appendix A: AES OpenGL Code for Encryption glEnable(GL_COLOR_LOGIC_OP); glLogicOp(GL_XOR); glRasterPos2i(0,0); glCopyPixels(KEY_START_POS,0,16,NBLKS,GL_COLOR); glDisable(GL_COLOR_LOGIC_OP); /* start of round */ /* compute 1*,2*,3* Sbox of each byte */ num.rnds = 9; for (r = 0; r < num.rnds; ++r) { glPixelTransferi(GL_MAP_COLOR,1); glPixelMapfV(GL_PIXEL_MAP_R_TO_R,256,Te1); glRasterPos2i(192,0); /* destination of copy */ glCopyPixels(0,0,16,NBLKS,GL.COLOR); glPixelMapfV(GL_PIXEL_MAP_R_TO_R,256,Te2); glRasterPos2i(208,0); /* destination of copy */ glCopyPixels(0,0,16,NBLKS,GL.COLOR); glPixelMapfV(GL_PIXEL_MAP_R_TO_R,256,Te3); glRasterPos2i(224,0); /* destination of copy */ glCopyPixels(0,0,16,NBLKS,GL.COLOR); glPixelTransferi(GL_MAP_COLOR,0); /* turn mapping off */ /* 1st term of XOR */ /* CopyPixels create rows 1,2,3,4 in order corresponding to 2*,1*,1*,3* S-Box entries respectively*/ glRasterPos2i(0,0); glCopyPixels(208,0,4,NBLKS,GL_COLOR); glRasterPos2i(4,0); glCopyPixels(192,0,4,NBLKS,GL.COLOR); glRasterPos2i(8,0); glCopyPixels(192,0,4,NBLKS,GL.COLOR); glRasterPos2i(12,0); glCopyPixels(224,0,4,NBLKS,GL_COLOR); /* turn xor on */ glEnable(GL_COLOR_LOGIC_OP); glLogicOp(GL_XOR); /* 2nd term of XOR */ /* creates rows 1,2,3,4 in order corresponding to 3*,2*,1*,1* S-Box entries respectively */ add.layer(0,0,229,0,3,NBLKS,3,0,228,0,1,NBLKS); add.layer(4,0,213,0,3,NBLKS,7,0,212,0,1,NBLKS); add.layer(8,0,197,0,3,NBLKS,11,0,196,0,1,NBLKS); add.layer(12,0,197,0,3,NBLKS,15,0,196,0,1,NBLKS); /* 3rd term of XOR */
113
114
CRYPTOGRAPHICS /* creates rows 1,2,3,4 in order corresponding to 1*,3*,2*,1* S-Box entries respectively */ add_layer(0,0,202,0,2,NBLKS,2,0,200,0,2,NBLKS); add_layer(4,0,234,0,2,NBLKS,6,0,232,0,2,NBLKS); add.layer(8,0,218,0,2,NBLKS,10,0,216,0,2,NBLKS); add.layer(12,0,202,0,2,NBLKS,14,0,200,0,2,NBLKS); /* 4th term of XOR */ /* creates rows 1,2,3,4 in order corresponding to 1*,1*,3*,2* S-Box entries respectively */ add_layer(0,0,207,0,l,NBLKS,l,0,204,0,3,NBLKS); add_layer(4,0,207,0,l,NBLKS,5,0,204,0,3,NBLKS); add_layer(8,0,239,0,l,NBLKS,9,0,236,0,3,NBLKS); add_layer(12,0,223,0,1,NBLKS,13,0,220,0,3,NBLKS); /* xor with round key */ key_ind = key_ind + 16; glRasterPos2i(0,0); glCopyPixels(key.ind,0,16,NBLKS,GL.COLOR); /* turn XOR off before starting next round */ glDisable(GL_COLOR_LOGIC_OP);
} /* end for r */ /* last round Sbox, ShiftRows and XOR with round key */ glDisable(GL_COLOR_LOGIC_OP); /* SBox */ glPixelTransferi(GL_MAP_COLOR,l); glPixelMapfv(GL_PIXEL_MAP_R_T0_R,256,Tel); glRasterPos2i(192,0); /* destination of copy */ glCopyPixels(0,0,16,NBLKS,GL.COLOR); /* ShiftRows */ glPixelTransferi(GL_MAP_COLOR,0); glRasterPos2i(0,0); glCopyPixels(192,0,4,NBLKS,GL.COLOR); add.layer(4,0,197,0,3,NBLKS,7,0,196,0,1,NBLKS); add.layer(8,0,202,0,2,NBLKS,10,0,200,0,2,NBLKS); add.layer(12,0,207,0,1,NBLKS,13,0,204,0,3,NBLKS); /* xor with round key */ glEnable(GL_COLOR_LOGIC_OP); glLogicOp(GL_XOR); key_ind = key_ind + 16; glRasterPos2i(0,0); glCopyPixels(key_ind,0,16,NBLKS,GL_COLOR); /* read buffer to system memory */
Appendix A: AES OpenGL Code for Encryption
115
glReadPixels(0,0,BYTES_PER_BLK,NBLKS,GL_RED,GL_UNSIGNED_BYTE,out.data); / * p r i n t 1 l i n e of r e s u l t s * / for (cnt=0; cnt < 16; ++cnt) { p r i n t f ("7oX " , o u t _ d a t a [ c n t ] ) ; } printf ("\n"); } / * end of encrypt*/ void i n i t ( v o i d ) { / * d i t h e r i n g needs to be off to prevent p i x e l s from being averaged with neighbors and set a l l p i x e l s t o 0 * / glDisable(GL_DITHER); glClearColor(0.0,0.0,0.0,0.0); glClearDepth(l.O); /* to simplify indexing: set raster positions to correspond to pixels, 0,0 = lower left */ glMatrixMode(GL_PROJECTION); glLoadldentityO ; gluOrtho2D(0.0,300.0, 0.0, 510.0); glMatrixMode(GL_MODELVIEW); glLoadldentityO ; /* set data transfers from/to system to use back buffer */ glDrawBuffer(GL_BACK); glReadBuffer(GL_BACK); /* create the test data */ maketestdataO ; /* set alignment for data storage */ glPixelStorei(GL_UNPACK_ALIGNMENT,1); } /* end of init */ /* display just clears the buffer */ void display(void) { glClear(GL_COLOR_BUFFER_BIT); glFlushO; } /* end of display */ /* pressing "e" will run the encryption function */ void Key(unsigned char pressedkey,int x, int y) { switch(pressedkey) { case 'e': encrypt(); glFlushO ; break; default: break; } }
116
CRYPTOGRAPHICS
/* end of Key */ int main(int argc, char **argv) { const GLubyte *ver_str; glutlnit(&argc, argv); glutInitDisplayMode(GLUT_DOUBLE I GLUT.RGB); glutInitWindowSize(300,510); glutInitWindowPosition(50,10); glutCreateWindowC'aes"); glutKeyboardFunc(Key); initO; /* print OpenGL version used */ ver_str = glGetString(GL_VERSION); fprintf (stderr, "version '/s \n" ,ver_str); fprintf(stderr,"Press e to encrypt.\n"); glutDisplayFunc(display); glutMainLoopO ; return 0;
A.3
Version Using the RGB Pixel Components and the Front Buffer
THIS SOFTWARE IS PROVIDED BY THE AUTHORS ''AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Example AES implementation using OpenGL and the RGB pixel components. All work is performed in the front buffer. The key expansion is not performed in the GPU. In this sample code the data to be encrypted and the expanded key are predefined. They are taken from FIPS 197. For each color component, one block of the resulting ciphertext will be printed to the window from which the program is executed to allow the user to verify the data was correctly encrypted.
Appendix A: AES OpenGL Code for Encryption This code runs with Microsoft Visual C++ and requires OpenGL and GLUT. The data block from FTPS 197 and its corresponding ciphertext: data: 32 88 31 eO 43 5a 31 37 ±6 30 98 07 a8 8d a2 34 ciphertext: 39 02 dc 19 25 dc 11 6a 84 09 85 Ob Id fb 97 32 The expanded key is defined below in the array ekey.
#include #include #include /* number of blocks encrypted simultaneously in one pixel component total # of blocks encrypted simultcineously is 3*NBLK
*/ #define NBLK 300 /* the expanded key is loaded starting at pixel (KEY_START_POS,0) */ #define KEY_START_POS 16 #define BYTES_PER_BLK 16 /* 16 bytes in 128 bit block */ #define EKEY.BYTES 176 /* 176 bytes in expanded key */ /* temp array used to create multiple data copies */ GLubyte tmpdata[BYTES_PER_BLK*NBLK] ; /* data[i][j] contains input bytes and is read into the jth component of the pixels */ GLubyte data[BYTES_PER_BLK*NBLK][3]; /expanded key */ GLubyte ekey [EKEY.BYTES]; /* expanded key, 1 copy in each of RGB */ GLubyte rgba.ekey[EKEY_BYTES][3]; /* 16 bytes of output per block*/ GLubyte out.data[BYTES_PER_BLK*NBLK][3]; /* contains one data block of output, used to verify the ciphertext */ GLubyte out_red[BYTES_PER_BLK]; GLubyte out.green[BYTES_PER_BLK]; GLubyte out.blue[BYTES.PER.BLK]; /* The T tables are written as 3 tables (l*Sbox, 2*Sbox, 3*Sbox) in order to process data in 1 byte segments as a single pixel
111
118
CRYPTOGRAPHICS
color component instead of processing 4 bytes. Values are converted to floating point by dividing by 255 then adding 0.000001. The addition of 0.000001 is because OpenGL stores the pixels as floating point values and truncates the values when converting from floating point to integer format. This conversion in format occurs when using a color component as an index into the color map. Therefore, each value needs to be >= the corresponding integer but less than the next integer to avoid errors due to rounding. 0, 1 are set to exactly 0 and 1. */ static const GLfloat Tel[256] = { 0.388237,0 .486276 .466668,0.482354,0. 949021,0. 419609,0 .435295,0 .772550, 0.188236,0 .003922 .403923,0. 168629,0. 996080,0. 843139,0 .670590,0 .462746, 0.792158,0 .509805 .788236,0.490197,0. 980394,0. 349020,0 .278433,0 .941177, 0.678432,0 .831374 .635296,0. 686276,0. 611766,0. 643138,0 .447059,0 .752942, 0.717648,0 .992158 .576472,0. 142039,0. 211766,0. 247060,0 .968629,0 .800001, 0.203923,0 .647060 .898040,0. 945099,0. 443138,0. 847060,0 .192158,0 .082354, 0.015687,0 .780393 .137256,0.764707,0. 094119,0. 588237,0 .019609,0 .603923, 0.027452,0 .070589 .501962,0. 886276,0. 921570,0. 152942,0 .698041,0 .458824, 0.035295,0 .513727 .172550,0. 101961,0. 105884,0. 431373,0 .352942,0 .627452, 0.321569,0 .231374 .839217,0. 701962,0. 160785,0. 890197,0 .184315,0 .517648, 0.325491,0 .819609 .000000,0. 929413,0. 125491,0. 988236,0 .694118,0 .356864, 0.415687,0 .796080 .745100,0. 223530,0. 290197,0. 298040,0 .345099,0 .811766, 0.815687,0 .937256 .666668,0. 984315,0. 262746,0. 301962,0 .200001,0 .521569, 0.270588,0 .976471 .007844,0.498040,0. 313727,0. 235295,0 .623531,0 .658825, 0.317648,0 .639217 .250980,0. 560786,0. 572551,0. 615687,0 .219609,0 .960785, 0.737256,0 .713727 .854903,0. 129413,0. 062746,1. 000000,0 .952942,0 .823531, 0.803922,0 .047060 .074511,0. 925491,0. 372550,0. 592158,0 .266668,0 .090197, 0.768628,0 .654903 .494118,0. 239216,0. 392158,0. 364707,0 .098040,0 .450982, 0.376472,0 .505883 .309805,0. 862746,0. 133334,0. 164706,0 .564707,0 .533334, 0.274511,0 .933335 .721570,0. 078432,0, 870590,0, 368629,0 .043139,0 .858825, 0.878432,0 .196079 .227451,0. 039216,0, 286275,0, 023531,0 .141177,0 .360785, 0.760786,0 .827452 .674511,0. 384314,0, 568628,0, 584315,0 .894119,0 .474511, 0.905884,0 .784315 .215688,0. 427452,0, 552942,0, 835295,0 .305884,0 .662746, 0.423530,0 .337256 .956864,0. 917649,0, 396079,0, 478432,0 .682354,0 .031374, 0.729413,0 .470589 .145099,0. 180394,0, 109805,0, 650982,0 .705883,0 .776472, 0.909805,0 .866667 .454903,0. 121570,0, 294119,0, 741177,0 .545099,0 .541178, 0.439217,0 .243139 .709805,0. 400001,0, 282354,0, 011766,0 .964707,0 .054903, 0.380393,0 .207844 .341178,0, 725491,0, 525492,0, 756864,0 .113726,0 .619609, 0.882354,0 .972550 .596079,0, 066667,0, 411765,0, 850981,0 .556864,0 .580393, 0.607844,0 .117649 .529413,0, 913726,0, 807845,0, 333334,0 .156863,0 .874511, 0.549021,0 .631373 .537256,0, 050981,0, 749021,0, 901962,0 .258824,0 .407844, 0.254903,0 .600001 .176471,0, 058825,0, 690197,0, 329413,0 .733335,0 .086276 }; static const GLfloat Te2[256] = { 0.776472,0.972550,0.933335,0.964707,1.000000,0.839217,0.870590,0.568628, 0.376472,0.007844,0.807845,0.337256,0.905884,0.709805,0.301962,0.925491, 0.560786,0.121570,0.537256,0.980394,0.937256,0.698041,0.556864,0.984315, 0.254903,0.701962,0.372550,0.270588,0.137256,0.325491,0.894119,0.607844,
Appendix A: AES OpenGL Code for Encryption
119
0.458824,0.882354,0.239216,0.298040,0.423530,0.494118,0.960785,0.513727, 0.407844,0.317648,0.819609,0.976471,0.886276,0.670590,0.384314,0.164706, 0.031374,0.584315,0.274511,0.615687,0.188236,0.215688,0.039216,0.184315, 0.054903,0.141177,0.105884,0.874511,0.803922,0.305884,0.498040,0.917649, 0.070589,0.113726,0.345099,0.203923,0.211766,0.862746,0.705883,0.356864, 0.643138,0.462746,0.717648,0.490197,0.321569,0.866667,0.368629,0.074511, 0.650982,0.725491,0.000000,0.756864,0.250980,0.890197,0.474511,0.713727, 0.831374,0.552942,0.403923,0.447059,0.580393,0.596079,0.690197,0.521569, 0.733335,0.772550,0.309805,0.929413,0.525492,0.603923,0.400001,0.066667, 0.541178,0.913726,0.015687,0.996080,0.627452,0.470589,0.145099,0.294119, 0.635296,0.364707,0.501962,0.019609,0.247060,0.129413,0.439217,0.945099, 0.388237,0.466668,0.686276,0.258824,0.125491,0.898040,0.992158,0.749021, 0.505883,0.094119,0.149020,0.764707,0.745100,0.207844,0.533334,0.180394, 0.576472,0.333334,0.988236,0.478432,0.784315,0.729413,0.196079,0.901962, 0.752942,0.098040,0.619609,0.639217,0.266668,0.329413,0.231374,0.043139, 0.549021,0.780393,0.419609,0.152157,0.654903,0.737256,0.086276,0.678432, 0.858825,0.392158,0.454903,0.078432,0.572551,0.047060,0.282354,0.721570, 0.623531,0.741177,0.262746,0.768628,0.223530,0.192158,0.827452,0.949021, 0.835295,0.545099,0.431373,0.854903,0.003922,0.694118,0.611766,0.286275, 0.847060,0.674511,0.952942,0.811766,0.792158,0.956864,0.278433,0.062746, 0.435295,0.941177,0.290197,0.360785,0.219609,0.341178,0.450982,0.592158, 0.796080,0.631373,0.909805,0.243139,0.588237,0.380393,0.050981,0.058825, 0.878432,0.486276,0.443138,0.800001,0.564707,0.023531,0.968629,0.109805, 0.760786,0.415687,0.682354,0.411765,0.090197,0.600001,0.227451,0.152942, 0.850981,0.921570,0.168629,0.133334,0.823531,0.662746,0.027452,0.200001, 0.176471,0.235295,0.082354,0.788236,0.529413,0.666668,0.313727,0.647060, 0.011766,0.349020,0.035295,0.101961,0.396079,0.843139,0.517648,0.815687, 0.509805,0.160785,0.352942,0.117649,0.482354,0.658825,0.427452,0.172550 }; static const GLfloat Te3[256] = { 0.647060,0.517648,0.600001,0.552942,0.050981,0.741177,0.694118,0.329413, 0.313727,0.011766,0.662746,0.490197,0.098040,0.384314,0.901962,0.603923, 0.270588,0.615687,0.250980,0.529413,0.082354,0.921570,0.788236,0.043139, 0.925491,0.403923,0.992158,0.917649,0.749021,0.968629,0.588237,0.356864, 0.760786,0.109805,0.682354,0.415687,0.352942,0.254903,0.007844,0.309805, 0.360785,0.956864,0.203923,0.031374,0.576472,0.450982,0.325491,0.247060, 0.047060,0.321569,0.396079,0.368629,0.152157,0.631373,0.058825,0.709805, 0.035295,0.211766,0.607844,0.239216,0.142039,0.411765,0.803922,0.623531, 0.105884,0.619609,0.454903,0.180394,0.176471,0.698041,0.933335,0.984315, 0.964707,0.301962,0.380393,0.807845,0.482354,0.243139,0.443138,0.592158, 0.960785,0.407844,0.000000,0.172550,0.376472,0.121570,0.784315,0.929413, 0.745100,0.274511,0.850981,0.294119,0.870590,0.831374,0.909805,0.290197, 0.419609,0.164706,0.898040,0.086276,0.772550,0.843139,0.333334,0.580393, 0.811766,0.062746,0.023531,0.505883,0.941177,0.266668,0.729413,0.890197, 0.952942,0.996080,0.752942,0.541178,0.678432,0.737256,0.282354,0.015687, 0.874511,0.756864,0.458824,0.388237,0.188236,0.101961,0.054903,0.427452, 0.298040,0.078432,0.207844,0.184315,0.882354,0.635296,0.800001,0.223530, 0.341178,0.949021,0.509805,0.278433,0.674511,0.905884,0.168629,0.584315, 0.627452,0.596079,0.819609,0.498040,0.400001,0.494118,0.670590,0.513727,
120
CRYPTOGRAPHICS
0.792158,0. 160785,0, 827452,0, 235295,0, 474511,0. 886276,0. 113726,0 .462746, 0.231374,0.337256,0, 305884,0. 117649,0, 858825,0. 039216,0. 423530,0 .894119, 0.364707,0.431373,0, 937256,0, 650982,0, 658825,0. 643138,0. 215688,0 .545099, 0.196079,0.262746,0, 349020,0, 717648,0, 549021,0. 392158,0. 823531,0 .878432, 0.705883,0.980394,0, 027452,0, 145099,0, 686276,0. 556864,0. 913726,0 .094119, 0.835295,0.533334,0, 435295,0. 447059,0, 141177,0. 945099,0. 780393,0 .317648, 0.137256,0.486276,0, 611766,0. 129413,0, 866667,0. 862746,0. 525492,0 .521569, 0.564707,0.258824,0, 768628,0, 666668,0, 847060,0. 019609,0. 003922,0 .070589, 0.639217,0.372550,0, 976471,0, 815687,0 568628,0. 345099,0. 152942,0 .725491, 0.219609,0.074511,0, 701962,0, 200001,0 733335,0. 439217,0. 537256,0 .654903, 0.713727,0. 133334,0 572551,0, 125491,0 286275,1. 000000,0, 470589,0 .478432, 0.560786,0.972550,0 501962,0, 090197,0 854903,0. 192158,0, 776472,0 .721570, 0.764707,0.690197,0 466668,0, 066667,0 796080,0. 988236,0, 839217,0 .227451 }; / * c r e a t e s NBLK copies of 16 byt e test data */ void mcLketestdata(void) { i n t i=0; int c n t , j ; for (cnt=0; cnt < NBLK; ++cnt ) { i = 16*cnt; for (j=0; j < 3; ++j) i (GLubyte) 0x32; data[i+0] [j] (GLubyte) 0x88; d a t a [ i + l ] [j] (GLubyte) 0x31; data[i+2] [j] (GLubyte) OxeO; data[i+3] [j] (GLubyte) 0x43; data[i+4] [j] (GLubyte) 0x5a; data[i+5] [j] (GLubyte) 0x31; data[i+6] [j] (GLubyte) 0x37; data[i+7] [j] data[i+8] [j] • (GLubyte) 0xf6; data[i+9] [j] = (GLubyte) 0x30; dataCi+lO] [j] = (GLubyte) 0x98 d a t a [ i + l l ] [j] = (GLubyte) 0x07 data[i+12] [j] = (GLubyte) 0xa8 data[i+13] [j] = (GLubyte) 0x8d data[i+14] [j] = (GLubyte) 0xa2 data[i+15] [j] = (GLubyte) 0x34 } } / / end of for cnt } / * end of maJket est data * / /* expanded key written 1 entry per line for readability */ void mcLketestekey(void) { int i,j; /* initial whitening */ ekeyCO] = (GLubyte) 0x2b; ekeyCl] = (GLubyte) 0x28;
Appendix
A: AES OpenGL Code for (GLubyte) (GLubyte) (GLubyte) (GLubyte) (GLubyte) (GLubyte) (GLubyte) (GLubyte) (GLubyte (GLubyte (GLubyte (GLubyte (GLubyte (GLubyte
Oxab; 0x09; 0x7e; Oxae; 0xf7; Oxcf ; 0x15; 0xd2; 0x15 Ox4f 0x16 0xa6 0x88 0x3c
/ * 1st roundkey */ ekey[16] = (GLubyte (GLubyte ekey [17] (GLubyte ekey[18] (GLubyte ekey[19] (GLubyte ekey[20] (GLubyte ekey [21] (GLubyte ekey [22] (GLubyte ekey [23] (GLubyte ekey[24] (GLubyte ekey[25] (GLubyte ekey[26] (GLubyte ekey [27] (GLubyte ekey[28] (GLubyte ekey[29] (GLubyte ekey[30] (GLubyte ekey [31]
OxaO 0x88 0x23 0x2a Oxfa 0x54 0xa3 0x6c Oxfe 0x2c 0x39 0x76 0x17 Oxbl 0x39 0x05
/* 2nd round key */ ekey [32] = (GLubyte; (GLubyte ekey[33] (GLubyte ekey[34] (GLubyte ekey[35] (GLubyte ekey[36] (GLubyte ekey[37] (GLubyte ekey [38] (GLubyte ekey [39] (GLubyte ekey [40] (GLubyte ekey [41] (GLubyte ekey [42] (GLubyte ekey[43] (GLubyte ekey[44] (GLubyte ekey [45] (GLubyte ekey [46] (GLubyte ekey[47]
0xf2 0x7a 0x59 0x73 0xc2 0x96 0x35 0x59 0x95 0xb9 0x80 0xf6 0xf2 0x43 0x7a 0x7f
ekey[2] = ekey[3] = ekey[4] = ekey[5] = ekey [6] = ekey [7] = ekey[8] = ekey[9] = ekey[10] = ekey[11] = ekey [12] = ekey [13] = ekey [14] = ekey[15] =
Encryption
121
122
CRYPTOGRAPHICS
/ * 3rd round key * / ekey[48] = (GLubyte ekey[49] (GLubyte ekey[50] (GLubyte ekey [51] (GLubyte (GLubyte ekey [52] (GLubyte ekey[53] (GLubyte ekey[54] (GLubyte ekey [55] (GLubyte ekey [56] (GLubyte ekey [57] (GLubyte ekey [58] (GLubyte ekey [59] (GLubyte ekey[60] (GLubyte ekey [61] (GLubyte ekey [62] (GLubyte ekey [63]
0x3d 0x47 Oxle 0x6d 0x80 0x16 0x23 0x7a 0x47 Oxfe 0x7e 0x88 0x7d 0x3e 0x44 0x3b
/* 4th round key */ ekey [64] = (GLubyte (GLubyte ekey [65] (GLubyte ekey [66] (GLubyte ekey[67] (GLubyte ekey[68] (GLubyte ekey[69] (GLubyte ekey [70] (GLubyte ekey [71] (GLubyte ekey[72] (GLubyte ekey [73] (GLubyte ekey [74] (GLubyte ekey [75] (GLubyte ekey[76] (GLubyte ekey[77] (GLubyte ekey[78] (GLubyte ekey [79]
Oxef 0xa8 0xb6 Oxdb 0x44 0x52 0x71 OxOb 0xa5 0x5b 0x25 Oxad 0x41 0x7f 0x3b 0x00
/* 5th round key */ ekey[80] = (GLubyte ekey[81] = (GLubyte ekey[82] = (GLubyte ekey[83] = (GLubyte ekey[84] = (GLubyte ekey[85] = (GLubyte (GLubyte ekey[86] (GLubyte ekey[87] (GLubyte ekey[88] (GLubyte ekey [89] (GLubyte ekey [90] (GLubyte ekey [91]
0xd4 0x7c Oxca 0x11 Oxdl 0x83 0xf2 0xf9 0xc6 0x9d 0xb8 0x15
Appendix A: AES OpenGL Code for Encryption ekey[92] ekey[93] ekey[94] ekey[95]
= = = =
(GLubyte) (GLubyte) (GLubyte) (GLubyte)
0xf8; 0x87; Oxbc; Oxbc;
/* 6th round key */ ekey[96] = (GLubyte) 0x6d ekey[97] = (GLubyte) 0x11 ekey[98] = (GLubyte) Oxdb ekey[99] = (GLubyte) Oxca ekeyElOO] = (GLubyte 0x88 (GLubyte OxOb ekeyElOl] (GLubyte 0xf9 ekey[102] (GLubyte 0x00 ekey[103] (GLubyte' 0xa3 ekey[104] (GLubyte; 0x3e ekey[105] (GLubyte; 0x86 ekey[106] (GLubyte 0x93 ekey[107] (GLubyte; 0x7a ekey[108] (GLubyte Oxfd ekey[109] (GLubyte 0x41 ekeyLllO] (GLubyte Oxfd ekeyElll] /* 7th round key */ ekey[112] = (GLubyte ekey[113] = (GLubyte ekey[114] = (GLubyte ekey[115] = (GLubyte ekey[116] = (GLubyte ekey[117] = (GLubyte ekey[118] = (GLubyte ekey[119] = (GLubyte; ekey[120] = (GLubyte; ekey[121] = (GLubyte ekeyCl22] = (GLubyte ekey[123] = (GLubyte ekey[124] = (GLubyte ekey[125] = (GLubyte ekey[126] = (GLubyte ekey[127] = (GLubyte
0x4e 0x5f 0x84 0x4e 0x54 0x5f 0xa6 0xa6 0xf7 0xc9 0x4f Oxdc OxOe 0xf3 0xb2 Ox4f
/* 8th round key */ ekey[128] = (GLubyte ekey[129] = (GLubyte ekey[130] = (GLubyte ekey[131] = (GLubyte ekey[132] = (GLubyte ekey[133] = (GLubyte ekey[134] = (GLubyte ekey[135] = (GLubyte
Oxea 0xb5 0x31 0x7f 0xd2 0x8d 0x2b 0x8d
123
124
CRYPTOGRAPHICS (GLubyte (GLubyte (GLubyte (GLubyte (GLubyte (GLubyte (GLubyte (GLubyte
0x73 Oxba 0xf5 0x29 0x21 0xd2 0x60 0x2f
/* 9th round key */ ekey[144] = (GLubyte ekey[145] = (GLubyte ekey[146] = (GLubyte ekey[147] = (GLubyte ekey[148] = (GLubyte ekey[149] = (GLubyte ekey[150] = (GLubyte ekey[151] = (GLubyte ekey[152] = (GLubyte ekey[153] = (GLubyte ekey[154] = (GLubyte ekey[155] = (GLubyte ekey[156] = (GLubyte ekey[157] = (GLubyte ekey[158] = (GLubyte ekey[159] = (GLubyte
Oxac 0x19 0x28 0x57 0x77 Oxfa Oxdl 0x5c 0x66 Oxdc 0x29 0x00 0xf3 0x21 0x41 0x6e
/* 10th round key */ ekey[160] = (GLubyte ekey[161] = (GLubyte ekey[162] = (GLubyte ekey[163] = (GLubyte ekey[164] = (GLubyte ekey[165] = (GLubyte ekey[166] = (GLubyte ekey[167] = (GLubyte ekey[168] = (GLubyte ekey[169] = (GLubyte ekey[170] = (GLubyte ekey[171] = (GLubyte ekey[172] = (GLubyte ekey[173] = (GLubyte ekey[174] = (GLubyte ekey[175] = (GLubyte
OxdO 0xc9 Oxel 0xb6 0x14 Oxee 0x3f 0x63 0xf9 0x25 OxOc OxOc 0xa8 0x89 0xc8 0xa6
ekey[136] ekey[137] ekey[138] ekey[139] ekey[140] ekey[141] ekey[142] ekey[143]
for (i=0; i < 176; ++i) { for (j=0; j < 3; ++j) { rgba_ekey[i] [j] = ekey[i] ; } }
Appendix A: AES OpenGL Code for Encryption
125
} /* end of maketestekey */ /* helper function - performs 2 copies */ void add_layer(int dxl,int dyl,int sxl,int syl,int wl,int hi, int dx2,int dy2,int sx2,int sy2,int w2,int h2) { glRasterPos2i(dxl,dyl); glCopyPixels(sxl,sy1,wl,hi,GL.COLOR); glRasterPos2i(dx2,dy2); glCopyPixels(sx2,sy2,w2,h2,GL_COLOR);
/* encryption function */ void encrypt(void) { int r = 0; int ri = 0; int k; int key.ind = KEY_START_POS; int nuin_rnds = 9; int cnt=0; /* index used in print statements */ glDisable(GL_COLOR_LOGIC_OP); glPixelTransferi(GL_MAP_COLOR,0); /* load expanded key at (KEY_START_POS,0) NBLK copies (rows) of expanded key are needed */ for (k = 0; k < NBLK; ++k) { glRasterPos2i(KEY_START_P0S,k); glDrawPixels(EKEY.BYTES,1,GL.RGB,GL_UNSIGNED_BYTE,rgba.ekey); } // end of for k /* load data at (0,0) */ glRasterPos2i(0,0); glDrawPixels(BYTES_PER_BLK,NBLK,GL.RGB,GL.UNSIGNED.BYTE,dat /* perform first xor with key */ glEnable(GL_COLOR_LOGIC_OP); glLogicOp(GL_XOR); glRasterPos2i(0,0); glCopyPixels(KEY_START_POS,0,16,NBLK,GL.COLOR); glDisable(GL_COLOR_LOGIC_OP); /* start of round */ /* compute 1*,2*,3* Sbox of each byte */ for (r = 0; r < 9; ++r) { glPixelTransferi(GL_MAP_COLOR,1); glPixelMapfv(GL_PIXEL_MAP_R_T0_R,256,Tel); glPixelMapfV(GL_PIXEL_MAP_G_TO_G,256,Tel); glPixelMapfV(GL_PIXEL_MAP_B_TO_B,256,Tel); glRasterPos2i(192,0); /* destination of copy */ glCopyPixels(0,0,16,NBLK,GL.COLOR);
a);
126
CRYPTOGRAPHICS glPixelMapfV(GL_PIXEL_MAP_R_TO_R,256,Te2); glPixelMapfV(GL_PIXEL_MAP_G_TO_G,256,Te2); glPixelMapfV(GL_PIXEL_MAP_B_TO_B,256,Te2); glRasterPos2i(208,0); /* destination of copy */ glCopyPixels(0,0,16,NBLK,GL.COLOR); glPixelMapfV(GL_PIXEL_MAP_R_TO_R,256,Te3); glPixelMapfV(GL_PIXEL_MAP_G_TO_G,256,Te3); glPixelMapfv(GL_PIXEL_MAP_B_TO_B,256,Te3); glRasterPos2i(224,0); /* destination of copy */ glCopyPixels(0,0,16,NBLK,GL.COLOR); glPixelTransferi(GL_MAP_COLOR,0); /* turn mapping off */ /* create "TO[rowl]" */ /* 1st of 4 layers of 1st row 2* entry */ glRasterPos2i(0,0); glCopyPixels(208,0,4,NBLK,GL.COLOR); /* 1st of 4 layers of 2nd row 1* entry */ glRasterPos2i(4,0); glCopyPixels(192,0,4,NBLK,GL.COLOR); /* 1st of 4 layers of 3rd row 1* entry*/ glRasterPos2i(8,0); glCopyPixels(192,0,4,NBLK,GL.COLOR); /* 1st of 4 layers of 4th row 3* entry/ glRasterPos2i(12,0); glCopyPixels(224,0,4,NBLK,GL_COLOR); /* turn xor on */ glEnable(GL_COLOR_LOGIC_OP); glLogicOp(GL_XOR); /* create "Tl[row2]" */ /* 2nd of 4 layers of 1st row 3* entry */ add.layer(0,0,229,0,3,NBLK,3,0,228,0,1,NBLK); /* 2nd of 4 layers of 2nd row 2* entry */ add.layer(4,0,213,0,3,NBLK,7,0,212,0,1,NBLK); /* 2nd of 4 layers of 3rd row 1* entry */ add.layer(8,0,197,0,3,NBLK,11,0,196,0,1,NBLK); /* 2nd of 4 layers of 4th row 1* entry */ add.layer(12,0,197,0,3,NBLK,15,0,196,0,1,NBLK); /* create "T2[row3]" */
Appendix A: AES OpenGL Code for Encryption /* 3rd of 4 layers of 1st row 1* entry */ add_layer(0,0,202,0,2,NBLK,2,0,200,0,2,NBLK); /* 3rd of 4 layers of 2nd row 3* entry */ add_layer(4,0,234,0,2,NBLK,6,0,232,0,2,NBLK); /* 3rd of 4 layers of 3rd row 2* entry*/ add_layer(8,0,218,0,2,NBLK,10,0,216,0,2,NBLK); /* 3rd of 4 layers of 4th row l*entry */ add.layer(12,0,202,0,2,NBLK,14,0,200,0,2,NBLK); /* create "T3[row4]" */ /* 4th of 4 layers of 1st row 1* entry */ add.layer(0,0,207,0,1,NBLK,1,0,204,0,3,NBLK); /* 4th of 4 layers of 2nd row 1* entry */ add_layer(4,0,207,0,1,NBLK,5,0,204,0,3,NBLK); /* 4th of 4 layers of 3rd row 3* entry */ add.layer(8,0,239,0,1,NBLK,9,0,236,0,3,NBLK); /* 4th of 4 layers of 4th row 2* entry */ add.layer(12,0,223,0,1,NBLK,13,0,220,0,3,NBLK); /* xor with round key */ key_ind = key_ind + 16; glRasterPos2i(0,0); glCopyPixels(key.ind,0,16,NBLK,GL.COLOR); /* turn off XOR before starting the next round */ glDisable(GL_COLOR_LOGIC_OP); } /* end of for r */ /* last round Sbox, ShiftRows and XOR with round key */ glDisable(GL_COLOR_LOGIC_OP); /* SBox */ glPixelTransferi(GL_MAP_COLOR,1); glPixelMapfV(GL_PIXEL_MAP_R_TO_R,256,Te1); glPixelMapfv(GL_PIXEL_MAP_G_T0_G,256,Tel); glPixelMapfv(GL_PIXEL_MAP_B_T0_B,256,Tel); glRasterPos2i(192,0); /* destination of copy */ glCopyPixels(0,0,16,NBLK,GL.COLOR); /* ShiftRows */ glPixelTransferi(GL_MAP_COLOR,0); glRasterPos2i(0,0); glCopyPixels(192,0,4,NBLK,GL_COLOR);
127
128
CRYPTOGRAPHICS
add_layer(4,0,197,0,3,NBLK,7,0,196,0,l,NBLK); add.layer(8,0,202,0,2,NBLK,10,0,200,0,2,NBLK); add_layer(12,0,207,0,1,NBLK,13,0,204,0,3,NBLK); /* xor with round key */ glEnable(GL_COLOR_LOGIC_OP); glLogicOp(GL_XOR); key_ind = key_ind + 16; glRasterPos2i(0,0); glCopyPixels(key_ind,0,16,NBLK,GL.COLOR); /* read buffer to system memory */ // glReadPixels(0,0,BYTES_PER_BLK,NBLK,GL.RGB,GL_UNSIGNED_BYTE,out.data); /* Uncomment the above line to read all pixels to a single array which can then be written to a file. The following prints one row (since all blocks being encrypted are identical in this example, just check one row) of each pixel component so the user can verify the ciphertext. */ /* 1 line of each pixel color */ glReadPixels(0,0,16,1,GL.RED,GL_UNSIGNED_BYTE,out.red); for (ri=0; ri < 16; ++ri) {
printfC'/oX ", out_red[ri] ) ; } printf("\n"); glReadPixels(0,0,16,1,GL.GREEN,GL_UNSIGNED_BYTE,out_green); for (ri=0; ri < 16; ++ri) { printf ("'/oX " , out_green [ri] ); } printf("\n"); glReadPixels(0,0,16,l,GL_BLUE,GL_UNSIGNED_BYTE,out_blue); for (ri=0; ri < 16; ++ri) { printf ("7oX '•, out_blue [ri] ); } printf("\n"); } /* end of encrypt*/ void init(void) { /* dithering needs to be off Initialize all pixels to 0 */ glDisable(GL_DITHER); glClearColor(1.0,1.0,1.0,1.0); glClearDepth(l.O); /* to simplify indexing: set raster positions to correspond to pixels, 0,0 = lower left */ glMatrixMode(GL_PROJECTION);
Appendix A: AES OpenGL Code for Encryption glLoadldentityO ; gluOrtho2D(0.0,300.0, 0.0, 410.0); glMatrixMode(GL_MODELVIEW); glLoadldentityO ; glDrawBuffer(GL_FRONT); glReadBuffer(GL_FRONT); maketestdataO ; maketestekeyO ; glPixelStorei(GL_UNPACK_ALIGNMENT,1); } /* end of init */ void display(void) { glClear(GL_COLOR_BUFFER_BIT); encrypt 0 ; glFlushO; } /* end of display */ int main(int argc, char **argv) { const GLubyte *ver_str; glutlnit(&argc, argv); glutlnitDisplayMode(GLUT.SINGLEI GLUT.RGB); glutInitWindowSize(300,410); glutInitWindowPosition(50,10); glutCreateWindowC'aes") ; initO ; ver.str = glGetString(GL_VERSION); fprintf(stderr, "OpenGL version /.s \n" ,ver_str); glutDisplayFunc(display); glutMainLoopO ; return 0; }
129
References
[1] W. A. Arbaugh. Chaining Layered Integrity Checks. PhD thesis, University of Pennsylvania, Philadelphia, 1999. [2] W. A. Arbaugh, D. J. Farber, and J. M. Smith. A secure and reliable bootstrap architecture. In IEEE Security and Privacy Conference, pages 65-71, May 1997. [3] P. Biddle, M. Peinado, and D. Flanagan. Privacy, Security and Content Protection. http://download.microsoft.eom/download/a/f/c/ afcf8195-0eda-4190-a46d-aa60b45e0740/Secure.ppt. 14] E. Biham. A Fast New DES Implementation in Software. In Workshop on Fast Software Encryption (FSE), pages 260-272, 1997. [5] E. Biham and A. Shamir. Differential Fault Analysis of Secret Key Cryptosystems. Computer Science Technical Report CS0910, Technion, 1997. [6] Boneh, Demillo, and Lipton. On the Importance of Checking Cryptgraphic Protocols for Faults. In Proceedings of Advances in Cryptology - Eurocrypt, pages 37-51, 1997. [7] D. Boneh and N. Shacham. Improving SSL Handshake Performance via Batching. In Proceedings of the RSA Conference, January 2001. [8] I. Buck. BrookGPU. i n d e x . h t m l , 2003.
http://graphics.stanford.edu/projects/brookgpu/
[9] J. Butler and S. Sparks. Spy ware and Rootkits - The Future Convergence. USENIX ;login:, 29(6):8-15, December 2004. [10] C.Elliot. Vertigo. h t t p : / / w w w . c o n a l . n e t / V e r t i g o . [11] A. Carroll, M. Juarez, J. Polk, and T. Leininger. Overview. White paper, Microsoft, August 2002.
Microsoft Palladium: A Business
[12] N. Chou, R. Ledesma, Y. Teraguchi, and J. C. Mitchell. Client-Side Defense Against WebBased Identity Theft. In Proceedings of the ISOC Symposium on Network and Distributed Systems Security (SNDSS), February 2004.
132
REFERENCES
[13] M. Christodorescu and S. Jha. Testing Malware Detectors. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA), July 2004. [14] P. C. Clark. BITS: A Smartcard Protected Operating System. PhD thesis, George Washington University, 1994. [15] C. Coarfa, P. Druschel, and D. Wallach. Performance Analysis of TLS Web Servers. In Proceedings of the ISOC Symposium on Network and Distributed Systems Security (SNDSS), February 2002. [16] D. Cook, R. Baretto, and A. Keromytis. Remotely Keyed Cryptographies - Secure Remote Display Access Using (Mostly) Untrusted Hardware. In Proceedings of ICICS, pages 363-375, December 2005. [17] D. Cook, J. loannidis, A. Keromytis, and J. Luck. CryptoGraphics: Secret Key Cryptography Using Graphics Cards. In Proceedings of the RSA Conference, Cryptographer's Track (CT-RSA), pages 334-350, February 2005. [18] D. Coppersmith, et.al. The MARS Cipher, security/mars.html, 1999.
http://www.research.ibm.com/
[19] J. Daemon and V. Rijmen. The Design ofRijndael: AES the Advanced Encryption Standard. Springer-Verlag, Berlin, 2002. [20] D. Davis, F. Monrose, and M. K. Reiter. On User Choice in Graphical Password Schemes. In Proceedings of the 13*^ USENIX Security Symposium, pages 151-163, August 2004. [21] T. Dierks and C. Allen. The TLS protocol version 1.0. Request for Comments (Proposed Standard) 2246, Jan. 1999. [22] P. Druschel, M. Abbott, M. Pagels, and L. Peterson. Network subsystem design. IEEE Network, 7(4):8-17, July 1993. [23] P. Ekdahl and T. Johansson. A New Version of the Stream Cipher SNOW. In Proceedings of SAC, 2002. [24] W. Feghali, B. Burres, G. Wolrich, and D. Carrigan. Security: Adding Protection to the Network via the Network Processor. Intel Technology Journal, 6, August 2002. [25] R. Fernando and M. Kilgard. The Cg Tutorial. Addison-Wesley, 2003. [26] N. Galoppo, N. Govindoraju, M. Henson, and D. Manocha. LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware. In Proceedings of ACM/IEEE Super Computing Conference, 2005. [27] A. Goldberg, R. Buff, and A. Schmitt. Secure Web Server Performance Dramatically Improved By Caching SSL Session Keys. In Workshop on Internet Server Performance, held in conjunction with SIGMETRICS, June 1998. [28] V. Gupta, D. Stebila, S. Fung, S. C. Shantz, N. Gura, and H. Eberle. Speeding up Secure Web Transactions Using Elliptic Curve Cryptography. In Proceedings of the ISOC Symposium on Network and Distributed System Security (SNDSS), pages 231-239, February 2004.
REFERENCES
133
[29] P. Gutmann. The Design of a Cryptographic Security Architecture. In Proceedings of the 8*^ USENIX Security Symposium, August 1999. [30] P. Gutmann. An Open-source Cryptographic Coprocessor. In Proceedings of the 9*^ USENIX Security Symposium, August 2000. [31] H. Gobioff and S. Smith and J. Tygar and B. Yee. Smart Cards in Hostile Environments. In 2"^"^ USENIX Workshop on Electronic Commerce, 1996. [32] Helion Technology Limited. High Performance Solutions in Silicon, AES (Rijndael) Core, http://www.heliontech.com/core2.htm, 2003. [33] Y.-C. Hu, A. Perrig, and D. B. Johnson. Paclcet Leashes: A Defense against Wormhole Attacks in Wireless Networks. In Proceedings of IEEE Infocomm, April 2003. [34] N. L. P. Jr., T. Fraser, J. Molina, and W. A. Arbaugh. Copilot - a Coprocessor-based Kernel Runtime Integrity Monitor. In Proceedings of the 13*^ USENIX Security Symposium, pages 179-194, August 2004. [35] J. Kay and J. Pasquale. The Importance of Non-Data Touching Processing Overheads in TCP/IP. In Proceedings ACM SIGCOMM Conference, pages 259-269, September 1993. [36] J. Kelsey, B. Schneier, D. Wagner, and C. Hall. Side Channel Cryptanalysis of Product Ciphers. Journal of Computer Security, 8(2-3):141-158, 2000. [37] S. Kent and R. Atkinson. Security Architecture for the Internet Protocol. Request for Comments (Proposed Standard) 2401, Internet Engineering Task Force, Nov. 1998. [38] A. D. Keromytis, J. L. Wright, and T. de Raadt. The Design of the OpenBSD Cryptographic Framework. In Proceedings of the USENIX Annual Technical Conference, pages 181-196, June 2003. [39] J. Kessenich, D. Baldwin, and R. Rost. The OpenGL Shading Language Version 1.10. h t t p : //www. opengl. org, April 2004. [40] P. Kocher. Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS and Other Systems. In Proceedings of Advances in Cryptology - Crypto, pages 104—113, 1996. [41] D. KoUer, M. Turitzin, M. Levoy, M. Tarini, G. Croccia, P. Cignoni, and R. Scopigno. Protected Interactive 3D Graphics Via Remote Rendering. In Proceedings of ACM SIGGRAPH, 2004. [42] H. Kuo and I. Verbauwhede. Architectual Optimization for 1.82 Gbits/sec VLSI Implementation of Rijndael Algorithm. In Proceedings ofCHES, pages 51-64, 2001. [43] X. Lai and J. Massey. A Proposal for a New Block Encryption Standard. In Proceedings ofEUROCRYPT1990, pages 389-404, 1991. [44] E. Levy. Interface Illusions. IEEE Security & Privacy, 2(6):66-69, November/December 2004. [45] A. Lutz, J. Treichler, F. Gurkeynak, H. Kaeslin, G. Bosler, A. Erni, S. Reichmuth, P. Rommens, S. Oetiker, and W. Fichtner. 2G bits/s Hardware Realizations of Rijndael and Serpent: A Comparative Analysis. In Proceedings ofCHES, pages 144-158, 2002.
134
REFERENCES
[46] M. Abadi and M. Burrows and C. Kaufman and B. Lampson. Authentication and Delegation with Smart-cards. In Theoretical Aspects of Computer Software, 1991. [47] M. Macedonia. The GPU Enters Computing's Mainstream. IEEE Computer Magazine, pages 106-108, October 2003. [48] J. McCune, J. Perrig, and M. Reiter. Bump in Ether: Mobile Phones as Proxies for Sensitive Input. Computer Science Technical Report CyLab-05-007, Carnigie Mellon University, 2005. [49] J. P. McGregor and R. B. Lee. Protecting Cryptographic Keys and Computations via Virtual Secure Coprocessing. In Proceedings of the Workshop on Architectural Support for Security and Anti-virus (WASSA), pages 11-21, October 2004. [50] M. McLoone and J. McConny. High Performance Single Chip FPGA Rijndael Algorithms Implementations. In Proceedings ofCHES, pages 65-76, 2001. [51] Microsoft. Microsoft DirectX, default.aspx.
http://www.microsoft.com/windows/directx/
[52] Microsoft. Windows 9 Media Series Digital Rights Management. microsoft.com/windows/windowsmedia/drm.aspx.
http://www.
[53] S. Miltchev, S. loannidis, and A. D. Keromytis. A Study of the Relative Costs of Network Security Protocols. In Proceedings of USENIX Annual Technical Conference, Freenix Track, pages 41-48, June 2002. [54] J. Nieh, S. J. Yang, and N. Novik. Measuring Thin-Client Performance Using SlowMotion Benchmarking. ACM Transactions on Computer Systems (TOCS), 21(1):87-115, Feb. 2003. [55] NIST. PIPS 46-3 Data Encryption Standard (DES), 1999. [56] NIST. PIPS 197 Advanced Encryption Standard (AES), 2001. [57] Nvidia. GPGPU Presentation, 2005. [58] OpenGL Organization. OpenGL. h t t p : //www. o p e n g l . org, 2005. [59] G. Organization. General Purpose Computation Using Graphics Hardware, h t t p : / / www.gpgpu.org. [60] D. Osvik, A. Shamir, and E. Tromer. Cache Attacks and Countermeasures: The Case of AES. In Proceedings ofRSA Conference Cryptographers Track (CT-RSA), 2006. [61] P. Rogaway. A Software Optimized Encryption Algorithm, pages 273-287, 1998. [62] M. Pharr, editor. GPU Gems2. Addison-Wesley, 2005. [63] C. Pu, H. Massalin, J. loannidis, and P. Metzger. The Synthesis System. Systems, 1(1), 1988.
Computing
[64] R. lannella. Digital Rights Management (DRM) Architectures. D-Lib Magazine, 1(6), June 2001.
REFERENCES
135
[65] V. Rijmen, A. Bosselaers, and P. Barreto. AES Optimized ANSI C Code, h t t p : //www. e s a t . k u l e u v e n . a c . b e / ~ r i j m e n / r i j n d a e l / r i j n d a e l - f s t - 3 . 0 . z i p , 2002. [66] Rivest, Robshaw, Sidney, and Yin. RC6 Block Cipher, http://www.rsasecurity. com/rsalabs/node.asp?id=2512, 1998. [67] R. Rivest. The RC5 Encryption Algorithm. CryptoBytes, 1(1), 1995. [68] G. Rose. A Stream Cipher Based on Linear Feedback Over GF (28). In Information Security and Privacy LNCS 1438, page 135ff, 1998. [69] V. Roth, K. Richter, and R. Freidinger. A PIN-Entry Method Resilient Against Shoulder Surfing. In Proceedings of the 11* ACM Conference on Computer and Communications Security (CCS), pages 236-245, October 2004. [70] RSA Laboratories. PKCS #7.- RSA Encryption Standard, Version 7.5, November 1993. [71] C. B. S. and J. M. Smith. Hardware/Software Organization of a High-Performance ATM Host Interface. IEEE Journal on Selected Areas in Communications (Special Issue on High Speed Computer/Network Interfaces), 11 (2):240-253, February 1993. [72] R. Sailer, X. Zhang, T Jaeger, and L. van Doom. Design and Implementation of a TCGbased Integrity Measurement Architecture. In Proceedings of the 13*^ USENIX Security Symposium, pages 223-238, August 2004. [73] S. Saroiu, S. D. Gribble, and H. M. Levy. Measurement and Analysis of Spyware in a University Environment. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI), March 2004. [74] B. K. Schmidt, M. S. Lam, and J. D. Northcutt. The Interactive Performance of SLIM: A Stateless, Thin-Client Architecture. In Proceedings of the 17*^ ACM Symposium on Operating Systems Principles (SOSP), pages 32-47, Kiawah Island Resort, SC, December 1999. [75] M. Segal and K. Akeley. The OpenGL Graphics System, A Specification, Version 2.0. h t t p : //www. opengl. org, SiliconGraphics, Inc., October 2004. [76] A. Shamir and E. Tromer. Acoustic Cryptanalysis On Nosy People and Noisy Machines. Eurocrypt rump session presentation, 2004. [77] M. Shirase and Y. Hibino. An architecture for elliptic curve cryptograph computation. In Proceedings of the Workshop on Architectural Support for Security and Anti-virus (WASSA), pages 120-129, October 2004. [78] Simpson, Dawson, Golic, and Millar. LILI Keystream Generator. In Selected Areas in Cryptology, LNCS 2012, page 248ff, 2000. [79] J. M. Smith and C. B. S. Traw. Giving Applications Access to Gb/s Networking. IEEE Network, 7(4):44-52, July 1993. [80] J. M. Smith, C. B. S. Traw, and D. J. Farber. Cryptographic Support for a Gigabit Network. In Proceedings oflNET, pages 229-237, June 1992. [81] S. Smith. Magic Boxes and Boots: Security in Hardware. IEEE Computer, 37(10): 106109, October 2004.
136
REFERENCES
[82] C. Thompson, S. Hahn, and M. Oskin. Using Modern Graphics Architectures for GeneralPurpose Computing: A Framework and Analysis. In 35*^ Annual IEEE/ACM International Symposium on Micro Architecture - MICRO-35, pages 306-317, 2002. [83] J. Thorpe and P. C. van Oorschot. Graphical Dictionaries and the Memorable Space of Graphical Passwords. ]In Proceedings of the 13*^ USENIX Security Symposium, pages 135-150, August 2004. [84] Trusted Computing Group. TCG Specification Architecture Overview, version 1.2. h t t p s : //\j\j\j. trustedcomputinggroup. org/home, April 2004. [85] J. Tygar and B. Yee. DYAD: A System for Using Physically Secure Coprocessors. Technical Report CMU-CS-91-140R, Carnegie Mellon University, May 1991. [86] Veritest. i-Bench version 1.5, Ziff-Davis, Inc, 2004. http://www.veritest.com/ benchmarks/i-bench/. [87] T. J. Walsh and D. R. Kuhn. Challenges in Securing Voice over IP. IEEE Security & Privacy Magazine, 3(3):44-49, May/June 2005. [88] S. Wasson. NVIDIA's GeForce 7800 GTX graphics processor. The Tech Report, h t t p : / / t e c h r e p o r t . com, June 2005. [89] M. Woo, J. Neider, T. Davis, and D. Shreiner. The OpenGL Programming Guide, S^'^ edition. Addison-Wesley, 1999. [90] Z. Ye, S. Smith, and D. Anthony. Trusted Paths for Browsers. ACM Transactions on Information and System Security (TISSEC), 8(2):153-186, May 2005. [91] B. Yee. Using Secure Coprocessors. PhD thesis, Carnegie Mellon University, 1994. [92] Q. Yu, C. Chen, and Z. Pan. Parallel Genetic Algorithms in Programmable Graphics Hardware. In Proceedings oflCNC, pages 1051-1059, 2005.
About the Authors
Debra Cook is a Ph.D. student with the Department of Computer Science at Columbia University in New York. She is completing her doctorate in 2006. Her research interests are focused in applied cryptography and security. She has a B.S. and M.S.E. in mathematical sciences from the Johns Hopkins University in Baltimore, Maryland and a M.S. in computer science from Columbia University. After graduating from Johns Hopkins, she was a senior technical staff member at Bell Labs and AT&T Labs before pursuing her Ph.D. Angelos Keromytis is an Associate Professor of Computer Science at Columbia University in New York. His research interests include design and analysis of network and cryptographic protocols, software security and reliability, and operating system design. He received his Ph.D. and M.Sc. in computer science from the University of Pennsylvania, Philadelphia, PA in 200 L He received his B.S. in computer science from the University of Crete, Heraclion, Greece in 1996.
Index
AES, 27, 34, 39-^1, 48--64, 105 experiments, 58-64 key schedule, 52 OpenGLcode, 107-129 OpenGL implementation, 53-58 asymmetric key ciphers, 38 block ciphers, 24, 34, 40, 82, 99 BrookOPU, 17 Cg, 17 cryptographic accelerators, 25 data compression and CPUs, 97 DES, 34, 42 differential fault analysis, 33-35 Diffie-Hellman, 39 digital rights management, 29 digital signal processors, 101, 106 Direct3D, 17 elliptical curve cryptography, 39 GLUT, 18,58,62, 82 GLX, 62 GPU, 9-24 APIs, 17 architecture, 10 pixel processor, 10 vertex processor, 10 GPUs and general purpose programming, 15, 23 graphical keypad, 90, 106 graphics based stream cipher, 99, 105 keying of GPUs, 69, 90 experiments, 82
remote keying protocol, 75 MAC, 82 malware, 28, 30, 32, 42, 69, 90, 93 man-in-the-middle attack, 93 modes of encryption, 45-48, 105 OpenGL, 12, 16-22, 48, 78 phishing, 28, 94 pixel processing, 10, 12, 15, 19-22 projects, 105 RC4,43,79, 81 RC6, 42 remotely keyed CryptoGraphics, 69 RSA, 39, 80 side channel attacks, 33-35 spy ware, 28, 30, 32, 37, 71, 87, 96, 97 stream ciphers, 40, 44 experiments, 64-67 symmetric key ciphers, 40 thin-clients, 28, 69, 83 Trusted Computing Group, 29, 95 trusted platform module, 30, 95 untrusted clients, 69 user input - protecting, 89 vertex processing, 10, 13, 22 Vertigo, 17 video conferencing, 28, 69, 83 window toolkits - wrappers for APIs, 19