CS 111, Fall 2005

Lecture 19: Security II

Buffer Overflow


Description

Buffer Overflow

One rich source of attacks has been due to the fact that virtually all operating systems and most systems programs are written in C Programming language. Unfortunately, no C compiler does array bound checking. Consequently, if the program fails to ensure that the length of the data entered is equal to or smaller than the data buffer allocated for its storage, then any overflow data will simply be written over whatever happens to be after the data buffer. The following program is one example of such programs.


f()
{
	char buf[1024];
	while (more chars)
		buf++ = read();
}

      

This property of C leads to attacks of the following kind. In Fig. 1(a), we see the main program running, with its local variables on the stack. At some point it calls a procedure A, as shown in Fig. 1(b). The standard calling sequence starts out by pushing the return address onto the stack. It then transfers control to A, which decrements the stack pointer to allocate storage for its local variables. Suppose A has a finxed-size buffer of 1024 bytes. If the amount of data provided by the user of the program is more than 1024 bytes, a buffer overflow occurs and memory is overwritten as shown in the gray area of Fig. 1(c). Worse yet, if the data is large enough, it also overwrites the return address. If the buffer contains a malicious program and the layout has been very, very carefully made so tht the word overlaying the return address just happens to be the address of the start of the program. What will happen is that when A returns, the program now in B will start executing. In effect, the attacker has inserted code into the program and gotten it executed.


Avoiding Buffer Overflow

  • Do not write outside bounds of any object.
  • Generally, the buffer overflow problem is caused by careless programming. Therefore, one solution is to simply check the length of input data, to ensure that it is not larger than the allocated data buffer. Another choice would be to use a programming language that does not allow programmer to code a buffer overflow bug, but doing so does not necessarily prevent buffer overflow. For example, Java, C#, and Perl have some mechanisms to make it difficult to have a buffer overflow, but they all have a lot of libraries written in C.

  • Remove predictability on which attack depends on.
  • In order for the attack to succeed, the word overwriting the return address has to be the beginning address of the malicious program. The attacker has to know the position of stack pointer beforehand in order to make the precise layout. Therefore, another solution is to randomize stack pointer, thereby minimizing the exploitable invariants.



    SQL Injection


    SQL injection is a security vulnerability that occurs in the database layer of an application. It's better explained with an example. Suppose in a database-backed web site, user enters username on a form, and server replies with address. Assume the following code is embedded in the server application.

    string username = dom.getForm("username"); string query = "SELECT address FROM usertable WHERE uname = '"+username+"';" database.execute(query);

    If the user types "eddie'; SELECT bankaccount FROM usertable WHERE uname = 'ahnuld'" as username, the following SQL statement would be built by the code above:

    SELECT address FROM usertable WHERE uname = 'eddie'; SELECT bankaccount FROM usertable WHERE uname = 'ahnuld';

    When sent to the database, this statement would be executed and Arnold's bankaccount would be sent to the user. I am sure our governor wouldn't be happy about it.



    UNIX Access Control Model


    User Identrifier (UID)

  • Set of users identified by a UID
  • Root user has a UID of 0

  • Group Identrifier (GID)

  • Set of groups identified by a GID
  • Each group contains a set of UIDs
  • Each process has a UID and a GID, and they are inherited by children. Only root user can change its UID.

    Each file in file system has UID and GID.

    Permissions

    Permissions tell the user and the group what they can do and not do.

  • Can the owner (same UID) read, write, or execute?
  • Can the group (same GID) read, write, or execute?
  • Can anyone else read, write, or execute?
  • In Unix, when you type the command "ls -l", you will see the permissions on each file.

    cs111> ls -l cs111> drwxrw-r-- 1 username groupname 100 Dec 25 00:00 filename

    The first character indicates the type of file (d for directory, s for special file, - for a regular file). The next three characters ("rwx") describe the permissions of the user of the file. In this case, the user can read, write, and execute. The next three characters ("rw-") describe the permissions for those in the same group. The last three characters describe the permissions for all others.

    Access Matrix

    An important question is how the system keeps track of which object belongs to which principal. Conceptually, one can envision a large matrix, with the rows being principals and the columns being objects. Each box lists the acess rights that the principal contains for the object. The following matrix is one example.



    Objects


    /home/eddie
    /home/eddie/grades
    /home/bob/file
    /etc/motd
    Principals
    Eddie
    read, write, -
    read, write
    -,-,-
    read,-,-
    Bob
    read, write, -
    -,-,-
    read, write, -
    read,-,-
    Lucifer
    -,-,-
    -,-,-
    -,-,-
    read,-,-
    Mike/Chris
    -,-,-
    -,-,-
    -,-,-
    read,-,-

    One potential problem of this particular access matrix is that even though Bob does not have permission to access the grades file, but he has access to the directory in which the grades file is placed. Since the directory operations require only directory privileges and removing a file turns out to be a directory operation, Bob can delete the grades file and create a new grades file. If he knows the format of the file, he can modified the grades without the Professor knowing it.

    The Login Program

    The login program is responsible for authenticating a user and granting him or her access to a machine. On Linux, the login programs works by first searching the /etc/passwd file for the particular user's password hash. It then computes the hash of the password the user entered and compares this value to hash stored in the password file. If these two hashed values match, it can then fork off the user's preferred login shell. Because this new login shell must run as though it was a process owned by the authenticated user, the login program must change the new process's ownership field. This tells us that the login program must be a priviledged application run as root, because only root has the ability to change process ownership.

    Implementing An Access Matrix

    Access matrices can be implemented in two different ways:

    Access Control Lists

    With access control lists each object lists all of its associated access rights. Going back to the access diagram, this means that access is defined by grouping the columns of the access matrix. As an example, UNIX files are implemented as access control lists because each file in the system defines three access rights: read, write, and execute. In order to associate these access rights with principals, each file on the system is owned by a certain user and a group. Given these file ownerships, every file on UNIX defines access rights for three principals: the user that owns the file, the group that owns the file, and everyone else.

  • The clear advantage to access control lists is that rights and policies are clearly defined and are very easy to track.
  • The big disadvantage to this approach is that transferring rights to others is difficult. For example, in order for Professor Kohler to give Mike and Chris access to the grades file, he must either let them use his login to access the file or define a special group associated to this password file.
  • Capabilities

    With capabilities, each principal has a list of access rights its allowed to exercise. Thus, in this scenario the access matrix is grouped by rows. A good example of capabilities are file descriptors because they are objects that correspond directly to a principal (in this case, the principal being a user). The pros and cons of capabilities are precisely the opposite of those for access control lists:

  • It is easy to transfer rights and policies. For examples, when a process forks a child process, the child immediately has access to all the file descriptors the parent had.
  • It is hard to understand policies. Going back to the file descriptor example, it is hard to understand what particular access rights a file descriptor has as well as what kind of object it acts upon.
  • Cryptography

    Cryptography is used in order to ensure that a secret is kept a secret. Here is some basic terminology associated with cryptography:

    Encryption - a process used in order to keep the contents of a document secret from attackers.

    Authentication - verifying that a party in a message exchange knows some secret, hence verifying their identity.

    The Login Program: As mentioned, the login program is a process running as root that starts up a user shell based on the user who is trying to gain access to the machine. The problem is how to correctly identify that the user who typed in his or her username is in fact that user?

    Authentication Problem

    To solve this problem, we must expect the user to know a secret. Then we can check the user’s secret against a version known to system, and thus make a decision on whether to authorize this user access to the system.

    On UNIX systems, these secrets along with other user related data are stored in the /etc/passwd file:

    /etc/passwd

    eddie:radprofessorguy
    lucifer:cold

    The problem is we don’t want to store these passwords in plain-text format on the disk because they can be easily accessible by others who have sufficient privileges on the system. Thus we need to develop some mathematics that will allow us convert arbitrarily long messages into cryptographic hashes, such that the original message may be difficult to recover given the hash value. Once we have developed such a method, the hash of the password will be stored on the disk. Each time a user is being authenticated, the login program computes the hash of the password and compares this with the version that exists on the disk. Now it is nearly impossible for a user to get the password because a hash function is very difficult to invert. An even better method would be to utilize a cryptographic hash in which a password is hashed into a hashed value:

    H(y) = H(x)

    This way, it would be very difficult to find y. (NOTE: Here difficult means exponential time.)

    Users need to be very careful when choosing passwords. It has been found that users often choose passwords badly, and are subject to what is know as a dictionary attack. A dictionary attack is the general technique of trying to guess some secret by running through a list of likely possibilities.  This list of possibilities is often a list of words from a dictionary. The attack works because users often choose easy to decipher passwords! Go figure!

    Example: Network Login

    Lets say a user named Betty wanted to login to a server. How should the server authenticate Betty’s identity? A modest approach would simply be to send to the server a hashed password:

    Betty -> server: H(password)

    This is a very simple and bad protocol design. It should be mentioned that any message sent over a network and be intercepted by an attacker.  In this protocol, the attacker does not need to know Betty’s password. The attacker simply needs to know the hash of the password since the server is only checking the hashed password! This is an example of a replay attack. A replay attack is form of network attack in which an attacker intercepts and records messages and sends them out at a later time. The receiver unknowingly thinks the bogus traffic is legitimate and authentic. Note that the attacker can gain benefit without having to understand the message contents.

    Another approach would be to use some form of key exchange. For instance:

    Betty -> server: Hi

    server -> Betty: KBS

    Betty -> server: {H(password)}KBS


    Here the message is within the curly braces. The message is encrypted with the key, KBS. This, however, is not a good protocol as well because the key is still being transferred over the network, and the attacker can still intercept it.  A better way would be to implement the following protocol.

    Key Distribution Protocol

    In this protocol, Betty and the server agrees on a key without transferring that key in plain text. Let’s define a couple of terms:

    Plaintext – message data.

    Cipher Text – version of plaintext encrypted; only those who possess the secret can read it.

    Plaintext and cipher text works like the following:

    encrypt(plaintext) -> cipher text

    decrypt(cipher text) -> plaintext

    Symmetric-Key Cryptography

    This was the cryptographic method of choice for a long time, and it used the key distribution protocol where the encryptor (Betty) and decryptor (server) agrees to use the same key (K):

    encrypt (P, K) = C

    decrypt (C, K) = P

    The advantage of this method is that it is fast. The disadvantage is that it is difficult to setup the key over an insecure channel.

    Asymmetric-Key Cryptography (Public Key Cryptography)

    This is essentially the same as symmetric-key cryptography, except that encrypt and decrypt uses different keys:

    encrypt (P, Ke) = C

    decrypt (C, Kd) = P

    It is very difficult in this case to figure out Kd given Ke. Because of this, it is perfectly fine to make Ke public. It can be published.

    Example: Public Key Remote Login

    Is the following a safe protocol?

    Betty -> server: Hi

    Server -> Betty: Kse

    Betty -> server: {(H(password)} Kse


    No it is not! You should convince yourself that this protocol is vulnerable to the above mentioned replay attack. This can be solved by using a random number that is used exactly once. Let’s call this number NONCE.

    Betty -> server: Hi

    Server -> Betty: Kse, NONCE

    Betty -> server: {(H(password), NONCE} Kse

    Do you see why this solves the problem? The attacker can do nothing by intercepting the message and replaying it because the server will generate a different NONCE number when the attacker sends the message. This solves the replay attack problem, but is vulnerable to another form of attack called man-in-the-middle attack. This is an attack in which an attacker is able to read, insert, and modify at will, messages exchanged between two parties without the two parties ever realizing that the link between them had been compromised. For example:

    Betty -> server: Hi

    (Say attacker, E, intercepts the message at this point.)

                E -> server: Hi

                server -> E: Kse, NONCE

    server -> Betty: Kse, NONCE

    Betty -> server: {(H(password), NONCE} Kse

    The attacker in this case can decrypt the message. How can we avoid such attacks? We need a protocol that avoids sending public keys over the network!!! Such a protocol would require compiling public keys (e.g. Kse) into every computer in the world. Perhaps you have seen or are already familiar with this. This is known as SSL encryption. The difference in the message exchange would then be something like:

    Betty -> server: Hi

    Server -> Betty: {NONCE} Kbe

    Betty -> server: NONCE


    Here Kbe is Betty’s public key. The last message sent from Betty to the server is secure because only Betty could have decrypted the message sent from the server. A variance of this is for the server to send its public key as the message:

    Betty -> server: Hi

    Server -> Betty: { Kse } Kbe

    Betty -> server: NONCE