UID and GID generation

Discussion:

Martin Bammer

2016-08-12 14:29:54 UTC

Hi,

I've got an issue with the generation of UIDs and GIDs when new users are added. By default UIDs and GIDs for users and user groups are values starting from 1000 (on Red Hat from 500). When a user is added the next free value is chosen.
The issue now is when the same user names are added on different machine in a different order. A very common example is a family where each family member has it's own computer. So for example on computer A the users are added in the order john, mary, dave. On computer B mary, dave, john.
Now John buys an external drive for backups and data sharing and formats it with ext4. Then John copies several private files to the external drive. Then Mary wants to do the same on her computer, but when she connects the external drive she can see John's files with user and group mary and she has full access to these files. A very bad design issue!
So my suggestion would be to change the default behavior of UID and GID generation to hash value calculation. Has values are computed by the user and group names as 32bit values on Debian (31bit on Red Hat). The minimum and maximum values should be configurable.
Here an example how UIDs and GIDs can be generated:

uidGen = hashlib.hashfunc()
uidGen.update(username)
uid = uidGen.digest()
# We have to check here if the uid is in the valid range and if no other user already has this uid.
while uid < lowerLimit or uid > upperLimit or uid in listOfUsers:
uidGen.update(username)
uid = uidGen.digest()

If we have no conflicts with other users before the while loop then the UIDs should be the same on all computers independent of the generation order.
IMHO the current implementation is a design bug which must be fixed.

Regards, Martin

Dimitris Kazantzas

2016-08-12 15:25:22 UTC

Permalink

This is a nice idea. I think that it is somewhat important that, this design issue is fixed

------ Original message------

From: Martin Bammer

Date: Fri, Aug 12, 2016 17:30

To: debian-***@lists.debian.org;

Subject:UID and GID generation

Hi,

I've got an issue with the generation of UIDs and GIDs when new users are added. By default UIDs and GIDs for users and user groups are values starting from 1000 (on Red Hat from 500). When a user is added the next free value is chosen.
The issue now is when the same user names are added on different machine in a different order. A very common example is a family where each family member has it's own computer. So for example on computer A the users are added in the order john, mary, dave. On computer B mary, dave, john.
Now John buys an external drive for backups and data sharing and formats it with ext4. Then John copies several private files to the external drive. Then Mary wants to do the same on her computer, but when she connects the external drive she can see John's files with user and group mary and she has full access to these files. A very bad design issue!
So my suggestion would be to change the default behavior of UID and GID generation to hash value calculation. Has values are computed by the user and group names as 32bit values on Debian (31bit on Red Hat). The minimum and maximum values should be configurable.
Here an example how UIDs and GIDs can be generated:

uidGen = hashlib.hashfunc()
uidGen.update(username)
uid = uidGen.digest()
# We have to check here if the uid is in the valid range and if no other user already has this uid.
while uid < lowerLimit or uid > upperLimit or uid in listOfUsers:
uidGen.update(username)
uid = uidGen.digest()

If we have no conflicts with other users before the while loop then the UIDs should be the same on all computers independent of the generation order.
IMHO the current implementation is a design bug which must be fixed.

Regards, Martin

Philipp Kern

2016-08-12 17:37:43 UTC

Permalink

Post by Martin Bammer
The issue now is when the same user names are added on different
machine in a different order. A very common example is a family where
each family member has it's own computer. So for example on computer A
the users are added in the order john, mary, dave. On computer B mary,
dave, john.
Now John buys an external drive for backups and data sharing and
formats it with ext4. Then John copies several private files to the
external drive. Then Mary wants to do the same on her computer, but
when she connects the external drive she can see John's files with
user and group mary and she has full access to these files. A very bad
design issue!

I waited for you to complain that this is not the case and that files
can't be accessed, but you did it the other way around and complain that
they can be. If you want to keep files private on external drives (or
drives in general), you use encryption. POSIX file permissions and ACLs
do not help you there as anyone with root (say, on their personal device
like a laptop) can just look at all of the files anyway. That assumption
is as true on Windows with NTFS, by the way (unless you use EFS, which
people generally don't).

Post by Martin Bammer
So my suggestion would be to change the default behavior of UID and
GID generation to hash value calculation.

I think that's a terrible idea. It does not solve the problem you are
trying to solve and it creates even more of a mess with user and group
IDs.

Kind regards
Philipp Kern

Adam Borowski

2016-08-13 04:03:30 UTC

Permalink

I waited for you to complain that this is not the case and that files can't
be accessed, but you did it the other way around and complain that they can
be. If you want to keep files private on external drives (or drives in
general), you use encryption. POSIX file permissions and ACLs do not help
you there as anyone with root (say, on their personal device like a laptop)
can just look at all of the files anyway. That assumption is as true on
Windows with NTFS, by the way (unless you use EFS, which people generally
don't).

Actually, I do like this idea; obviously with reasoning contrary to the
original report. In any small organization or a family, where you have an
ad hoc set of machines without centralized user management, it is nice to
have consistent uids.

This helps with cases like moving a disk around. Or, most of the times you
need rsync --numeric-ids (or it will cause irreversible metadata loss),
except for the times when you need the opposite. With uids being the same
no matter which of you originally set that box up, this problem is avoided.

Obviously, it doesn't scale well past a handful of users, but by then anyone
sane will keep things organized in some way.

Post by Martin Bammer
So my suggestion would be to change the default behavior of UID and GID
generation to hash value calculation. Has values are computed by the
user and group names as 32bit values on Debian (31bit on Red Hat). The
minimum and maximum values should be configurable.

I'd make the hash 16 bits rather than 32, to make it possible to copy them
by hand without resorting to copy&paste or a piece of paper. A compromise
between hash collisions and readable numbers. And there's no security loss,
as there's no security beyond physical possession when moving an unencrypted
disk -- or when mounting an untrusted filesystem that's not something dead
simple like FAT.

--
An imaginary friend squared is a real enemy.

Martin Bammer

2016-08-13 09:23:03 UTC

Permalink

Post by Adam Borowski
Actually, I do like this idea; obviously with reasoning contrary to the
original report. In any small organization or a family, where you have an
ad hoc set of machines without centralized user management, it is nice to
have consistent uids.
This helps with cases like moving a disk around. Or, most of the times you
need rsync --numeric-ids (or it will cause irreversible metadata loss),
except for the times when you need the opposite. With uids being the same
no matter which of you originally set that box up, this problem is avoided.
Obviously, it doesn't scale well past a handful of users, but by then anyone
sane will keep things organized in some way.

You always have the option to define an UID or GID manually when creating a new
user or group. In addition the old algorithm should be kept so the admin has
the option to choose how UIDs and GIDs are generated. The algorithm I've put
in my first post is only a first suggestion. I'm sure this algorithm can be
improved.

Post by Adam Borowski
I'd make the hash 16 bits rather than 32, to make it possible to copy them
by hand without resorting to copy&paste or a piece of paper. A compromise
between hash collisions and readable numbers. And there's no security loss,
as there's no security beyond physical possession when moving an
unencrypted disk -- or when mounting an untrusted filesystem that's not
something dead simple like FAT.

The value range of generated UIDs and GIDs should be configurable. So if
someone wants only small values he can configure this.

And what would be needed in addition if hash-ids would be implemented is a
migration tool which helps to migrade UIDs and GIDs for all files in a file tree
which have a specific UID or GID .

Regards,
Martin

Martin Bammer

2016-08-13 09:24:18 UTC

Permalink

Ian Jackson

2016-08-15 12:59:53 UTC

Permalink

<plug>
Maybe you want sync-accounts, from chiark-utils. Centralized user
management on the cheap.

Ian.

Christoph Biedl

2016-08-14 16:04:59 UTC

Permalink

Martin Bammer wrote...

Post by Martin Bammer
I've got an issue with the generation of UIDs and GIDs when new
users are added. By default UIDs and GIDs for users and user groups
are values starting from 1000 (on Red Hat from 500). When a user is
added the next free value is chosen.

Yes, also NFS has a problem here unless you use some additional ID
mapping.

Similar, system user IDs: If you want to migrate to a new installation
but there are a lot of files that should be preserved, think
/var/lib/munin/.

For all such situations a workaround exists. Still I've been wondering
for years why appearently nobody else considers this a problem. So I
patched adduser to determine the user (also: group) ID from a static
"acount name"<->"ID" mapping. It's in the BTS somewhere eight years
ago, and I use an updated version still today. Migration of existing
installations was painful but worth it, YMMV.

Post by Martin Bammer
So my suggestion would be to change the default behavior of UID and
GID generation to hash value calculation. Has values are computed by
the user and group names as 32bit values on Debian (31bit on Red
Hat). The minimum and maximum values should be configurable.

Given Murphy and birthday paradoxon, this will bite you much sooner
than you'd expect.

Post by Martin Bammer
IMHO the current implementation is a design bug which must be fixed.

I wouldn't use the b word here. The implementation is simple but
introduces problems once you have more than one machine.

Christoph

Christoph Biedl

2016-08-14 16:18:00 UTC

Permalink

Christoph Biedl wrote...

So I patched adduser to determine the user (also: group) ID from a
static "acount name"<->"ID" mapping. It's in the BTS somewhere eight
years ago,

FTR: #243929

Martin Bammer

2016-08-14 23:09:20 UTC

Permalink

Post by Christoph Biedl
For all such situations a workaround exists. Still I've been wondering
for years why appearently nobody else considers this a problem. So I
patched adduser to determine the user (also: group) ID from a static
"acount name"<->"ID" mapping. It's in the BTS somewhere eight years
ago, and I use an updated version still today. Migration of existing
installations was painful but worth it, YMMV.

Good that I'm not the only one who is not satisfied with the current
implementation. I didn't know that for system services the same algorithm is
used. I thought that the IDs for system services are fixed. Static ID mapping
is a nice idea which would be helpful for system services. Sad that the patch
didn't find it's way into the official release. Nevertheless your idea does not
solve the problem of different user IDs.
I fear that my idea will also be ignored, because someone can always find a
reason not to change a proven algorithm.

Post by Christoph Biedl
Given Murphy and birthday paradoxon, this will bite you much sooner
than you'd expect.

Yes, collisions must be exptected. Thus the algorithm has to deal with it.
To get an idea about the chance of collisions I've used a file of 281249 common
user names I've found on the net and checked the number of collisions with
different algorithms. The valid value range is from 1000 to 2^32-1, because I
would not use the hash algorithm for system services. Here are the results.
Collisions for 281249 usernames:
15 xxh32
7 md5 (Only first 4 bytes of hash value used)
73980 adler32
7 crc32
9 fnv1_32
9 fnv1a_32
14 murmur1_32
14 murmur1_aligned_32
8 murmur2_32
8 murmur2a_32
8 murmur2_aligned_32
8 murmur2_neutral_32
8 murmur3_32
9 spooky_32
72 super_fast_hash
5 gencrc

It seems that gencrc is the best, but it has to be considered that the used
list is not meant the be complete in any way. The conclusion IMHO is that it
is good enough for a family and small communities where you only have a few
users.

Here is the code for gencrc:

gencrctab = [
0x46d1e192, 0x66edf9aa, 0x927fc9e5, 0xa53baacc, 0x29b47658, 0x5a411a01,
0x0e66d5bd, 0x0dd5b1db, 0xcb38340e, 0x04d4ebb6, 0x98bc4f54, 0x36f20f2c,
0x4a3047ed, 0x1ec1e0eb, 0x568c0c1f, 0x6a731432, 0x81367fc6, 0xe3e25237,
0xe7f64884, 0x0fa59f64, 0x4f3109de, 0xf02d61f5, 0x5daec03b, 0x7f740e83,
0x056ff2d8, 0x2026cc0a, 0x7ac2112d, 0x82c55605, 0xb0911ef2, 0xa7b88e4c,
0x89dca282, 0x4b254d27, 0x7694a6d3, 0xd229eadd, 0x8e8f3738, 0x5bee7a55,
0x012eb6ab, 0x08dd28c8, 0xb5abc274, 0xbc7931f0, 0xf2396ed5, 0xe4e43d97,
0x943f4b7f, 0x85d0293d, 0xaed83a88, 0xc8f932fc, 0xc5496f20, 0xe9228173,
0x9b465b7d, 0xfda26680, 0x1ddeab35, 0x0c4f25cb, 0x86e32faf, 0xe59fa13a,
0xe192e2c4, 0xf147da1a, 0x67620a8d, 0x5c9a24c5, 0xfe6afde2, 0xacad0250,
0xd359730b, 0xf35203b3, 0x96a4b44d, 0xfbcacea6, 0x41a165ec, 0xd71e53ac,
0x835f39bf, 0x6b6bde7e, 0xd07085ba, 0x79064e07, 0xee5b20c3, 0x3b90bd65,
0x5827aef4, 0x4d12d31c, 0x9143496e, 0x6c485976, 0xd9552733, 0x220f6895,
0xe69def19, 0xeb89cd70, 0xc9bb9644, 0x93ec7e0d, 0x2ace3842, 0x2b6158da,
0x039e9178, 0xbb5367d7, 0x55682285, 0x4315d891, 0x19fd8906, 0x7d8d4448,
0xb4168a03, 0x40b56a53, 0xaa3e69e0, 0xa25182fe, 0xad34d16c, 0x720c4171,
0x9dc3b961, 0x321db563, 0x8b801b9e, 0xf5971893, 0x14cc1251, 0x8f4ae962,
0xf65aff1e, 0x13bd9dee, 0x5e7c78c7, 0xddb61731, 0x73832c15, 0xefebdd5b,
0x1f959aca, 0xe801fb22, 0xa89826ce, 0x30b7165d, 0x458a4077, 0x24fec52a,
0x849b065f, 0x3c6930cd, 0xa199a81d, 0xdb768f30, 0x2e45c64a, 0xff2f0d94,
0x4ea97917, 0x6f572acf, 0x653a195c, 0x17a88c5a, 0x27e11fb5, 0x3f09c4c1,
0x2f87e71b, 0xea1493e4, 0xd4b3a55e, 0xbe6090be, 0xaf6cd9d9, 0xda58ca00,
0x612b7034, 0x31711dad, 0x6d7db041, 0x8ca786b7, 0x09e8bf7a, 0xc3c4d7ea,
0xa3cd77a8, 0x7700f608, 0xdf3de559, 0x71c9353f, 0x9fd236fb, 0x1675d43e,
0x390d9e9a, 0x21ba4c6b, 0xbd1371e8, 0x90338440, 0xd5f163d2, 0xb140fef9,
0x52f50b57, 0x3710cf67, 0x4c11a79c, 0xc6d6624e, 0x3dc7afa9, 0x34a69969,
0x70544a26, 0xf7d9ec98, 0x7c027496, 0x1bfb3ba3, 0xb3b1dc8f, 0x9a241039,
0xf993f5a4, 0x15786b99, 0x26e704f7, 0x51503c04, 0x028bb3b8, 0xede5600c,
0x9cb22b29, 0xb6ff339b, 0x7e771c43, 0xc71c05f1, 0x604ca924, 0x695eed60,
0x688ed0bc, 0x3e0b232f, 0xf8a39c11, 0xbae6e67c, 0xb8cf75e1, 0x970321a7,
0x5328922b, 0xdef3df2e, 0x8d0443b0, 0x2885e3ae, 0x6435eed1, 0xcc375e81,
0xa98495f6, 0xe0bff114, 0xb2da3e4f, 0xc01b5adf, 0x507e0721, 0x6267a36a,
0x181a6df8, 0x7baff0c0, 0xfa6d6c13, 0x427250b2, 0xe2f742d6, 0xcd5cc723,
0x2d218be7, 0xb91fbbb1, 0x9eb946d0, 0x1c180810, 0xfc81d602, 0x0b9c3f52,
0xc2ea456f, 0x1165b2c9, 0xabf4ad75, 0x0a56fc8c, 0x12e0f818, 0xcadbcba1,
0x2586be56, 0x952c9b46, 0x07c6a43c, 0x78967df3, 0x477b2e49, 0x2c5d7b6d,
0x8a637272, 0x59acbcb4, 0x74a0e447, 0xc1f8800f, 0x35c015dc, 0x230794c2,
0x4405f328, 0xec2adba5, 0xd832b845, 0x6e4ed287, 0x48e9f7a2, 0xa44be89f,
0x38cbb725, 0xbf6ef4e6, 0xdc0e83fa, 0x54238d12, 0xf4f0c1e3, 0xa60857fd,
0xc43c64b9, 0x00c851ef, 0x33d75f36, 0x5fd39866, 0xd1efa08a, 0xa0640089,
0x877a978b, 0x99175d86, 0x57dfacbb, 0xceb02de9, 0xcf4d5c09, 0x3a8813d4,
0xb7448816, 0x63fa5568, 0x06be014b, 0xd642fa7b, 0x10aa7c90, 0x8082c88e,
0x1afcba79, 0x7519549d, 0x490a87ff, 0x8820c3a0 ]

def gencrc(s):
h = 0
for c in s:
h = (h >> 8) ^ gencrctab[(h & 0xff) ^ ord(c)];
return h

uid = gencrc(username)
while 0 <= uid < 1000:
username += "X"
uid = gencrc(username)

Post by Christoph Biedl
I wouldn't use the b word here. The implementation is simple but
introduces problems once you have more than one machine.

You are right. Declaring a proven algorithm as a bug is too exaggerated, but
from the view of a user with no knowledge about UIDs and GIDs and how the
mapping works the behavior as I've described in my first post seems like a bug.
I'm wondering how Windows behaves in such situations. Does Windows save
usernames for access rights? Does anybody know?

Regards,
Martin

Jeremy Stanley

2016-08-15 18:44:51 UTC

Permalink

On 2016-08-15 01:09:20 +0200 (+0200), Martin Bammer wrote:
[...]

Post by Martin Bammer
I thought that the IDs for system services are fixed. Static ID
mapping is a nice idea which would be helpful for system services.

[...]

Some of them are fixed: https://wiki.debian.org/SystemGroups

There are compiled-in defaults for SYS_GID_MIN and SYS_UID_MIN of
100 so GIDs and UIDs you see lower than that are fixed assignments.
Red Hat/Fedora have been continuing to add to the list on their
distros over the years, and have to explicitly bump these minimums
in /etc/login.defs as they eventually grew past 100:
https://git.fedorahosted.org/cgit/setup.git/tree/uidgid

--
Jeremy Stanley

Michael Lustfield

2016-08-15 23:18:28 UTC

Permalink

[...]

In the original scenario, the concern was was with shared media having
uid/gid numbers that don't match what's on the system. In that
scenario, this was viewed as a security concern. This is not a
security concern because once someone has physical access to your
unencrypted data, it's no longer your data. That's just an unfortunate
truth. You can give me a HD w/ Windows on it, tell me you set up the
file system permissions to be really super duper secure, but... I'm
just going to walk around the file system as if you gave me higher
than administrator access. Whatever isn't encrypted is now something I
have access to.

For the above scenario, the let's say the data is encrypted. If you're
giving the other party the key so they can open it... it's no longer
your data.

The next scenario I recall from this thread was about the small
business scenario. The typical response to that is obviously
centralized authentication. I know scenarios where that's not possible
or the logistics are absurd. The next best thing is configuration
management utilities. In my personal opinion, if your company is large
enough to have servers, it's large enough that config management is no
longer optional.

If you can go along with that, you can get something like this (salt example):
local_users:
- michael:
uid: 4001
gid: 4001
pwd: <password hash>
keys:
- ssh_key1
- ssh_key2
- timmy:
uid: 4002
gid: 4002
pwd: <pwd_hash>
- freddy:
terminated: True

When tools like sssd take a remote uid/gid and translate that to a
local translated uid/gid, I don't believe that's a security concern so
much as a concern of things breaking if you start getting collisions.
Ya, that's a security concern if sssd generates uid/gid numbers that
collide with numbers that other tools that want to use those
specifically, but I'm convinced this behavior has nothing to do with
security. This behavior only makes collisions unlikely, it does not,
in any way, guarantee that collisions will never happen.

In fact... Story time! One of the first times I started playing with
sssd, I was rolling it out in a mid-size enterprise (~24,000
employees). One a few servers, the uid/gid numbers that sssd came up
with collided with over 80% of the existing local system users. This
was a design issue that needed to be resolved, not something that
needed a band-aid. Because centralized user management already
existed, miscellaneous uid/gid numbers were sent up river to AD and
then every system was migrated to using those numbers. End result...
collisions can't happen because we don't let them.

tl;dr -- Randomizing uid/gid numbers does not improve security, it
just decreases the probability of that security hole being accessible.
Enforcing the same uid/gid everywhere *will* prevent collisions.

Sorry, I got a bit long winded. That's what I get when I write fun
emails on my break. :(

Andrey Rahmatullin

2016-08-16 05:43:57 UTC

Permalink

Does Windows save usernames for access rights?

No, NTFS permissions are tied to SIDs which are like UNIX UIDs/GIDs but
unique for non-default entities, because they include the computer SID
(those may be non-unique if a system was cloned though, but NTFS
permissions are not better than UNIX permissions for removable storage.).

https://en.wikipedia.org/wiki/Security_Identifier

--
WBR, wRAR