Discussion:
[PATCH 1/5] axonram: Fix bug in direct_access
(too old to reply)
Boaz Harrosh
2014-08-27 15:25:57 UTC
Permalink
From: Matthew Wilcox <***@intel.com>

The 'pfn' returned by axonram was completely bogus, and has been since
2008.

Signed-off-by: Matthew Wilcox <***@intel.com>
Reviewed-by: Jan Kara <***@suse.cz>
Reviewed-by: Boaz Harrosh <***@plexistor.com>
---
arch/powerpc/sysdev/axonram.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 47b6b9f..830edc8 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -156,7 +156,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
}

*kaddr = (void *)(bank->ph_addr + offset);
- *pfn = virt_to_phys(kaddr) >> PAGE_SHIFT;
+ *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

return 0;
}
--
1.9.3
Boaz Harrosh
2014-08-27 15:27:11 UTC
Permalink
From: Matthew Wilcox <***@intel.com>

In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.

Add a new helper function, bdev_direct_access(), to handle common
functionality including partition handling, checking the length requested
is positive, checking for the sector being page-aligned, and checking
the length of the request does not pass the end of the partition.

Signed-off-by: Matthew Wilcox <***@intel.com>
Reviewed-by: Jan Kara <***@suse.cz>
Reviewed-by: Boaz Harrosh <***@plexistor.com>
---
Documentation/filesystems/xip.txt | 15 +++++++++------
arch/powerpc/sysdev/axonram.c | 17 ++++-------------
drivers/block/brd.c | 12 +++++-------
drivers/s390/block/dcssblk.c | 21 +++++++++-----------
fs/block_dev.c | 40 +++++++++++++++++++++++++++++++++++++++
fs/ext2/xip.c | 31 +++++++++++++-----------------
include/linux/blkdev.h | 6 ++++--
7 files changed, 84 insertions(+), 58 deletions(-)

diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
index 0466ee5..b774729 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -28,12 +28,15 @@ Implementation
Execute-in-place is implemented in three steps: block device operation,
address space operation, and file operations.

-A block device operation named direct_access is used to retrieve a
-reference (pointer) to a block on-disk. The reference is supposed to be
-cpu-addressable, physical address and remain valid until the release operation
-is performed. A struct block_device reference is used to address the device,
-and a sector_t argument is used to identify the individual block. As an
-alternative, memory technology devices can be used for this.
+A block device operation named direct_access is used to translate the
+block device sector number to a page frame number (pfn) that identifies
+the physical page for the memory. It also returns a kernel virtual
+address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested. The function should return the number
+of bytes that can be contiguously accessed at that offset. It may also
+return a negative errno if an error occurs.

The block device operation is optional, these block devices support it as of
today:
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 830edc8..8709b9f 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,26 +139,17 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
* axon_ram_direct_access - direct_access() method for block device
* @device, @sector, @data: see block_device_operations method
*/
-static int
+static long
axon_ram_direct_access(struct block_device *device, sector_t sector,
- void **kaddr, unsigned long *pfn)
+ void **kaddr, unsigned long *pfn, long size)
{
struct axon_ram_bank *bank = device->bd_disk->private_data;
- loff_t offset;
-
- offset = sector;
- if (device->bd_part != NULL)
- offset += device->bd_part->start_sect;
- offset <<= AXON_RAM_SECTOR_SHIFT;
- if (offset >= bank->size) {
- dev_err(&bank->device->dev, "Access outside of address space\n");
- return -ERANGE;
- }
+ loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;

*kaddr = (void *)(bank->ph_addr + offset);
*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

- return 0;
+ return bank->size - offset;
}

static const struct block_device_operations axon_ram_devops = {
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3598110..78fe510 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -370,25 +370,23 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
}

#ifdef CONFIG_BLK_DEV_XIP
-static int brd_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, unsigned long *pfn)
+static long brd_direct_access(struct block_device *bdev, sector_t sector,
+ void **kaddr, unsigned long *pfn, long size)
{
struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;

if (!brd)
return -ENODEV;
- if (sector & (PAGE_SECTORS-1))
- return -EINVAL;
- if (sector + PAGE_SECTORS > get_capacity(bdev->bd_disk))
- return -ERANGE;
page = brd_insert_page(brd, sector);
if (!page)
return -ENOSPC;
*kaddr = page_address(page);
*pfn = page_to_pfn(page);

- return 0;
+ /* If size > PAGE_SIZE, we could look to see if the next page in the
+ * file happens to be mapped to the next page of physical RAM */
+ return PAGE_SIZE;
}
#endif

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 0f47175..96bc411 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -28,8 +28,8 @@
static int dcssblk_open(struct block_device *bdev, fmode_t mode);
static void dcssblk_release(struct gendisk *disk, fmode_t mode);
static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
-static int dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn);
+static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
+ void **kaddr, unsigned long *pfn, long size);

static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";

@@ -866,25 +866,22 @@ fail:
bio_io_error(bio);
}

-static int
+static long
dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn)
+ void **kaddr, unsigned long *pfn, long size)
{
struct dcssblk_dev_info *dev_info;
- unsigned long pgoff;
+ unsigned long offset, dev_sz;

dev_info = bdev->bd_disk->private_data;
if (!dev_info)
return -ENODEV;
- if (secnum % (PAGE_SIZE/512))
- return -EINVAL;
- pgoff = secnum / (PAGE_SIZE / 512);
- if ((pgoff+1)*PAGE_SIZE-1 > dev_info->end - dev_info->start)
- return -ERANGE;
- *kaddr = (void *) (dev_info->start+pgoff*PAGE_SIZE);
+ dev_sz = dev_info->end - dev_info->start;
+ offset = secnum * 512;
+ *kaddr = (void *) (dev_info->start + offset);
*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

- return 0;
+ return dev_sz - offset;
}

static void
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 6d72746..ffe0761 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -427,6 +427,46 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
}
EXPORT_SYMBOL_GPL(bdev_write_page);

+/**
+ * bdev_direct_access() - Get the address for directly-accessibly memory
+ * @bdev: The device containing the memory
+ * @sector: The offset within the device
+ * @addr: Where to put the address of the memory
+ * @pfn: The Page Frame Number for the memory
+ * @size: The number of bytes requested
+ *
+ * If a block device is made up of directly addressable memory, this function
+ * will tell the caller the PFN and the address of the memory. The address
+ * may be directly dereferenced within the kernel without the need to call
+ * ioremap(), kmap() or similar. The PFN is suitable for inserting into
+ * page tables.
+ *
+ * Return: negative errno if an error occurs, otherwise the number of bytes
+ * accessible at this address.
+ */
+long bdev_direct_access(struct block_device *bdev, sector_t sector,
+ void **addr, unsigned long *pfn, long size)
+{
+ long avail;
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+
+ if (size < 0)
+ return size;
+ if (!ops->direct_access)
+ return -EOPNOTSUPP;
+ if ((sector + DIV_ROUND_UP(size, 512)) >
+ part_nr_sects_read(bdev->bd_part))
+ return -ERANGE;
+ sector += get_start_sect(bdev);
+ if (sector % (PAGE_SIZE / 512))
+ return -EINVAL;
+ avail = ops->direct_access(bdev, sector, addr, pfn, size);
+ if (!avail)
+ return -ERANGE;
+ return min(avail, size);
+}
+EXPORT_SYMBOL_GPL(bdev_direct_access);
+
/*
* pseudo-fs
*/
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index e98171a..bbc5fec 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,18 +13,12 @@
#include "ext2.h"
#include "xip.h"

-static inline int
-__inode_direct_access(struct inode *inode, sector_t block,
- void **kaddr, unsigned long *pfn)
+static inline long __inode_direct_access(struct inode *inode, sector_t block,
+ void **kaddr, unsigned long *pfn, long size)
{
struct block_device *bdev = inode->i_sb->s_bdev;
- const struct block_device_operations *ops = bdev->bd_disk->fops;
- sector_t sector;
-
- sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */
-
- BUG_ON(!ops->direct_access);
- return ops->direct_access(bdev, sector, kaddr, pfn);
+ sector_t sector = block * (PAGE_SIZE / 512);
+ return bdev_direct_access(bdev, sector, kaddr, pfn, size);
}

static inline int
@@ -53,12 +47,13 @@ ext2_clear_xip_target(struct inode *inode, sector_t block)
{
void *kaddr;
unsigned long pfn;
- int rc;
+ long size;

- rc = __inode_direct_access(inode, block, &kaddr, &pfn);
- if (!rc)
- clear_page(kaddr);
- return rc;
+ size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
+ if (size < 0)
+ return size;
+ clear_page(kaddr);
+ return 0;
}

void ext2_xip_verify_sb(struct super_block *sb)
@@ -77,7 +72,7 @@ void ext2_xip_verify_sb(struct super_block *sb)
int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
void **kmem, unsigned long *pfn)
{
- int rc;
+ long rc;
sector_t block;

/* first, retrieve the sector number */
@@ -86,6 +81,6 @@ int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
return rc;

/* retrieve address of the target data */
- rc = __inode_direct_access(mapping->host, block, kmem, pfn);
- return rc;
+ rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
+ return (rc < 0) ? rc : 0;
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 518b465..ac25166 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1615,8 +1615,8 @@ struct block_device_operations {
int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
- int (*direct_access) (struct block_device *, sector_t,
- void **, unsigned long *);
+ long (*direct_access)(struct block_device *, sector_t,
+ void **, unsigned long *pfn, long size);
unsigned int (*check_events) (struct gendisk *disk,
unsigned int clearing);
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
@@ -1634,6 +1634,8 @@ extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
extern int bdev_read_page(struct block_device *, sector_t, struct page *);
extern int bdev_write_page(struct block_device *, sector_t, struct page *,
struct writeback_control *);
+extern long bdev_direct_access(struct block_device *, sector_t, void **addr,
+ unsigned long *pfn, long size);
#else /* CONFIG_BLOCK */

struct block_device;
--
1.9.3
Boaz Harrosh
2014-08-27 15:28:25 UTC
Permalink
From: Boaz Harrosh <***@plexistor.com>

Some programs like fdisk, require HDIO_GETGEO to work, which requires we
implement getgeo.

We set all hd_geometry members to 1, because this way fdisk
math will not try its crazy geometry math and get stuff totally wrong.

I was trying to get some values that will make fdisk Want to align
first sector on 4K (like 8, 16, 20, ... sectors) but nothing worked,
I searched the net the math is not your regular simple multiplication
at all.

If you managed to get these please tell me. I would love to solve
this.

But for now we use 4k physical sectors for fixing fdisk alignment
issues, and setting these here to something that will not make
fdisk serve us with crazy numbers.

Signed-off-by: Boaz Harrosh <***@plexistor.com>
---
drivers/block/brd.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 78fe510..f841d9e 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -19,6 +19,7 @@
#include <linux/radix-tree.h>
#include <linux/fs.h>
#include <linux/slab.h>
+#include <linux/hdreg.h>

#include <asm/uaccess.h>

@@ -424,6 +425,23 @@ static int brd_ioctl(struct block_device *bdev, fmode_t mode,
return error;
}

+static int brd_getgeo(struct block_device *bd, struct hd_geometry *geo)
+{
+ /* Just tell fdisk to get out of the way. The math here is so
+ * convoluted and does not make any sense at all. With all 1s
+ * The math just gets out of the way.
+ * NOTE: I was trying to get some values that will make fdisk
+ * Want to align first sector on 4K (like 8, 16, 20, ... sectors) but
+ * nothing worked, I searched the net the math is not your regular
+ * simple multiplication at all. If you managed to get these please
+ * fix here. For now we use 4k physical sectors for this
+ */
+ geo->heads = 1;
+ geo->sectors = 1;
+ geo->cylinders = 1;
+ return 0;
+}
+
static const struct block_device_operations brd_fops = {
.owner = THIS_MODULE,
.rw_page = brd_rw_page,
@@ -431,6 +449,7 @@ static const struct block_device_operations brd_fops = {
#ifdef CONFIG_BLK_DEV_XIP
.direct_access = brd_direct_access,
#endif
+ .getgeo = brd_getgeo,
};

/*
--
1.9.3


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Matthew Wilcox
2014-08-27 17:53:27 UTC
Permalink
Post by Boaz Harrosh
We set all hd_geometry members to 1, because this way fdisk
math will not try its crazy geometry math and get stuff totally wrong.
I was trying to get some values that will make fdisk Want to align
first sector on 4K (like 8, 16, 20, ... sectors) but nothing worked,
I searched the net the math is not your regular simple multiplication
at all.
If you managed to get these please tell me. I would love to solve
this.
But for now we use 4k physical sectors for fixing fdisk alignment
issues, and setting these here to something that will not make
fdisk serve us with crazy numbers.
Are you saying that fdisk ignores the 4k physical sectors (that you set up
in patch 5/5) in favour of the geometry exposed here? That doesn't make
sense to me, since it would misalign 4k-physical ATA drives if it did.

I don't see anywhere else in the kernel reporting (1,1,1). The most common
form to fake a geometry uses (64, 32, x), including SCSI.
Boaz Harrosh
2014-08-28 07:26:31 UTC
Permalink
Post by Matthew Wilcox
Post by Boaz Harrosh
We set all hd_geometry members to 1, because this way fdisk
math will not try its crazy geometry math and get stuff totally wrong.
I was trying to get some values that will make fdisk Want to align
first sector on 4K (like 8, 16, 20, ... sectors) but nothing worked,
I searched the net the math is not your regular simple multiplication
at all.
If you managed to get these please tell me. I would love to solve
this.
But for now we use 4k physical sectors for fixing fdisk alignment
issues, and setting these here to something that will not make
fdisk serve us with crazy numbers.
Are you saying that fdisk ignores the 4k physical sectors (that you set up
in patch 5/5) in favour of the geometry exposed here? That doesn't make
sense to me, since it would misalign 4k-physical ATA drives if it did.
No with patch 5/5 the 4k stuff is good.

What I'm saying is that with (64, 32, x) fdisk offers a very high first
sector and with all 1(s) it will allow a low value like 4k

For example with (64, 32, x) + the 4k patch in 5/5 with a 4M brd disk it
will offer 40 (20K) as first possible sector.

With this patch applied it will offer 8 (4K) as first sector, which is what
I want
Post by Matthew Wilcox
I don't see anywhere else in the kernel reporting (1,1,1). The most common
form to fake a geometry uses (64, 32, x), including SCSI.
But is it wrong? why?

I guess that until now no one cared about wasted space at the beginning
but with brd I do. Your words this is all "fake" then why at all.
With all 1(s) the CHS convoluted math just disappears and gets out of the
way.

This is clearly the case of copy/paste so long in the history of the code
that no one knows why it was done originally, and it has become religion
that if you dare touch it ghost will come out of closets.

But I looked at the code's math and I found that this is better.

And why not let it be a flag to UM that this is RAM based? If legacy
code likes it just fine, why do we think otherwise?

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Matthew Wilcox
2014-08-28 15:11:26 UTC
Permalink
Post by Boaz Harrosh
Post by Matthew Wilcox
Post by Boaz Harrosh
We set all hd_geometry members to 1, because this way fdisk
math will not try its crazy geometry math and get stuff totally wrong.
I was trying to get some values that will make fdisk Want to align
first sector on 4K (like 8, 16, 20, ... sectors) but nothing worked,
I searched the net the math is not your regular simple multiplication
at all.
If you managed to get these please tell me. I would love to solve
this.
But for now we use 4k physical sectors for fixing fdisk alignment
issues, and setting these here to something that will not make
fdisk serve us with crazy numbers.
Are you saying that fdisk ignores the 4k physical sectors (that you set up
in patch 5/5) in favour of the geometry exposed here? That doesn't make
sense to me, since it would misalign 4k-physical ATA drives if it did.
No with patch 5/5 the 4k stuff is good.
What I'm saying is that with (64, 32, x) fdisk offers a very high first
sector and with all 1(s) it will allow a low value like 4k
For example with (64, 32, x) + the 4k patch in 5/5 with a 4M brd disk it
will offer 40 (20K) as first possible sector.
With this patch applied it will offer 8 (4K) as first sector, which is what
I want
That makes for a much better changelog entry than what you wrote above :-)
Post by Boaz Harrosh
Post by Matthew Wilcox
I don't see anywhere else in the kernel reporting (1,1,1). The most common
form to fake a geometry uses (64, 32, x), including SCSI.
But is it wrong? why?
I guess that until now no one cared about wasted space at the beginning
but with brd I do. Your words this is all "fake" then why at all.
With all 1(s) the CHS convoluted math just disappears and gets out of the
way.
It is all fake. Nobody's reported their real CHS geometry in twenty
years. A drive's geometry has been more complex than a fixed number
of sectors per track, and anyway the hd_geometry struct tops out at
describing a 500GB drive (with 63 heads and 255 sectors).

My concern is that clearly (64, 32, x) works since other drivers are
doing it. (1, 1, 1) is stepping into the unknown, and we don't know
what applications are going to make of this value.

This really needs to be something that's handled by the block midlayer.
Forcing drivers to make this stuff up is only leading to pain and
suffering. Drivers that don't have a getgeo method should be given a
default geometry that makes all known users of getgeo do the right thing.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Boaz Harrosh
2014-08-28 15:43:48 UTC
Permalink
<>
Post by Matthew Wilcox
Post by Boaz Harrosh
No with patch 5/5 the 4k stuff is good.
What I'm saying is that with (64, 32, x) fdisk offers a very high first
sector and with all 1(s) it will allow a low value like 4k
For example with (64, 32, x) + the 4k patch in 5/5 with a 4M brd disk it
will offer 40 (20K) as first possible sector.
With this patch applied it will offer 8 (4K) as first sector, which is what
I want
That makes for a much better changelog entry than what you wrote above :-)
OK Will do thanks
Post by Matthew Wilcox
Post by Boaz Harrosh
But is it wrong? why?
I guess that until now no one cared about wasted space at the beginning
but with brd I do. Your words this is all "fake" then why at all.
With all 1(s) the CHS convoluted math just disappears and gets out of the
way.
It is all fake. Nobody's reported their real CHS geometry in twenty
years. A drive's geometry has been more complex than a fixed number
of sectors per track, and anyway the hd_geometry struct tops out at
describing a 500GB drive (with 63 heads and 255 sectors).
Right?
Post by Matthew Wilcox
My concern is that clearly (64, 32, x) works since other drivers are
doing it. (1, 1, 1) is stepping into the unknown, and we don't know
what applications are going to make of this value.
But you are ignoring my "it does not work" complain above?

I have tested with fdisk cfdisk gparted (and parted) and they all do
the same as before or better in the case of fdisk. I have reviewed the
source code to fdisk, and was satisfied that 1,1,1 is actually good and
fixes my problem above.

Why can't we put this in Linux-next for a while and let imaginary
"applications are going to make of this value" complain in do time.

Does anyone know of any application other then above list that I should test
with?

Let me remind you that partitions was completely broken on brd since its
inception, and no one knew about it until I started testing, so its not
like we can do *any kind* of harm with this.
Post by Matthew Wilcox
This really needs to be something that's handled by the block midlayer.
Forcing drivers to make this stuff up is only leading to pain and
suffering. Drivers that don't have a getgeo method should be given a
default geometry that makes all known users of getgeo do the right thing.
Fine, I promise to do this once this test driver is to run for a cycle or
two I will do it. I will clean the all stack and have two defaults for
faking drivers and let be those old drivers that actually go to hardware
to read it. Let us have this brd driver as a test bed and I promise to
do this.

This is regular Kernel procedure. I have audited and tested to the best of
my, or ML readers ability, and we have in place a staging procedure that should
catch any problems soon enough.

Since when have we become so defensive on ZERO risk code in the Kernel. We are
changing core and revolutionary systems every day for breakfast? With far more
out reaching implications. And problems come up and we fix them.

Can we move forward please
Boaz
Boaz Harrosh
2014-08-27 15:30:26 UTC
Permalink
From: Boaz Harrosh <***@plexistor.com>

This patch fixes up brd's partitions scheme, now enjoying all worlds.

The MAIN fix here is that currently, if one fdisks some partitions,
a BAD bug will make all partitions point to the same start-end sector
ie: 0 - brd_size And an mkfs of any partition would trash the partition
table and the other partition.

Another fix is that "mount -U uuid" will not work if show_part was not
specified, because of the GENHD_FL_SUPPRESS_PARTITION_INFO flag.
We now always load without it and remove the show_part parameter.

So NOW the logic goes like this:
* max_part - Just says how many minors to reserve between devices
But in any way, there can be as many partition as requested.
If minors between devices ends, then dynamic 259-major ids will
be allocated on the fly.
The default is now max_part=1, which means all partitions devt
will be from the dynamic major-range.
(If persistent partition minors is needed use max_part=)

* Creation of new devices on the fly still/always work:
mknod /path/devnod b 1 X
fdisk -l /path/devnod
Will create a new device if (X / max_part) was not already
created before. (Just as before)

partitions on the dynamically created device will work as well
Same logic applies with minors as with the pre-created ones.

TODO: dynamic grow of device size, maybe through sysfs. So each
device can have it's own size.

CC: Dmitry Monakhov <***@openvz.org>
Tested-by: Ross Zwisler <***@linux.intel.com>
Signed-off-by: Boaz Harrosh <***@plexistor.com>
---
drivers/block/brd.c | 100 ++++++++++++++++++++--------------------------------
1 file changed, 38 insertions(+), 62 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index f841d9e..83e3d22 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -455,19 +455,18 @@ static const struct block_device_operations brd_fops = {
/*
* And now the modules code and kernel interface.
*/
-static int rd_nr;
-int rd_size = CONFIG_BLK_DEV_RAM_SIZE;
-static int max_part;
-static int part_shift;
-static int part_show = 0;
+static int rd_nr = CONFIG_BLK_DEV_RAM_COUNT;
module_param(rd_nr, int, S_IRUGO);
MODULE_PARM_DESC(rd_nr, "Maximum number of brd devices");
+
+int rd_size = CONFIG_BLK_DEV_RAM_SIZE;
module_param(rd_size, int, S_IRUGO);
MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes.");
+
+static int max_part = 1;
module_param(max_part, int, S_IRUGO);
-MODULE_PARM_DESC(max_part, "Maximum number of partitions per RAM disk");
-module_param(part_show, int, S_IRUGO);
-MODULE_PARM_DESC(part_show, "Control RAM disk visibility in /proc/partitions");
+MODULE_PARM_DESC(max_part, "Num Minors to reserve between devices");
+
MODULE_LICENSE("GPL");
MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR);
MODULE_ALIAS("rd");
@@ -513,16 +512,15 @@ static struct brd_device *brd_alloc(int i)
brd->brd_queue->limits.discard_zeroes_data = 1;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue);

- disk = brd->brd_disk = alloc_disk(1 << part_shift);
+ disk = brd->brd_disk = alloc_disk(max_part);
if (!disk)
goto out_free_queue;
disk->major = RAMDISK_MAJOR;
- disk->first_minor = i << part_shift;
+ disk->first_minor = i * max_part;
disk->fops = &brd_fops;
disk->private_data = brd;
disk->queue = brd->brd_queue;
- if (!part_show)
- disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO;
+ disk->flags = GENHD_FL_EXT_DEVT;
sprintf(disk->disk_name, "ram%d", i);
set_capacity(disk, rd_size * 2);

@@ -544,10 +542,11 @@ static void brd_free(struct brd_device *brd)
kfree(brd);
}

-static struct brd_device *brd_init_one(int i)
+static struct brd_device *brd_init_one(int i, bool *new)
{
struct brd_device *brd;

+ *new = false;
list_for_each_entry(brd, &brd_devices, brd_list) {
if (brd->brd_number == i)
goto out;
@@ -558,6 +557,7 @@ static struct brd_device *brd_init_one(int i)
add_disk(brd->brd_disk);
list_add_tail(&brd->brd_list, &brd_devices);
}
+ *new = true;
out:
return brd;
}
@@ -573,70 +573,46 @@ static struct kobject *brd_probe(dev_t dev, int *part, void *data)
{
struct brd_device *brd;
struct kobject *kobj;
+ bool new;

mutex_lock(&brd_devices_mutex);
- brd = brd_init_one(MINOR(dev) >> part_shift);
+ brd = brd_init_one(MINOR(dev) / max_part, &new);
kobj = brd ? get_disk(brd->brd_disk) : NULL;
mutex_unlock(&brd_devices_mutex);

- *part = 0;
+ if (new)
+ *part = 0;
+
return kobj;
}

static int __init brd_init(void)
{
- int i, nr;
- unsigned long range;
struct brd_device *brd, *next;
+ int i;

/*
* brd module now has a feature to instantiate underlying device
* structure on-demand, provided that there is an access dev node.
- * However, this will not work well with user space tool that doesn't
- * know about such "feature". In order to not break any existing
- * tool, we do the following:
*
- * (1) if rd_nr is specified, create that many upfront, and this
- * also becomes a hard limit.
- * (2) if rd_nr is not specified, create CONFIG_BLK_DEV_RAM_COUNT
- * (default 16) rd device on module load, user can further
- * extend brd device by create dev node themselves and have
- * kernel automatically instantiate actual device on-demand.
+ * (1) if rd_nr is specified, create that many upfront. else
+ * it defaults to CONFIG_BLK_DEV_RAM_COUNT
+ * (2) User can further extend brd devices by create dev node themselves
+ * and have kernel automatically instantiate actual device
+ * on-demand. Example:
+ * mknod /path/devnod_name b 1 X # 1 is the rd major
+ * fdisk -l /path/devnod_name
+ * If (X / max_part) was not already created it will be created
+ * dynamically.
*/

- part_shift = 0;
- if (max_part > 0) {
- part_shift = fls(max_part);
-
- /*
- * Adjust max_part according to part_shift as it is exported
- * to user space so that user can decide correct minor number
- * if [s]he want to create more devices.
- *
- * Note that -1 is required because partition 0 is reserved
- * for the whole disk.
- */
- max_part = (1UL << part_shift) - 1;
- }
-
- if ((1UL << part_shift) > DISK_MAX_PARTS)
- return -EINVAL;
-
- if (rd_nr > 1UL << (MINORBITS - part_shift))
- return -EINVAL;
-
- if (rd_nr) {
- nr = rd_nr;
- range = rd_nr << part_shift;
- } else {
- nr = CONFIG_BLK_DEV_RAM_COUNT;
- range = 1UL << MINORBITS;
- }
-
if (register_blkdev(RAMDISK_MAJOR, "ramdisk"))
return -EIO;

- for (i = 0; i < nr; i++) {
+ if (unlikely(!max_part))
+ max_part = 1;
+
+ for (i = 0; i < rd_nr; i++) {
brd = brd_alloc(i);
if (!brd)
goto out_free;
@@ -648,10 +624,10 @@ static int __init brd_init(void)
list_for_each_entry(brd, &brd_devices, brd_list)
add_disk(brd->brd_disk);

- blk_register_region(MKDEV(RAMDISK_MAJOR, 0), range,
+ blk_register_region(MKDEV(RAMDISK_MAJOR, 0), 1UL << MINORBITS,
THIS_MODULE, brd_probe, NULL, NULL);

- printk(KERN_INFO "brd: module loaded\n");
+ pr_info("brd: module loaded\n");
return 0;

out_free:
@@ -661,21 +637,21 @@ out_free:
}
unregister_blkdev(RAMDISK_MAJOR, "ramdisk");

+ pr_info("brd: module NOT loaded !!!\n");
return -ENOMEM;
}

static void __exit brd_exit(void)
{
- unsigned long range;
struct brd_device *brd, *next;

- range = rd_nr ? rd_nr << part_shift : 1UL << MINORBITS;
-
list_for_each_entry_safe(brd, next, &brd_devices, brd_list)
brd_del_one(brd);

- blk_unregister_region(MKDEV(RAMDISK_MAJOR, 0), range);
+ blk_unregister_region(MKDEV(RAMDISK_MAJOR, 0), 1UL << MINORBITS);
unregister_blkdev(RAMDISK_MAJOR, "ramdisk");
+
+ pr_info("brd: module unloaded\n");
}

module_init(brd_init);
--
1.9.3


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Boaz Harrosh
2014-08-27 15:32:32 UTC
Permalink
From: Boaz Harrosh <***@plexistor.com>

Because of the direct_access() API which returns a PFN. partitions
better start on 4K boundary, else offset ZERO of a partition will
not be aligned and blk_direct_access() will fail the call.

By setting blk_queue_physical_block_size(PAGE_SIZE) we can communicate
this to fdisk and friends.
Note that blk_queue_physical_block_size() also trashes io_min, but
we can leave this one to be 512.

Signed-off-by: Boaz Harrosh <***@plexistor.com>
---
drivers/block/brd.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 83e3d22..6cc3034 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -503,10 +503,17 @@ static struct brd_device *brd_alloc(int i)
brd->brd_queue = blk_alloc_queue(GFP_KERNEL);
if (!brd->brd_queue)
goto out_free_dev;
+
blk_queue_make_request(brd->brd_queue, brd_make_request);
blk_queue_max_hw_sectors(brd->brd_queue, 1024);
blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY);

+ /* This is so fdisk will align partitions on 4k, because of
+ * direct_access API needing 4k alignment, returning a PFN
+ */
+ blk_queue_physical_block_size(brd->brd_queue, PAGE_SIZE);
+ brd->brd_queue->limits.io_min = 512; /* Don't use the accessor */
+
brd->brd_queue->limits.discard_granularity = PAGE_SIZE;
brd->brd_queue->limits.max_discard_sectors = UINT_MAX;
brd->brd_queue->limits.discard_zeroes_data = 1;
--
1.9.3
Dmitry Monakhov
2014-08-27 15:45:48 UTC
Permalink
Jens Hi
What do you intend to do with these fixes? These are real bugs on devices
shipped for a while now. I think they need to go into current 3.17-rcX Kernel.
If not then lets please put them in for-next
[This set is for linux-block.git/for-next, tell me if you need one ontop
of for-linus]
It looks like I've missed your context. Please elaborate your use-case
[v2]
Based on Jens's linux-next [30e996a] incorporating the brd patch by Dmitry Monakhov.
Dmitry has introduced a new part_show parameter, this parameter is now
removed
In which commit or kernel-tree this parameter was removed?
and we always "part_show=1".
Scripts that did part_show=1 will work just the same but will display a
message in logs. This is harmless. (And scripts can be modified to
remove this parameter)
Which script are you talking about? Why you want use it?
[v1]
Current situation is that any attempt to use partitions with brd device would
create the partition but then any use will trash the data.
See: http://www.spinics.net/lists/linux-scsi/msg76737.html
So these patches fixes up all the problems we saw with the code, but not sacrificing
any of the old fixtures. See [patch 4/5] for more explanations.
[PATCH 1/5] axonram: Fix bug in direct_access
[PATCH 2/5] Change direct_access calling convention
These are Matthew's patches from the DAX series which fixes the interface to
direct_access taking into account the partition offset. It must be applied
here for partitions to work with direct_access() API.
[PATCH 3/5] brd: Add getgeo to block ops
This one is needed by fdisk, otherwise it just asks extra questions
[PATCH 4/5] brd: Fix all partitions BUGs
[PATCH 5/5] brd: Request from fdisk 4k alignment
Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Boaz Harrosh
2014-08-27 15:57:04 UTC
Permalink
<>
Post by Dmitry Monakhov
[v2]
Based on Jens's linux-next [30e996a] incorporating the brd patch by Dmitry Monakhov.
Dmitry has introduced a new part_show parameter, this parameter is now
removed
In which commit or kernel-tree this parameter was removed?
I mean after this patchset, see patch
[PATCH 4/5] brd: Fix all partitions BUGs

It will remove this param
Post by Dmitry Monakhov
and we always "part_show=1".
Scripts that did part_show=1 will work just the same but will display a
message in logs. This is harmless. (And scripts can be modified to
remove this parameter)
Which script are you talking about? Why you want use it?
I meant that if any user-mode code was doing
modprobe brd part_show=1
Then this "part_show=1" is now deprecated and can be removed

Though the part_show=1 has not yet been released in any kernel
and it is a very low chance that any one had a chance to
use it

Thanks
Boaz


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Boaz Harrosh
2014-09-01 10:15:53 UTC
Permalink
Jens Hi
What do you intend to do with these fixes? These are real bugs on devices
shipped for a while now. I think they need to go into current 3.17-rcX Kernel.
Jens hi

I have seen that the brd patch went into rc3.

I should insist then, that these patches go into this rc cycle ASAP.
For one they fix real *hard* unexpected bugs with brd that make it trash data.

But specially because the patch to brd that just went into Kernel introduces
a wrong module parameter "show_part" which must be removed. If we release a Kernel
with it, it will be pain, we will need to deprecate it which is long and annoying
procedure. It is wrong because it must be always on. Pleas see below

This patchset is based on what just went into Kernel.
Also on git here: git://git.open-osd.org/pmem.git brd-partitions branch

[http://git.open-osd.org/gitweb.cgi?p=pmem.git;a=shortlog;h=refs/heads/brd-partitions]

Thanks
Boaz
If not then lets please put them in for-next
[This set is for linux-block.git/for-next, tell me if you need one ontop
of for-linus]
[v2]
Based on Jens's linux-next [30e996a] incorporating the brd patch by Dmitry Monakhov.
Dmitry has introduced a new part_show parameter, this parameter is now removed
and we always "part_show=1".
Scripts that did part_show=1 will work just the same but will display a
message in logs. This is harmless. (And scripts can be modified to
remove this parameter)
[v1]
Current situation is that any attempt to use partitions with brd device would
create the partition but then any use will trash the data.
See: http://www.spinics.net/lists/linux-scsi/msg76737.html
So these patches fixes up all the problems we saw with the code, but not sacrificing
any of the old fixtures. See [patch 4/5] for more explanations.
[PATCH 1/5] axonram: Fix bug in direct_access
[PATCH 2/5] Change direct_access calling convention
These are Matthew's patches from the DAX series which fixes the interface to
direct_access taking into account the partition offset. It must be applied
here for partitions to work with direct_access() API.
[PATCH 3/5] brd: Add getgeo to block ops
This one is needed by fdisk, otherwise it just asks extra questions
[PATCH 4/5] brd: Fix all partitions BUGs
[PATCH 5/5] brd: Request from fdisk 4k alignment
Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Boaz Harrosh
2014-09-09 12:10:20 UTC
Permalink
Post by Boaz Harrosh
Jens Hi
What do you intend to do with these fixes? These are real bugs on devices
shipped for a while now. I think they need to go into current 3.17-rcX Kernel.
Jens hi
I have seen that the brd patch went into rc3.
I should insist then, that these patches go into this rc cycle ASAP.
For one they fix real *hard* unexpected bugs with brd that make it trash data.
But specially because the patch to brd that just went into Kernel introduces
a wrong module parameter "show_part" which must be removed. If we release a Kernel
with it, it will be pain, we will need to deprecate it which is long and annoying
procedure. It is wrong because it must be always on. Pleas see below
This patchset is based on what just went into Kernel.
Also on git here: git://git.open-osd.org/pmem.git brd-partitions branch
[http://git.open-osd.org/gitweb.cgi?p=pmem.git;a=shortlog;h=refs/heads/brd-partitions]
Thanks
Boaz
<>
Post by Boaz Harrosh
[PATCH 1/5] axonram: Fix bug in direct_access
[PATCH 2/5] Change direct_access calling convention
<>
Post by Boaz Harrosh
[PATCH 3/5] brd: Add getgeo to block ops
<>
Post by Boaz Harrosh
[PATCH 4/5] brd: Fix all partitions BUGs
[PATCH 5/5] brd: Request from fdisk 4k alignment
Ping Jens?

Please tell me what you want to do with these patches. They must go into this Kernel
cycle before brd is released with wrong user-mode visible nub

If you want you can drop the [PATCH 3/5] which is not important and can go in next
cycle, but the others are real BUG fixes.

Thanks
Boaz
Jens Axboe
2014-09-09 14:25:35 UTC
Permalink
Post by Boaz Harrosh
Post by Boaz Harrosh
Jens Hi
What do you intend to do with these fixes? These are real bugs on devices
shipped for a while now. I think they need to go into current 3.17-rcX Kernel.
Jens hi
I have seen that the brd patch went into rc3.
I should insist then, that these patches go into this rc cycle ASAP.
For one they fix real *hard* unexpected bugs with brd that make it trash data.
But specially because the patch to brd that just went into Kernel introduces
a wrong module parameter "show_part" which must be removed. If we release a Kernel
with it, it will be pain, we will need to deprecate it which is long and annoying
procedure. It is wrong because it must be always on. Pleas see below
This patchset is based on what just went into Kernel.
Also on git here: git://git.open-osd.org/pmem.git brd-partitions branch
[https://urldefense.proofpoint.com/v1/url?u=http://git.open-osd.org/gitweb.cgi?p%3Dpmem.git%3Ba%3Dshortlog%3Bh%3Drefs/heads/brd-partitions&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=3JMVyziIyZtZ5cv9eWNLwQ%3D%3D%0A&m=XOvuGmlhiuRy6LRMabkkeXBBXOBMB9BXeBpwsjyR98Y%3D%0A&s=c5919437c1a82b2bf63f9ca95ae6d96ebf2f0fbc2ed85bb6766df0add4af40b3]
Thanks
Boaz
<>
Post by Boaz Harrosh
[PATCH 1/5] axonram: Fix bug in direct_access
[PATCH 2/5] Change direct_access calling convention
<>
Post by Boaz Harrosh
[PATCH 3/5] brd: Add getgeo to block ops
<>
Post by Boaz Harrosh
[PATCH 4/5] brd: Fix all partitions BUGs
[PATCH 5/5] brd: Request from fdisk 4k alignment
Ping Jens?
Please tell me what you want to do with these patches. They must go into this Kernel
cycle before brd is released with wrong user-mode visible nub
If you want you can drop the [PATCH 3/5] which is not important and can go in next
cycle, but the others are real BUG fixes.
I will drop the 3/5 patch (do that one separately) and queue up and test.
--
Jens Axboe
Boaz Harrosh
2014-09-28 10:45:20 UTC
Permalink
On 09/09/2014 05:25 PM, Jens Axboe wrote:
<>
Post by Jens Axboe
I will drop the 3/5 patch (do that one separately) and queue up and test.
Jens hi

I'm reminding you of these brd bug fixes. Yes they are bug fixes for a driver
no one cares about. But it is a great pity if an unwanted user-mode accessible
parameter will hit a release. Lets solve this now please?

Is there anything I can help with?

Thanks
Boaz
Boaz Harrosh
2014-10-19 17:46:50 UTC
Permalink
<>
Post by Jens Axboe
Post by Boaz Harrosh
Please tell me what you want to do with these patches. They must go into this Kernel
cycle before brd is released with wrong user-mode visible nub
<>
Post by Jens Axboe
I will drop the 3/5 patch (do that one separately) and queue up and test.
Hi Jens

So the Kernel closed with a wrong knob and Even with the new merge window
these patches are not merged.

Could you at least tell me what you do not like about them so I can fix
it?

Thanks
Boaz

Loading...