Streaming Uploads Discussion

View: New views
16 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

Re: Streaming Uploads Discussion

by Malcolm Tredinnick :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Mon, 2008-03-24 at 11:02 -0700, Mike Axiak wrote:
> Now that we actually have a working patch [1], there are a few details
> I'd like to raise here.

I haven't had time to sit down and devote to reading through the whole
patch, but I have a possibly very easy question that I can't answer from
the docs at the moment.

I'm a simple guy. I like simple stuff. So if I'm deploying a Django
system and it will handle file uploads from some forms and all I want to
do is ensure that large file uploads aren't held in memory, how do I do
that? In other words, I want to avoid the problem that originally
prompted this ticket and the related one.

Does this Just Work(tm) out of the box?

Malcolm

--
Remember that you are unique. Just like everyone else.
http://www.pointy-stick.com/blog/


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-developers@...
To unsubscribe from this group, send email to django-developers-unsubscribe@...
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---


Re: Streaming Uploads Discussion

by Mike Axiak :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Hey,

On Mar 25, 3:32 pm, Malcolm Tredinnick <malc...@...>
wrote:
> Does this Just Work(tm) out of the box?
Yes. :-)

-Mike
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-developers@...
To unsubscribe from this group, send email to django-developers-unsubscribe@...
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---


Re: Streaming Uploads Discussion

by Mike Axiak :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Mar 25, 3:38 pm, Mike Axiak <mcax...@...> wrote:
> Yes. :-)
Doh. I didn't realize you actually wanted me to do more work.

The newly updated doc can be found at the same URL:
http://orodruin.axiak.net/upload_handling.html

Be sure to refresh as the browser caching is high on that page.

-Mike
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-developers@...
To unsubscribe from this group, send email to django-developers-unsubscribe@...
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---


Re: Streaming Uploads Discussion

by Malcolm Tredinnick :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Tue, 2008-03-25 at 22:59 -0700, Mike Axiak wrote:
> On Mar 25, 3:38 pm, Mike Axiak <mcax...@...> wrote:
> > Yes. :-)
> Doh. I didn't realize you actually wanted me to do more work.

I didn't either. I thought I may have just been missing something
obvious. But now that you've written the extra bits at the top of the
document, it makes more sense to me as a user. Thanks.

Okay, so this all has to go on the review pile now I guess. That should
be fun, for some value of "fun". Nice work, Mike.

Regards,
Malcolm

--
No one is listening until you make a mistake.
http://www.pointy-stick.com/blog/


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-developers@...
To unsubscribe from this group, send email to django-developers-unsubscribe@...
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---


Re: Streaming Uploads Discussion

by Jacob Kaplan-Moss-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Mon, Mar 24, 2008 at 1:02 PM, Mike Axiak <mcaxiak@...> wrote:
>  Now that we actually have a working patch [1], there are a few details
>  I'd like to raise here.

Woo! Thanks for your hard work. My thoughts on your questions follow inline:

>  Supporting dictionaries in form code
>  ------------------------------------
>  [...]
>    TextFileForm(data={'description': u'Assistance'}, files={'file':
>  {'filename': 'test1.txt', 'content': 'hello world'}})

What would an equivalent line look like under the new system? That is, what do
folks need to change their tests to?

>  I see three options to deal with this:
>  [...]
>   2. Modify the form code to access the attributes of the
>  UploadedFile, but on AttributeError use the old dict-style interface
>  and emit a DeprecationWarning.

Yes, this is correct approach. Something along these lines:

   if isinstance(uploadedfile, dict):
        warn(...)
        uploadedfile = uploadedfile_from_dict(uploadedfile)

Option #3 is unacceptable: if at all possible we want people's tests to
not break. Warnings are fine; breakage if avoidable is a bad idea.

>  The other issue is what we should do with the tests. Should we
>  leave them? Should we copy them and create a copy of them for the new
>  style? Should we replace them with the new style?

The latter -- fix all the tests to use the new syntax as a demo of how it's
supposed to be done.

>  Having upload_handlers be a list object
>  ---------------------------------------
>  [...]
>  Therefore, I decided to just leave it as a plain old list. Thus, to
>  add an upload handler, you'd just write::
>
>    request.upload_handlers.append(some_upload_handler)
>
>  And to replace them all::
>
>    request.upload_handlers = [some_upload_handler]

If we do indeed need this -- see below -- then this is the right way to do it.

>  What do people think about this?

I'm thinking YAGNI here. Why would I need multiple upload handlers? I think you
need to talk me through your thinking here, because at first glance this smacks
of overegineering. Remember that multiple handlers can always be accomplished by
composition anyway, so unless there's a good reason I think set_upload_handler()
is just much cleaner.

Similarly, I'm a bit suspicious of FILE_UPLOAD_HANDLERS. Couldn't you just write
a simple middleware to do the same thing? As a general principle, you should try
as hard as you can to avoid introducing new settings; if there's another way
just do it that way. In this case, I'd just document that if you want to use a
custom upload handler globally that you should write a middleware class.

>  (Mis)Handling Content-Length
>  ----------------------------
>  [...]
>  There's probably not much room for argument here, but it's worth
>  asking a larger group. Currently when you try uploading Large files
>  (~2GB and greater), you will get a weird Content-Length header (less
>  than zero, overflowing).

Personally, I can't see any reason to care too much about people trying
to upload 2GB files through the web, anyway. Anyone allowing uploads should
have a sane upload size limit set at the web server level. Anyone who's allowing
uploads of over 2GB is just asking to get DOSed.

I think this is a place where we just add a note to the docs about setting the
Apache/lighttpd/etc. upload limit and move on.

>  Revised API
>  ===========
>  [...]
>  Let me know if you have any comments on the API. (Things like how it
>  deals with multiple handlers could be up for discussion.)

I quite like the API. I'm not sure why you'd need to return a custom
UploadedFile from an upload handler, but props for documenting the interface
anyway :)

Thanks again for the hard work -- this looks very good!

Jacob

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-developers@...
To unsubscribe from this group, send email to django-developers-unsubscribe@...
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---


Re: Streaming Uploads Discussion

by Mike Axiak :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Thanks for the review!

On Apr 4, 12:28 pm, "Jacob Kaplan-Moss" <jacob.kaplanm...@...>
wrote:
> >    TextFileForm(data={'description': u'Assistance'}, files={'file':
> >  {'filename': 'test1.txt', 'content': 'hello world'}})
>
> What would an equivalent line look like under the new system? That is, what do
> folks need to change their tests to?
The patch does this for all the tests in tests/, but here's an
example:

    TextFileForm(data={'description': u'Assistance'}, files={'file':
SimpleUploadedFile('test1.txt', 'hello world')})

Note the SimpleUploadedFile, which takes the filename and the content.

> Yes, this is correct approach. Something along these lines:
>
>    if isinstance(uploadedfile, dict):
>         warn(...)
>         uploadedfile = uploadedfile_from_dict(uploadedfile)

I like this way better than special-casing it. I'll update this in the
patch.

> I'm thinking YAGNI here. Why would I need multiple upload handlers? I think you
> need to talk me through your thinking here, because at first glance this smacks
> of overegineering. Remember that multiple handlers can always be accomplished by
> composition anyway, so unless there's a good reason I think set_upload_handler()
> is just much cleaner.

Composition gets a little tricky. I realized this when I wrote the
stuff that handles the composition.
Right now we have (InMemoryUploadHandler, TemporaryFileUploadHandler)
in a sequence, and now the memory handler doesn't have to worry about
what happens nor does the TemporaryFileUploadHandler need to worry
about what's before.
Basically every upload handler I can think of writing benefits from
this. For example, the UploadProgressUploadHandler would be wedged
between:

    (InMemoryUploadHandler, UploadProgressUploadHandler,
TemporaryFileUploadHandler)

So that file progress only gets computed for uploads that go into the
file system. Avoiding the overhead for small uploads.

> Similarly, I'm a bit suspicious of FILE_UPLOAD_HANDLERS. Couldn't you just write
> a simple middleware to do the same thing? As a general principle, you should try
> as hard as you can to avoid introducing new settings; if there's another way
> just do it that way. In this case, I'd just document that if you want to use a
> custom upload handler globally that you should write a middleware class.

This took a while for me to get sold on. I can only think of 5 or 6
useful upload handlers, but that's still 1956 possible orderings. It'd
be tough to make a handler easy to install when it has to be aware of
where it belongs. This functionality could be replicated in a
middleware which reads the settings, but I'm not sure that's the best
place for something like that.

> I quite like the API. I'm not sure why you'd need to return a custom
> UploadedFile from an upload handler, but props for documenting the interface
> anyway :)

If the upload handlers don't alter the content or represent any
storage change then sure they wouldn't need an additional UploadedFile
class.
However, the instant they pass the file to some remote location
(think: S3UploadHandler) or alter the content (think:
GZipUploadHandler) they will need their own way of defining what is
content and how it should be "saved".

Thanks for the thorough review.

Cheers,
Mike

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-developers@...
To unsubscribe from this group, send email to django-developers-unsubscribe@...
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---


Re: Streaming Uploads Discussion

by Jacob Kaplan-Moss-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Fri, Apr 4, 2008 at 1:04 PM, Mike Axiak <mcaxiak@...> wrote:

>  Composition gets a little tricky. I realized this when I wrote the
>  stuff that handles the composition.
>  Right now we have (InMemoryUploadHandler, TemporaryFileUploadHandler)
>  in a sequence, and now the memory handler doesn't have to worry about
>  what happens nor does the TemporaryFileUploadHandler need to worry
>  about what's before.
>  Basically every upload handler I can think of writing benefits from
>  this. For example, the UploadProgressUploadHandler would be wedged
>  between:
>
>     (InMemoryUploadHandler, UploadProgressUploadHandler,
>  TemporaryFileUploadHandler)
>
>  So that file progress only gets computed for uploads that go into the
>  file system. Avoiding the overhead for small uploads.

OK, I think I understand; thanks. It's hard sometimes figuring out a
thought process from its final result. I think you're probably right
to make this a list, then.

>  This [FILE_UPLOAD_HANDLERS] took a while for me to get sold on. I can
>  only think of 5 or 6 useful upload handlers, but that's still 1956
>  possible orderings. It'd be tough to make a handler easy to install
>  when it has to be aware of where it belongs. This functionality could
>  be replicated in a middleware which reads the settings, but I'm not
>  sure that's the best place for something like that.

Hrm, good point. I'll chew a bit more, but I can't think of a good way
to avoid the extra setting (as much as I dislike creeping settings).

Jacob

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-developers@...
To unsubscribe from this group, send email to django-developers-unsubscribe@...
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---


Re: Streaming Uploads Discussion

by Mike Axiak :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Apr 4, 2:46 pm, "Jacob Kaplan-Moss" <jacob.kaplanm...@...>
wrote:
> Hrm, good point. I'll chew a bit more, but I can't think of a good way
> to avoid the extra setting (as much as I dislike creeping settings).

I didn't want to use the extra setting either, but I finally caved in
after working with it (and discussing with Ivan). I will certainly
explore any other possibilities.

>
> Jacob
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-developers@...
To unsubscribe from this group, send email to django-developers-unsubscribe@...
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---


Re: Streaming Uploads Discussion

by Marty Alchin-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Fri, Apr 4, 2008 at 2:04 PM, Mike Axiak <mcaxiak@...> wrote:
>  However, the instant they pass the file to some remote location
>  (think: S3UploadHandler) or alter the content (think:
>  GZipUploadHandler) they will need their own way of defining what is
>  content and how it should be "saved".

Maybe I'm missing something obvious, but why would there ever be an
S3UploadHandler? Shouldn't that be handled by a file storage backend?

As for GZipUploadHandler, if you're talking about gzipping it as part
of the save process, shouldn't that also be a file storage backend? Of
course, if you're talking about *un*gzipping it during upload, so it
can be processed by Python, I withdraw the question.

I admit I haven't been following this terribly closely, but now that
both #5361 and #2070 are nearing completion, I'm trying to get a good
handle on all of this in case there are any interactions between the
two that I can help with.

-Gul

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-developers@...
To unsubscribe from this group, send email to django-developers-unsubscribe@...
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---


Re: Streaming Uploads Discussion

by Jacob Kaplan-Moss-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Fri, Apr 4, 2008 at 2:29 PM, Marty Alchin <gulopine@...> wrote:
>  Maybe I'm missing something obvious, but why would there ever be an
>  S3UploadHandler? Shouldn't that be handled by a file storage backend?

Yeah, one thing we'll need to figure out PDQ is what's appropriate for
an upload handler, and what's appropriate for a storage backend.
Hopefully the two of you can work out the breakdown.

FYI, I plan to merge 2070 first (sorry, Marty!) since I think it works
a bit better that way from a dependancy POV.

Jacob

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-developers@...
To unsubscribe from this group, send email to django-developers-unsubscribe@...
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---


Re: Streaming Uploads Discussion

by Marty Alchin-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Fri, Apr 4, 2008 at 3:43 PM, Jacob Kaplan-Moss
<jacob.kaplanmoss@...> wrote:
>  Yeah, one thing we'll need to figure out PDQ is what's appropriate for
>  an upload handler, and what's appropriate for a storage backend.
>  Hopefully the two of you can work out the breakdown.

I'll read over the patch and the docs and see if I can get a better
handle on how it works, so I can be of more use there. Also, Mike and
I put our heads together in IRC sometimes, so we should be able to get
it sorted out soon.

>  FYI, I plan to merge 2070 first (sorry, Marty!) since I think it works
>  a bit better that way from a dependancy POV.

No worries. I still have another of David's suggestions to integrate
before I can have it "finished" anyway. Plus, there's a 3291-ticket
age difference. The youngest always gets the shaft. :) I can live with
that.

-Gul

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-developers@...
To unsubscribe from this group, send email to django-developers-unsubscribe@...
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---


Re: Streaming Uploads Discussion

by Mike Axiak :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Apr 4, 3:29 pm, "Marty Alchin" <gulop...@...> wrote:
> I admit I haven't been following this terribly closely, but now that
> both #5361 and #2070 are nearing completion, I'm trying to get a good
> handle on all of this in case there are any interactions between the
> two that I can help with.
Yeah, it's an interesting interaction. We can talk about this more on
IRC, but let me try to outline a few ideas here.
I see both 2070 (one half of it) and 5361 both doing the same thing:
representing files. However, they are representing files in different
stages of life.
One is tied to the database (#5361), and one can potentially be
half-written (#2070). I forsee a time when we actually merge these
into one, but I think should do this incrementally.

Let's look at the S3 uploading process. There are two ways we can
handle the upload and send to S3:

   1. Stream the data directly to S3.
   2. Stream the data to disk, then send to S3.

I think a lot of people might opt to do option (1). In this case, the
upload handler will need to stream it to the S3 directly, but it needs
to give request.FILES *something* that the user/forms/db layer can
interact with. In this case, I gave the example of an S3UploadedFile.
It has a size, a way to read data from it, and a filename. In light of
#5361, if your backend is S3FileBackend, I would imagine it would
notice the S3UploadedFile and say "Oh, saving this is trivial. Let me
just save attribute ____". But that doesn't mean you can't save a
normal file in the S3FileBackend, does it?

Again, this is a little tricky, and to get right you and I will have
to work out some more of the details. But I do see them as usefully
somewhat separate identities.

Cheers,
Mike

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-developers@...
To unsubscribe from this group, send email to django-developers-unsubscribe@...
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---


Re: Streaming Uploads Discussion

by Malcolm Tredinnick :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Fri, 2008-04-04 at 13:23 -0700, Mike Axiak wrote:
[...]
> Let's look at the S3 uploading process. There are two ways we can
> handle the upload and send to S3:
>
>    1. Stream the data directly to S3.
>    2. Stream the data to disk, then send to S3.
>
> I think a lot of people might opt to do option (1).

Then those people deserve to be beaten heavily about the head and
shoulders. S3 is NOT a reliable upload endpoint. They (Amazon) say
there'll be approximately a 1% failure rate for attempted uploads. As 37
signals have noted in the past (they use it for their whiteboard file
storage), that doesn't mean that when you try again it will work,
either. It means, in practice, that periodically for 1% of the time, all
your uploads will fail. Then, a little later, they'll start to work. But
there could be minutes on end where you cannot upload. So unless your
web application is designed to intentionally lose data (in which case I
can think of a big optimisation on the saving end to lose it even faster
and more reliably), you must save to disk before trying to upload to S3.

In short, I can't see any reason the file upload handler should care
about storage systems like this.

Regards,
Malcolm

--
Experience is something you don't get until just after you need it.
http://www.pointy-stick.com/blog/


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-developers@...
To unsubscribe from this group, send email to django-developers-unsubscribe@...
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---


Re: Streaming Uploads Discussion

by Ivan Sagalaev :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Mike Axiak wrote:
> I didn't want to use the extra setting either, but I finally caved in
> after working with it (and discussing with Ivan). I will certainly
> explore any other possibilities.

My original reason for settings was that a single upload handler can't
possibly know the semantics of other upload handlers and can't decide
where to put itself in the list. This should be a decision of a
developer who compiles a project from third-party apps with upload
handlers how to arrange them. It's very similar to middleware: the order
is decided in settings. The only exception would be when a handler
really depends on some other handler which it can require be raising an
Exception somewhere.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-developers@...
To unsubscribe from this group, send email to django-developers-unsubscribe@...
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---


Re: Streaming Uploads Discussion

by Ivan Sagalaev :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Malcolm Tredinnick wrote:
> Then those people deserve to be beaten heavily about the head and
> shoulders. S3 is NOT a reliable upload endpoint. They (Amazon) say
> there'll be approximately a 1% failure rate for attempted uploads. As 37
> signals have noted in the past (they use it for their whiteboard file
> storage), that doesn't mean that when you try again it will work,
> either. It means, in practice, that periodically for 1% of the time, all
> your uploads will fail.

Nobody stops a developer from doing both things in parallel: storing on
disk and streaming to S3. Then when S3 fails a stored file can be
scheduled for repeating uploading.

The reason for not doing it only over-the-disk way is speed. Since 99%
of uploads do succeed they will gain heavily from not doing writes and
reads of the whole file on local disk.

Anyway S3 is just an example. It could be some local media-server
instead. I think the main reason between an upload handler and a file
backend is that the latter is generally simpler to write and the former
is generally more flexible.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-developers@...
To unsubscribe from this group, send email to django-developers-unsubscribe@...
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---


Re: Streaming Uploads Discussion

by Ivan Sagalaev :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Ivan Sagalaev wrote:
> Nobody stops a developer from doing both things in parallel: storing on
> disk and streaming to S3.
<skip>
> they will gain heavily from not doing writes and
> reads of the whole file on local disk.

Looks like I'm contradicting myself a bit here :-). But it's not the
point actually :-)

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-developers@...
To unsubscribe from this group, send email to django-developers-unsubscribe@...
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

< Prev | 1 - 2 | Next >