Batch items download

Mar 27, 2013 at 5:05 PM
Hi,
I saw that cloning large TFS repository takes long time, I started to investigate the reason in git-tf source code, and I saw that in CreateCommitForChangesetVersionSpecTask - createBlob each item is download separately.

I changed the code to do batch download to the temp directory and remove the downloadFile call in createBlob.

It improved the download performance from 12 seconds to 4 seconds (!).
What is the reason for not using batch download?
Developer
Mar 27, 2013 at 5:14 PM
Hello yosy,

This is on our backlog to improve in the future. There are 2 things we need to do to make this scenario faster and perform better:

1) Batch download as you specified above.

2) When performing a deep clone today we will call QueryItems for every changeset we need to download (we only download the changed items though which is good). But QueryItems will bring in more data from the server than what is needed, we need to change that call to QueryChangeset which will only bring down info on the changed items.

But this scenario surely needs some work.

Thanks for your suggestion,

Thanks,
Youhana
Mar 27, 2013 at 5:28 PM
Hi Youhana,

Can you explain the second thing? where are we calling QueryChangeset ? do you mean QueryHistory?

I am wondering, when I can see the roadmap/backlog of the project?



Thanks,
Yosy
Developer
Mar 27, 2013 at 5:40 PM
Hello,

For example today to do a deep clone we call the following:

QueryHistory - to get all the changeset to create commits for
ForEach Changeset in QueryHistory we call QueryItems at that version to get how the tree looks like
Then for each item returned we check whether we have downloaded it before or not using the version. If yes then we just reuse that blob id if no we call DownloadItem.

The main advantage of the logic above is that it is simple but it suffers performance bottle necks.

The alternative more complex logic would roughly look like this:

QueryHistory - to get all the changeset to create commits for
Call QueryItems for the first changeset only
For each changeset after call QueryChangesetDetails
Then replay the changes that happened in every changeset

We use codeplex to track all of the work for git-tf there are a couple of issues that are around improving clone's performance for large repos.

Thanks,
Youhana
Mar 30, 2013 at 2:32 PM
Edited Mar 30, 2013 at 2:32 PM
Hi Youhana,
I am doing some experiments with downloadItems, and I am trying to do multiple clones at the same time (using ExecutorService, with 15 threads).

As I can see, for some reason, as more items are downloaded the method get slow and slow -
Downloading 3098 items for changeset 14 , TOOK - 72113 ms
Downloading 3098 items for changeset 14 , TOOK - 98789 ms
Downloading 3098 items for changeset 14 , TOOK - 95447 ms
Downloading 3098 items for changeset 14 , TOOK - 96630 ms
Downloading 3098 items for changeset 14 , TOOK - 95786 ms
Downloading 3098 items for changeset 14 , TOOK - 95399 ms
Downloading 3098 items for changeset 14 , TOOK - 96376 ms
Downloading 3098 items for changeset 14 , TOOK - 95063 ms
Downloading 3098 items for changeset 14 , TOOK - 93769 ms
Downloading 3098 items for changeset 14 , TOOK - 94540 ms
Downloading 3098 items for changeset 14 , TOOK - 93962 ms
Downloading 3098 items for changeset 14 , TOOK - 94702 ms
Downloading 3098 items for changeset 14 , TOOK - 93103 ms
Downloading 3098 items for changeset 14 , TOOK - 88811 ms
Downloading 3098 items for changeset 14 , TOOK - 49782 ms
Downloading 3098 items for changeset 14 , TOOK - 33543 ms
Downloading 3098 items for changeset 14 , TOOK - 58334 ms
Downloading 3098 items for changeset 14 , TOOK - 36073 ms
Downloading 3098 items for changeset 14 , TOOK - 21014 ms
Downloading 3098 items for changeset 14 , TOOK - 21544 ms
For one single download at a time it takes - 4 to 5 seconds.
As the documentation says, VersionControlClient is thread-safe.. do you know why it happens? or where should I ask it?
Developer
Mar 30, 2013 at 6:21 PM
Here is fine. Another appropriate forum would be the TFS Cross-Platform / Team Explorer Everywhere forum

When we talk about thread safety and thread compatibility, we talk about the guarantees that you can make when calling from multiple threads with regard to race conditions, locking, memory, etc. Nothing in your post suggests that we're violating thread safety guarantees.

Posting your sample code here would be elucidating. There are a number of factors at play here but I suspect that you're maxing out the http worker thread pool. We limit the number of simultaneous connections to a single host, like most http clients. Thus there's an upper limit on the number of downloads that can occur at once. So throwing more threads at the problem is only going to make them wait.

That's just my hunch, though.

-ed
Mar 30, 2013 at 6:24 PM
Hi ethomson,
Thanks for the detailed answer, How can I increase the http worker thread pool threads?