Original Post: 9/13/2017 9:09:46 AM
Search indexing time can vary depending on the Code and Work Item volume under the Collection. In another post here, I had briefly explained the 2 different types of indexing that happens.
- Bulk Indexing (BI), where the entire code and work item artifacts in all projects/repositories under a Collection are indexed. This is a time consuming operation and depends on the size of the artifacts under the collection.
- Continuous Indexing (CI), which handles all incremental updates to the artifacts (add/updated/delete) and indexes them. This is notification based model where the indexer listens to TFS events and operates based on those event notifications. CI handles almost all update operations including CRUD operations at Project/Repository/Collection layer (such as Repository renames, Project add/deletes, etc.). The operation time for these CI would depend again on the size of the incremental update. BI always precedes CI i.e. a CI will never execute on a project/repository until BI is completed for the same.
Irrespective of whether it's BI or CI, the indexer always does an incremental metadata fetch of the documents to be indexed, followed by the document content fetch and then feeds it to the Elasticsearch in batches for indexing. Since there is no strict bound to the number of documents that exist in a repository (for BI) or the number of document updates as part of a GIT commit or TFVC checkin (for CI), the indexer does not know upfront about this count (or rather evaluating this metadata number itself is an extremely expensive operation).
In addition, throughput of each indexer job run (i.e. doc/sec indexed) may vary depending on the TFS load, Application Tier CPU load, as well as parameters such as network throughput (when the Elasticsearch is running on remote machine).
So the question arises, how can I track the indexing progress and completion? Few approaches you can try out to get insights into it -
Check recent indexing status through script
Run this script. This will give data such as number of repositories for which BI ("fresh indexing" as the script would refer as) has completed, in progress or failed in last 'N' days. It will also show similar stats for CI.
(Make sure to pick the script under the correct TFS release version folder)
Checking indexing progress for a specific repository (for Code) or project (for Work Item)
This is a bit trickier. For a large repository, all the documents may not be indexed in a single job instance. Indexer jobs are time boxed. After a certain time (5-10min), they are supposed to yield, requeue and resume from where they had left in the previous run instance for that repository. This applies to BI (say, a large repository) as well as CI (a large commit).
- Get the AssociatedJobId for the repository/project you want to track.
SELECT [AssociatedJobId]
FROM [<CollectionDB>].[Search].[tbl_IndexingUnit]
WHERE TFSEntityAttributes like '%{Provide your RepositoryName (for Code) or ProjectName (for WorkItem)}%'
and EntityType = ' ' -- Provide 'Code' or 'WorkItem'
and IndexingUnitType = ' ' -- Provide 'TFVC_Repository' or 'Git_Repository' (for code) or 'Project' (for WorkItem)
FROM [<CollectionDB>].[Search].[tbl_IndexingUnit]
WHERE TFSEntityAttributes like '%{Provide your RepositoryName (for Code) or ProjectName (for WorkItem)}%'
and EntityType = ' ' -- Provide 'Code' or 'WorkItem'
and IndexingUnitType = ' ' -- Provide 'TFVC_Repository' or 'Git_Repository' (for code) or 'Project' (for WorkItem)
- Use the AssociatedJobId to filter the results from the JobHistory table.
SELECT [StartTime]
,[EndTime]
,[Result]
,[ResultMessage]
FROM [Tfs_Configuration].[dbo].[tbl_JobHistory]
where JobId = ' ' -- AssociatedJobId from tbl_IndexingUnit
order by StartTime desc
,[EndTime]
,[Result]
,[ResultMessage]
FROM [Tfs_Configuration].[dbo].[tbl_JobHistory]
where JobId = ' ' -- AssociatedJobId from tbl_IndexingUnit
order by StartTime desc
You would observe Result Message such as -
Events (9) completed with status Succeeded. Event 9 completed with message 'BeginBulkIndex-AccountFaultIn: TimeboxSupportedCPF : Crawled 1000 documents for TFSEntityId ... Successfully indexed TFVC_Repository with Id ... Feeder phase Documents Updated ...
Events (12) completed with status Succeeded. Event 12 completed with message 'UpdateIndex-PushEventNotification: TimeboxSupportedCPF : Crawled 75 documents for TFSEntityId ... Successfully indexed TFVC Repository Id ... Feeder phase Documents Updated ...
Events (6) completed with status Succeeded. Event 6 completed with message 'BeginBulkIndex-AccountFaultIn: Successfully Crawled 50 workItems revisions or discussions and Indexed ... Successfully indexed WorkItems in Project Id ... Feeder phase Documents Updated ... '.
As I had explained earlier above in this post, depending on whether this job instance completed the indexing of the repository, or yielded after partially indexing, you should see the ResultMessage mentioning Successfully indexed or Partially indexed
The Crawler/Feeder phase document crawled/updated count will give an indication on how many documents were indexed in this job instance. This should finally add up to the total number of documents that are supposed to be indexed for the repository/project. So if you have an approximate count of total code/workitems in your collection, you can do an estimate of how many documents are already indexed and the remaining count.
One final point. For Code indexing in large TFVC Repositories, and specifically for BI, you can also check the extent of pending folders that are still to be indexed by checking the number of entries in [<CollectionDB>].[Search].[tbl_JobYield] for that Repository.
SELECT BatchId, Content
FROM [Tfs_DefaultCollection].[Search].[tbl_JobYield] as JY
INNER JOIN [Tfs_DefaultCollection].[Search].[tbl_IndexingUnit] as IU ON
JY.TFSEntityId = IU.TFSEntityId WHERE
IU.TFSEntityAttributes like '%{Provide your RepositoryName or ProjectName}%'
and IU.EntityType = '' -- Provide 'Code' or 'WorkItem'
and IU.IndexingUnitType = '' -- Provide 'TFVC_Repository', 'Git_Repository' (for code) or 'Project' (for WorkItem)
FROM [Tfs_DefaultCollection].[Search].[tbl_JobYield] as JY
INNER JOIN [Tfs_DefaultCollection].[Search].[tbl_IndexingUnit] as IU ON
JY.TFSEntityId = IU.TFSEntityId WHERE
IU.TFSEntityAttributes like '%{Provide your RepositoryName or ProjectName}%'
and IU.EntityType = '' -- Provide 'Code' or 'WorkItem'
and IU.IndexingUnitType = '' -- Provide 'TFVC_Repository', 'Git_Repository' (for code) or 'Project' (for WorkItem)
If there are multiple row entries, each of them indicate close to 10000 folders that are still to be crawled for indexing. As and when these folder will be picked up and crawled, they can uncover more sub-folders which will be recursively indexed (essentially a BFS traversal of the TFVC folder path hierarchy). As indexing progresses, you should be observing monotonically increasing BatchIds being processed and removed from the table. When indexing is over, zero results should be returned from the above query.
No comments:
Post a Comment