Re: Sitemap gen apache log technique coupled with already existing sitemap : am i right ?
On my server sitemap_gen has split sitemaps based on the number of
urls, because :
- the uncompressed sitemap file is around 5MB
- testing sitemap_gen in verbose mode i get something like (i have
replaced paths with /.../) :
Reading configuration file: /.../sitemapgen/config.xml
Opened ACCESSLOG file: /.../log/apache/access.log
Opened SITEMAP file: /.../sitemap.xml.gz
Sorting and normalizing collected URLs.
Writing Sitemap file /.../sitemap.xml.gz" with 50000 URLs
Opened SITEMAP file: /.../sitemap1.xml.gz
Sorting and normalizing collected URLs.
Writing Sitemap file "/.../sitemap1.xml.gz" with 3638 URLs
Writing index file /.../sitemap_index.xml" with 2 Sitemaps
>Do you have some idea of how many URLs in total
>your sitemaps will have after you will run the script for some time,
>and hence some idea of how many sitemaps
>you will have and their size uncompressed?
>(I suppose there must be some sort of limit towards which
>these iterations are going).
Google sitemap help says that "A Sitemap index file cannot list more
than 1,000 Sitemaps. "
That would make 50 000 000 urls !
But there are no limits if we can plus sitemap index files to other
sitemap index files, can we ?
I have around 500k pages indexed by google.
For me, the limit would be when the sitemap_gen execution causes a
server overload.
As written upper, that have already happened to me when first trying to
launch it on a 1.6GB apache access log file lol !
Apart from this i don't know what are the limits with python and with
my webserver.
Currently, with 2 sitemap files and a one day apache access log, the
script takes around 1 minute to execute, during it the server stays up
and accessible.
Maybe i should seriously monitor serverload to see if their is a peak.
In case i get problems, i'll program the sitemap_gen.py execution only
once or two a day, at sleep hours.
0 Comments:
Yorum Gönder
<< Home