Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding Catalog Optimal Output File Format #16

Open
JPGlaser opened this issue Jul 17, 2015 · 15 comments
Open

Regarding Catalog Optimal Output File Format #16

JPGlaser opened this issue Jul 17, 2015 · 15 comments

Comments

@JPGlaser
Copy link

Hey Guys,

So I am noticing that as I generate larger datasets from larger FoVs, CatSim's write_catalog command seems to be running rather slow over large sets of observations. I have begun creating a catalog writer that stores each observation's instance catalog into a table in a SQLite3 database. However, the thought occurred to me that I should probably ask to see if such a command already existed within the InstanceCatalog class. Is this the case?

~ Joe

@JPGlaser JPGlaser changed the title Regarding Catalog Output File Format Regarding Catalog Optimal Output File Format Jul 17, 2015
@rbiswas4
Copy link
Contributor

Hi Joe,

So I am noticing that as I generate larger datasets from larger FoVs, CatSim's write_catalog command seems to be running rather slow over large sets of observations.

yes, this is something I run into as well, and I think is expected to be there. So it is helpful to do a number smaller FoVs covering the desired FoV.

I have begun creating a catalog writer that stores each observation's instance catalog into a table in a SQLite3 database. However, the thought occurred to me that I should probably ask to see if such a command already existed within the InstanceCatalog class. Is this the case?

Do you mean you are trying to do just that by storing your instance catalogs for smaller fields of view in a sqlite3 database statement for convenience of using them? I don't think there is such a function, but @danielsf would know better.

@danielsf
Copy link
Contributor

Hi all,

  1. The photometry mixins are slow. This is something we are going to try to fix in the next month or so.

  2. There is not currently a method to output InstanceCatalogs as sqlite databases. @rbiswas4, do I remember correctly that you put something together for your SNIa catalogs? Should we try to make it an "official" part of InstanceCatalog?

-- Scott

@JPGlaser
Copy link
Author

Hey @rbiswas4,

This might give you a better idea of what I am up to:
https://github.com/JPGlaser/LSST_CatGen/blob/master/Tests/genLSST_AGNData.ipynb

Basically, I want to automate the process of generating instance catalogs for observed AGN in a selected set of unique pointings over the 10 year period of LSST. The end result is about 8338 observations for 9 unique pointings, which contain a total of 624 AGNs.

It takes about 10 hours to create all of these files. I suspect this is due to the fact that opening the files and writing them via python is the bottleneck, which is why I was curious to see if there were any quicker methods out there. Considering I will be increasing the FoV to at least a radius of 1 deg from 0.05 deg very soon, longterm storage of both observation meta-data and the instance catalogs is something of a concern.

~ Joe

EDIT: Thanks for the comment @danielsf. So the likely bottleneck lies in the photometry mixins?

@rbiswas4
Copy link
Contributor

@danielsf

  1. There is not currently a method to output InstanceCatalogs as sqlite databases. @rbiswas4, do I remember correctly that you put something together for your SNIa catalogs? Should we try to make it an "official" part of InstanceCatalog?

I had forgotten that :) . It is so unlikely that I would be writing to a database that I did not think I might have done it.

https://github.com/rbiswas4/SNIacatalogs/blob/cadence_notebook/examples/FindLCsFromOpSim.ipynb

@rbiswas4
Copy link
Contributor

I don't know how much of the bottleneck is in 'python'. Even though the time is taken by the write method of the instance catalog class, this is really where the database is being queried. And increasing the field of view simply means querying through more rows.

@danielsf
Copy link
Contributor

I agree with Rahul. I just did a quick test with a field of view of radius 0.05 degrees (which took 4 seconds) and a field of view of radius 0.1 degrees (which took 60 seconds). The amount of time spent in the mixins quadrupled from 1 second to 4 seconds. The amount of time querying the database when from 1 second to 57 seconds. The galaxies in the database are pretty dense, so you're pulling in a lot of them for any field of view.

There might be a way to speed up the database query by cutting out objects that do not have AGN filenames specified at the SQL level. It will require some patches to the code, though. Give me a day or two.

@danielsf
Copy link
Contributor

Actually, I was wrong. You might be able to speed it up now. Try changing your write_catalog statement so that it says:

yourCatalog.write_catalog('yourCatalogName.txt', constraint='sedname_agn is not NULL')

The constraint kwarg adds an extra constraint to the SQL query. sedname_agn is what the agn SED name is called on fatboy. Applying the "it must have an AGN attached to it" cut at the SQL level, rather than at the python level with the cannot_be_null variable in InstanceCatalog, ought to be faster.

Let me know if that does not work.

@JPGlaser
Copy link
Author

Thanks @rbiswas4, I will look into your ipynb for a possible way to store the set of instance catalogs in a database for long-term storage.

Thanks to you as well @danielsf. I will try that out with the next run later today to see if it does in fact speed stuff up. Also, is there a way run a loop with write_catalog inside it in parallel? Might be a good thing for us to look into to speed up writing large groups of instance catalogs for independent observations.

~ Joe

@JPGlaser
Copy link
Author

Hey Scott @danielsf,
Was constraint a recent addition to the lsst_sims package, because I am currently returning this error:

SQLRules = '(sedname_agn IS NOT NULL) AND (agn_sfu < 24.0)'
variableAgn.write_catalog(WorkingDir+'/'+CatFileName, write_mode='a', constraint=SQLRules)

TypeError: write_catalog() got an unexpected keyword argument 'constraint'

~ Joe

@danielsf
Copy link
Contributor

sorry. My mistake. 'constraint' is a kwarg for InstanceCatalog's init. So you need to pass it in when you instantiate the catalog.

@JPGlaser
Copy link
Author

Yeah, I just found it. Also it seems running that the catalog with:
SQLRules = '(sedname_agn IS NOT NULL) AND (magnorm_agn < 24.0)'

Makes a very good cut and reduces the time dramatically. On a 0.1 radius aperture, one catalog of 15 objects is created in 0.38 seconds with the full SQLRules. Getting rid of the magnitude cut, means that the catalog of 272 objects is created in 1.74 seconds. Which means there is a time-saver of 0.0189 seconds per object written. Given that I am writing several thousand files, this may be helpful in when we increase the FoV.

~ Joe

@JPGlaser
Copy link
Author

Just ran the code with the SQLRules value described above for a FoV of 1.75 degrees in radius (aka: LSST's full FoV). It took 34.082017 seconds to generate one observation of 5589 AGNs with mags under 24.0. So, it should take 3.291 days running on this iMac to generate a dataset of 9 unique pointings cover a 10x10 square degree swatch of the night-sky for the entire 10-year span of observations.

This is a significant improvement, so it seems that SQL querying is likely the best long-term way of dealing with large data-sets. Maybe we should look into storing the catalogs in a similar manner. My current idea is to make a database of object classes, within which there are time-series tables for each object observed. Each row then is an observation of an object which includes observation-meta data as well as observed magnitudes, errors, etc.

This sounds reasonable, yes?

~ Joe

@connolly
Copy link

Have you looked at what the sql is doing on the msqlserver side. If I
understand your numbers a 4-fold increase in area is a 60-fold increase in
database access time. This seems odd as I wouldnt have expected the
database to scale much worse than the raw IO. Is it the query or the
manipulation of the galaxy tiling (i.e.if you took a pointing within a
single tile of the galaxies is it any faster)

cheers
Andy

On Tue, Jul 21, 2015 at 12:43 AM, Joseph Glaser [email protected]
wrote:

Just ran the code with the SQLRules value described above for a FoV of
1.75 degrees in radius (aka: LSST's full FoV). It took 34.082017 seconds to
generate one observation of 5589 AGNs with mags under 24.0. So, it should
take 3.291 days running on this iMac to generate a dataset of 9 unique
pointings cover a 10x10 square degree swatch of the night-sky for the
entire 10-year span of observations.

This is a significant improvement, so it seems that SQL querying is likely
the best long-term way of dealing with large data-sets. Maybe we should
look into storing the catalogs in a similar manner. My current idea is to
make a database of object classes, within which there are time-series
tables for each object observed. Each row then is an observation of an
object which includes observation-meta data as well as observed magnitudes,
errors, etc.

This sounds reasonable, yes?

~ Joe


Reply to this email directly or view it on GitHub
#16 (comment).

@danielsf
Copy link
Contributor

I ran the test again and, while the bigger field of view was still slower, it was not 60 times slower. Sometimes things are slower when you run them for the first time. I'm not sure if this is python initializing itself or some extra overhead when first connecting to fatboy.

Regardless, it is not as bad as I made it sound.

@connolly
Copy link

That is probably because the database is caching the outputs.

On Tue, Jul 21, 2015 at 5:12 PM, danielsf [email protected] wrote:

I ran the test again and, while the bigger field of view was still slower,
it was not 60 times slower. Sometimes things are slower when you run them
for the first time. I'm not sure if this is python initializing itself or
some extra overhead when first connecting to fatboy.

Regardless, it is not as bad as I made it sound.


Reply to this email directly or view it on GitHub
#16 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants