Hi,

Is it possible to combine queries?

I have a TextNumericSearch:
    text_numeric_search = TextNumericSearch()
    text_numeric_search.add_citation(journal='Acta Crystallogr.,Sect.E:Struct.Rep.Online')
    text_numeric_search.add_citation(year=range(2013, 2017))

And a smartsearch:
    pattern=['[CH2][CH2][CH2][CH2][CH2]']
    q = SMARTSSubstructure(pattern[0])
    s = SubstructureSearch()
    s.add_substructure(q)    

Can I combine the 2 queries that look for entries that sastifies both of them? Like in conquest.

Hi Pascal,

It is not currently possible to make a combined search directly, though this is under consideration for a future release.  Instead you can perform a second search on the results of the first search.  For example:

text_hits = text_numeric_search.search()
# this gives 3234 hits
smarts_hits = s.search([h.identifier for h in text_hits])
# This gives 231 hits, of which 112 are different structures

I hope this is helpful.  Please get back in touch if anything is unclear.

Best wishes
Richard

 

 

Thanks.

So if I understand well, I can pass a list of identifiers to do the search on them?

At the moment, I am just manually testing the text field in the loop when processing which is more or less the same approach.

 Hi Pascal,

yes, all the searches except TextNumericSearch will accept a list of identifiers, a molecule, a crystal, an entry or a file.  By default it will search the CSD.  You will find it faster to do the TextNumericSearch first, since that is much faster than the substructure search.

Best wishes
Richard

The 2 queries independently are ok but when I combine then it is extremely slow.

I have done a few searches with a limit on the first one:
1000hits: 10s
2000hits: 21s
4000hits: 42s

The first search return ~39000 so it would take more than 5min...
If I do the second search on the full database it takes 20s.

 

    print("Text search...")
    text_numeric_search = TextNumericSearch()
    text_numeric_search.add_citation(journal='Acta Crystallogr.,Sect.E:Struct.Rep.Online')
    #text_numeric_search.settings.max_hit_structures = 1000
    texthits=text_numeric_search.search()

    s = SubstructureSearch()
    cf3_substructure = QuerySubstructure()
    c = cf3_substructure.add_atom('C')
    F1 = cf3_substructure.add_atom('F')
    b1 = cf3_substructure.add_bond('Single', c, F1)
    F2 = cf3_substructure.add_atom('F')
    b2 = cf3_substructure.add_bond('Single', c, F2)
    F3 = cf3_substructure.add_atom('F')
    b3 = cf3_substructure.add_bond('Single', c, F3)
    c1 = cf3_substructure.add_atom('C')
    b4 = cf3_substructure.add_bond('Single', c, c1)

    search_settings = s.Settings()
    search_settings.has_3d_coordinates = True
    search_settings.max_r_factor = 5
    search_settings.no_errors = True
    search_settings.no_disorder = True    
    search_settings.no_powder = True

    s.add_substructure(cf3_substructure)    
    s.settings=search_settings
    print("Substructure search...")
    hits = s.search([h.identifier for h in texthits], max_hit_structures=500, max_hits_per_structure=1)    
    print(len(hits))
    sys.exit()

 HI Pascal,

I'm afraid the substructure search over the list of identifiers will be slow, since it will not be able to make use of the screens for the database and so will attempt the substructure match on each of the 39000 structures.

You may find it faster in this case to perform both searches on the full database and combine them by hand:

substructure_hits = substructure_search.search()
text_hits = text_numeric_search.search()
text_ids = set(h.identifier for h in text_hits)
both_hits = [h for h in substructure_hits if h.identifier in text_ids]

Let me know if this is better.

Best wishes
Richard

 

I think just one search with the substructure and then filter on the text during processing is the best.

 

You must be signed in to post in this forum.