Update develop from master for v3.7 #13011

adrianeboyd · 2023-09-25T09:24:46Z

Description

Update develop from master for v3.7.

Types of change

Chore.

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

* Update most recent release * Switch from azure to GHA CI tests badge * Remove link to survey * Format

* initial commit * update for v0.4.0 * Apply suggestions from code review * Fix formatting * Apply suggestions from code review * Update website/docs/api/large-language-models.mdx * Update website/docs/api/large-language-models.mdx * update usage page * Apply suggestions from review * Apply suggestions from review * fix links * fix relative links * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <[email protected]> * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <[email protected]> * Apply suggestions from review * Add section on Llama 2. Format. --------- Co-authored-by: Raphael Mitsch <[email protected]> Co-authored-by: Sofie Van Landeghem <[email protected]>

Additionally remove outdated `is_new_osx` check and settings.

* Added OdyCy to spaCy Universe * Replaced template tags Co-authored-by: Adriane Boyd <[email protected]> --------- Co-authored-by: Adriane Boyd <[email protected]>

* Add cli for finding locations of registered func * fixes: naming and typing * isort * update naming * remove to find-function * remove file:// bit * use registry name if given and exit gracefully if a registry was not found * clean up failure msg * specify registry_name options * mypy fixes * return location for internal usage * add documentation * more mypy fixes * clean up example * add section to menu * add tests --------- Co-authored-by: svlandeg <[email protected]>

* Add data structures to docs * Adjusted descriptions for more consistency * Add _optional_ flag to parameters * Add tests and adjust optional title key in doc * Add title to dep visualizations * fix typo --------- Co-authored-by: thomashacker <[email protected]>

* Update universe.json added entry for Sayswho * Update universe.json updated sayswho entry * Update universe.json * Update website/meta/universe.json * Update website/meta/universe.json --------- Co-authored-by: Adriane Boyd <[email protected]>

…osion#12173) * remove migration support form * initial test commit * add fixture * add combo test * pull out parameter example data * fix formatting on examples * remove unused import * remove unncessary fmt:off instructions * only set logger level if verbose flag is explicitly set --------- Co-authored-by: svlandeg <[email protected]>

…xplosion#12857)

There was a mistake in the regex pattern which caused not matching all the desired tokens. The problem was that when we use r string literal prefix to suppose a raw text, we should not use two backslashes to demonstrate a backslash.

* Fix displacy br tag * Prefer , also update package CLI

* Add `cuda12x` for `cupy-cuda12x`. * Drop `cuda-autodetect` from quickstart, set default to `cuda11x` instead.

Add: example sentences to improve the Turkish model. Let's get the tr_web_core_sm out in the the world yaa

* Update universe.json added hobbit-spacy to the universe json * Update universe.json removed displacy from hobbit-spacy and added a default text.

SpaCy's HashEmbedCNN layer performs convolutions over tokens to produce contextualized embeddings using a `MaxoutWindowEncoder` layer. These convolutions are implemented using Thinc's `expand_window` layer, which concatenates `window_size` neighboring sequence items on either side of the sequence item being processed. This is repeated across `depth` convolutional layers. For example, consider the sequence "ABCDE" and a `MaxoutWindowEncoder` layer with a context window of 1 and a depth of 2. We'll focus on the token "C". We can visually represent the contextual embedding produced for "C" as: ```mermaid flowchart LR A0(A0) B0(B0) C0(C0) D0(D0) E0(E0) B1(B1) C1(C1) D1(D1) C2(C2) A0 --> B1 B0 --> B1 C0 --> B1 B0 --> C1 C0 --> C1 D0 --> C1 C0 --> D1 D0 --> D1 E0 --> D1 B1 --> C2 C1 --> C2 D1 --> C2 ``` Described in words, this graph shows that before the first layer of the convolution, the "receptive field" centered at each token consists only of that same token. That is to say, that we have a receptive field of 1. The first layer of the convolution adds one neighboring token on either side to the receptive field. Since this is done on both sides, the receptive field increases by 2, giving the first layer a receptive field of 3. The second layer of the convolutions adds an _additional_ neighboring token on either side to the receptive field, giving a final receptive field of 5. However, this doesn't match the formula currently given in the docs, which read: > The receptive field of the CNN will be > `depth * (window_size * 2 + 1)`, so a 4-layer network with a window > size of `2` will be sensitive to 20 words at a time. Substituting in our depth of 2 and window size of 1, this formula gives us a receptive field of: ``` depth * (window_size * 2 + 1) = 2 * (1 * 2 + 1) = 2 * (2 + 1) = 2 * 3 = 6 ``` This not only doesn't match our computations from above, it's also an even number! This is suspicious, since the receptive field is supposed to be centered on a token, and not between tokens. Generally, this formula results in an even number for any even value of `depth`. The error in this formula is that the adjustment for the center token is multiplied by the depth, when it should occur only once. The corrected formula, `depth * window_size * 2 + 1`, gives the correct value for our small example from above: ``` depth * window_size * 2 + 1 = 2 * 1 * 2 + 1 = 4 + 1 = 5 ``` These changes update the docs to correct the receptive field formula and the example receptive field size.

* fix typo in link * fix REL.v1 parameter

* fix usage example * revert back to v2 to allow hot fix on main

* Update incorrect example config. (explosion#12893) * spacy-llm docs cleanup (explosion#12945) * Shorten NER section * fix template references * simplify sections * set temperature to 0.0 in examples * condense model information * fix parameters for REST models * set temperature to 0.0 * spelling fix * trigger preview * fix quotes * add small note on noop.v1 * move up example noop config * set appropriate model example configs * explain config * fix Co-authored-by: Raphael Mitsch <[email protected]> --------- Co-authored-by: Raphael Mitsch <[email protected]> * Docs for ner.v3 and spancat.v3 spacy-llm tasks (explosion#12949) * formatting * update usage table with NER.v3 * fix typo in links * v3 overview of parameters * add spancat.v3 * add further v3 explanations * remove TODO comment * few more small fixes * Add doc section on LLM + task factories (explosion#12905) * Add section on LLM + task factories. * Apply suggestions from code review --------- Co-authored-by: Sofie Van Landeghem <[email protected]> * add default config to openai models (explosion#12961) * Docs for spacy-llm 0.5.0 (explosion#12967) * simplify Python example * simplify Python example * Refer only to latest OpenAI model versions from usage doc * Typo fix Co-authored-by: Raphael Mitsch <[email protected]> * clarify accuracy claim --------- Co-authored-by: Raphael Mitsch <[email protected]> --------- Co-authored-by: Raphael Mitsch <[email protected]>

* fix construction example * shorten task-specific factory list * small edits to HF models * small edit to API models * typo * fix space Co-authored-by: Raphael Mitsch <[email protected]> --------- Co-authored-by: Raphael Mitsch <[email protected]>

* fix BertWordPieceTokenizer constructor call * fix * Update website/docs/usage/linguistic-features.mdx --------- Co-authored-by: Adriane Boyd <[email protected]>

…lop-from-master-v3.7-1

* add span key option for CLI evaluation * Rephrase CLI help to refer to Doc.spans instead of spancat * Rephrase docs to refer to Doc.spans instead of spancat --------- Co-authored-by: Adriane Boyd <[email protected]>

…lop-from-master-v3.7-1

adrianeboyd · 2023-09-25T09:44:59Z

Let me redo this in a bit...

adrianeboyd and others added 30 commits July 24, 2023 10:41

Update README for v3.6 (explosion#12844)

1d216a7

* Update most recent release * Switch from azure to GHA CI tests badge * Remove link to survey * Format

Switch from distutils to setuptools/sysconfig (explosion#12853)

f8f489b

Additionally remove outdated `is_new_osx` check and settings.

SpanCat: Remove invalid threshold config argument (explosion#12860)

98799d8

Added OdyCy to spaCy Universe (explosion#12826)

51b9655

* Added OdyCy to spaCy Universe * Replaced template tags Co-authored-by: Adriane Boyd <[email protected]> --------- Co-authored-by: Adriane Boyd <[email protected]>

Display model's full base version string in incompatiblity warning (e…

222bd3c

…xplosion#12857)

fix (explosion#12881)

3b7faf4

Update br tags (explosion#12882)

45af8a5

* Fix displacy br tag * Prefer , also update package CLI

Allow pydantic v2 using transitional v1 support (explosion#12888)

245e2dd

Update CuPy extras (explosion#12890)

c4e378d

* Add `cuda12x` for `cupy-cuda12x`. * Drop `cuda-autodetect` from quickstart, set default to `cuda11x` instead.

Set version to v3.6.1 (explosion#12892)

458bc5f

Update examples.py (explosion#12895)

d50b8d5

Add: example sentences to improve the Turkish model. Let's get the tr_web_core_sm out in the the world yaa

Update universe.json (explosion#12904)

64b8ee2

* Update universe.json added hobbit-spacy to the universe json * Update universe.json removed displacy from hobbit-spacy and added a default text.

Docs: clarify abstract spacy.load examples (explosion#12889)

76a9f9c

docs: fix ngram_range_suggester max_size description (explosion#12939)

d8a32c1

Add headers to netlify.toml [ci skip]

52758e1

Update large-language-models.mdx (explosion#12944)

3e42648

updated add_pipe docs (explosion#12947)

065ead4

fix typo in link (explosion#12948)

5c1f926

* fix typo in link * fix REL.v1 parameter

Fix LLM usage example (explosion#12950)

6d1f6d9

* fix usage example * revert back to v2 to allow hot fix on main

fix training.batch_size example (explosion#12963)

cc78847

Fix in BertTokenizer docs (explosion#12955)

8f0d6b0

* fix BertWordPieceTokenizer constructor call * fix * Update website/docs/usage/linguistic-features.mdx --------- Co-authored-by: Adriane Boyd <[email protected]>

Merge remote-tracking branch 'upstream/master' into chore/update-deve…

e9f0485

…lop-from-master-v3.7-1

adrianeboyd added the v3.7 Related to v3.7 label Sep 25, 2023

evornov and others added 2 commits September 25, 2023 11:25

add --spans-key option for CLI spancat evaluation (explosion#12981)

4e3360a

* add span key option for CLI evaluation * Rephrase CLI help to refer to Doc.spans instead of spancat * Rephrase docs to refer to Doc.spans instead of spancat --------- Co-authored-by: Adriane Boyd <[email protected]>

Merge remote-tracking branch 'upstream/master' into chore/update-deve…

7db189d

…lop-from-master-v3.7-1

adrianeboyd closed this Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Update develop from master for v3.7 #13011

Update develop from master for v3.7 #13011

Uh oh!

adrianeboyd commented Sep 25, 2023

Uh oh!

adrianeboyd commented Sep 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

Uh oh!

Update develop from master for v3.7 #13011

Update develop from master for v3.7 #13011

Uh oh!

Conversation

adrianeboyd commented Sep 25, 2023

Description

Types of change

Checklist

Uh oh!

adrianeboyd commented Sep 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants